Hacker news

Top
New
Past
Ask
Show
Jobs

Did Claude increase bugs in rsync? (https://alexispurslane.github.io)

203 points by logicprog about 9 hours ago | 205 comments | View on ycombinator

dvt 1 minute ago |

It's always the most insufferable people that make the biggest hullabaloo about a project they have nothing to do with and have never contributed to. People with literally zero skin in the game using the AI boogeyman to push some agenda or some anti-agenda. OSS has become so incredibly toxic in the past decade, and consumers of OSS have become extremely entitled.

I run a smallish project with ~1k stars and I've stopped maintaining it last year because people feel like they're absolutely owed features or bug-fixes or whatever. It's tiring and a complete shame that author has to make such an insane deep dive into a random accusation that just caught on social media.

aesthesia about 3 hours ago |

I don't have a dog in this fight, but a few points that look a little suspicious:

- The release with the highest number of attributed bugs is the release _right before_ the first release with Claude-coauthored commits, released in January; is there a chance that unattributed LLM-authored commits made it into this release?

- The release attribution methodology is not great, since it will tend to attribute bugs introduced in a minor version update to the longest-lived patch release of that minor version. I doubt that 3.4.1 actually introduced a lot of bugs, but since it was released a day after 3.4.0, bugs that were introduced in that release get attributed to 3.4.1.

- Relatedly, more recent releases have had less time to have bugs filed against them, so there may be a bit of a bias toward evaluating recent releases as less buggy.

thorum about 3 hours ago |

Unfortunately for the people mad about this, I predict the only thing they will accomplish by pressuring the rsync maintainers, is to discourage everyone else from responsibly disclosing their use of AI. You’re just going to make people disable Claude attribution on their commits to avoid drama.

scsh about 9 hours ago |

> It does not control for commit complexity, security intensity, or bug severity. It does not distinguish between a one-line typo fix and a CVE patch. It is a blunt instrument. But the critics' accusation is also blunt: "Claude is making things worse." A blunt instrument is the fairest response.

If by fairest you mean to say that this analysis and response is sufficient, then I'm sorry but I have to disagree. We really need to understand if the nature of the bugs are worse from a user's perspective. Even if the rate stayed unchanged, if the result is the perceived quality of the software declined then I would personally consider that worse, especially if I were a project maintainer.

That's not meant to be wholly dismissive either. But in general, I don't think quantitative analysis alone is enough to fully answer this type of question.

mikaeluman about 3 hours ago |

Not going to critique this survey. Must have taken a lot of time and required a lot of patience. Great work!

I think it will be up to some group in academia to make a real full blown study across several repositories.

There must be tons to learn on how LLMs have changed software development and perhaps the cleanest separation will simply be going by what repositories declare e.g. "No LLM involved" vs those that proudly do the opposite or are neutral.

Bugs is not the only variable of interest here. I am guessing someone is already doing this as we discuss it here...

lbrito about 1 hour ago |

Wait, how is any of this relevant if there were only 2 Claude commits? My statistics courses are far behind me, but don't you need at least 30 data points to conclude anything?

AEVL 34 minutes ago |

How does the analysis look if we only count the >=90 severity cases—that is, if we downgrade the severity of all <90 cases to 0?

tiahura 4 minutes ago |

Write with your own voice and then polish with ai.

faitswulff about 9 hours ago |

> The analysis uses a single metric: bugs per 10 commits (bugs/10c).

Bugs per commit as a metric papers over severity, both in terms of security severity as well as the effect on the user. A mislabeled button has the same weight as the entire app crashing in this framework.

logicprog about 9 hours ago |

Okay, I really have to point out to everyone: the numbers and report cards are TEMPLATED IN BY A SCRIPT. Hallucinations are a moot point. https://github.com/alexispurslane/rsync-analysis/blob/main/s...

geraneum about 9 hours ago |

> But the critics' accusation is also blunt: "Claude is making things worse." A blunt instrument is the fairest response.

So the criticism was bad, and that somehow makes it ok to use a bad metric?

parliament32 30 minutes ago |

Thank you for (re)writing this in your own voice. Despite how much effort might be put into methodology, data collection, etc.. reading slop is unbearable, full stop. It's not intentional, but I have almost a nauseated reaction when the "AI tone" comes though, regardless of how good the data or how accurate the writing is.

Your verbosity and sentence structure are not a problem. I hope that publishing this gives you a bit more confidence in your writing, because it's legitimately good.

tptacek about 2 hours ago |

This is a neat post and I'm glad it got written and this is a little bit off-topic but:

Hey, 'logicprog, your writing is fine!

Use LLMs to critique your writing, check its structure, vet your choice of topic sentences, check flow from graf to graf and section to section, look for passive voice and overused words. LLMs are fantastic for that. But don't use a single word an LLM suggests in your actual writing. If it suggests something really fucking good, too bad, those words are disqualified. It's an easy red line to adhere to, easier than it sounds, and it'll keep your writing human.

(You ended up somewhere around here anyways, but that was after you posted something with LLM-written language because you weren't confident enough in your own writing. The things you do "worse" than an LLM are what make you you; be protective of them!)

mmonaghan about 1 hour ago |

I think there's evolution at play here - if you dislike AI enough to opt out of using any ai-generated code, you will likely suffer. I think there's definitely a conversation to be had about whether to disclose AI use or not but that's a separate issue if you assume that everyone is using it in some respect.

rovr138 about 9 hours ago |

I'm just curious about testing.

Is this a configuration that's not common and thus not tested?

If people think they can do better, I want to see their forks and them keeping up with it.

https://github.com/RsyncProject/rsync/graphs/contributors?fr...

PunchyHamster about 2 hours ago |

The fact last few commits were attributed to claude doesn't mean previous ones didn't use it.

Also if you write a paper where you get statistical conclusions out of whole 2 datapoints you'd be laughed out of the room

logicprog about 3 hours ago |

Another update: did an automated severity analysis on each bug report (~2000 of them!) using an LLM at temp=0 with a very strict rubric (and I checked to make sure that it rated things in a consistent, stable way using it). The rubric, LLM used, and some example ratings are included in the methodology section. For now, the information was just stored per-bug in the DuckDB and used to filter out non-bug bugs, to get a clearer signal. I'm going to try to use it to see if the post-Claude bugs were more severe in any way next.

Polarity about 9 hours ago |

so the answer is: no. actaully less bugs. thanks

WesolyKubeczek about 1 hour ago |

The discussions around this have devolved to excrement anyway, I feel tempted to invoke the meme where the goose asking a guy what his jacket is made of, asks “where is your reproducer case!?” instead.

Instead we have a shitstorm over presumably legit issue, for which the only source is some mastodon post.

One command that used to work in 3.4.1 and stopped working in 3.4.3. Just one! We could have already bisected the living shit out of this and go home, but no.

steno132 about 1 hour ago |

This is just narrow thinking. Say Claude did increase the bugs in rsync by a negligible factor.

So what? You've saved a significant amount of time for a decent number of humans, and if those humans are working on other projects, the overall net output for the world is net positive compared to without LLMs.

You have to broaden your perspective. It's not just about how rsync was affected.

undefined about 9 hours ago |

undefined

KronisLV about 2 hours ago |

Pretty cool site!

> v3.4.3 has been out long enough that its rate (5.00) is already comparable to historical releases. The "wait and see" argument is an appeal to an unknowable future that shifts the burden of proof away from the critics. If more bugs surface, they will enter the distribution like every other release. There is no reason to expect a regime break.

I mean, as someone who uses LLMs, it might be a good idea to consider how one might limit the amount of bugs that will appear in the future at least a little bit: parallel iterative code review loops would probably be the easiest and most applicable to LLMs, though I guess test coverage and other code analysis tools help too.

overgard about 3 hours ago |

The TLDR seems to be: needs more data.

gadrev about 9 hours ago |

Ok.

  $ apt-cache policy rsync | grep Installed
    Installed: 3.4.1+ds1-7ubuntu0.2
  $ sudo apt-mark hold rsync     
    rsync set on hold.

themafia about 1 hour ago |

> If anyone complains about my verbosity or sentence structure — as they usually do, which is the reason I originally let the AI write the prose, among other reasons obsoleted by templating — they can go fuck themselves.

You can write for an audience or you can write for yourself. Which is fine either way but you shouldn't pass the blame for bad results on to your audience.

> and recieving almost no substantive input, discussion, or response on the actual content of the article

Well did you write it for that purpose?

> "Just wait, more bugs will surface" -- v3.4.3 has been out long enough

Wait for _more releases_. As your own data shows the bug rate is not consistent between releases. So this is probably not a worthwhile metric. Perhaps systems touched, new features included, or attempted fixes would be a better way to contextualize releases and the goals of the author.

yobid20 about 2 hours ago |

needs a tldr; im not reading all that. maybe claude can summarize it for me.

pushcx about 7 hours ago |

    What followed was extraordinary: 329 comments and counting, ranging from thoughtful concern to outright harassment.
    The thread did not stop at words. One user posted My Little Pony drawings of themselves strangling the "project janitor that pushed vibecoded commits":
    It spread to Hacker News and Lobsters, generating hundreds more comments.

This is false, it did not appear on Lobsters. Here is the function in the codebase that prohibits this kind of brigading: https://github.com/lobsters/lobsters/blob/main/app/models/st...

Please correct your article.

nairboon about 9 hours ago |

Is this an analysis made by/with Claude?

dang about 3 hours ago |

[stub for offtopicness]

[see https://news.ycombinator.com/item?id=48416020 for how all this happened in the first place]

jrflowers about 2 hours ago |

Tl;dr:

Yes, it did. Here is some math showing that you shouldn’t care about that.

the_real_cher about 9 hours ago |

Is there a non vibe coded fork of rsync?

MagicMoonlight about 3 hours ago |

Typical AI slop post. It’s pretty hilarious that he just added spaces before the emdashes and claims it’s human written.

If I’m hiring and I see this kind of slop, I ain’t hiring you.

wookmaster about 9 hours ago |

Claude is just a tool ? The developers who merged that code and didn't properly test increased the bugs.

mwkaufma about 2 hours ago |

Smokescreen of highly-contingent analysis and appeals to authority over a premotivated-conclusion.