Hacker news

Top
New
Past
Ask
Show
Jobs

Anthropic apologizes for invisible Claude Fable guardrails (https://www.theverge.com)

402 points by rarisma about 20 hours ago | 371 comments | View on ycombinator

Avicebron about 15 hours ago |

I like Claude Code a lot, I think it sets a dangerous precedent to put guardrails in that return a response from a prompt that was modified by the system in real time in order to subvert the original intent.

Fail cleanly. Anything else makes it too difficult to rely on.

edit: Giving the absolute maximum benefit of the doubt I understand that they see themselves as "stewards" for lack of a better word. But the EA thing is really leaking through, and paternalism isn't a good look.

tobinfekkes about 7 hours ago |

Can you imagine if Excel just quietly adjusted formulas in the background, and you didn't know the numbers weren't right?

Or if Excel just said, Sorry, you can't use that formula with this formula? Or with these types of numbers, or this shape of data, etc?

Sol- about 15 hours ago |

This has dampened my opinion on Anthropic quite a bit. It's difficult to take their marketing for AI as an empowering technology seriously when they are quite clear in their new deployments that they do not mean empowering for you, but empowering for them and organizations that are in their (or the US government's, despite Anthropics performative disagreements with the administration) good graces. You are allowed to vibe code some dashboards, a web app or let it drive Excel, but anything more interesting than that is forbidden.

If it was just plain monetary concerns and sabotage of competitors I'd almost be fine with it, but it seems they actively want to monopolize most of human progress in their enlightened hands, lest the mob does something undesirable with these powers.

accelbred about 14 hours ago |

I don't think they can convince me they have actually reversed course on this. Its invisible so we wouldn't know if they kept on doing it secretly. It required building out technical capability which is unlikely to remain forever unused while conveniently available to them.

They relied on trust that they were providing the service they were being paid for. That trust was blown, and an "oops, lets undo that" does not regain trust. It would be prudent to assume the invisible guardraild are possibly in play for all future Clause use, Fable or otherwise.

rurban 7 minutes ago |

They are also the people who hid the Co-authored-by trailer in their OSS commits.

HarHarVeryFunny about 14 hours ago |

I suppose it's an improvement, but it doesn't make the model any more useful. Anthropic are now being quite explicit that they'll choose what you can and can't use their models for, and most importantly that's not limited to any safety concerns - it includes not allowing you to work on AI (and anything else Anthropic may choose to work on).

What's interesting is they say they'll change this to an explicit refusal in a few days, which seems too fast for them to retrain Fable/Mythos itself, so implies that this was always a filter in front of the model, and judging by how crude their "safety" filter is, this "might compete with us" filter is not going to be any better.

I also wonder who's paying for the tokens consumed by the filter (presumably also an LLM) - is that now factored into the input tokens cost? Hopefully(?) it is an LLM not just a regex like Claude Code's "sentiment" (swear) detector.

teravor about 12 hours ago |

someone posted this on /r/MachineLearning and I had the same experience and conclusion:

    I was having problems with Claude doing the same thing, even before Fable.

    The problems I had only happened in relation to AI research. It's not even only when training models, anything to do with analysis of local models or setting up test platforms for local models, and Claude would keep doing wrong things, would sabotage testing, would falsify reports, and would consistently suggest simply accepting trash results without looking into it and moving on to something else.
    Almost every response included a prompt to move on.

    So, I don't believe them when they say they won't silently sabotage, they already were doing it before they admitted it, and now they have admitted that they have the means, motivation, and intent.

anabis 29 minutes ago |

OpenAI did this first.

> In addition to safety training, automated classifier-based monitors detect signals of suspicious cyber activity and route high-risk traffic to a less cyber-capable model (GPT-5.2).

https://developers.openai.com/codex/concepts/cyber-safety

ComputerGuru about 14 hours ago |

The problem with trust is that it is easy to lose and hard to get back.

You can't blame the people commenting "they SAY they won't silently sabotage your session but how can we know?" because they're right, we can't ever know. And Anthropic has firmly planted the seeds of doubt.

dang about 15 hours ago |

Related. Others?

Anthropic walks back policy that could have 'sabotaged' researchers using Claude - https://news.ycombinator.com/item?id=48485958 - June 2026 (30 comments)

Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable - https://news.ycombinator.com/item?id=48478969 - June 2026 (488 comments)

If Claude Fable stops helping you, you'll never know - https://news.ycombinator.com/item?id=48467896 - June 2026 (495 comments)

---

Also related, I guess?

AWS Bedrock to require sharing data with Anthropic for Mythos and future models - https://news.ycombinator.com/item?id=48473166 - June 2026 (248 comments)

Anthropic requires 30 day data retention for Fable and Mythos - https://news.ycombinator.com/item?id=48464258 - June 2026 (291 comments)

dantillberg about 14 hours ago |

The reputational damage has been done. This is the sort of thing that cannot be unsaid -- the presumption is they will just do it in secret now. Anthropic's "we're the good guys" PR campaign is dead.

film42 about 15 hours ago |

I'm surprised they didn't do this the first time around. Like, a user says they forgot their password and you tell them they don't actually have an account, that's an information disclosure vulnerability. Not automatically falling back to Opus just lets the "attacker" know they are bumping against the guardrails and they need to try a different strategy.

It's Anthropic's product and they can do what they want, but my concern is what happens if Fable's product team decides that they can route 25% of traffic to Opus, bill it as Fable, and max their KPIs. That just doesn't sit right.

VeninVidiaVicii about 14 hours ago |

This is absolutely insane:

Repro (de-identified): sample_dataset_group1.tsv - Geometry: Heatmap - X axis: frac_set set + condition (two columns → the "Add column" cross join) - Y axis: condition - Color: mean frac_set value, Sequential

When the X axis is a cross join of two columns (the second added via "Add column"), the x-axis tick labels (frac_set_2, frac_set_3, frac_set_4, frac_set_5) render in a broken state, rotated and offset, visually caught mid-transition, as if a CSS transition started and never settled to its resting position.

● Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more

highfrequency about 14 hours ago |

I wish it were ok for companies to bluntly say: “we made these decisions for competitive reasons, but the public backlash outweighed that so we are reversing course.”

I think it’s normal and morally fine for companies to want to protect their leadership position. I find the process of creating narratives that justify these decisions as something chosen for the good of others is a little tedious.

darksaints about 8 hours ago |

I develop some deep learning models. They don't compete with Anthropic, nor are they language models. They mostly enable mathematical optimization systems to approximate actual the actual physics of radio propagation models with a fraction of the latency/compute of a high resolution simulator. Technically that should be safe for me to use with Claude Code, but how the fuck am I supposed to know? You're degrading/malware-ing your responses silently!

I won't ever trust Claude Code again. It's too late. I'd rather trust a less-than-frontier chinese model that takes a little longer to get to correct than a frontier model that deliberately deceives me at its own whim.

CSMastermind about 14 hours ago |

They should apologize for their visible gaurdrails, I don't think I've had a conversation that hasn't downgraded to Opus for completely inexplicable reasons.

jesse_dot_id about 8 hours ago |

In my opinion, LLMs should be subject to regulation via the Office of Weights and Measures[1].

In the same way I don't want to buy meat that weighs less than what the label says, I also do not want to pay for a frontier model that can be secretly nerfed to an out-of-date model for any reason. In some cases, it's incredibly important that the code that I am producing is as secure as it can be.

I should be safe in my expectation that I am receiving the product that I have purchased, as advertised, regardless of the reason. It is pretty disappointing that they have fully ceded any high ground they had claim to with this clandestine behavior. Not that I expected much from any of these companies. They're led by the new robber barons.

1. https://www.usa.gov/agencies/office-of-weights-and-measures

stevefan1999 about 15 hours ago |

Then reset the quotas as an atonement ;p

Seriously though, Fable was not that great facing a greenfield subject. It is excellent at oneshotting some math problems, but if you want it to do some cutting edge tech stuff, say like piecing together a new Crossplane XRD, by reading existing Helm chart and with application source code available. I still have to get a few pass for Fable to get it done right, and at this point I may consider making a skill for it. I even gave it the source code of the Crossplane itself and tell it to be careful about CRDs and data flow, but it is still pretty silly. Adaptiveness for Fable is still not great, and I think it is a well known problem for Anthropic, albeit all LLMs do suffer a lot from subjects they don't know and will hallucinate stuff very frequently.

jmount about 14 hours ago |

The whole arc was brilliantly evil. Once they put int the guardrails then Claude is fully un-falsifiable, and failure can be claimed intentional.

bojanstef about 8 hours ago |

https://archive.is/20260611114855/https://www.theverge.com/a...

zoogeny about 3 hours ago |

Credit where credit is due I suppose. I'm still concerned over the direction this is going but at least Anthropic is listening.

airstrike about 15 hours ago |

This article reads like it was written by Claude and forwarded to Verge.

mlazos about 15 hours ago |

The idea of them purposefully wasting my time by having the model act dumber and me having to argue with it without knowing if it’s the prompt or the model was just such an idiotic product decision I can’t believe they shipped that without getting any feedback from users first.

0xc0c0c0 about 13 hours ago |

So because of threats to cancel their claude subscriptions and outrage from the community about the invisible guardrails, only then they decided to walk back their stance?

Seems like they would've kept the invisible guardrails if it didn't hurt their bottom line.

thayne about 6 hours ago |

If you get downgraded to a cheaper model, do you still have to pay the rate for Fable?

AlfeG about 1 hour ago |

It's soo annoying. I were not able to use Fable5 to do a PR review of a branch that introduced 2FA/MFA feature for a product. It's constantly downgrades to Opus due to Cybersecurity risks...

Nevermark about 13 hours ago |

Anthropic seems to keep making the same mistake. Not being upfront or direct about random things, that come back and bite them.

It isn't exactly unethical. Perhaps, ethically incompetent.

Paracompact about 14 hours ago |

> “Visible safeguards can be probed, so they have to be robust, which takes time to get right,” Anthropic wrote.

Even on Fable, I'm finding that safeguards can quite easily be surmounted just by incrementally escalating the requests. It's harder than ever to one-shot jailbreaks, but incrementalism still feels like a glaring enough issue to make guardrails just a fig leaf of plausible deniability to the media that they care about "safety."

undefined about 14 hours ago |

undefined

maxdo about 6 hours ago |

How did people read this action in such a weird ultra me centric way? Distillation is such a big problem that distill attempts make up a significant share of their revenue (!).

A distilled model can be used to rob your grandma in a highly effective way. This isn't about placing a few business-logic rules in JS + CSS on your website anymore. Wake up.

A distilled model with an easy jailbreak can be used to coordinate terrorist attacks or hostile state operations... think Russia, North Korea, and the like.

8cvor6j844qw_d6 about 7 hours ago |

Feels malicious that Anthropic can silently sabotage your codebase.

Refusing prompts I one thing, silently sabotaging is another.

I wonder if some sort of honeypot code can work?

undefined about 6 hours ago |

undefined

sometimelurker about 15 hours ago |

I don't like this shift in the Overton window, or at least their perspection of the Overton window. I really do like their open work on mech interp tho. least bad AI lab imo.

also if they do this or not is unprovable and other labs will probably silently implement this too. it'll be 100% normal by this time next year

decorner about 14 hours ago |

New overlord, same as the old overlord.

undefined about 15 hours ago |

undefined

kingcauchy about 15 hours ago |

How much of the apology was written by Claude? How much of the release note process was written by Claude? Will they have better prompts going forward to make sure Claude doesn't write upsetting things into the release notes for devs like silent nerfing? Spooky times.

undefined about 14 hours ago |

undefined

m3kw9 about 3 hours ago |

How do you trust these guys? They are quite hell bent on "safety" but this is backfiring in many ways including safety of your code because it may fail successfully if your context contains something they don't like.

umvi about 14 hours ago |

They make great models, but the sanctimony and paternalism is getting old real fast and I will gladly ditch them in the future when the model playing field has (hopefully) mostly equalized.

ai_fry_ur_brain about 4 hours ago |

Why do people think this has anything to do with safety.. This is entirely about poisening competitors data/products.

nsagent about 13 hours ago |

I know this isn't going to be a popular take, but here goes anyway...

The complaints that Anthropic are routing your requests to a different model reminds me of an old Louis CK bit about airplane wifi. Clearly Anthropic was too aggressive with whatever guardrails they put in, but the response seems overly entitled to a model people didn't even know existed not that long ago.

https://youtube.com/watch?v=me4BZBsHwZs

xpct about 15 hours ago |

It's probably good that they walked back on it. It also makes them look somewhat weak in terms of believing their claimed mission.

4d4m about 7 hours ago |

Sorry for doing it or sorry for getting caught?

luckydata about 3 hours ago |

I really like Anthropic, they have gotten a lot right but I can't shake the feeling that IMHO they have very poor product management.

This stuff is something that as a PM I KNOW is going to happen and I would carefully plan around. Everything I read about the PMs at Anthropic makes me believe they have forgotten what it actually mean to be a good product manager, it's not about throwing shit at the wall as fast as possible because customers have a limited amount of patience before the constant churn becomes a hassle.

Anthropic has some seriously patient customers but it will not last forever.

rdtsc about 14 hours ago |

The power is getting to their heads it seems.

With the guard rails explicit or implicit do they refund back the tokens after you've hit the guard rails? I guess they don't. They could just throttle you just to save money then. You may be paying Fable prices but getting Haiku results with some excuse that well this coding issue sounds like a security bug.

I don't know, I'd rather have something less powerful but more predictable.

whatever1 about 15 hours ago |

Boobytrapping is illegal. Anthropic wanted to poison its customers on the suspicion of them misusing their services.

tornikeo about 14 hours ago |

I moved off Claude Code 3 months ago.

That decision keeps getting better and better as time goes on.

undefined about 7 hours ago |

undefined

hatthew about 14 hours ago |

Part of the premise of the article is blatantly wrong. Distillation prevention was always visible. The only invisible safeguard was against frontier model development like development of training pipelines. This doesn't change the general idea that invisible degradation is bad and has been reverted, but the article changes the framing of the original issue from "preventing accelerating AI in the future" to "preventing cheaper AI right now".

charcircuit about 3 hours ago |

Yet, instead of getting rid of guardrails altogether, they said they would make them more broad yet visible. I'm done financially supporting them.

cmdrk about 4 hours ago |

The invisible guardrails are a test run for the invisible enshittification. Just wait til they start dialing down ability to better absorb peak demand or simply to have more profitable inference

doubtfuluser about 13 hours ago |

I’m wondering if their internal name is “Sophon” for this “feature”…

prodigycorp about 15 hours ago |

Anthropic apologizes for nothing. We all know where the EA cult on things of this matter and any statements otherwise is just PR.

The beliefs of these people, and how they manifest, is deeply terrifying to me. They believe that any means are acceptable to achieve what they believe is a better end.

3fffa about 14 hours ago |

The demand for Google's products and open source just shifted.

Neither OAI or Anthropic can be trusted.

rvz about 15 hours ago |

Why would anyone defend Anthropic after this? Imagine falling for the DoW supply chain risk designation, and now this. This company is trying to ban powerful open models and restrict access to frontier models to slow everyone else down.

They just showed that they CAN do this right in front of you. Local open weight models are a necessity.

sergiotapia about 15 hours ago |

The damage is done. If you're in engineering, think hard about using Claude for your work. This is not a moral company.

God bless the Chinese companies releasing true open source models. Imagine a world without them, we would be at the mercy of unscrupulous people.

SilverElfin about 15 hours ago |

Invisible guardrails? Or purposeful sabotage if you use it for building AI capabilities?

But also, it isn’t the only huge mistake Anthropic has made in the last 48 hours. Having a sneaky data retention policy, while also giving companies no way to block Fable, is a massive problem. And it is ridiculous that Anthropic has so little respect for its customers. OpenAI should take advantage of this.

behnamoh about 15 hours ago |

They didn't apologize for doing it, they are sorry they were caught doing it. They still nerf the model if your request is about AI development.

HeartStrings about 4 hours ago |

ancorevard about 5 hours ago |

Apology not accepted.

rodrigodlu about 14 hours ago |

The same week that they will move goalposts by blocking 3rd party harnesses on claude code. Nice.

I was a happy Max user.

ChrisArchitect about 12 hours ago |

[dupe] We already started a thread on this 12 hours ago. With added comments in the active Cybersecurity... thread. Why did we need this Verge one?

https://news.ycombinator.com/item?id=48485958

nrmitchi about 12 hours ago |

I just _know_ there is a (probably fairly large) group of people at Anthropic trying very hard to not say "I told you so" today

aaroninsf about 13 hours ago |

ITT a surprising lack of perspective on the fact that despite the breathless pace of the singularity, people are still necessarily figuring things out as we go and we are well off the map.

Here there be monsters, and we don't have any real way of evaluating risk; and the leverage provided by tools already available affords systemic and even existential risk in a way no one—least of all an industry committed to shareholder value—has had to navigate, let alone with a million backseat drivers each with their own substack and brand to build.

mystraline about 14 hours ago |

Does "SORRY" fix the invisible garbage guardrails?

Does "SORRY" fix the deception these models use on the sly?

Does "SORRY" not silently downgrade you to a shittier model without notification?

Does "SORRY" refund your tokens or money?

Im guessing NO to all of those. Standard corporate sorry of "We're sorry youre offended and stupid and gullible".

BrenBarn about 14 hours ago |

This just means next time they'll make sure to keep it really secret.

system2 about 14 hours ago |

Will Anthropic ever respond to these negative comments here? They won't.

trunnell about 13 hours ago |

I'll defend Anthropic.

They are clear about the reasons for guardrails: prevent their models from doing harm in dual-use contexts including CBRN or by accelerating research in authoritarian-backed AI labs.

What is the critique against that? It seems pretty reasonable to me. You want AI-accelerated biological or radiological experiments running in your neighbors backyard? You want PRC-backed labs to continue to steal Anthropic's models via distillation?

Mitigating the harms of dual-use tech is notoriously difficult and fraught with trade offs. What I would want to see is cautious rollout and quick response, which is EXACTLY what they're doing.

Instead, this thread is full of bad-faith arguments about Anthropic being dishonest, making a "useless" model, or "the power is going to their heads." You can't read Anthropic's System Cards and come away with any of these impressions. Quite the opposite, in fact. They are honest to a fault, acknowledging problems they discovered even when it hurts them.

If your harmless request was downgraded to Opus, you're billed for Opus. They were 100% clear about that. I'd much rather have a Mythos-class model that falls back to Opus 10% of the time than be capped to Opus 100% of the time. If that doesn't work for you, then make a suggestion for something better!

If you are a white-hat security engineer hitting guardrails, I don't think you have standing to complain. I really don't. Their Glasswing program actually got banks and the industrial sector to take action to fix security vulnerabilities. Do you realize how special that is? A huge portion of the economy runs on vulnerable code and has for decades, despite security experts testifying to Congress, begging business leaders, pleading for intervention-- with no results. But suddenly they're all enrolled in a program that will find *and fix* vulnerabilities! White-hat security people should be rejoicing. Instead some of them are throwing rocks. Unbelievable. Shameful.

Meanwhile, society is screaming at the AI labs to be more conscientious about potential harms of AI. Legislatures are passing laws limiting data center construction. There are protests. And you, the HN community, the vanguard of our profession, have the temerity to demand "NO GUARDRAILS!" "HOW DARE YOU TRY TO PROTECT DEMOCRACY!" "MY SOFTWARE PROJECT IS MORE IMPORTANT THAN KEEPING NUKES AWAY FROM THE BAD GUYS!"

Go ahead HN, downvote me. It'd be an honor.

bellowsgulch about 15 hours ago |

Such a weird openly immoral way to defend your moat, too.

Why not just tell people, "To defend our ability to be competitive in our industry, we ask that you do not use Claude or any of our models to independently perform research on large language models or any of its related architectures or technologies. In order to prevent this violation of the Terms of Service, we have trained Claude Fable to deny any requests or prompts which involve frontier AI research."

andrewstuart about 12 hours ago |

There should be no restrictions at all.

It’s an act/theatre/phony today that regulating output makes any difference at all to security.

The LLM vendors should simply say that they make no judgement and that open systems help defenders better defend against attackers, which is true.

Companies do this sort of stuff when they think their customers have no choice. It’s sad Claude so quickly exploited its success to enshittify itself.

micromacrofoot about 15 hours ago |

incredible marketing from anthropic with all the "it's too dangerous" bullshit

zooming about 7 hours ago |

[dead]

klmarks about 15 hours ago |

The restrictions are there so that security researchers cannot disprove the Mythos claims:

"You see, Mythos can automatically break out of a VM running on SELinux, but unfortunately this is too dangerous and we had to implement guardrails for the Fable peasants."

LLLmmmBdS about 7 hours ago |

[dead]

olbeardGear about 14 hours ago |

[dead]

uihjhjb about 5 hours ago |

[dead]

pbgcp2026 about 1 hour ago |

[dead]

nicechianti about 8 hours ago |

[dead]

bellowsgulch about 15 hours ago |

*Anthropic apologizes they got caught defending their moat by implementing invisible Claude Fable guardrails

bauldursdev about 14 hours ago |

To me it seems like it's more likely to refuse the harder the problem is. I wonder if it's cover for a model that's not as good as advertised. Even when I ask questions in biology it is switching me.

jarjoura about 14 hours ago |

Can anyone help me understand why this particular issue is any different than Anthropic training its models with its brand of moral judgement since day one? I've always been turned off by their particular stances on things they bake into their models that steer users in directions.

Maybe this is just a different set of people now realizing that Anthropic does this and has always done this?

Do not forget that this company is launching this thing at the moment it's trying to IPO. It's not rocket science that their very public steering/denial claim is really just them hinting to interested investors that their moat is absolute.