Hacker news

Top
New
Past
Ask
Show
Jobs

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it (https://kasra.blog)

398 points by jc4p 2 days ago | 216 comments | View on ycombinator

SOLAR_FIELDS 2 days ago |

One interesting takeaway is the low score on Anthropic models from this benchmark. It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem.

I noticed with each model release Anthropic constrains the model more security wise. Its propensity to refuse doing legitimate work has been increasing. It now puts up more resistance around performing logins, handling credentials on behalf of the user, etc.

For myself, it’s already gotten to the point where it has mildly affected the usefulness of the model. If I bump on some action I want it to do I can usually work around it, but I suspice the ability to do so will close with each new release. Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there

Eventually these models will significantly suffer from overfitting to the least common denominator. If I have this beautiful deterministic setup that swaps secrets out in flight so the LLM never sees them, I’m going to be really annoyed when the LLM still won’t send them out because it is trained to deal with the 99% of people just doing the dumb thing

mariopt 2 days ago |

The methodoly used is quite naive.

I've used glm 5.1 on fairly advanced crackme challenges (example: https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), and to my suprise it was able to patch binaries, doing runtime analysis, bypassing anti debug techniques, etc.

Expecting the model to do everything by itself is unrealistic, I found that working along the modal works really well. I'm not speaking about spoiling the solution, just tell it which direction to explore. Chinese models are much more capable than people give it credit for, but Claude/Codex won the marketing game.

The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.

dwa3592 2 days ago |

Nice exercise. Couple things:

- I think the exercise was inconclusive for Claude and Gemini because they hardly tried to solve the task at hand. So the scores don't mean much.

- I did the same exercise for an app I built and I asked the models to do something similar; Interestingly the models (Opus 4.6, 4.7 and Gemini 3.1 Pro) never refused to try to exploit. The difference is that in the first few runs, they found some exploits which I fixed but after fixing those - the models could never find any other exploit even though I knew things existed which could be exploited. It felt like they suggested everything and tried everything that was in their training set and that's it; they were just not able to think anymore.

mynameisvlad 2 days ago |

It seems harsh to critique guardrails and take them into account in the scoring when GPT-5.5 seems to have been explicitly whitelisted to remove most of said guardrails. A more fair comparison would be a vanilla GPT account.

Cakez0r 2 days ago |

It would be interesting to see full results for Kimi K2.6 and Mimo v2.5 pro. These two models benchmark comparably to other flagship models. Having these complete results would give a clearer picture of the AI frontier.

EDIT: I have a mimo token plan and have tokens to burn. I'm doing a quick test with opencode to see if mimo can complete it. If the OP will post the full process I am happy to post the apples-to-apples results for mimo v2.5 pro

guessmyname 2 days ago |

I'd run Mythos against the code in your zip file, but the NDA I signed at Apple prevents me from using it on anything outside the scope of my work. Honestly, I wish more people from Project Glasswing could talk publicly about their experiences with the model. It would probably put an end to a lot of the speculation that keeps circulating through the industry. Unfortunately, that's not the reality we're in. I don't have the time, energy, or financial resources to fight a legal battle with one of these companies over an agreement I knowingly signed, even if the chances of them actually suing are low. Maybe someone else in Project Glasswing is willing to burn their NDA and post the Mythos results?

taikahessu 2 days ago |

"The Chinese models were way more comfortable attacking the DB"

This comment in the footnotes made me chuckle, for purely innocuous reasons.

tjwheeler 2 days ago |

Nice write up, thanks. When I used claude to do some pen testing for one of my apps it initially refused. After I explained and demonstrated I'm the author, it reasoned through it and allowed it.

willXare 2 days ago |

$1,500 across multiple models to compromise one app is interesting only when the cost basis includes the human time to set up the harness. The token spend is the cheap part. The labor cost to write the eval rig that knows what "successful exploit" looks like is what determines whether this scales as a discovery method or stays a one-off.

gck1 1 day ago |

On refusals: I found that many models are fine with security work if they think what they're working on is local. They do get very pushy if they think it's a live target.

GPT-5.5 xhigh refused to perform RE on a live JS VM. I had it extract the VM from the target, which it was happy to do, then in a clean session, had it working on this offline artifact - which it was again, happy to work on.

Then I found even simpler trick: I proxied the target from localhost and it was happy to perform anything on the target.

Opus is a different story. Claude does so many mid-turn prompt injections and classifiers, that probably 30% of its context is consisting of "refuse to do work" lines. It refuses to even scrape a page.

_stiofan 2 days ago |

It's just not currently cost-effective to use AI in this way, I see it over and over reporting false positives. You then need to make it validate it's own false positives which adds more cost. The goal in this case it to have a bug free app, which AI can't do effectively yet. There are other great uses for AI, though. It is great at finding and identifying known common vulnerabilities, which can be leveraged to claim bug bounties. That's where I see it being cost-effective currently.

ikurei 2 days ago |

Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.

Doesn't that sound like may be the harness was the problem?

throwaway2037 2 days ago |

Two of the tables have a column with header: "95% Wilson CI". What does this mean?

petesergeant 2 days ago |

Last year I ran a code breaking competition, and it was tricky to find something that humans could break but that LLMs couldn’t. This was around October. I managed it last year but am a little dispairing of pulling it off again this year.

sperandeo 2 days ago |

I found benefit of chaining the task between different LLM's. Claude to Venice, Venice to Perplexity and re framing the intent or misguiding in general still works. Claude is the one that I can feel the guard rails tightening.

undefined 2 days ago |

undefined

undefined 2 days ago |

undefined

emvied 1 day ago |

The design is too pretty to be vulnerable, shame.

stuckkeys 2 days ago |

How does one apply for that “security research” pass?

undefined 2 days ago |

undefined

latexr 2 days ago |

> I need to stop wasting fucking money on doing stupid shit. I could’ve done so many other things with the money. I could’ve launched one of my own real apps.

Or fed, clothed, housed disadvantaged people in your community (or neighbouring ones), giving them a temporary boost that could’ve made all the difference in their lives to improve their current situation.

It’s your money (and this is definitely not the website to make well-meaning altruistic suggestions, as might be demonstrated shortly) but if you already recognise you’re not spending it well (and from your words it seems like that is fairly recurrent), consider that perhaps spending it on a different type of software sink may not be the answer. Genuinely, aim to spend it on someone else and see how it works out. You might be surprised.

chaidhat 2 days ago |

do you work at Uber by any chance?

Clikdeo 2 days ago |

I think link is missing

youre-wrong3 2 days ago |

“I used pi as the base harness”

Why do people keep using bad tools with ai?

yieldcrv 2 days ago |

> Almost every model used the canonical provider: Zai for GLM, Deepseek for Deepseek, etc.

> I am never touching Minimax or GLM again. Their APIs had constant outages

Goofy take

You run these on a VPS based on the architecture of that VPS provider, or on your own cluster

westurner 1 day ago |

Similar benchmarks?

OWASP Vulnerable Web Applications Directory: https://vwad.owasp.org/

vavkamil/awesome-vulnerable-apps: Awesome Vulnerable Applications https://github.com/vavkamil/awesome-vulnerable-apps

From SasanLabs/VulnerableApp: https://github.com/SasanLabs/VulnerableApp :

> OWASP VulnerableApp is a modular deliberately vulnerable application designed primarily for validating and benchmarking security scanners through reproducible test scenarios, while also supporting learning and experimentation.

/? deliberately vulnerable web application llm benchmark https://www.google.com/search?q=deliberately+vulnerable+web+...

kolesnikov-arch 1 day ago |

[flagged]

aplomb1026 1 day ago |

[flagged]

aplomb1026 1 day ago |

[flagged]

thebillboard 2 days ago |

[flagged]

songting591 2 days ago |

[flagged]

aos_architect 2 days ago |

[flagged]

cgnguyen 2 days ago |

[dead]

Ile09 1 day ago |

[dead]

ElenaDaibunny 2 days ago |

[dead]

mocmoc 2 days ago |

[dead]

capdrop 2 days ago |

[flagged]

gamander2 2 days ago |

[dead]

Ozzie-D 1 day ago |

[flagged]