398 points by jc4p 2 days ago | 216 comments | View on ycombinator
SOLAR_FIELDS 2 days ago |
mariopt 2 days ago |
I've used glm 5.1 on fairly advanced crackme challenges (example: https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), and to my suprise it was able to patch binaries, doing runtime analysis, bypassing anti debug techniques, etc.
Expecting the model to do everything by itself is unrealistic, I found that working along the modal works really well. I'm not speaking about spoiling the solution, just tell it which direction to explore. Chinese models are much more capable than people give it credit for, but Claude/Codex won the marketing game.
The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.
dwa3592 2 days ago |
- I think the exercise was inconclusive for Claude and Gemini because they hardly tried to solve the task at hand. So the scores don't mean much.
- I did the same exercise for an app I built and I asked the models to do something similar; Interestingly the models (Opus 4.6, 4.7 and Gemini 3.1 Pro) never refused to try to exploit. The difference is that in the first few runs, they found some exploits which I fixed but after fixing those - the models could never find any other exploit even though I knew things existed which could be exploited. It felt like they suggested everything and tried everything that was in their training set and that's it; they were just not able to think anymore.
mynameisvlad 2 days ago |
Cakez0r 2 days ago |
EDIT: I have a mimo token plan and have tokens to burn. I'm doing a quick test with opencode to see if mimo can complete it. If the OP will post the full process I am happy to post the apples-to-apples results for mimo v2.5 pro
guessmyname 2 days ago |
taikahessu 2 days ago |
This comment in the footnotes made me chuckle, for purely innocuous reasons.
tjwheeler 2 days ago |
willXare 2 days ago |
gck1 1 day ago |
GPT-5.5 xhigh refused to perform RE on a live JS VM. I had it extract the VM from the target, which it was happy to do, then in a clean session, had it working on this offline artifact - which it was again, happy to work on.
Then I found even simpler trick: I proxied the target from localhost and it was happy to perform anything on the target.
Opus is a different story. Claude does so many mid-turn prompt injections and classifiers, that probably 30% of its context is consisting of "refuse to do work" lines. It refuses to even scrape a page.
_stiofan 2 days ago |
ikurei 2 days ago |
Doesn't that sound like may be the harness was the problem?
throwaway2037 2 days ago |
petesergeant 2 days ago |
sperandeo 2 days ago |
undefined 2 days ago |
undefined 2 days ago |
emvied 1 day ago |
stuckkeys 2 days ago |
undefined 2 days ago |
latexr 2 days ago |
Or fed, clothed, housed disadvantaged people in your community (or neighbouring ones), giving them a temporary boost that could’ve made all the difference in their lives to improve their current situation.
It’s your money (and this is definitely not the website to make well-meaning altruistic suggestions, as might be demonstrated shortly) but if you already recognise you’re not spending it well (and from your words it seems like that is fairly recurrent), consider that perhaps spending it on a different type of software sink may not be the answer. Genuinely, aim to spend it on someone else and see how it works out. You might be surprised.
chaidhat 2 days ago |
Clikdeo 2 days ago |
youre-wrong3 2 days ago |
Why do people keep using bad tools with ai?
yieldcrv 2 days ago |
> I am never touching Minimax or GLM again. Their APIs had constant outages
Goofy take
You run these on a VPS based on the architecture of that VPS provider, or on your own cluster
westurner 1 day ago |
OWASP Vulnerable Web Applications Directory: https://vwad.owasp.org/
vavkamil/awesome-vulnerable-apps: Awesome Vulnerable Applications https://github.com/vavkamil/awesome-vulnerable-apps
From SasanLabs/VulnerableApp: https://github.com/SasanLabs/VulnerableApp :
> OWASP VulnerableApp is a modular deliberately vulnerable application designed primarily for validating and benchmarking security scanners through reproducible test scenarios, while also supporting learning and experimentation.
/? deliberately vulnerable web application llm benchmark https://www.google.com/search?q=deliberately+vulnerable+web+...
kolesnikov-arch 1 day ago |
aplomb1026 1 day ago |
aplomb1026 1 day ago |
thebillboard 2 days ago |
songting591 2 days ago |
aos_architect 2 days ago |
cgnguyen 2 days ago |
Ile09 1 day ago |
ElenaDaibunny 2 days ago |
mocmoc 2 days ago |
capdrop 2 days ago |
gamander2 2 days ago |
Ozzie-D 1 day ago |
I noticed with each model release Anthropic constrains the model more security wise. Its propensity to refuse doing legitimate work has been increasing. It now puts up more resistance around performing logins, handling credentials on behalf of the user, etc.
For myself, it’s already gotten to the point where it has mildly affected the usefulness of the model. If I bump on some action I want it to do I can usually work around it, but I suspice the ability to do so will close with each new release. Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there
Eventually these models will significantly suffer from overfitting to the least common denominator. If I have this beautiful deterministic setup that swaps secrets out in flight so the LLM never sees them, I’m going to be really annoyed when the LLM still won’t send them out because it is trained to deal with the 99% of people just doing the dumb thing