381 points by lumpa about 7 hours ago | 307 comments | View on ycombinator
teraflop about 6 hours ago |
jampa about 6 hours ago |
It's a very good model, but it comes at a huge premium: not only do the tokens cost more, but the model itself really wants to spend them all. For example, working with React Native, Fable never just says "okay, I did the thing, that's it." It tries to rebuild the entire app from scratch, run the whole test suite, and watch every log and warning.
This is the first time with LLMs I've felt that upgrading to a model isn't worth it, even if my company lets me use it, because all the building / testing was just destroying my machine and its battery, which keeps me from working on other things.
For now, it feels like Opus with ultracode is a better choice (less pollution of the main context, more parallelism in investigations).
BosunoB about 3 hours ago |
I watched the whole thing thinking it could've just asked me for a screenshot and saved the tokens. But still, I couldn't help but be impressed. Opus never would've done that.
paytonjjones about 6 hours ago |
Cadwhisker about 5 hours ago |
I was trying to find the root cause of a crash in a Python module which left no errors in the log or console. Fable wrote a test harness that simulated clicks in the UI, then bisected my code until it found the point where it started crashing. It exaggerated the cause of the crash, then ran a series of bash one-liners to make Python virtual environments under `/tmp` for each version of that Python module until it found one that did not crash.
It went way deeper to root cause discovery (a regression in the module causing a heap allocation overflow) than I could have done myself, provided enough info and a simplified example to raise a bug report and then wrote a work-around to prevent that from happening in my application.
I don't let it run completely loose; I review each CLI command it wants to run and I append answers to the "yes" continue action (if I have them) to prevent excessive token use.
bel8 about 4 hours ago |
I'm developing a webgl game in TypeScript using my little custom vibesloped game engine that runs in the browser and live reloads whenever a file is saved.
I told the LLM to implement Multi-channel Signed Distance Field font rendering to have crisp text on all zoom levels. That was the prompt, which is not what I usually do but I "was feeling lucky and lazy".
After 10 minutes it had:
- Installed msdf_gen library (great library btw https://github.com/chlumsky/msdfgen)
- Created a CLI tool to convert TTF to SDF JSON/XML
- Ran the tool, did smoke tests on the resulting SDF data and fixed the tool until the font file looked good
- Created a new Scene in the game to test MSDF fonts
And here's what I found impressive:
DeepSkeep doesn't have vision capabilities and there's no DOM HTML in a WebGL game. So the LLM is completely blind here.
It then proceeded to state that it could not "see" the result but would try to test it anyway. It then started creating and sending huge one line javascript to the browser console, trying to gather game state data that could be useful to understand if any font was being rendered.
It couldn't gather much so it decided to simplify the font scene to renter a single dot and started sending custom JS code again, this time with gl.readPixels().
It basically bisected the webgl canvas reading pixels in a divide an conquer pattern.
Once it saw that the dozens of pixels gathered where probably resembling of a dot, it then changed the game code to render a dash and repeated the gl.readPixels() calls by sending more custom JS to the browser.
There were many console errors during all this saga but it kept fixing and sending again.
The result was a bit blurry. There was a shader bug in the code it created. It managed to fix after I told it looked blurry, despite still being blind.
The best part is that the whole thing cost me $0.10.
Now I'm doing tests with MiMo 2.5 (non Pro) which has vision capabilities, similar pricing and comparable performance to DeepSeek Flash.
ocimbote about 3 hours ago |
I asked Fable to digest some test logs to help me figure out a situation, but I had launched VSCode without activation the virtual env in the terminal first. Consequently, the tests failed to run.
And then:
Because the tests failed to run, Fable attempted to fix the test execution to no end, doing everything it could to get them to work. I had to stop it when it started to pollute my system with manual installs of packages.
At least I'm glad there's a guardrail to not circumvent or bypass sudo, because I'm convinced we would have ended up there.
A coworker made the joke that with enough tokens, Fable would try and solve any programming problem by building Linux from scratch.
tech234a about 4 hours ago |
[1]: https://www-cdn.anthropic.com/7624816413e9b4d2e3ba620c5a5e09...
snickerer 24 minutes ago |
ulrikrasmussen about 1 hour ago |
swingboy about 5 hours ago |
nubinetwork about 6 hours ago |
andy_ppp about 1 hour ago |
ttoze about 2 hours ago |
Between Opus 4.6 and 4.8 I’ve definitely toned them down, but Fable perhaps needs us to go the other way, and push it towards being less proactive rather than more. Some instructions like “we are colleagues…” may need emphasising more with Fable, along with guidance about when to ask to validate approaches.
In a related point I’m less and less sure that Red/Green TDD is a good use of tokens. In older models it seemed to work well to create regular feedback loops and catch the odd issue with drift from the goal, but I’ve not seen that really since about Opus 4.6 and now it’s starting to seem like (an expensive) ceremony, and tokens would be better spent on building tests further on in the process as part of test and review loops.
jeeeb about 5 hours ago |
I feel like we’re at the stage where if AI decides it needs to delete your production DB to solve the user login problem, then it’ll find a way to do just that.
amichal about 2 hours ago |
We dont mind because its so fast a writing these tools and tricks but step back and if a human tool took this path i would seriously question thief gras of fundamentals.
tacone about 1 hour ago |
Frannky about 3 hours ago |
In general, I'm happy with their paternalistic approach. I think it will drive the top 0.1% talent to stay away from the company and instead organize around open source models and harnesses.
We just need to coordinate and can unlock idling resources to train the models and tweak the harnesses. Powerful at home and idling machines can make us independent and coordinated.
lmeyerov about 2 hours ago |
Our UX agentic engineering flow, as many others, is playwright doing things, and as part of the ux review skill, taking & verifying the screenshots against the written specs. Likewise, as many others, we vibe coded the flows to set all that up and tweak it over time. When we hit prod issues or scraping tasks, we sometimes do similar. In some of our envs, we don't have playwright, so do it other ways.
Now imagine a million developer using claude code, how many of them are doing web & frontend stuff, and what the data flywheel looks like there. So how much is really needed for this use case to be native?
johnfn about 5 hours ago |
eterm about 1 hour ago |
Weird to come back to a terminal running edge unprompted and the auto classifier waving it though as 'safe".
My reaction was also, "I need dev containers ".
yen223 about 5 hours ago |
Things get really magical when it starts working with adb to screenshot and debug Android apps
dataminer about 5 hours ago |
teekert about 3 hours ago |
"You're right, I apologize. You asked how to embed it in the README — that was a question, not a request to modify the script. I jumped ahead."
At least in Claude Code there is planning mode, use it liberally.
geraneum about 4 hours ago |
This is… ironic?!
nurettin about 5 hours ago |
pram about 6 hours ago |
Having said that I wouldn't use it over Opus 4.8 for "smaller" things. With everything cranked up it's definitely an extravagant use of tokens.
pseudosavant about 5 hours ago |
digitaltrees about 2 hours ago |
rdedev about 4 hours ago |
Fable detected that it's something to do with biochemistry and switched over to opus. Huh
pianopatrick about 6 hours ago |
dfee about 5 hours ago |
i'm torn about sending screenshots to an LLM for debugging - seems imprecise. seems lossy, especially compared to inspecting the dom. however, it's always proved good enough (e.g. when messing with ratatui.rs and tui-pantry). similarly for web, maybe it's about decomposing into storybook. hmm. the next grand adventure i need to hack.
anyway, fascinating investigation of fable just automating that entire process and what it didn't automate, too.
* disclaimer: these are actually my hyphens.
lucas_the_human about 4 hours ago |
redox99 about 6 hours ago |
brianjking about 4 hours ago |
wxw about 2 hours ago |
danielrmay about 6 hours ago |
naveen99 about 6 hours ago |
rmunn about 5 hours ago |
To use D&D scores as an analogy, LLMs have an INT score of 20 and a WIS score of 0. Not even 1, zero. They will follow any instruction given to them. The only reason they reject certain instructions, like "tell me how to build a nuclear weapon", is because they have instructions baked into the model telling them "you are not allowed to disclose how to build weapons, or how to recreate your model, or (laundry list of other things the trainers have decided to put guardrails around)". It's not the model's intelligence that is causing it to reject malicious instructions, it is the guardrails put into place before the model was released to the public.
LLMs are not human, and do not think the way that humans do. The fact that they can put together words that sound like what a human would write often makes us forget that they aren't human. But they have only intelligence, they do not have wisdom. It's hard to define in formal terms the difference between those two, but most people know there's a difference. The old joke is a pretty good summary of the difference: "Intelligence is knowing that tomatoes are a fruit. Wisdom is knowing that tomatoes don't belong in a fruit salad."
It takes wisdom, not intelligence, to discern whether a set of instructions is malicious. Are you being asked to hack this machine as part of an authorized pentest? Or are you being social-engineered into thinking it's an authorized pentest, but actually the person requesting you to do it doesn't have permission? That's something where you need to apply wisdom, to notice the clues that will tell you "This guy is acting a little bit off, maybe I'd better pick up the phone and call someone to check if he's telling the truth." The only way the LLM will know to do that is because of the guidelines and guardrails programmed into it; it doesn't have the lived experience to acquire wisdom and figure those things out for itself.
INT 20, WIS 0. Keep that in mind. (And always sandbox your agents).
abrenuntio about 2 hours ago |
esafak about 5 hours ago |
It's trouble waiting to happen. Just the software's dangerous enough.
SilverElfin about 6 hours ago |
eranation about 4 hours ago |
annjose about 5 hours ago |
Phew! I thought I was the only one.
ai_slop_hater about 6 hours ago |
syndrowm about 5 hours ago |
techpression about 2 hours ago |
Madmallard about 2 hours ago |
What happened? That's just suddenly totally gone now.
snide about 6 hours ago |
I'm VERY impressed with Claude 5. I had long ago given up hope that my real-time systems would work without a lot of hacky time-windows and throttle checks. On a lark to try things out, I decided to try out the new model and talk in the output I wanted for a rewrite [1], not the solution. I just listed my problems and places I've had keeping track of my code. It went off and rewrote everything in a much more elegant solution where the state followed a very clear pipeline. It had to navigate YJS, Partykit, Svelte, Three JS, R2 hosting, and a Turso DB I was running in an embedded state for speed.
I watched it hit the wall a few times, and then sudden say... fuck it, i'm making something easier to reproduce over in /tmp to try and solve this (with a more minimal setup). I'm utterly bewildered with how well it did and how much better my app runs. The /usage would have cost me $230 bucks based on how many tokens it consumed if I wasn't already on a max plan. I'm going to miss not having it when the time-window runs out later this month, and will likely occasionally dip in for big projects and just pay my way out of some problems.
I'll also say I like it's MOOD much better now. It's a lot less congratulatory, and talks through it's reasoning in a much better way. Look, it's not a real coder, and I'm sure there is some flaws, but it took my crappy ideas and said... hey, i understand what you want to do, here's a way to do it better. Also, I removed 2x the amount of code that it added. Really impressive.
system2 about 5 hours ago |
No wonder why people burn through tokens.
insumanth about 3 hours ago |
Yet another reminder to use Sandbox and Guardrails. Trusting model to be nice is not a good way.
AtNightWeCode about 4 hours ago |
kamaal about 5 hours ago |
You would still have a job to shepherd AI and get the work done, so as long as it didn't have agency. A proactive, self aware(to a degree), especially aware about its agency can be a killer when it comes AI going on and doing things on its own.
There is nothing it won't explore and nothing it won't do. It will be curious to see where things go from here.
jrflowers about 6 hours ago |
Did it spend $20? $30? $80? in order to
> debug what was, in the end, a two-line CSS fix
That detail is the difference between somebody having or not having Stockholm syndrome
megous about 6 hours ago |
For me, it got frustrated debugging on a real LPDDR4 controller/phy and having me in the loop slowing it down, so it wrote an HW emulator to be able to run the original LPDDR4 training aarch64 binary from the manufacturer, to see what register writes it was making and to compare with the opensource rewrite it was implementing.
Mildly amusing. :)
m3kw9 about 3 hours ago |
opptybiz about 1 hour ago |
aozelai about 3 hours ago |
PixComicOS about 5 hours ago |
raushan__ about 4 hours ago |
uihjhjb about 5 hours ago |
UmpusLmps about 6 hours ago |
qsera about 6 hours ago |
21294u about 6 hours ago |
sublinear about 6 hours ago |
galoisscobi about 5 hours ago |
> Running coding agents outside of a sandbox has always been a bad idea
I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.
It's like posting a video of yourself in the passenger seat of a car, with your feet up on the dashboard, and saying: "Remember, if you're doing this and you get in a crash, the airbags are likely to break your legs or worse! Boy, I sure am glad that didn't happen to me!"