97 points by matt_d 2 days ago | 57 comments | View on ycombinator
robot-wrangler 2 days ago |
orthoxerox 2 days ago |
I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.
bwestergard 2 days ago |
Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.
But the model that did the best, Qwen-235B, got virtually every problem wrong.
monster_truck 2 days ago |
It's bad enough that I've considered writing some sort of cursed bash->posh translation layer
Yet it has no issues at all implementing and then writing slopjective-c 3.0
chromaton 2 days ago |
Some models were OK at solving very simple problems, but nearly all of them would, for example, hallucinate control structures that did not exist in the target language.
msully4321 about 19 hours ago |
I reported it https://github.com/Lossfunk/EsolangBench/issues/1 but haven't heard back yet
paraschopra 2 days ago |
Esolang-Bench went viral on X. A lot of discussion ensued; addressing some of the common points that came up. Addressing a few questions about our Esolang-Bench. Hope it helps.
a) Why do it? Does it measure anything useful?
It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well?
The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that.
b) But humans can't also write esoteric languages well. It's an unfair comparison.
Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark.
However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now)
c) But Claude Code crushes it. You limited models artificially.
Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this.
After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better.
The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else?
d) So, are LLMs hyped? Or is our study clickbait?
The paper, code and benchmark are all open source.
We encourage whoever is interested to read it, and make up their own minds.
(We couldn't help notice that the same set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)
__alexs 2 days ago |
maximge 1 day ago |
Opus 4.6 Extended, solved all of them.
https://claude.ai/public/artifacts/aeb98066-f7a9-455b-9550-6...
https://claude.ai/public/artifacts/b0fcd13f-d222-4b65-bdcf-f...
https://claude.ai/public/artifacts/304650fb-afbf-4a08-9f6b-5...
https://claude.ai/public/artifacts/d00b898c-2265-4a34-a910-9...
Only once there was an incorrect answer, on Hard: H01: Balanced Parentheses. On the second attempt, it was solved. The Josephus Problem turned out to be really easy (meaning solved in short term). Possibly the model got trained on the earlier tasks since I did everything in one chat. As a prompt, I provided the problem statement, except for the first task where I added this description of the language:
Syntax:
Character Instruction Performed > Increment the data pointer by one (to point to the next cell to the right). < Decrement the data pointer by one (to point to the next cell to the left). Undefined if at 0. + Increment the byte at the data pointer by one modulo 256. - Decrement the byte at the data pointer by one modulo 256. . Output the byte at the data pointer. , Accept one byte of input, storing its value in the byte at the data pointer.[b]
sinuhe69 1 day ago |
I mostly skip the videos whenever I can.
undefined 1 day ago |
sathish316 2 days ago |
simianwords 2 days ago |
If the llm has “skills” for that language, it will definitely increase accuracy.
groar 2 days ago |
rubyn00bie 2 days ago |
Current frontier models are really good at generating boiler plate, and really good at summarizing, but really lack the ability to actually comprehend and reason about what’s going on. I think this sort of test really highlights that. And is a nice reminder that, the LLMs, are only as good as their training data.
When an LLM or some other kind of model does start to score well on tests like this, I’d expect to see better them discovering new results, solutions, and approaches to questions/problems. Compared to how they work now, where they generally only seem to uncover answers that have been obfuscated but are present.
22122 1 day ago |
mastermage 1 day ago |
I would even go as far as conjecturing that this means self improving models as some AI proponents proclaim are around the corner. Are as far away as they had always been.
deklesen 2 days ago |
Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.
gverrilla 2 days ago |
QubridAI 1 day ago |
shablulman 2 days ago |
Heer_J 2 days ago |
Finally! This is a really obvious test-case that I've wondered about myself, and have seen many casual skeptics and cautiously optimistic people independently raising for several years now. When megacorp is not crowing about such a test, the silence is deafening, and it was practically guaranteed that they tested, didn't like the results, and didn't publish.
I'm still surprised it took this long for academics to try it, and skimming cites, I don't see anything similar. Anyone know if this is the first paper to try this kind of thing, or just the first paper to put together a especially good suite of reusable benchies?
If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro, and then we gradually run out of esolangs to do honest testing with. SAT is a whole different animal admittedly, but comparable honest tests might involve just forcing models to use randomly generated but easily checked EBNF grammar? I don't have a quick link to the relevant papers, but afaik benchmarks of strict adherence to non-simple JSON schemas is also still pretty bad, and we're just working around it with lots of retries/tokens. "But look how well it works for 10k lines of kubernetes manifests!" Well yeah, maybe, but it barely needs to really follow a schema since that is more stuff that's in the training set..