Hacker news

Top
New
Past
Ask
Show
Jobs

Lies, Damned Lies and Proofs: Formal Methods Are Not Slopless (https://www.lesswrong.com)

80 points by OgsyedIE 3 days ago | 42 comments | View on ycombinator

enum about 10 hours ago |

The post says this in other words: in Lean, Rocq, or any other theorem prover, you get a formally-verified proof, but you do NOT get a formally verified theorem statement.

So, even if the proof is correct, you need to determine if the theorem is what you want. Making that determination requires expertise. Since you cannot "run the theorem", you cannot vibe-code your way through it. E.g., there is no equivalent of "web app seems to be working!" You have to actually understand what the theorems are saying in a deep way.

Paracompact about 11 hours ago |

> Third, you need to decide how far “down the stack” you want to go. That is to say, the software you want to verify operates over some kind of more complex system, for instance, maybe it’s C code which gets compiled down to X86 and runs on a particular chip, or maybe it’s a controller for a nuclear reactor and part of the system is the actual physical dynamics of the reactor. Do you really want your proof to involve specifying the semantics of the C compiler and the chip, or the way that the temperature and other variables fluctuate in the reactor?

I can appreciate what he's getting at, but my utopian vision for the future is that we won't need to reinvent the wheel like this each time we want verified software! E.g. for high-consequence systems, the hard part of compiler correctness is already handled by the efforts of CompCert, and SystemVerilog assertions for the design guarantees of processors is becoming more commonplace.

Rochus 3 days ago |

Interesting article, thanks. There is indeed a "semantic gap". However, there is also a practical solution: bidirectional LLM translation. You can verify the formal specification by back-translating it to natural language with another LLM session, allowing human review at the intent level rather than requiring expertise in e.g. Event-B syntax (see https://rochuskeller.substack.com/p/why-rust-solves-a-proble...). This addresses the concern about "mis-defining concepts" without requiring the human to be a formal methods expert. The human can review intent and invariants in natural language, not proof obligations. The AI handles the mathematical tedium while the human focuses on domain correctness, which is exactly where human expertise belongs.

crvdgc about 11 hours ago |

Some valid points, but I hope the authors had developed them more.

On the semantic gap between the original software and its representation in the ITP, program extraction like in Rocq probably deserves some discussion, where the software is written natively in the ITP and you have to prove the extraction itself sound. For example, Meta Rocq did this for Rocq.

For the how far down the stack problem, there are some efforts from https://deepspec.org/, but it's inherently a difficult problem and often gets less love than the lab environment projects.

utopiah about 7 hours ago |

Reminds me of Z notation, good times... sure it's nice to have a harness, or a series of test, but they only insure you that what's it done within it, is done within it. If it's poorly done, then the result if still within it but it's not magically correct.

lmeyerov about 7 hours ago |

I've had an interesting comparative experience so far using Alloy, which is a 'simpler' formal method approach that you can think of as smarter exhaustive fuzzing for your properties of interest.

We have been building GFQL, an open source graph query language that can run at the compute tier seperate from whatever existing storage system (splunk, SQL, cs , ..), making graphs easy for interactive notebooks/dashboards/analytics/pipelines/agents without needing graph DBs, even at billion scales. Part of the trick is it is a bit unusual inside too because we are making cypher-style queries run with columnar operations (purely vectorized for both CPU + GPU mode via a multi-target dataframe abstraction layer), but that is taking a few innovations in the interpreter algorithm so we worry about subtle semantic bugs. Most of our bugs can surface with small (graph, query) combos, so this sounded perfect for Alloy's small scope hypothesis!

So we are having a natural experiment for which does more bug finding:

- asking Claude code to analyze new code or recently found issues and using that for unit test amplification

- asking Claude code to use the alloy model checker to prove out our space of (query, graph) for key methods

- asking Claude code to port a large cypher conformance test suite

So far, Claude Code as a test amplifier is doing so much better than the others that it's our Dr default preference. The real gem is when we make it do a '5 whys root cause analysis' on why some bug got through and then it does test amplification around several categories of potential error.

We found the conformance suite to be 'class imbalanced' by focusing on a couple features we are adding next, so jury still out on that one. And finally... Alloy hasn't really found anything. We suspect we need to change how we use Alloy to be more reflective of the kinds of bugs we find, but 0 was disappointing.

If anyone is into this kind of thing, happy to discuss/collab!

paulajohnson about 11 hours ago |

A formal specification language is a programming language that we don't know how to compile.

If we can use AI to automatically implement a formal spec, then that formal specification language has just become a programming language.

psuedobrain about 11 hours ago |

Slight nitpick, but isn't Agda based on the MLTT family instead of CoC family of languages?

pron about 5 hours ago |

A fundamental problem is that program verification is intractable in the computational complexity sense. The question of whether a program satisfies a spec or not is called in the literature the model checking problem (not to be confused with model checkers, which are various algorithms for automatically solving some model checking problems). In the worst case, there is no faster way to determine whether a program satisfies a spec or not than explicitly testing each of its reachable states.

The question is, of course, what portion of programs are much easier than the worst case (to the point of making verification tractable). That is not known, but the results are not necessarily encouraging. Programs that have been deductively verified proof assistants were not only very small, and not only written with proofs in mind, but were also restricted to use simple and less efficient algorithms to make the proofs doable. They tend to require between 10x and 1000x lines of proof per line of code.

(an old blog post of mine links to some important results: https://pron.github.io/posts/correctness-and-complexity)

There is a belief that programs that people write and more-or-less work should be tractably provable, as that's what allowed writing them in the first place (i.e. the authors must have had some vague proof of correctness in mind when writing them). I don't think this argument is true (https://pron.github.io/posts/people-dont-write-programs), and we use formal methods precisely when we want to close the gap between working more-or-less and definitely always working.

But even if verifying some program is tractable, it could still take a long time between iterations. Because it's reasonable that it would take an LLM no less than a month to prove a correct program, there's no point in stopping it before. So a "practical" approach could be: write a program, try proving it for a month, and if you haven't finished in a month, try again. That, of course, could mean waiting, say, six months before deciding whether or not it's likely to ultimately succeed. Nevertheless, I expect there will be cases where writing the program and the proof would have taken a team of humans 800 years, and an LLM could do it in 80.

positron26 about 6 hours ago |

All formal methods came from natural. Somewhere in the meta-language stack, without natural methods, any system is stuck within an inescapable well. Bad formalisms are stuck and wrong. Natural methods are bad, but not stuck. They can arrive at new formalisms.

rafaelbeirigo about 9 hours ago |

For the case of Propositional Logic, ChatGPT reflects the current epistemological crisis. When asking for help on a question, it could not properly apply the Law of the Excluded Middle [1].

1. https://chatgpt.com/share/696b7f8a-9760-8006-a1b5-89ffd7c5d2...