Hacker news

  • Top
  • New
  • Past
  • Ask
  • Show
  • Jobs

NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute (https://qlabs.sh)

168 points by sdpmas 2 days ago | 46 comments | View on ycombinator

NooneAtAll3 1 day ago |

I thought "data efficiency" meant same quality with less parameters

instead it's more parameters with less training data... but I don't really see any quality control?

andai 1 day ago |

What's the human baseline? How many cats does a human need to see to learn what a cat is, vs an AI?

Maybe not quite a fair comparison since my human brain has been "learning" for half a billion years before I was born.

I wonder if there's an equivalent of that for AI. Evolving the architectures?

pastescreenshot 1 day ago |

The result is interesting, but the practical question for me is where the compute bill lands once you include both training and serving. If a fixed-data regime pushes you toward ensembles plus chain distillation, is the endgame “serve the ensemble”, or do you expect most of the gain can be compressed back into a single deployable model later? That seems like the difference between a neat scaling result and a generally usable recipe.

nsnzjznzbx 2 days ago |

We will get to the point where you can quickly bootstrap i.e. an LLM can train a better LLM in a loop, leave it and it can really learn. Like learn learn.

"Train yourself to solve this problem see OBJECTIVE.md"

QubridAI 1 day ago |

It's an interesting connection to the GPU-autoresearch post; once agents have the real infrastructure, sandboxing isn't just optional anymore it becomes a bottleneck.

abeppu 1 day ago |

In their little algorithm box on Chain Distillation, they have at step 2b some expression that involves multiplying and dividing by `T`, and then they say "where α = 0.5, T = 1.0".

I think someone during the copy-editing process told them this needed to look more complicated?

phr4ts 1 day ago |

The brain does optimization during sleep. Is that something llms can benefit from?

naasking 1 day ago |

Great project. On the matter of data efficiency and regularization, I'd love to see someone try scaling GrokAlign!

littlestymaar 2 days ago |

> Data efficiency matters because compute grows much faster than data [2] (referencing a paper from 2022)

I'm not convinced this is particularly true in today's world, if you have more compute, you can simply generate more, and higher quality, artificial data. That's what all labs have been doing since at least 2023.

Also, the post references the Chinchilla-optimal training as a comparison baseline, but everyone has moved far beyond Chinchilla scaling, small models are routinely trained on 10-400 times more data than (1-40T tokens) than the Chinchilla-optimal number, so the entire industry went the complete opposite of what they are proposing.

That doesn't mean the techniques presented here are useless or anything (I'm not qualified to judge) but you should take the introduction with a grain of salt.

yorwba 2 days ago |

Related: Discussion on the initial NanoGPT Slowrun announcement: https://news.ycombinator.com/item?id=47251259 (185 points 15 days ago, 39 comments)

webagent255 1 day ago |

[dead]

myylogic 2 days ago |

[dead]

aledevv 1 day ago |

[dead]

AliEveryHour16 2 days ago |

[dead]

1425curlz80 1 day ago |

[dead]