380 points by tamnd 5 days ago | 156 comments | View on ycombinator
tamnd about 2 hours ago |
6thbit about 13 hours ago |
> Except as expressly authorized by Y Combinator, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Site or the Site Content, in whole or in part
Not to pretend this isn't widely happening behind the curtains already, but coming from a "Show HN" seems daring.
xnx about 18 hours ago |
0cf8612b2e1e about 19 hours ago |
deleted and dead are integers. They are stored as 0/1 rather than booleans.
Is there a technical reason to do this? You have the type right there.brtkwr about 17 hours ago |
palmotea about 19 hours ago |
Wouldn't that lose deleted/moderated comments?
gkbrk about 19 hours ago |
alstonite about 18 hours ago |
lhoestq about 13 hours ago |
epogrebnyak about 17 hours ago |
undefined about 10 hours ago |
vovavili about 17 hours ago |
kshacker about 17 hours ago |
imhoguy about 17 hours ago |
maxloh about 16 hours ago |
mlhpdx about 18 hours ago |
> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.
That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.
jiggawatts about 12 hours ago |
I've been evaluating Gemini Embedding 2 using Hacker News comments and I wasted half a day making a wrapper for the HN API to collect some sample data to play with.
In case anyone is curious:
- The ability to simply truncate the provided embedding to a prefix (and then renormalize) is useful because it lets users re-use the same (paid!) embedding API response for multiple indexes at different qualities.
- Traditional enterprise software vendors are struggling to keep up with the pace of AI development. Microsoft SQL Server for example can't store a 3072 element vector with 32-bit floats (because that would be 12 KB and the page size is only 8 KB). It supports bfloat16 but... the SQL client doesn't! Or Entity Framework. Or anything else.
- Holy cow everything is so slow compared to full text search! The model is deployed in only one US region, so from Australia the turnaround time is something like 900 milliseconds. Then the vector search over just a few thousand entries with DiskANN is another 600-800 ms! I guess search-as-you-type is out of the question for... a while.
- Speaking of slow, the first thing I had to do was write an asynchronous parallel bounded queue data processor utility class in C# that supports chunking of the input and rate limit retries. This feels like it ought to be baked into the standard library or at least the AI SDKs because it's pretty much mandatory if working with anything other than "hello world" scenarios.
- Gemini Embedding 2 has the headline feature of multi-modal input, but they forgot to implement anything other than "string" for their IEmbeddingGenerator abstraction when used with Microsoft libraries. I guess the next "Preview v0.0.3-alpha" version or whatever will include it.
robotswantdata about 17 hours ago |
trwhite about 15 hours ago |
politician about 16 hours ago |
lyu07282 about 17 hours ago |
Imustaskforhelp about 17 hours ago |
Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.
I had thought of making a ping website where people can just @Username and a service which can detect it and then send mail to said username if the username has signed up to the service (similar to a service run by someone from HN community which mails you everytime someone responds to your thread directly, but this time in a sort of ping)
[The previous idea came as I tried to ping someone to show them something relevant and thought that wait a minute, something like ping which mails might be interesting and then tried to see if I can use algolia or any service to hook things up but not many/any service made much sense back then sadly so I had the idea in back of my mind but this service sort of solves it by having it being updated every 5 minutes]
Your 5 minute updates really make it possible. I will look what I can do with that in some days but I am seeing some discrepancy in the 5 minute update as last seems to be 16 march in the readme so I would love to know more about if its being updated every 5 minutes because it truly feels phenomenal if true and its exciting to think of some new possibilities unlocked with it.
tonymet about 17 hours ago |
Onavo about 19 hours ago |
GeoAtreides about 19 hours ago |
https://www.ycombinator.com/legal/
Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)
bstsb about 19 hours ago |
lokimoon about 17 hours ago |
The main reason I built this was to have HN data that is easy to query and always up to date, without needing to run your own pipeline first. There are also some interesting ideas in the pipeline, like what I call "auto-heal". Happy to share more if anyone is interested :)
A lot of the choices are trade-offs, as usual with data pipelines. I chose Parquet because it is columnar and compressed, so tools like DuckDB or Polars can read only the columns they need. This matters a lot as the dataset grows. I went with Hugging Face mainly because it is simple and already handles distribution and versioning. I can just push data as commits and get a built-in history without managing extra infrastructure (and, more conveniently, if you read the README, you can query it directly using Python or DuckDB).
The pipeline is incremental. Instead of rebuilding everything, it appends small batches every few minutes using the API. That keeps it fresh while staying cheap to run. The data is also partitioned by time, so queries do not need to scan the entire dataset (and I use very simple tech, just a Go binary running in a "screen" session, using only a few MB of RAM for the whole pipeline).