Hacker news

Top
New
Past
Ask
Show
Jobs

Show HN: Streambed – Stream Postgres to Iceberg on S3, Supports Postgres Wire (https://github.com)

129 points by vira28 6 days ago | 40 comments | View on ycombinator

vira28 5 days ago |

Author here. For context, I was the tech lead for the Postgres team at Cloudflare, and this came directly out of a challenge I kept hitting there: BI and dashboard teams needed to run long-running analytical queries, and the answer was always to spin up another bespoke read replica or stand up an ETL dump into an analytical database and query that.

So the question I started with was: what's the fewest components I could get away with? That led to the architecture here — Streambed connects to Postgres as a logical replication subscriber (same mechanism as a read replica) and streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB. There are a lot of edge cases to handle, and it's very much early days.

Welcome any feedback.

buremba 5 days ago |

Looks interesting! It reminds me of pg_lake, which we evaluated for our startup https://lobu.ai but it's missing a lot of pushdown capabilities which made OLAP queries expensive.

I also tried DuckLake but that required us to move away from PG-first approach. I was thinking of using Debezium to create Iceberg on S3 for our append-only PG tables and use DuckDB. I will try Streambed out as well!

cpard 5 days ago |

Replicating the Postgres WAL to S3 and Iceberg reliably is a hard problem but it’s not accurate to say that no ETL is needed here.

maybe you can say it’s more of an ELT pattern but anyone who’s interested into using this for realistic analytics they will have to transform the data at some point.

If an org is early enough to think that they can use a solution like this and just get in duckdb and start spitting out reports, they will be up for a really bad experience.

Please educate people to do the right thing and realize the scope of the work they are facing, it might feel that it hurts your growth in the short term but it will benefit you greatly in the mid-long term as a vendor.

karakanb 5 days ago |

Hi, this looks interesting, thanks for sharing. I am the builder of ingestr (https://github.com/bruin-data/ingestr), so I am very much in the same space.

I really like that you did this in Go, and I'll definitely dig a bit more into the source code to see how you tackled the CDC stuff, given that there is not many reliable CDC libraries in Go, and there are quite a few gotchas when it comes to doing CDC right. We also hand-rolled ours in ingestr, or I must say clanker-rolled, and we got quite a few things wrong in the first place.

Curious about the postgres-compatible query option: what's the usecase you have in mind there? My perception is that any org that would use Iceberg also has one or a few query engines in place, is this more for debugging stuff?

Quite cool stuff, keep it up!

viveknathani_ 5 days ago |

interesting approach, was exploring a Postgres to Clickhouse CDC setup while helping a team sometime back, this seems better as it allows separating the compute (query server) and storage (s3) layers, and thereby allowing us to be creative in cost reductions

ryanshrott 4 days ago |

We ran into issues with CDC when tables had a lot of TOAST columns. The WAL records don't include the full values unless you set REPLICA IDENTITY FULL. Does Streambed handle that, or do you need the extra config?

nightfly 5 days ago |

vira28: It looks like nearly all of your responses to comments/questions here are flagged/dead. Probably because they all look AI written. Are you actually responding or do you have an agent answering questions for you?

nitinram 5 days ago |

This is a nice project! we do some exporting of data from postgres to s3 and its a little flaky but does the job for now. Feel like this a good project to explore using

chrislusf 5 days ago |

If less components is desired, use SeaweedFS, which supports S3 table buckets and Iceberg catalog and maintenance. Basically storing Iceberg tables data and metadata.

oa335 5 days ago |

nice work! we have handrolled something similar at work.

do you have any perf metrics? throughput, end-to-end latency, etc?

ApiFB-Dev 5 days ago |

hmm wow very interesting idea!

mohinish 5 days ago |

[flagged]

jiangriver66 5 days ago |

[flagged]

smartrich 5 days ago |

[flagged]