288 points by tosh 1 day ago | 109 comments | View on ycombinator
mrtimo 1 day ago |
steve_adams_86 1 day ago |
I work with scientists who research BC's coastal environment, from airborne observation of glaciers to autonomous drones in the deep sea. We've got heaps of data.
A while back I took a leap of faith with DuckDB as the data-processing engine for a new tool we're using to transform and validate biodiversity data. The goal is to take heaps of existing datasets and convert them to valid Darwin Core data. Keyword being valid.
DuckDB is such an incredible tool in this context. Essentially I dynamically build duckdb tables from schemas describing the data, then import it into the tables. If it fails, it explains why on a row-by-row basis (as far as it's able to, at least). Once the raw data is in, transformations can occur. This is accomplished entirely in DuckDB as well. Finally, validations are performed using application-layer logic if the transformation alone isn't assurance enough.
I've managed to build an application that's way faster, way more capable, and much easier to build than I expected. And it's portable! I think I can get the entire core running in a browser. Field researchers could run this on an iPad in a browser, offline!
This is incredible to me. I've had so much fun learning to use DuckDB better. It's probably my favourite discovery in a couple of years.
And yeah, this totally could have been done any number of different ways. I had prototypes which took much different routes. But the cool part here is I can trust DuckDB to do a ton of heavy lifting. It comes with the cost of some things happening in SQL that I'd prefer it didn't sometimes, but I'm content with that tradeoff. In cases where I'm missing application-layer type safety, I use parsing and tests to ensure my DB abstractions are doing what I expect. It works really well!
edit: For anyone curious, the point of this project is to allow scientists to analyze biodiversity and genomic data more easily using common rather than bespoke tools, as well as publish it to public repositories. Publishing is a major pain point because people in the field typically work very far from the Darwin Core spec :) I'm very excited to polish it a bit and get it in the hands of other organizations.
uwemaurer 1 day ago |
To get fast access to the query results we use the Apache Arrow interface and generate the code directly from DuckDB SQL queries using the SQG tool ( https://sqg.dev/generators/java-duckdb-arrow/)
owlstuffing 1 day ago |
> Writing SQL code
Language integration is paramount for med/lg projects. There's an experimental Java lang project, manifold-sql [1], that does the impossible: inline native DuckDB SQL + type-safety.
"""
[.sql/] SELECT station_name, count(*) AS num_services
FROM 'http://blobs.duckdb.org/train_services.parquet'
WHERE monthname(date) = 'May'
GROUP BY ALL
ORDER BY num_services DESC
LIMIT 3
"""
.fetch()
.forEach(row -> out.println(row.stationName + ": " + row.numServices));
1. https://github.com/manifold-systems/manifold/blob/master/doc...noo_u 1 day ago |
"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."
become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.
The author also has another article where they claim that:
"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.
lz400 1 day ago |
Please sell DuckDB to me. I don't know it very well but my (possibly wrong) intuition is that even giving equal performance, it's going to drop me to the awkwardness of SQL for data processing.
film42 1 day ago |
I'm still waiting for Excel to load the file.
majkinetor 1 day ago |
Doing that in postgres takes some time, and even simple count(*) takes a lot of time (with all columns indexed)
willtemperley about 23 hours ago |
As such, it's not readily usable as a library, or set of libraries. I really prefer Apache's approach to analytics where it's possible to pick and choose the parts you need, and integrate them with standard package maangers.
Need GB/S arrays over HTTP? Use Arrow Flight. Want to share self-describing structured arrays with files? Use Arrow IPC. Need to read Parquet? Add that package trait.
Another potential issue with DuckDB is the typing at the SQL interface.
Arrow allows direct access to primitive arrays, but DuckDB uses a slightly different type system at the SQL interface. Even small differences in type systems can lead to combinatoric type explosion. This is more a critiscm of SQL interfaces than DuckDB however.
Additionally Arrow has native libraries in most mainstream languages.
DangitBobby 1 day ago |
efromvt 1 day ago |
The web/WASM integration is also fabulous. Looking forward to more "small engines" getting into that space to provide some competition and keep pushing it forward.
netcraft 1 day ago |
smithclay 1 day ago |
Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.
tjchear 1 day ago |
yakkomajuri 1 day ago |
But coincidentally today I was exploring memory usage and I believe I'm finding memory leaks. Anybody have similar experiences?
Still debugging more deeply but looking reasonably conclusive atm.
nylonstrung 1 day ago |
If you want it's power as a query engine but like to write python instead of SQL, I highly recommend using it as a backend for the Ibis dataframe library
It let's you interchange pythonic dataframe syntax (like Pandas and Polars) with SQL that 'compile' down to SQL in DuckDB dialect
And you can use those queries interchangably in postgres, sqlite, polars, spark, etc
oulu2006 1 day ago |
I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.
microflash about 15 hours ago |
biophysboy 1 day ago |
s-a-p 1 day ago |
Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.
davidtm 1 day ago |
PLenz 1 day ago |
undefined 1 day ago |
countrymile 1 day ago |
wswin 1 day ago |
tobilg 1 day ago |
red2awn 1 day ago |
undefined 1 day ago |
falconroar about 21 hours ago |
n_u 1 day ago |
fsdfasdsfadfasd 1 day ago |
rustyconover 1 day ago |
vivzkestrel 1 day ago |
- what is wrong with postgresql for doing this?
clumsysmurf 1 day ago |
-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).
-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.
-- if the files don't have the same schema, union_by_name is amazing.
-- The .csv parser is amazing. Auto assigns types well.
-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.
-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.