Hacker news

Top
New
Past
Ask
Show
Jobs

Why DuckDB is my first choice for data processing (https://www.robinlinacre.com)

288 points by tosh 1 day ago | 109 comments | View on ycombinator

mrtimo 1 day ago |

What I love about duckdb:

-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).

-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.

-- if the files don't have the same schema, union_by_name is amazing.

-- The .csv parser is amazing. Auto assigns types well.

-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.

-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.

steve_adams_86 1 day ago |

It has become a favourite tool for me as well.

I work with scientists who research BC's coastal environment, from airborne observation of glaciers to autonomous drones in the deep sea. We've got heaps of data.

A while back I took a leap of faith with DuckDB as the data-processing engine for a new tool we're using to transform and validate biodiversity data. The goal is to take heaps of existing datasets and convert them to valid Darwin Core data. Keyword being valid.

DuckDB is such an incredible tool in this context. Essentially I dynamically build duckdb tables from schemas describing the data, then import it into the tables. If it fails, it explains why on a row-by-row basis (as far as it's able to, at least). Once the raw data is in, transformations can occur. This is accomplished entirely in DuckDB as well. Finally, validations are performed using application-layer logic if the transformation alone isn't assurance enough.

I've managed to build an application that's way faster, way more capable, and much easier to build than I expected. And it's portable! I think I can get the entire core running in a browser. Field researchers could run this on an iPad in a browser, offline!

This is incredible to me. I've had so much fun learning to use DuckDB better. It's probably my favourite discovery in a couple of years.

And yeah, this totally could have been done any number of different ways. I had prototypes which took much different routes. But the cool part here is I can trust DuckDB to do a ton of heavy lifting. It comes with the cost of some things happening in SQL that I'd prefer it didn't sometimes, but I'm content with that tradeoff. In cases where I'm missing application-layer type safety, I use parsing and tests to ensure my DB abstractions are doing what I expect. It works really well!

edit: For anyone curious, the point of this project is to allow scientists to analyze biodiversity and genomic data more easily using common rather than bespoke tools, as well as publish it to public repositories. Publishing is a major pain point because people in the field typically work very far from the Darwin Core spec :) I'm very excited to polish it a bit and get it in the hands of other organizations.

uwemaurer 1 day ago |

We use DuckDB to process analytics and feeds for Bluesky (https://bluefacts.app)

To get fast access to the query results we use the Apache Arrow interface and generate the code directly from DuckDB SQL queries using the SQG tool ( https://sqg.dev/generators/java-duckdb-arrow/)

owlstuffing 1 day ago |

100% agree.

> Writing SQL code

Language integration is paramount for med/lg projects. There's an experimental Java lang project, manifold-sql [1], that does the impossible: inline native DuckDB SQL + type-safety.

    """
    [.sql/] SELECT station_name, count(*) AS num_services
      FROM 'http://blobs.duckdb.org/train_services.parquet'
      WHERE monthname(date) = 'May'
      GROUP BY ALL
      ORDER BY num_services DESC
      LIMIT 3
    """
    .fetch()
    .forEach(row -> out.println(row.stationName + ": " + row.numServices));

1. https://github.com/manifold-systems/manifold/blob/master/doc...

noo_u 1 day ago |

I'd say the author's thoughts are valid for basic data processing. Outside of that, most of claims in this article, such as:

"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."

become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.

The author also has another article where they claim that:

"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)

This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.

lz400 1 day ago |

I do a lot of data processing and my tool of choice is polars. It's blazing fast and has (like pandas) a lot of very useful functions that aren't in SQL or are awkward to emulate in SQL. I can also just do Python functions if I want something that's not offered.

Please sell DuckDB to me. I don't know it very well but my (possibly wrong) intuition is that even giving equal performance, it's going to drop me to the awkwardness of SQL for data processing.

film42 1 day ago |

Just 10 minutes ago I was working with a very large semi-malformed excel file generated by a mainframe. DuckDB was able to load it with all_varchar (just keep everything a string) in under a second.

I'm still waiting for Excel to load the file.

majkinetor 1 day ago |

Anybody with experience in using duckdb to quickly select page of filtered transactions from the single table having a couple of billions of records and let's say 30 columns where each can be filtered using simple WHERE clausule? Lets say 10 years of payment order data. I am wondering since this is not analytical scenario.

Doing that in postgres takes some time, and even simple count(*) takes a lot of time (with all columns indexed)

willtemperley about 23 hours ago |

It's a great tool, but the dynamic loading of extensions in DuckDB makes working with code signing very difficult. Second, the spatial extension uses LGPL components which adds another headache for commercial apps.

As such, it's not readily usable as a library, or set of libraries. I really prefer Apache's approach to analytics where it's possible to pick and choose the parts you need, and integrate them with standard package maangers.

Need GB/S arrays over HTTP? Use Arrow Flight. Want to share self-describing structured arrays with files? Use Arrow IPC. Need to read Parquet? Add that package trait.

Another potential issue with DuckDB is the typing at the SQL interface.

Arrow allows direct access to primitive arrays, but DuckDB uses a slightly different type system at the SQL interface. Even small differences in type systems can lead to combinatoric type explosion. This is more a critiscm of SQL interfaces than DuckDB however.

Additionally Arrow has native libraries in most mainstream languages.

DangitBobby 1 day ago |

Being able to use SQL on CSV and json/jsonl files is pretty sweet. Of course it does much more than that, but that's what I do most often with it. Love duckdb.

efromvt 1 day ago |

DuckDB can't get enough love! Beyond being a really performant database, being so friendly [sql and devx] is really underrated and hard to pull off well and is key to it being so fun - you get a compounding ecosystem because it's so easy to get started. Hoping that they can manage to keep it vibrant without it slowing down the pace of innovation at all.

The web/WASM integration is also fabulous. Looking forward to more "small engines" getting into that space to provide some competition and keep pushing it forward.

netcraft 1 day ago |

I love duckdb, I use it as much as I can. I just wish that the support for node/bun was as good as python. And I wish that they would bundle it differently for node/bun - the way it is now it depends on a dynamic link to a library which means I cant bundle it into a bun executable.

smithclay 1 day ago |

Agree with the author, will add: duckdb is an extremely compelling choice if you’re a developer and want to embed analytics in your app (which can also run in a web browser with wasm!)

Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.

tjchear 1 day ago |

I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?

yakkomajuri 1 day ago |

Been quite a fan of DuckDB and we actually even use it in production.

But coincidentally today I was exploring memory usage and I believe I'm finding memory leaks. Anybody have similar experiences?

Still debugging more deeply but looking reasonably conclusive atm.

nylonstrung 1 day ago |

DuckDB is awesome.

If you want it's power as a query engine but like to write python instead of SQL, I highly recommend using it as a backend for the Ibis dataframe library

It let's you interchange pythonic dataframe syntax (like Pandas and Polars) with SQL that 'compile' down to SQL in DuckDB dialect

And you can use those queries interchangably in postgres, sqlite, polars, spark, etc

https://ibis-project.org/

oulu2006 1 day ago |

That's really interesting, I love the idea of being able to use columnar support directly within postgresql.

I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.

microflash about 15 hours ago |

I've been using DuckDB to process massive Excel files. Despite weird quirks [1], it has been great experience so far. I now use it to process CSV, JSON, Parquet files. It is very fast, and extremely approachable, thanks to SQL being the language for interaction.

[1]: https://github.com/duckdb/duckdb-excel/issues/76

biophysboy 1 day ago |

I think my favorite part of duckdb is its flexibility. Its such a handly little swiss army knife for doing analytical processing in scientific environments (messy data w/ many formats).

s-a-p 1 day ago |

"making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data" > I'm a Data Engineering noob, but DuckDB alone doesn't do metadata & catalog management, which is why they've also introduce DuckLake.

Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.

davidtm 1 day ago |

It is the darling of laptop-based analysis. I’m always surprised that duckdb doesn’t get used more to enhance existing, and stubbornly bedded-in data pipelines. We use it in browser and on server at Count (https://count.co) to make data warehouses more responsive for multiuser and agentic work.

PLenz 1 day ago |

Duckdb is wonderful. I have several multi-TB pipelines that I moved over from spark and dask. It's so much easier to think in terms of knew machine that gets it's resources used efficiently instead of distributed/cloud processes. It even let me take some big data things back to on-prem from aws.

undefined 1 day ago |

undefined

countrymile 1 day ago |

It's worth noting that R has a great duckdb API as well. Saved me a lot of time when dealing with a 29GB CSV file and splitting it into separate parquet files on a low RAM ONS server a few months back.

https://r.duckdb.org/

wswin 1 day ago |

Comparison to Spark in terms of speed and simplicity is pretty unfair since Spark is designed for multi-node use and horizontal scaling isn't going anywhere.

tobilg 1 day ago |

Built with DuckDB-Wasm: SQL Terminal in the browser

https://terminal.sql-workbench.com

red2awn 1 day ago |

It was mentioned that the performance of DuckDB is similar to that of Polars (among others). In that case why would one choose DuckDB over Polars? The only differentiator seems to be that you do the querying with standard SQL instead of the library specific APIs.

undefined 1 day ago |

undefined

falconroar about 21 hours ago |

Polars has all of these benefits (to some degree), but also allows for larger-than-memory datasets.

n_u 1 day ago |

Question for folks in data science / ML space: Has DuckDB been replacing Pandas and NumPy for basic data processing?

fsdfasdsfadfasd 1 day ago |

ctrl + f clickhouse doesn't return anything, surprising

rustyconover 1 day ago |

DuckDB is awesome and Robin is too!

vivzkestrel 1 day ago |

stupid question:

- what is wrong with postgresql for doing this?

clumsysmurf 1 day ago |

DuckDB has experimental builds for Android ... I'm wondering how much work it would take to implement a Java API for it similar to sqlite (Cursor, etc).