Buckets of Parquet Files are Awful
We have to stop with the "buckets of parquet files" (BOPF). Everything about it sucks.
Nowadays the most common way of sharing data seems to be "dump a terabyte of parquet in an S3 bucket and call it a day."
I understand why: all of the big data tools know how to talk to s3 and dump data in parquet format, so this is pretty easy to set up as a programmer. "customers can ingest that data however they want!"
I am here to say that this sucks for anyone who doesn't have a team dedicated to writing scripts to babysit file transfers. Please stop.
Here is why it sucks:
- Postgres can't read from S3, load from Parquet, or query parquet files. "Here's a data dump, now go get a new database to read it." Come on.
- Every guide to parquet uses
read_parquet()
which loads the entire file into memory and is completely bonkers. Seriously, google "how to read parquet in python" and this is the first hit. - Even if you use DuckDB to read parquet directly from S3, it is slow! Especially if you want to do something that is not specifically SUM() of a single column with filters. (Most of the time your data dump isn't partitioned anyway so you still have to read each file.)
- Nobody knows what parquet is, or how to use it, or what a row group is, or how to use a library to efficiently parse it.
- Even if you're using Clickhouse, importing data in bulk sucks. How many times have you been hit with an out of memory error trying to import a large file? Good luck figuring out which config to update to make it work.
- No libraries let you control the amount of disk or memory you want to use. At the end of the day the thing you can measure is disk space and memory and there's just no way to set those as parameters and have the algorithm do the "right thing". (Will DuckDB ever respect the MEMORY_LIMIT setting?)
Here is what I end up doing:
- Download the file from S3
- Read the parquet file small chunks (1000 rows at a time)
- For each chunk, convert the data to CSV since all databases know how to import CSV
- Load chunks, one at a time, into the database
Sometimes my database knows how to speak Parquet but I still need to process it in chunks. Sometimes I can coax the DB into a larger import. There are a hundred optimizations that can be made depending on the tech, but the gist of the solution is the same:
- Work with files locally rather than slow S3 operations
- Divide the data into very small chunks
- Rewrite data in a format my DB can understand
- Slowly load and babysit everything
All of it sucks. This is why I'm building Scratch Data.
Why I am building Scratch Data
I am a builder.
I do not care about data pipelines, airflow, spark, row groups, arrow, kafka, push-down filtering, or any of these other distractions from building.
I want to write software that does stuff, and I want to be able to handle large amounts of data and have the tools just do the right thing.
Today, data processing is and essential complexity of wrangling data and it doesn't need to be that way.
All of the other solutions are too fucking complicated and just don't work that well. (Listening to the postgres wire protocol for replication? Way too complicated and too much setup. Tools that integrate with 1,000 different products poorly? Come on.)
Let's work together.
If you are dropping buckets of parquet files onto your customers, we're building better tools to help them do what they really want, which is to analyze, ingest, query, and build on top of it. Let's chat!
If you have had a ton of data dropped on you (like a billion dollars worth of pennies), I will personally help you figure out how to turn it int something usable. Drop me a line: jay@scratchdata.com.