rohankhameshra's comments

rohankhameshra · 2025-12-18T17:29:59 1766078999

Hi HN,

OLake just added Kafka as a source, allowing data from Kafka topics to be written directly into Apache Iceberg tables (open source, no proprietary format).

Why this was added:

Many teams today land Kafka data into warehouses or custom storage layers, then later rewrite it into Iceberg for analytics or AI workloads.

That adds latency, cost, and operational complexity.

With OLake:

- Kafka -> Iceberg is a single step

- Tables are standard Iceberg (queryable by Spark, Trino, Presto, Athena, etc.)

- Supports schema evolution and high-throughput ingestion

This is early and we’re actively looking for feedback from teams running Kafka at scale or experimenting with Iceberg-based lakehouses.

Happy to answer questions or discuss trade-offs.

rohankhameshra · 2025-12-08T06:58:09 1765177089

As a open source contributor, agree to most of the points mentioned.

rohankhameshra · 2025-12-05T16:38:31 1764952711

Hi everyone, I’m one of the founders of OLake. We’ve been working on a high-throughput, open-source ingestion path for Apache Iceberg, and I wanted to share the latest benchmark results and the architectural changes behind them. Here are the key numbers from the benchmark run:

- On a 4.01 billion-row dataset, OLake sustained around 319,562 rows/sec (full-load) from Postgres into Iceberg.

- The next-best ingestion tool we tested on the same dataset managed about 46,000 rows/sec, making OLake roughly 6.8× faster for full loads.

- For CDC workloads, OLake ingested change batches at around 41,390 rows/sec, compared to ~26,900 rows/sec for the closest alternative.

- Average memory usage was about 44 GB, peaking at ~59 GB on a 64-vCPU / 128 GB RAM VM.

- Parquet file output stabilized at ~300–400 MB per file (after compression), improving performance downstream and avoiding “small file” fragmentation.

How we got these improvements:

1. Rewrote the writer architecture: Parsing, schema evolution, buffering, and batch management now happen in Go. Only the final Parquet and Iceberg write path uses Java. This cut down huge amounts of serialization and JVM churn.

2. Introduced a new batching and buffering model: Instead of producing many small Parquet files, we buffer data in memory per thread and commit large chunks (roughly 4 GB before compression). This keeps throughput high and files uniform.

3. Optimized Iceberg metadata operations: Commits remain atomic even with large batches, and schema evolution happens fully in Go before any write, reducing cross-system coordination.

4. Improved operational stability: CPU, memory, and disk behaviour remained predictable even at multi-billion-row scales.

Benchmark setup:

- Dataset: ~4.01 billion rows from the NYC Taxi + FHV trips parquet sets (row width ~120–144 bytes).

- Test machine: Azure Standard D64ls v5 (64 vCPUs, 128 GB RAM).

Iceberg stored on local NVMe for the benchmark, same architecture works with S3/GCS/HDFS.

The full benchmark results, methodology, and configs are here: https://olake.io/docs/benchmarks/

And the deep-dive into how we got the ~7× speedup is here: https://olake.io/blog/how-olake-becomes-7x-faster/

I’d love feedback from the HN community, specifically around tuning (batch sizes, commit frequency, partitioning strategies), Iceberg best practices, and real-world constraints you’ve seen in high-volume pipelines.

Happy to answer questions or share configs. Thanks for taking a look!

rohankhameshra · 2025-12-05T12:31:57 1764937917

Love Go and Rust depending on usecase, but yet to check Zig

rohankhameshra · 2025-12-05T12:29:17 1764937757

Interesting, that will bring a big production house capabilities within Netflix itself

niek_pas · 2025-12-05T13:16:15 1764940575

Unfortunately, Netflix thus far seems to lack the creative vision to fully utilize any size of production house (barring rare exceptions).

jmkd · 2025-12-05T12:50:30 1764939030

Netflix is already the sole client of a huge studio outside Madrid.