Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Differential Dataflow (timelydataflow.github.io)
120 points by timhigins on Oct 20, 2020 | hide | past | favorite | 28 comments


Differential Dataflow is about differential computation which itself is a generalization of incremental computing, most commonly known from Excel.

A good introduction to the overall topic is incremental [0] an OCaml library by Jane Street and the corresponding blog posts and videos [1]. They even use it for webapps [2].

[0]: https://github.com/janestreet/incremental

[1]: https://www.janestreet.com/tech-talks/seven-implementations-... (Great to understand fundamentals of an incremental computation graph)

[2]: https://www.janestreet.com/tech-talks/intro-to-incr-dom/


I think the underlying algorithms/architecture are pretty different. I believe Incremental is based on http://matthewhammer.org/icfp11/ChenDuHaAc2011-implicit-SAC.... and Differential Dataflow is based on the ideas in https://www.microsoft.com/en-us/research/wp-content/uploads/... .


interesting without me having to read both paper what would you say is the main differences? Pros and cons of each architectur?


"Incremental Computing" - A new term for me... I figured out how to do it a few days ago, after seeing this reference to MetaMine on Reddit https://www.reddit.com/r/programming/comments/j94lgd/metamin...

In Metamine,

  = signifies Equation, a persistent assignment. Any update to the value on the  after which any change to the value results in all dependent values being recomputed.
  := signifies a standard one shot assignment.
The runtime behavior is easy enough to implement, less than 100 lines of Pascal. I worried that the code for a language that supported it would be unreadable, like FORTH, only worse. Imagine if Excel spreadsheets were plain text source code.


This should be the top comment, reading through those docs is a terrible experience, I think the author lacks any didactics experience


This is also the tech powering Materialize, which is an awesome materialised view maintenance engine with a Postgres like interface:

https://materialize.io/


I still feel like there's no simple platform for a use case where you're doing stream processing, but you really care about being able to update old data. Obviously you can just rerun a big batch job, and maybe work out which caches need invalidating but that's hard work. Streaming platforms with watermarks make it hard to deal with very late events or long term updates. Something like Kafka Streams does away with watermarks and so can keep aggregate data up to date, but if your logic involves a lot of stateful, sequential logic, it's hard to know what to rollback and replay when updates come in. If and when Materialize supports window functions, a lot of these workloads become much simpler, I think. Obviously differential dataflow on its own can address some of this, but alas, not a Rust shop.


Doesn't Apache Flink already solve these problems? E.g. stateful, fault tolerant, scalable stream processing.


Sort of. Let's say you're counting the number of unique users visiting your site in a 5 minute fixed window. This is pretty simple: you have a set, and each time a request comes in you add it to the set. At the end of the 5 minute window you output the set cardinality to your downstream consumer and throw out the raw data.

But what if this is a mobile app, and requests may be delayed by minutes or even hours due to gaps in cellular connectivity? In that case, we need a strategy to handle "late arriving" data.

Flink has several approaches for this, but they're all based around a "high-watermark." This is basically an estimate at time T that all data from time T-D has arrived. Once the watermark has passed the end of our window we consider it closed and can compute the final value for it. (How you compute the watermark is up to you; typically you use a fixed value but this can be made more accurate if you have out-of-band information).

In addition to the default behavior (where late-arriving data is dropped) you can customize this by specifying a trigger that runs when late-arriving date comes, which can be used to e.g., update an external datastore. However at that point the reconciliation is outside of the scope of Flink.

There are various other ways to organize this within flink, i.e., you can keep the windows open indefinitely, update the internal state of your flink job, and serve directly from the flink state, or you can use a periodic trigger to periodically update your downstream from that state. Obviously this will require a larger state size (in disk or in memory depending on your state backend) since you won't be able to close out the window and have to keep all of the data around.

Much more about all this in the Flink docs (https://ci.apache.org/projects/flink/flink-docs-release-1.11...).

Also a lot of the theory behind this comes from the Google Dataflow paper (https://research.google/pubs/pub43864/) which is also just a great read.


Yes I'm aware that Flink has these mechanisms. But I'm not sure I understand how Differential Dataflow handles this in a better way. For simple windowing functions there are certainly shortcuts possible, but for more complex ones such as session windows it gets complicated. You might have to merge sessions as late data arrives. This is what Flink and others (Google Dataflow) do.


Flink uses highwatermarks like google's dataflow and is based on (I think) the original Millwheel paper.

The alternative to all this nonsense is to just throw everything into clickhouse and build materialized views! The drawback is you can't do complex joins, but for 90% of use-cases, clickhouse materialized views work swimmingly.


As I understand it, ClickHouse still doesn't support window functions either, so it gets hard to do complex sequence-based logic in calculations. Materialise actually supports lateral joins so you can do some powerful things, but I am sure a better developer experience is possible.


ClickHouse does this using arrays. There's a rich set of functions to pull grouped values into arrays, process them with lambdas (e.g, map, sort, etc.), and explode them back out into tabular result.

Here's a slide deck with examples from a recent presentation: https://altinity.com/presentations/introduction-to-high-velo...

ClickHouse sneaks a functional programming model with lambda expressions into SQL. It's not standard SQL but has enormous flexibility.


I was kind of caught up on comment "if (m2, m1) and (m1, p), then output (m1, (m2, p))". When input is [(m2, m1), (m1, p)] and is pairwise switched and joined with unreversed input.

If m2 manages m1 and m1 manages p then m2 manages p via m1, but I could not quite understand what this example output should mean. M1 manages m2 and p? M1 is managed by m2 and p. Neither seems correct.


I had the same question. More generally, it's a "design smell" for me to see something like this:

        // define a new timely dataflow computation.
        timely::execute_from_args(std::env::args(), move |worker| {
Why would something as fundamental as creating a new dataflow need a comment? In other words: why is that not self-evident from the syntax?

Not to denigrate the technical accomplishments. And elegance/readability is a very personal thing. But it does seem jarring when the language constructs don't obviously map cleanly to the design intent.


Naming things is hard.

In this case, the command does not create a new dataflow, it creates a new timely dataflow computation by executing the closure on multiple workers. The computation can then spin zero, one, or many dataflows interactively. Dataflows come and go as the computation runs.


For what it's worth... someone like me could benefit a lot from this comment. It would allow me to understand some of the intended design from an abstracted point of view, and gives me a bunch of hooks to start googling to understand what's going on.

The syntax here doesn't otherwise communicate that to me at all.


responding to both children as they're related:

Enginerrrd said:

> The syntax here doesn't otherwise communicate that to me at all.

Assuming I'm understanding you correctly, that's my point. If the syntax doesn't communicate intent, it suggests a usability limitation.

frankmcsherry said:

> Naming things is hard.

Agreed. Hence point about readability being a personal thing. In this case though, look again at the comment and the line of code. I'd suggest that very few people, on reading the code without the comment, would guess that it was creating a new dataflow computation.

> In this case, the command does not create a new dataflow, it creates a new timely dataflow computation

Good point, thanks for the correction.


What are people here using differential dataflow for?


(Responding from the perspective of Materialize, the technology and company built on top of Differential Dataflow) Some examples we're seeing interest for:

Building realtime, low-latency (< 10 sec) dashboards. The type of things where you previously would have had to wait for several hours or a day for ETL pipelines to crunch through a lot of numbers.

We're also fielding interest for streaming ML applications. Ie, moving from batch models to streaming models.

Also worth pointing out that Differential Dataflow has been around for awhile, while Materialize is fairly young. We're still constantly learning about new applications!


This is useful for financial simulations. Used it to backtest trading strategies on billions of stock quotes in microseconds, about 20 years ago.


How is this different than for example rule engines where you have certain inputs, a calculation/solver engine and then you get the outputs for it?


this is also one of the main ideas behind Noria[1] (built by @jonhoo, who also gave his PhD dissertation[2] on this recently)

[1] https://github.com/mit-pdos/noria

[2] https://www.youtube.com/watch?v=GctxvSPIfr8


This sounds like streaming and what kafka streaming has been doing for a while. Everything is eventually consistent and what's served up is the current state at the time of the request.


it is but I think you can do more complex query and « the current state » is a table built from all event ever received not a small time window


Technically Kafka Streams can maintain tables over all events throughout time. But I don't think its logic is quite as optimised as the engine inside differential dataflow.


Can someone explain in simple english what is the difference between this and Automatic Materialized as implemented in SqlServer and Oracle


Materialize is a bit more of a lower-level technology, and less of a single "feature". I've included some example use cases built on top of Materialize in my other comment.

In addition, while Materialize does support connecting to other databases, to power "real-time materialized views", a common architecture we are used for is to present a SQL view on top of streaming systems (such as Kafka or Kinesis).

Here's an overview of Materialize that explains the relationship between Timely Dataflow, Differential Dataflow, and Materialize, starting at 23:20 - https://materialize.io/blog-cmudb/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: