Differential Dataflow is about differential computation which itself is a generalization of incremental computing, most commonly known from Excel.
A good introduction to the overall topic is incremental [0] an OCaml library by Jane Street and the corresponding blog posts and videos [1]. They even use it for webapps [2].
= signifies Equation, a persistent assignment. Any update to the value on the after which any change to the value results in all dependent values being recomputed.
:= signifies a standard one shot assignment.
The runtime behavior is easy enough to implement, less than 100 lines of Pascal. I worried that the code for a language that supported it would be unreadable, like FORTH, only worse. Imagine if Excel spreadsheets were plain text source code.
I still feel like there's no simple platform for a use case where you're doing stream processing, but you really care about being able to update old data. Obviously you can just rerun a big batch job, and maybe work out which caches need invalidating but that's hard work. Streaming platforms with watermarks make it hard to deal with very late events or long term updates. Something like Kafka Streams does away with watermarks and so can keep aggregate data up to date, but if your logic involves a lot of stateful, sequential logic, it's hard to know what to rollback and replay when updates come in. If and when Materialize supports window functions, a lot of these workloads become much simpler, I think. Obviously differential dataflow on its own can address some of this, but alas, not a Rust shop.
Sort of. Let's say you're counting the number of unique users visiting your site in a 5 minute fixed window. This is pretty simple: you have a set, and each time a request comes in you add it to the set. At the end of the 5 minute window you output the set cardinality to your downstream consumer and throw out the raw data.
But what if this is a mobile app, and requests may be delayed by minutes or even hours due to gaps in cellular connectivity? In that case, we need a strategy to handle "late arriving" data.
Flink has several approaches for this, but they're all based around a "high-watermark." This is basically an estimate at time T that all data from time T-D has arrived. Once the watermark has passed the end of our window we consider it closed and can compute the final value for it. (How you compute the watermark is up to you; typically you use a fixed value but this can be made more accurate if you have out-of-band information).
In addition to the default behavior (where late-arriving data is dropped) you can customize this by specifying a trigger that runs when late-arriving date comes, which can be used to e.g., update an external datastore. However at that point the reconciliation is outside of the scope of Flink.
There are various other ways to organize this within flink, i.e., you can keep the windows open indefinitely, update the internal state of your flink job, and serve directly from the flink state, or you can use a periodic trigger to periodically update your downstream from that state. Obviously this will require a larger state size (in disk or in memory depending on your state backend) since you won't be able to close out the window and have to keep all of the data around.
Yes I'm aware that Flink has these mechanisms. But I'm not sure I understand how Differential Dataflow handles this in a better way. For simple windowing functions there are certainly shortcuts possible, but for more complex ones such as session windows it gets complicated. You might have to merge sessions as late data arrives. This is what Flink and others (Google Dataflow) do.
Flink uses highwatermarks like google's dataflow and is based on (I think) the original Millwheel paper.
The alternative to all this nonsense is to just throw everything into clickhouse and build materialized views! The drawback is you can't do complex joins, but for 90% of use-cases, clickhouse materialized views work swimmingly.
As I understand it, ClickHouse still doesn't support window functions either, so it gets hard to do complex sequence-based logic in calculations. Materialise actually supports lateral joins so you can do some powerful things, but I am sure a better developer experience is possible.
ClickHouse does this using arrays. There's a rich set of functions to pull grouped values into arrays, process them with lambdas (e.g, map, sort, etc.), and explode them back out into tabular result.
I was kind of caught up on comment "if (m2, m1) and (m1, p), then output (m1, (m2, p))". When input is [(m2, m1), (m1, p)] and is pairwise switched and joined with unreversed input.
If m2 manages m1 and m1 manages p then m2 manages p via m1, but I could not quite understand what this example output should mean. M1 manages m2 and p? M1 is managed by m2 and p. Neither seems correct.
I had the same question. More generally, it's a "design smell" for me to see something like this:
// define a new timely dataflow computation.
timely::execute_from_args(std::env::args(), move |worker| {
Why would something as fundamental as creating a new dataflow need a comment? In other words: why is that not self-evident from the syntax?
Not to denigrate the technical accomplishments. And elegance/readability is a very personal thing. But it does seem jarring when the language constructs don't obviously map cleanly to the design intent.
In this case, the command does not create a new dataflow, it creates a new timely dataflow computation by executing the closure on multiple workers. The computation can then spin zero, one, or many dataflows interactively. Dataflows come and go as the computation runs.
For what it's worth... someone like me could benefit a lot from this comment. It would allow me to understand some of the intended design from an abstracted point of view, and gives me a bunch of hooks to start googling to understand what's going on.
The syntax here doesn't otherwise communicate that to me at all.
> The syntax here doesn't otherwise communicate that to me at all.
Assuming I'm understanding you correctly, that's my point. If the syntax doesn't communicate intent, it suggests a usability limitation.
frankmcsherry said:
> Naming things is hard.
Agreed. Hence point about readability being a personal thing. In this case though, look again at the comment and the line of code. I'd suggest that very few people, on reading the code without the comment, would guess that it was creating a new dataflow computation.
> In this case, the command does not create a new dataflow, it creates a new timely dataflow computation
(Responding from the perspective of Materialize, the technology and company built on top of Differential Dataflow) Some examples we're seeing interest for:
Building realtime, low-latency (< 10 sec) dashboards. The type of things where you previously would have had to wait for several hours or a day for ETL pipelines to crunch through a lot of numbers.
We're also fielding interest for streaming ML applications. Ie, moving from batch models to streaming models.
Also worth pointing out that Differential Dataflow has been around for awhile, while Materialize is fairly young. We're still constantly learning about new applications!
This sounds like streaming and what kafka streaming has been doing for a while. Everything is eventually consistent and what's served up is the current state at the time of the request.
Technically Kafka Streams can maintain tables over all events throughout time. But I don't think its logic is quite as optimised as the engine inside differential dataflow.
Materialize is a bit more of a lower-level technology, and less of a single "feature". I've included some example use cases built on top of Materialize in my other comment.
In addition, while Materialize does support connecting to other databases, to power "real-time materialized views", a common architecture we are used for is to present a SQL view on top of streaming systems (such as Kafka or Kinesis).
Here's an overview of Materialize that explains the relationship between Timely Dataflow, Differential Dataflow, and Materialize, starting at 23:20 -
https://materialize.io/blog-cmudb/
A good introduction to the overall topic is incremental [0] an OCaml library by Jane Street and the corresponding blog posts and videos [1]. They even use it for webapps [2].
[0]: https://github.com/janestreet/incremental
[1]: https://www.janestreet.com/tech-talks/seven-implementations-... (Great to understand fundamentals of an incremental computation graph)
[2]: https://www.janestreet.com/tech-talks/intro-to-incr-dom/