Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Data-Oriented Design Principles (data-oriented.design)
102 points by ingve on July 18, 2023 | hide | past | favorite | 51 comments


This might be an unpopular opinion, but I feel like these DOD posts should always mention that the ideas are borne out of experience in game engine development and don't necessarily easily apply in other domains.

In some general sense the principles are applicable everywhere. But trying to actually apply them in some domains is really difficult. For example, I work on developing a large scale ML training framework and I haven't seen an example where avoiding the use of objects and inheritance leads to better code. Having said that, the principles certainly apply lower down the stack when talking about implementing the GPU kernels etc on the hot path.


> ...where avoiding the use of objects and inheritance leads to better code...

DOD isn't about "better code" but strictly about "faster code" by letting real-world hardware restrictions dictate how code accesses data (the way I wrote that it may sound like it's a bad thing, but it's really not) - writing software that doesn't fight the hardware it runs on can lead to simpler and easier to maintain code, but that's not guaranteed.


Wouldn't it be nice if people would always follow the serious whitepaper structure: abstract, summary, conclusion, measurement...

But this wouldn't work for programming language ideologies. Not for the OOP proponents, anyway.

The DOD crowd can at least demostrate how CPU love blasting through data layed out in arrays.

The FP crowd can least explain how the languages can be formalised.

PS ML is the most natural application of dod principles, no?


Not everything that matters can be measured.


McNamara fallacy is actually such a common one


It's better to look for numbers that can represent reality at least when done right than to accept magical thinking and hand-waving.


True!

But things unmeasurable are typically very ephemereal.


> But this wouldn't work for programming language ideologies. Not for the OOP proponents, anyway.

Why not? Are you saying that OOP isn't solving any problem?

As I understand it, OOP is mainly about allowing behavior to be reconfigured at runtime. As the introduction to "Design Patterns" puts it: "Polymorphism simplifies the definition of clients, decouples objects from each other and lets them vary their relationships to each other at run-time."

Using this indirection absolutely everywhere probably isn't great for performance, but it seems helpful in moderation.


* “As I understand it, OOP is mainly about allowing behavior to be reconfigured at runtime. As the introduction to "Design Patterns" puts it: "Polymorphism simplifies the definition of clients, decouples objects from each other and lets them vary their relationships to each other at run-time."”*

You're talking about polymorphism, not OOP. While polymorphism also exists in OOP, it is not a concept that is exclusive to it. And polymorphism alone doesn't make OOP.


> Are you saying that OOP isn't solving any problem?

I certainly would. More precisely, I don’t see OOP having any comparative advantage. I don’t see it being the most suitable tool for any job. Except maybe GUI. And even then I have my doubts.

The actor model though is pretty cool.


The original idea was about messages and encapsulation. It's recast incarnation now is about lowering the entry level skill to programming and system understanding to the rock bottom. If you copy paste reality into objects, you are hired,effectivness be damned.


OOP is not about "allowing...runtime". The key bit is objects: data paired with functions meant to be used with the data.

That's it, really.


> ML is the most natural application of dod principles, no?

Yes, the principles are helpful and natural in implementing the code that will execute the ML model pass with no unnecessary overhead. But in terms of writing the code that configures and defines a model and the pre- + post-processing, and to some extent dataloaders, it's natural to use objects + inheritance. Now I don't actually think using objects and inheritance are antithetical to DOD but it seems that DOD often gets presented that way.


That sounds totally fine. There's no reason to care about the layout of data that your computer spends almost no time working on or the pointers that it chases dozens of times total.


OOP can also be formalized if you really want to.


I feel that DOD is about performance, not understandability and usability, so it makes sense that the frontend of a framework, where usability is priority, doesn't benefit, and the backend and hot path does.


Usability can only ever be as good as performance allows it. I don't care where the buttons are when it takes 5 seconds before they show up, or how pretty font is if the scrolling is stuttering, and so on.

From the user perspective, it really makes no difference if the same UI was hand coded in assembly or in something much higher level that is still fast because it doesn't work against the CPU, RAM, storage and other things.


There are at least 2 end users for any code, one who uses the product and one who has to work with the code in the future. The second one will care.


Fuck the second guy, their entire job is to serve the first. And having been that second guy a large fraction of my career, fuck the original author’s OO GoF hexagonal Clean Code too.


This comment has a point: there are much fewer developers than there are users, and our moral weight is proportionally tiny.


I think the long tail of user bases makes a big difference here. Most programs have few users. Most devs work on programs with few users.


Most devs work on programs with orders of magnitude more users than devs, even for “few” users. The absolute number does not matter.


The crap performance comes from bloat, not from too simple code. I also see precious little maintenance going on, at least compared to jumping ship to something even shinier and with even more eternal tech every few years. Microsoft promised Win10 to be the last one and then instantly changed their tune because of pure business reasons.

And some code gets run on so many devices, by so many users. If starting a phone call took 0.1 seconds longer for all humans on the planet for the next 20 years, even a huge team saving 50 years of their life by cutting a corner and not reducing the delay each would have destroyed more than they gained. And everyone of us uses thousands of programs, millions all in all, and if they all started to prioritize the resources of their makers over yours, it would outpace whatever you can as an individual gain by outsourcing your costs in the same way.

Of course I'm not saying spend twice the time to squeeze out 1% more performance for an installer that's run once. But we are not at risk of that anytime soon; instead we're more in a situation where some people people ship or advocate things that are 10-100 as big as other programs doing the exact same thing, and are ass-backwards and regressive in so many ways. All these github repos with 20 config files and dozens of folders many levels deep, with the "actual" code fitting on a napkin if you added it up, plus a "one-line" install that kicks of a cascade with a gazillion possible failure points. [1]

Partly because so many people flat out don't care, as if actually knowing what is going on under the hood, how the words in the text editor translate into a process, and the wider world it's embedded in, was somehow beneath them (and then they turn around and correctly complain about their managers who treat them like they treat their machines and users). But I think mostly because of this desire to re-invent something badly with a lot of fanfare and in doing that help keep the better thing that exists obscure (especially if that thing is free). That's the majority of professional software development it seems, so I don't worry about which of the laid off people will maintain the discontinued products too much.

[1] Remember NASA World Wind? They had a perfectly fine desktop app before Google Earth existed. Now they have a nodejs thing, so I thought okay, that is something I'll install node for. Alas: https://i.imgur.com/G4mQxjb.png <-- that's the "installer" segfaulting and leaving everything as is. I haven't seen something that bad in like... ever.

And I don't care that I can just roll back, or should have done it in a VM or docker. This, to me, is a joke, and I am not interested in fixing it or using a VM image or anything. I wish the people who maintain that sort of stuff the very best, I think they'll need it.


I find simple loops over simple data more understandable than inheritance chains and pointer chasing.


If the stuff you're doing allows a design where you can just loop over POD, sure, that's the ideal.

But lots of stuff isn't that simple, and you end up messily reinventing inheritance with a lot of switches and if statements.


Most of the time you can solve this by knowing shit and thinking harder.


That would be fair if OO/FP posts had always mentioned that the ideas were borne out of academia and don't necessarily easily apply in other domains, but that ship has sailed..


> That would be fair if OO/FP posts had always mentioned that the ideas were borne out of academia and don't necessarily easily apply in other domains, but that ship has sailed..

OO has been widely adopted because it maps almost perfectly to most domains.

FP is starting to be adopted because first class support from programming languages and frameworks lowered the barrier to entry by eliminating the need to implement all primitives and infrastructure. FP is being adopted because it maps almost perfectly to some domains.

What's your point, exactly?


Not sure if you’re being serious here: the whole point of academia is to eventually be useful to the outside world. Something that applies to academia and nowhere else is kind of useless. The very reason we take science so seriously comes from how it helped us completely transform the world.

So the idea that something would only (easily) apply to academia because it was born out of it feels kind of ridiculous to be honest.


Why can't we do both? i.e. give an honest account of the pros and cons of different approaches rather than treating them like dogma


The "methodology industry-complex" thrives on consultants, authors, certification experts, courseware, etc. By definition it is funded by those who literally do not know and need/looking for something better. So while we can be honest, there is a lot of money preventing that discussion from ever happening.


I feel the biggest weakness of PyTorch is that everything is an object with inheritance. It makes no sense that ReLu is a class for example. Seems like ML could take a few hints.


Well, PyTorch uses classes more as a declarative way to construct your neural networks. It doesn’t get more data-oriented than pushing thousands of tensors through your GPU!


Declarative vs. procedural is orthogonal to data oriented vs. object oriented.

The functional way to make a declarative interface would be to have a single type, `Expression`, and all operators take expressions and return expressions.

However, I would argue the declarative vs. procedural decision is much more important to the resulting code structure then whether you use a single object type or multiple object types with inheritance.


It's complicated because PyTorch also uses class syntax to allow you to introspect the various blocks in your modules. It's a bit of a kludge, but composing it strictly out of Expressions would remove that introspection capability.


The problem is that with python oop, the class is ur type. Therefore it comes with all the oop baggages.

If u squint enough and avoid those baggages and only leverage the type system(hint), then it's kinda usable. U don't have much choices anyway.


I agree that often people choose to use the object versions unnecessary, but many operations (including relu) have functional versions: https://pytorch.org/docs/stable/generated/torch.nn.functiona...


> Different problems require different solutions.

> If you have different data, you have a different problem.

What is unstated but perhaps implicit here is that you might have a different problem even if you have the same data, and this is why it is better not to bind functions and data together as in OOP.

Separating functions from data lets you organize them according to the functionality they provide, instead of grouping them all together with the data structure they operate on. When you bundle data and functions together into a class, it often grows into an oversized assortment of all the various operations you might want to do on that particular bundle of data. The OOP solution for splitting up a bloated kitchen sink class is decorators, which results in using multiple object instances, each with a different interface, to manipulate the same data. Instead of simply importing functions from a different module when you want extra functionality, you have to construct a new object instance to wrap your "basic" object and "enhance" it with extra methods.


As a Clojure programmer, this is not what I have in mind when I think of data-oriented-programming, but this:

Principle #1: Separating code (behavior) from data.

Principle #2: Representing data with generic data structures.

Principle #3: Treating data as immutable.

Principle #4: Separating data schema from data representation.

Source: https://blog.klipse.tech/dop/2022/06/22/principles-of-dop.ht...

I think using C++ gives a different twist to the meaning of data-oriented, mainly because with lisps code is data. As I read this "manifesto", it seems more focused on the data the program handles than handling the program with data: In Clojure I often use data-oriented programming for programs that barely deal with any data at all. I tend to lay what I call a "plan" that describes the computation that needs to be carried out. In some way this is similar to a DSL except that this "plan" won't run without also writing a "compiler" or "interpreter". If suddenly requirements change and you need to run your "plan" in a distributed way (or any other execution flavor you may think of), you just write another compiler.

Code being data, this is an approach you can take on code itself with macros, not just as a way to add behavior but to split different aspects of code: I once wrote a macro specifically for a block of complex code that I wanted to read without the clutter introduced by debug lines, so I moved this code in a macro that would add it back using a highly specific code-walker.

What is gained by introducing interfaces using data rather than an object system, must be repaid when writing and maintaining those compilers.


> As a Clojure programmer, this is not what I have in mind when I think of data-oriented-programming, but this:

Yeah, that's more or less how I would have defined DOP, too.

Could it be that there is a slight difference in meaning when people refer to "data-oriented design" (OP) vs. "data-oriented programming"? At least that's been my (anecdotal) impression so far.


I'm not sure what is meant by "data-oriented programming" (I know what "data-driven" means...) but, yes, "Data-Oriented Design" has a distinct (if somewhat nebulous) meaning which comes from game programmer culture. (And my guess is the difference between it and data-oriented programming is not slight.) Data-Oriented Design is, basically, the name given to the bag of techniques listed on the submitted webpage. I don't know if the term was coined by Mike Acton, but it was popularized by the talk he gave that is linked to on the submitted webpage. (It's an inspiring talk! You should watch it, if you have not, yet.)

As far as I can tell, Data-Oriented Design is a reaction to trauma experienced by game programmers trying to undo damage inflicted by the... inapt... application of OO techniques in game codebases by their (probably well-meaning, but ignorant) peers. (Hence the contrast in the names: OBJECT-Oriented Design -> DATA-Oriented Design.)

The keystone idea seems to be, instead of organizing your program's data as objects in an inheritance hierarchy, figure out which data will be accessed together in tight loops, and pack that data together in arrays. (I.e., prefer structs of arrays to arrays of objects.)

P.S., There's an interaction during the Q&A section of that presentation by Mike Acton which I love: one questioner asks, somewhat incredulously, (and I'm paraphrasing here) "If I were to follow the principles you have laid out in this talk, then, if I ever needed to alter the layout of my data--after having invested (perhaps significant) time already writing my program--I would subsequently be required to rewrite all the code which accesses that data." and Mike Action answers, basically, with a stone-cold "Yes."


This is also consistent with how it is thought of in GameDev. The trendy ECS (Entity Component System) architecture is usually implemented with a data-oriented mindset to maximize cache utilization, make allocations/deallocations of many small entities easy and fast, and facilitate concurrency.


AKA “Cache is King”.

(edit) Terje Mathisen's "almost all programming can be viewed as an exercise in caching" is often quoted amongst game developers, as it rings true with how we use hardware and pre-calculation of data.


Mike Acton presented a great visualization of this in his 2014 DOD talk -- here's a link to just that clip: https://www.youtube.com/watch?v=rX0ItVEVjHc&t=1831s


In contrast to jebarker's comment, I actually think it's really interesting that a concept coming from game engine development actually seems quite applicable in some very different domains.

We (https://estuary.dev/) ended up arriving at a very similar design for transformations in streaming analytics pipelines: https://docs.estuary.dev/concepts/derivations/

To paraphrase, each derivation produces a collection of data by reading from one or more source collections (DOD calls these "streams"), optionally updating some internal state (sqlite), and emitting 0 or more documents to add to the collection. We've been experimenting with this paradigm for a few years now in various forms, and I've found it surprisingly capable and expressive. One nice property of this system is that every transform becomes testable by just providing an ordered list of inputs and expectations of outputs. Another nice property is that it's relatively easy to apply generic and broadly applicable scale-out strategies. For example, we support horizontal scaling using consistent hashing of some value(s) that's extracted from each input.

Putting it all together, it's not hard to imagine building real-world web applications using this. Our system is more focused on analytics pipelines, so you probably don't want to build a whole application out of Flow derivations. But it would be really interesting to see a more generic DOD-based web application platform, as I'd bet it could be quite a nice way to build web apps.


While Data is always associated with SQL, there is a world of data separated from SQL. SQL is a standard way in many cases to persist and work on data, but doesn't span the lifetime/journey of data. Software systems are built on classes, functions and structs, caches, interfaces, buffers. Its common to have hierarchical relationships between data objects. SQL doesn't naturally handle hierarchy, despite the fact it has syntax to do so.

As a software engineer, data modeler, and data engineer, DoD is a weird label applied to anything. I've decades of experience with SQL but don't gravitate towards it. Reality is messy.


Even with SQL we recognize that while aggregate roots and ORMS are great for storing and editing single items, they're terrible for use cases where the data is _used_, and you're better off using different query mechanisms to slice it in a better way. Indexing services and caches and transformation pipelines continue that.

I think CQRS is really just a glimmer of DoD in the enterprise world, the recognition that the system of record is generally a terrible resource for actually using the data, and that you need to rethink everything again if you want a performant system.


I 2000% agree with everything in here except this line:

> If you don’t understand the hardware, you can’t reason about the cost of solving the problem.

For me, "Data-oriented Design" mostly implies "try to use SQL-shaped things". In this context, understanding of the hardware is like seeing another side of the query planner blackbox. You should have a general sense of how many raw bytes you can move in a serialized manner based upon your machines & networks, but I feel trying to directly understand how exactly every H/W resource will be utilized goes against these principles at a certain point.

Most offerings of SQL provide exceptionally-powerful tools that attempt to answer questions like "how long might this query take to run based upon history and/or projected growth?". Is this not "reasoning about the cost of solving the problem"?

For non-trivial domains, focusing only on getting the data model logically clean might be a 100% full-time job. You should always worry about performance after correctness unless you are developing a piece of software where performance is the headline aspect of correctness (game engines, DAWs, etc).


The context isn't clear from the page, but data-oriented design mostly comes from the game industry where this level of performance often (but not always) does matter.

And in that context, they don't mean "data-oriented" in the sense of "declarative like SQL". They mean it in the sense of "how the bytes are arranged in memory". One of the primary motivations is being able to use the CPU cache well.


The context is absolutely clear. Data is data. You are simply biased towards a particular domain.


Your comment reads like it's arguing with me, but I don't understand what you're trying to refute.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: