Maybe it will. It really depends whether it's "easier" for the network to learn an intuitive physics, versus a laundry list of superficial hacks that let it minimise loss all the same. If the list of hacks grows so long that gradient descent finds it easier to learn the actual physics, then it'll learn the physics.
Hinton argues that the easiest way to minimise loss in next token prediction is to actually understand meaning. An analogous thing may hold true in vision modelling wrt physics.
If your entire existence was constrained to seeing 2d images, not of your choosing, could a perplexity-optimizing process "learn the physics"?
Basic things that are not accessible to such a learning process:
- moving around to get a better view of a 3d object
- see actual motion
- measure the mass of an object participating in an interaction
- set up an experiment and measure its outcomes
- choose to look at a particular sample at a closer resolution (e.g. microscopy)
- see what's out of frame from a given image
I think we have at this point a lot of evidence that optimizing models to understand distributions of images is not the same thing as understanding the things in those images. In 2013 that was 'DeepDream' dog worms, in 2018 that was "this person does not exist" portraits where people's garments or hair or jewelry fused together or merged with their background. In 2022 it was diffusion images of people with too many fingers, or whose hands melted together if you asked for people shaking hands. In the Sora announcement earlier this year it was a woman's jacket morphing while the shot zoomed into her face.
I think in the same way that LLMs do better at some reasoning tasks by generating a program to produce the answer, I suspect models which are trained to generate 3D geometry and scenes, and run a simulation -> renderer -> style transfer process may end up being the better way to get to image models that "know" about physics.
Indeed. It will be very interesting when we start letting models choose their own training data. Humans and other animals do this simply by interacting with the world around them. If you want to know what is on the back of something, you simply turn it over.
My guess is that the models will come up with much more interesting and fruitful training sets than what a bunch of researchers can come up with.
They're being trained on video, 3d patches are being fed into the ViT (3rd dimension is time) instead of just 2d patches. So they should learn about motion. But they can't interact with the world so maybe can't have an intuitive understanding of weight yet. Until embodiment at least.
I mean, the original article doesn't say anything about video models (where, frankly, spotting fakes is currently much easier), so I think you're shifting what "they" are.
But still:
- input doesn't distinguish what's real vs constructed nonphysical motion (e.g. animations, moving title cards, etc)
- input doesn't distinguish what's motion of the camera versus motion of portrayed objects
- input doesn't distinguish what changes are unnatural filmic techniques (e.g. change of shot, fade-in/out) vs what are in footage
Some years ago, I saw a series of results about GANs for image completion, and they had an accidental property of trying to add points of interest. If you showed it the left half of a photo of just the ocean, horizon and sky, and asked for the right half, it would try to put a boat, or an island, because generally people don't take and publish images of just the empty ocean -- though most chunks of the horizon probably are quite empty. The distribution on images is not like reality.
> It really depends whether it's "easier" for the network to learn an intuitive physics, versus a laundry list of superficial hacks that let it minimise loss all the same.
Human innate understanding of physics is a laundry list of superficial hacks. People needs education and mental effort to go beyond that innate but limited understanding.
When it is said that humans innately understand physics, no one means that people innately understand the equations and can solve physics problems. I think we all know how laughable such a claim would be, because how much people struggle when learning physics and how few people even get to a moderate level (not even Goldstein, but at least calculus based physics with partial derivatives).
What people mean when saying people innately understand physics is that they have a working knowledge of many of the implications. Things like that gravity is uniformly applied from a single direction and that is the direction towards ground. That objects move in arcs or "ballistic trajectories", that straight lines are uncommon, that wires hang with hyperbolic function shapes even if they don't know that word, that snow is created from cold, that the sun creates heat, many lighting effects (which is how we also form many illusions), and so on.
Essentially, humans know that things do not fall up. One could argue that this is based on a "laundry list of superficial hacks" and they wouldn't be wrong, but they also wouldn't be right. Even when wrong, the human formulations are (more often than not) causally formulated. That is, explainable _and_ rational (rational does not mean correct, but that it follows some logic. The logic doesn't need to be right. In fact, no logic is, just some are less wrong than others).
> It really depends whether it's "easier" for the network to learn an intuitive physics, versus a laundry list of superficial hacks that let it minimise loss all the same
The latter is always easier. Not to mention that the architectures are fundamentally curve fitters. There are many curves that can fit data, but not all curves are casually related to data. The history of physics itself is a history of becoming less wrong and many of the early attempts at problems (which you probably never learned about fwiw) were pretty hacky approximations.
> Hinton argues
Hinton is only partially correct. It entirely depends on the conditions of your optimization. If you're trying to generalize and understand causality, then yes, this is without a doubt true. But models don't train like this and most research is not pursuing these (still unknown) directions. So if we aren't conditioning our model on those aspects, then consider how many parameters they have (and aspects like superposition). Without a doubt the "superficial hacks" are a lot easier and will very likely lead to better predictions on the training data (and likely test data).
The grokking papers show that after sufficient training models can transition into a regime where both training and test error gets arbitrarily small.
Yes, this is out of reach of how we train most models today. But it demonstrates how even current models are capable of building circuits that perfectly predict (meaning understand the actual dynamics) of data given sufficient exposure.
I have some serious reservation about the grokking papers and there's the added complication that test performance is not a great proxy for generalization performance. It is naive to assume the former begets the latter because there are many underlying assumptions there that I think many would not assume are true once you work them out. (Not to mention the common usage of t-SNE style analysis... but that's a whole other discussion)
It is important to remember that there are plenty of alternative explanations to why the "sudden increase" in performance happens. I believe if people had a deeper understanding of how metrics work that the phenomena would become less surprising and make one less convinced that scale (of data and/or model) will be insufficient to create general intelligence. But this does take quite a bit of advanced education (that is atypical from a ML PhD) and you're going to struggle to obtain it "in a few weekends".
It really isn't easier at a sufficient complexity threshold.
Truth and reality cluster.
So hyperdimensional data compression which is organized around truthful modeling versus a collection of approximations will, as complexity and dimensionality approach uncapped limits, be increasingly more efficient.
We've already seen toy models do world modeling far beyond what was being expected at the time.
This is a trend likely to continue as people underestimate modeling advantages.
> that gradient descent finds it easier to learn the actual physics, then it'll learn the physics.
I guess it really depends on what the meaning of gradient decent learning the physics is.
Maybe you define it to mean that the actually correct equations appear encoded in the computation of the net. But this would still be tacit knowledge. It would be kind of like a math software being aware of physics at best.
Hinton argues that the easiest way to minimise loss in next token prediction is to actually understand meaning. An analogous thing may hold true in vision modelling wrt physics.