But we don’t know how much larger the models will have to be, how large the data sets or how much trianing is needed, do we? They could have to be inconceivably large.
If you want to correct for this particular problem you might be better off training a face detector, an eye detector and a model that takes two eyes as input and corrects for this problem. Process then would be:
- generate image
- detect faces
- detect eyes in each face
- correct reflections in eyes
That is convoluted, though, and would get very convoluted when you want to correct for multiple such issues. It also might be problematic in handling faces with glass eyes, but you could try to ‘detect’ those with a model that is trained on the prompt.
The opposite might also be true. Just having better, well curated data goes a long way. LAION worked for a long time because it's huge, but what if all the garbage images were filtered out and the annotations were better?
The early generations of image and video models used middling data because it was the only data. Since then, literally everyone with data has been working their butts off to get it cleaned up to make the next generation better.
Better data, more intricate models, and improvements to the underlying infrastructure could mean these sorts of "improvements" come mostly "for free".
If you want to correct for this particular problem you might be better off training a face detector, an eye detector and a model that takes two eyes as input and corrects for this problem. Process then would be:
- generate image
- detect faces
- detect eyes in each face
- correct reflections in eyes
That is convoluted, though, and would get very convoluted when you want to correct for multiple such issues. It also might be problematic in handling faces with glass eyes, but you could try to ‘detect’ those with a model that is trained on the prompt.