I know there are murmurs that synthetic data (i.e. using rendering software with 3D models) was used to train some generative models, including OpenAI Sora; seems like it's the only plausible way right now to get the insane amounts of data needed to capture such statistical regularities.