Hacker Newsnew | past | comments | ask | show | jobs | submit | zbrw's commentslogin

Any insights on qwen-3 omni yet?


Looks awesome, but a 30B model is too big. Vast majority of people probably have 32GB of RAM or less unfortunately.


I believe that the corpus of video data to train on with video far exceeds that of 3D data. It's also much cheaper to produce video data. So I'd expect that this is probably the quickest way forward from a current world state perspective.

Additionally, video seems like a pretty forward output shape to me - 2D image with a time component. If we were talking 3D assets and animations I wouldn't even know where to start with modeling that as input data for training. That seems really hard to model as a fixed input size problem to me.

If there was comparable 3D data available for training, I'd guess that we'd see different issues with different approaches.

A couple of examples that I could think of quickly: Using these to build games, might be easier if we could interact with the underlying "assets". Getting photorealistic results with intricate detail (e.g. hair, vegetation) might be easier with video based solutions.


If the fidelity of the video is high enough, you could use SFM to build point clouds from the generated video frames and essentially do photogrammatry on the assets from a genie video.


Things to note: 1) supply a JSON schema in `config.reponse_schema` 2) set the `config.response_type` to `application/json`

That works for me reliably. I've had some issues with running into max_token constraints but that was usually on me because I had let it process a large list in one inference call, which would have resulted in very large outputs.

We're using gemini JSON mode in production applications with both `google-generativeai` and `langchain` without issues.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: