Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's because it's probably trained with "professional audio", ads, movies, audiobooks, and not "normal people talking". Like the effect when diffusion was mostly trained with stock photos.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: