Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Seems like he thinks RLVR == learning from binary reward for the whole chain, completely discounting techniques to provide denser rewards like process reward supervision?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: