Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> you'd need _a lot_ of examples to deduce this with any certainty

Are you saying that "a lot" isn't almost semi-trivial to obtain, seeing how much code is available online?



Why would a model trained on English texts see "a lot" of PHP code? What was the prompt used for generating this code?


It was trained on contents of links found on Reddit, wasn't it? Links to sample code or stack overflow posts could be pretty prevalent.


So you're buying the idea that it looked at a bunch of code snippets embedded at various pages, managed to build a sub-model for PHP (separate from all other languages it should have encountered) and managed to generate a long, nearly syntactically correct program uninterrupted by English text?

And while it makes tons of obvious mistakes in English (which is a much more flexible and forgiving language), its PHP is somehow nearly syntactically perfect?

-

Examples from GPT-2 GitHub have a lot of code:

https://raw.githubusercontent.com/openai/gpt-2/master/gpt-2-...

To me, this doesn't seem like an argument in favor of this model "understanding" English (or C, or PHP). It seems more like an indication that it memorizes way more information than the paper implies and then does clever word substitution.


Yes, I do think that it learned a model of PHP and JavaScript syntax. 40GB of text data is a lot, and PHP syntax is a lot simpler than English grammar, which it learns quite well.

See also the example in the paper of accidentally learning to translate into French even though they tried to remove French pages from the corpus.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: