> you'd need _a lot_ of examples to deduce this with any certainty Are you sayin...

gambler · on Feb 26, 2019

Why would a model trained on English texts see "a lot" of PHP code? What was the prompt used for generating this code?

IanCal · on Feb 26, 2019

It was trained on contents of links found on Reddit, wasn't it? Links to sample code or stack overflow posts could be pretty prevalent.

gambler · on Feb 26, 2019

So you're buying the idea that it looked at a bunch of code snippets embedded at various pages, managed to build a sub-model for PHP (separate from all other languages it should have encountered) and managed to generate a long, nearly syntactically correct program uninterrupted by English text?

And while it makes tons of obvious mistakes in English (which is a much more flexible and forgiving language), its PHP is somehow nearly syntactically perfect?

-

Examples from GPT-2 GitHub have a lot of code:

https://raw.githubusercontent.com/openai/gpt-2/master/gpt-2-...

To me, this doesn't seem like an argument in favor of this model "understanding" English (or C, or PHP). It seems more like an indication that it memorizes way more information than the paper implies and then does clever word substitution.

moyix · on Feb 27, 2019

Yes, I do think that it learned a model of PHP and JavaScript syntax. 40GB of text data is a lot, and PHP syntax is a lot simpler than English grammar, which it learns quite well.

See also the example in the paper of accidentally learning to translate into French even though they tried to remove French pages from the corpus.