>But are coding large language models in the wild trained on proprietary code bases?
There are source availleble projects on GitHub but BSD license is not compatible with GPL and for sure there is ton of GPL code in the LLMs.
The only solution I can think of is too train LLMs on BSD compatible code to be used on BSD projects, GPL compatible code to be used on GPL projects etc.
See below link were copilot was caught stealing code and prove that it is memorizing and reproducing rather then learning concepts and then generating original code.
There are source availleble projects on GitHub but BSD license is not compatible with GPL and for sure there is ton of GPL code in the LLMs.
The only solution I can think of is too train LLMs on BSD compatible code to be used on BSD projects, GPL compatible code to be used on GPL projects etc.
See below link were copilot was caught stealing code and prove that it is memorizing and reproducing rather then learning concepts and then generating original code.
https://news.ycombinator.com/item?id=27710287