I believe that counterexample only works in the limit where the sample size goes to infinity. Every finite sample will have μ≠0 almost surely.(Of course μ will still tend to be very close to 0 for large samples, but still slightly off)
So this means the sequence of μₙ will perform a kind of random walk that can stray arbitrarily far from 0 and is almost sure to eventually do so.
I agree. The authors generate a dataset of a similar size as the original and then train on that continuously (e.g. for multiple epochs). That's not what you need to do in order to get new model trained on the knowledge of the teacher. You need to ask the teacher to generate new samples every time, otherwise your generated dataset is not very representative of the totality of knowledge of the teacher. Generating samples every time would (in infinite limit) solve the collapse problem.
Agreed, that's what I struggle to see as well. It's not really clear why the variance couldn't stay the same or go to infinity instead. Perhaps it does follow from some property of the underlying Gamma/Wishart distributions.
There was actually a series of language models named after Sesame Street characters back in 2018-2020, starting with ELMo, then BERT, ERNIE (a different model from 2019), Big Bird, ... There are likely some more that I missed.
These calls should be automated in most cases [1]. Still an impressive feat, but there is no way they are paying a large number of people to phone through all businesses in the world.
If your data loading pipeline grows even slightly complex, then yes, you absolutely need concurrency in order to deliver your samples to the GPU fast enough.
The current workarounds to make this happen in python are quite ugly imho, e.g. Pytorch spawns multiple python processes and then pushes data between the processes through shared memory, which incurs quite some overhead. Tensorflow on the other hand requires you to stick to their Tensor-dsl so that it can run within their graph engine. If native concurrency were a thing, data loading would be much more straightforward to implement without such hacks.
So this means the sequence of μₙ will perform a kind of random walk that can stray arbitrarily far from 0 and is almost sure to eventually do so.