Mixture of Experts is not just 16 copies of a network, it's a single network whe...

Mixture of Experts is not just 16 copies of a network, it's a single network where for the feed forward layers the tokens are routed to different experts, but the attention layers are still shared. Also there are interesting choices around how the routing works and I believe the exact details of what OpenAI is doing are not public. In fact I believe someone making a visualization of that would dispell a ton of myths around what are MoEs and how they work