Thing is, you do need all the params usually, so if the model is only partially ...

Thing is, you do need all the params usually, so if the model is only partially mapped to RAM, it's the equivalent of an app swapping in and out as it runs. Which is to say, it means that inference is much slower.

Local Apple models are likely in the 2-3B range, but fine-tuned to specific tasks.