P and B frames are compressed versions of a reference image. Frames resulting from DLSS frame generation are predictions of what a reference image might look like even though one does not actually exist.
But MPEG is lossy compression which means they are kind of a just a guess. That is why MPEG uses motion vectors.
"MPEG uses motion vectors to efficiently compress video data by identifying and describing the movement of objects between frames, allowing the encoder to predict pixel values in the current frame based on information from previous frames, significantly reducing the amount of data needed to represent the video sequence"
There's a real difference between a lossy approximation as done by video compression, and the "just a guess" done by DLSS frame generation. Video encoders have the real frame to use as a target; when trying to minimize the artifacts introduced by compressing with reference to other frames and using motion vectors, the encoder is capable of assessing its own accuracy. DLSS fundamentally has less information when generating new frames, and that's why it introduces much worse motion artifacts.
it would be VERY interesting to have actual quantitative data on how many possible I video frames map to a specific P or B frame vs how many possible raster frames map to a given predicted DLSS frame. The lower this ration the more "accurate" the prediction is.
Compression and prediction are the same. Decompressing a lossy format is guessing how the original image might have looked like. The difference between fake frames and P and B frames is that the difference between prediction of fake frame and real frame is dependant on the user input.
... now I wonder ... Do DLSS models take mouse movements and keypresses into account?