> someone corrected me above, it does seem to matter more then I thought
if you llm agent takes different decisions from the same prompt, then you have to deal with it
1) your benchmarks become stochastic so you need multiple samples to get confidence for your AB testing
2) if your system assumes at least once completion you have to implement single record and replay so you dont get multiple rollout of with different actions
But at that point i feel like we are getting close to "everything that isn't a perfect Turing-machine is somewhat-stochastic" ;)
Edit: someone corrected me above, it does seem to matter more then I thought