Information isnt attached to the users query, the CoT still happens in the output of the model like in the first example that you gave. This can be done without any finetuning on the policy, but reinforcement learning can also be used to encourage the chat output to break the problem down in to “logical” steps. chat models have always passed in the chat history back into the next input while appending the users turn, thats just how they work (I have no idea if o1 passes the CoT into the chat history though, so i cant comment). But it wouldnt solely account for the massive degradation of performance between o1 and o3/o4
From my other comment about o1 and o3/o4 potential issues:
The other big difference between o1 and o3 and o4 that may explain the higher rate of hallucinations is that the o1’s reasoning is not user accessible, and it’s purposefully trained to not have safe guards on reasoning. Where o3 and o4 have public reasoning and reasoning safeguards. I think safeguards may be a significant source of hallucination because they change prompt intent, encoding and output. So on a non-o1 model that safeguard process is happening twice per turn once for reasoning and once for output, then being accumulated into the input. On an o1 model that’s happening once per turn only for output and then being accumulated.
Information isnt attached to the users query, the CoT still happens in the output of the model like in the first example that you gave. This can be done without any finetuning on the policy, but reinforcement learning can also be used to encourage the chat output to break the problem down in to “logical” steps. chat models have always passed in the chat history back into the next input while appending the users turn, thats just how they work (I have no idea if o1 passes the CoT into the chat history though, so i cant comment). But it wouldnt solely account for the massive degradation of performance between o1 and o3/o4
From my other comment about o1 and o3/o4 potential issues: