Embed Notice

HTML Code

<blockquote style="position: relative; padding-left: 55px;"><section><a href="https://rukii.net/users/tero/statuses/114016486193228871">Tero Keski-Valkama (tero@rukii.net)'s status on Monday, 17-Feb-2025 09:56:45 JST</a><a href="https://rukii.net/@tero" title="tero@rukii.net"><img src="https://gnusocial.jp/avatar/133855-48-20230603125628.webp" width="48" height="48" alt="Tero Keski-Valkama" style="position: absolute; left: 0; top: 0;">Tero Keski-Valkama</a></section><article><p>Hear me out: I think applying RL on <a href="https://rukii.net/tags/LLMs" rel="tag">#LLMs</a> and LMMs is misguided, and we can do much better.</p><p>Those <a href="https://rukii.net/tags/RL" rel="tag">#RL</a> algorithms are unsuitable for this, and for example they cannot learn how their decisions affect the eventual rewards, but instead are just optimized to make the decisions based on Bellman optimization.</p><p>Instead we can simply condition the LLMs with the rewards. The rewards become the inputs to the model, not something external to it, so the model will learn the proper reward dynamics, instead of only being externally forced towards the rewards. The model can itself do the credit assignment optimally without fancy mathematical heuristics!</p><p>This isn't a new idea, it comes from goal-conditioned RL, and decision transformers.</p><p>We can simply run the reasoning trajectories, judge the outcomes, and then put the outcome tokens first to these trajectories before training them to the model in a batch.</p><p><a href="https://arxiv.org/abs/2211.15657" rel="nofollow noreferrer">https://arxiv.org/abs/2211.15657</a></p></article><footer><a rel="bookmark" href="https://gnusocial.jp/conversation/4578590#notice-8963626">In conversation</a><time datetime="2025-02-17T09:56:45+09:00" title="Monday, 17-Feb-2025 09:56:45 JST">about 3 months ago</time> <span>from <span><a href="https://rukii.net/@tero/114016486193228871" rel="external" title="Sent from rukii.net via ActivityPub">rukii.net</a></span></span><a href="https://rukii.net/@tero/114016486193228871">permalink</a></footer></blockquote>

Corresponding Notice

Embed this notice
Tero Keski-Valkama (tero@rukii.net)'s status on Monday, 17-Feb-2025 09:56:45 JST Tero Keski-Valkama
Hear me out: I think applying RL on #LLMs and LMMs is misguided, and we can do much better.
Those #RL algorithms are unsuitable for this, and for example they cannot learn how their decisions affect the eventual rewards, but instead are just optimized to make the decisions based on Bellman optimization.
Instead we can simply condition the LLMs with the rewards. The rewards become the inputs to the model, not something external to it, so the model will learn the proper reward dynamics, instead of only being externally forced towards the rewards. The model can itself do the credit assignment optimally without fancy mathematical heuristics!
This isn't a new idea, it comes from goal-conditioned RL, and decision transformers.
We can simply run the reasoning trajectories, judge the outcomes, and then put the outcome tokens first to these trajectories before training them to the model in a batch.
https://arxiv.org/abs/2211.15657
In conversationabout 3 months ago from rukii.netpermalink

Public

Embed Notice

HTML Code

Corresponding Notice