Shaping Exploration Space

This is a cross-link to a LessWrong post I helped edit and give thoughts on.

The attached post falls under my general research interest in something akin to “LLM cognitive science.” I’m currently doing research on model personas, and I more broadly think that developing a better conception of how models might exhibit something akin to “psychology” might be useful to AI safety. I consider Anthropic’s Persona Selection Model a representative overview of the possibilities in this space.

The LessWrong post in particular focused on the neglectedness of “motivation-space exploration” in RL and highlights it as a potentially safety-critical research area. It proposes natural research directions, but I think we maybe could have done a little better with formalizing our definnition of motivations and what it would look like to identify them. It’s not obvious to me that we should be able to detect motivations as a mechanistic causal trace in model computations, let alone that we can meaningfully experiment with and intervene on them. That being said, I do think that further work on model psychology, especially in the RL training phase, is important and would have significant impact if successful, hence my contribution and endorsement of the post’s overall message.