How to Build a Self-Optimizing System
Reading “Human-level Control through Deep Reinforcement Learning” did not inspire anything philosophical in me. But the paper did describe a neat architecture for learning I’d love to share with you.
Mnih and the team at DeepMind built a system that started with zero knowledge about the games it played. Just by playing randomly, the same system reached a pro level across 49 different games.
The system had access to the score, input, controls, and output. It committed to the weights of the strategy it played for 10,000 steps at a time. During that commitment period, a separate learner equation adjusted in the background. Every period, the system got better at playing.
The commitment was necessary because a single step did not carry enough signal to learn anything meaningful. The learner needed a window to randomly sample across, which minimized recency bias. If you only learn from your last step, you never really learn anything. You just live reactively.
I tried shaping this into an essay and a guide, but it came out like robotic slop every time. So I’m moving on to keep getting inspired.
The takeaways from this paper for me are twofold. First, randomness across a vector reduces its bias. Random sampling across time reduces recency bias. Random sampling across topics reduces domain expertise bias. And second, retrospectives should group enough experience together that you can actually learn something meaningful about the quality of your decisions.
Read Human-level Control through Deep Reinforcement Learning