The agent's goal is to learn the best action for each state, maximizing its cumulative reward. It learns using the Bellman equation to update its Q-values:
Q(s,a) ← Q(s,a) + α[R + γmaxa'Q(s',a') - Q(s,a)]
- The arrows show the agent's preference (Q-value) for each action.
- Use the brushes and settings to create custom challenges!
- Use the mouse wheel to zoom and drag to pan the view.