Introduction
In the following notebooks, we'll mainly use the ReinforcementLearning.jl package to demonstrate how to generate figures in the book Reinforcement Learning: An Introduction(2nd). In case you haven't used this package before, you can visit juliareinforcementlearning.org for a detailed introduction. Besides, we'll also explain some basic concepts gradually when we meet them for the first time.
ReinforcementLearning.jl contains a collection of tools to describe and solve problems we want to handle in the reinforcement leanring field. Though we are mostly interested in traditional tabular methods here, it also contains many state-of-the-art algorithms. To use it, we can simply run the following code here:
xxxxxxxxxx
using ReinforcementLearning
Note
As you might have noticed, it takes quite a long time to load this package for the first time (the good news is that it will be largedly decreased after `Julia@v1.6`). Once loaded, things should be very fast.
Tic-Tac-Toe
In chapter 1.5, a simple game of Tic-Tac-Toe is introduced to illustrate the general idea of reinforcement learning. Before looking into the details about how to implement the monte carlo based policy, let's take a look at how the Tic-Tac-Toe environment is defined in ReinforcementLearning.jl.
# TicTacToeEnv
## Traits
| Trait Type | Value |
|:----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------:|
| NumAgentStyle | ReinforcementLearningBase.MultiAgent{2}() |
| DynamicStyle | ReinforcementLearningBase.Sequential() |
| InformationStyle | ReinforcementLearningBase.PerfectInformation() |
| ChanceStyle | ReinforcementLearningBase.Deterministic() |
| RewardStyle | ReinforcementLearningBase.TerminalReward() |
| UtilityStyle | ReinforcementLearningBase.ZeroSum() |
| ActionStyle | ReinforcementLearningBase.FullActionSet() |
| StateStyle | (ReinforcementLearningBase.Observation{String}(), ReinforcementLearningBase.Observation{Int64}(), ReinforcementLearningBase.Observation{BitArray{3}}()) |
| DefaultStateStyle | ReinforcementLearningBase.Observation{String}() |
## Is Environment Terminated?
No
## State Space
`ReinforcementLearningBase.WorldSpace{String}()`
## Action Space
`Base.OneTo(9)`
## Players
* `ReinforcementLearningEnvironments.Cross()`
* `ReinforcementLearningEnvironments.Nought()`
## Current Player
`ReinforcementLearningEnvironments.Cross()`
## Current State
```
...
...
...
```
xxxxxxxxxx
env = TicTacToeEnv()
There are many important information provided above. First, the TicTacToeEnv
is a zero-sum, two player environment. Sequential
means each player takes an action alternatively. Since all the players can observe the same information (the board), it is an environment of perfect information. Note that each player only get the reward at the end of the game (-1
, 0
, or 1
). So we call RewardStyle
of this kind of environments TerminalReward()
. At each step, only part of the actions are valid (the position of the board), so we say the ActionStyle
of this env is of FullActionSet()
.
All these information are provided by traits. In later chapters, we'll see how to define these traits for new customized environments. Now let's get familiar with some basic interfaces defined with the environment first.
...
...
...
xxxxxxxxxx
state(env) |> Text # Here we use Text to show the string properly in Pluto
By default, the state of the TicTacToeEnv
is represented as a String
. But some other types are also provided.
First, the number of possible valid boards is numerable. You might be interested in this blog to figure out the exact number. So we can use an integer to represent the state.
1
xxxxxxxxxx
state(env, Observation{Int}())
Another way to express the state is to use a one-hot bool Array to represent the state. The first and the second dimension mean the row and column of the board. The third dimension means the role in each cell (x
, o
or empty). This kind of representation is usually more friendly to neural network based algorithms.
3×3×3 BitArray{3}:
[:, :, 1] =
1 1 1
1 1 1
1 1 1
[:, :, 2] =
0 0 0
0 0 0
0 0 0
[:, :, 3] =
0 0 0
0 0 0
0 0 0
xxxxxxxxxx
state(env, Observation{BitArray{3}}())
Each environment is a functional object which takes in an action and changes its internal state correspondingly.
xox
ox.
xo.
xxxxxxxxxx
begin
env(1)
env(2)
env(3)
env(4)
env(5)
env(6)
env(7)
state(env) |> Text
end
Now we can check some basic information about the env
:
true
1
-1
xxxxxxxxxx
is_terminated(env), [reward(env, p) for p in players(env)]
As you can see, the game is now terminated, and the reward of each player is 1, -1.
xxxxxxxxxx
reset!(env)
...
...
...
xxxxxxxxxx
state(env) |> Text
RandomPolicy
Among all the policies provided in ReinforcementLearning.jl
, the simplest one is RandomPolicy
.
RandomPolicy
├─ action_space => Nothing
└─ rng => Random._GLOBAL_RNG
xxxxxxxxxx
policy = RandomPolicy()
5
3
1
7
2
8
7
5
9
7
xxxxxxxxxx
[policy(env) for _ in 1:10]
...
...
...
xxxxxxxxxx
state(env) |> Text
As you can see, each policy is also a functional object which takes in a env
and returns an action.
Note
A policy will never change the internal state of an `env`. The `env` will only get updated after executing `env(action)`.
Policy + Environment
Now that we have understood the two main concepts in reinforcement learning, we can roll out several simulations and collect some information we are interested in.
xxxxxxxxxx
run(policy, env, StopAfterEpisode(1))
The above run
function defined in ReinforcementLearning.jl
is quite straight forward. The policy
generates an action at each time step and feeds it into the env
. The process continues until the end of an episode. Here StopAfterEpisode(1)
is a built in stop condition. You can also see many other stop conditions in the doc.
You are encouraged to read the source code of this function. It's pretty simple (less than 30 lines) and easy to understand. I'll wait you here until you are finished.
If you have finished reading it, you'll notice that one important argument is missing in the above function call, the hook
. Now we'll add the forth argument to collect the reward of each player in every episode.
MultiAgentManager
└─ agents => Dict
xxxxxxxxxx
multi_agent_policy = MultiAgentManager(
(
NamedPolicy(p=>RandomPolicy())
for p in players(env)
)...
)
0.0
0.0
xxxxxxxxxx
multi_agent_hook = MultiAgentHook(
(
p => TotalRewardPerEpisode()
for p in players(env)
)...
)
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
0.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
0.0
xxxxxxxxxx
run(multi_agent_policy, env, StopAfterEpisode(10), multi_agent_hook)
Beside that we added a fourth argument in the run
, another important difference is that we used a MultiAgentManager
policy instead of a simple RandomPolicy
. The reason is that we want different players use a separate policy and then we can collect different information of them separately.
Now let's take a look at the total reward of each player in the above 10 episodes:
xxxxxxxxxx
x, o = players(env)
xxxxxxxxxx
using Plots
xxxxxxxxxx
begin
plot(multi_agent_hook[x][]; label="x")
plot!(multi_agent_hook[o][]; label="o")
end
Tackling the Tic-Tac-Toe Problem with Monte Carlo Prediction
Actually the Monte Carlo method mention in the first chapter to solve the Tic-Tac-Toe problem is explained in Chapter 5.1, so it's ok if you don't fully understand it right now for the first time. Just keep reading and turn back to this chapter once you finished chapter 5.
The intuition behind monte carlo prediction that, we use a table to record the estimated gain at each step. If such estimation is accurate, then we can simply look one step ahead and compare which state is of the largest gain. Then we just take the action which will lead to that state to maximize our reward.
TabularVApproximator
In ReinforcementLearning.jl
, the fundamental components used to estimate state values (or state-action values, or any other values we are interested in) are called AbstractApproximator. In this simple example, we'll use the TabularVApproximator
to estimate the state value.
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
xxxxxxxxxx
begin
using Flux
A = TabularVApproximator(;n_state=10, init=0.0, opt=InvDecay(1.0))
A
end
A TabularVApproximator
uses a Vector
underlying to estimate the state value. Here the Tabular means that the state is required to be an Integer
. The code above created a TabularVApproximator
which has a table of 10
states, 1:10
, and each state value is initialized to 0.0
. We can get the estimation by:
0.0
xxxxxxxxxx
A(1)
And we can update the estimation via:
0-dimensional view(::Array{Float64,1}, 2) with eltype Float64:
2.5
xxxxxxxxxx
update!(A, 2 => A(2) - 5.)
Let's spend some time on what's happening here. The update!
method provided in ReinforcementLearning.jl
adopts the similar interface defined in Flux.Optimise
. Here the 2 => A(2) - 5.)
is something similar to Grad
. And then we apply the optimiser (A.optimizer
) to calculate the error we want to apply on the approximator. The default optimizer in TabularApproximator
is InvDecay(1.0)
. The effect is the same with calculating the mean value of all the examples.
true
xxxxxxxxxx
begin
examples = 1:10
for x in examples
update!(A, 1 => A(1) - x)
end
A(1) == (0. #= the init value is 0. =# + sum(examples)) / (1+length(examples))
end
By using the same interfaces defined in Flux.jl
, we get a bunch of different optimizers free. You can take a look at all the supported optimizers
defined in optimisers.jl. In future notebooks, we'll see how to apply some other optimizers beyond InvDecay
.
MonteCarloLearner
In ReinforcementLearning.jl
, many different variants of monte carlo learners are supported. We'll skip the implementation detail for now. But to convince you that implementing such an algorithm is quite simple and straightforward in this package, we can take a look at the code snippet and compare it with the pseudocode on the book.
function _update!(
::EveryVisit,
::TabularVApproximator,
::NoSampling,
L::MonteCarloLearner,
t::AbstractTrajectory,
)
S, R = t[:states], t[:rewards]
V, G, γ = L.approximator, 0.0, L.γ
for i in length(R):-1:1
s, r = S[i], R[i]
G = γ * G + r
update!(V, s => V(s) - G)
end
end
MonteCarloLearner
├─ approximator => TabularApproximator
│ ├─ table => 5478-element Array{Float64,1}
│ └─ optimizer => InvDecay
│ ├─ gamma => 1.0
│ └─ state => IdDict
├─ γ => 1.0
├─ kind => ReinforcementLearningZoo.FirstVisit
└─ sampling => ReinforcementLearningZoo.NoSampling
xxxxxxxxxx
M = MonteCarloLearner(;
approximator = TabularVApproximator(;n_state=5478, init=0., opt=InvDecay(1.0)),
γ = 1.0,
kind = FIRST_VISIT,
sampling = NO_SAMPLING
)
The MonteCarloLearner
is just a wrapper of the TabularVApproximator
here. We can feed the env
to it and get the the state estimation.
0.0
xxxxxxxxxx
M(1)
MethodError: no method matching (::ReinforcementLearningCore.TabularApproximator{1,Array{Float64,1},Flux.Optimise.InvDecay})(::String)
Closest candidates are:
Any(!Matched::Int64) at /mnt/E4E0A9C0E0A998F6/github/ReinforcementLearningCore/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:28
- (::ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{1,Array{Float64,1},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling})(::String)@monte_carlo_learner.jl:45
- (::ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{1,Array{Float64,1},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling})(::ReinforcementLearningEnvironments.TicTacToeEnv)@monte_carlo_learner.jl:44
- top-level scope@Local: 1[inlined]
xxxxxxxxxx
M(env)
Uh-oh! The MethodError
above tells us that the MonteCarloLearner
expects the input to be an Int
but gets a String
. The reason is that the default state style of env
is Observation{String}()
. So we need to set the default state style to Observation{Int}()
.
# TicTacToeEnv |> DefaultStateStyleEnv
## Traits
| Trait Type | Value |
|:----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------:|
| NumAgentStyle | ReinforcementLearningBase.MultiAgent{2}() |
| DynamicStyle | ReinforcementLearningBase.Sequential() |
| InformationStyle | ReinforcementLearningBase.PerfectInformation() |
| ChanceStyle | ReinforcementLearningBase.Deterministic() |
| RewardStyle | ReinforcementLearningBase.TerminalReward() |
| UtilityStyle | ReinforcementLearningBase.ZeroSum() |
| ActionStyle | ReinforcementLearningBase.FullActionSet() |
| StateStyle | (ReinforcementLearningBase.Observation{String}(), ReinforcementLearningBase.Observation{Int64}(), ReinforcementLearningBase.Observation{BitArray{3}}()) |
| DefaultStateStyle | ReinforcementLearningBase.Observation{Int64}() |
## Is Environment Terminated?
Yes
## State Space
`Base.OneTo(5478)`
## Action Space
`Base.OneTo(9)`
## Players
* `ReinforcementLearningEnvironments.Cross()`
* `ReinforcementLearningEnvironments.Nought()`
## Current Player
`ReinforcementLearningEnvironments.Nought()`
## Current State
```
3938
```
xxxxxxxxxx
E = DefaultStateStyleEnv{Observation{Int}()}(env)
3938
xxxxxxxxxx
state(E)
Now that we have the MonteCarloLearner
to estimate state values, how can we use it to generate actions? A simple approach is that, we try each valid action and see what state it leads to. Then we simply select the one which results in the state of the largest state value.
Note
Obviously this approach only applies to deterministic environments with a small action set.
But usually this approach is more suitable for testing instead of training. Because always selecting the best action may fall into local minima and hurt the performance. You'll see more discussions on exploration and exploitation in the next chapter. Here we use the EpsilonGreedy
method to select action.
0.1
1.0
0
0
1
true
xxxxxxxxxx
explorer = EpsilonGreedyExplorer(0.1)
xxxxxxxxxx
begin
values = [1, 2, 3, 1]
N = 1000
actions = [explorer(values) for _ in 1:N]
plot([sum(actions .== i)/N for i in 1:length(values)])
end
The above figure shows that with 0.1
, we have the probability of
Below, we define a function to illustrate how to select an action given an environment.
select_action (generic function with 1 method)
xxxxxxxxxx
function select_action(env, V)
A = legal_action_space(env)
values = map(A) do a
V(child(env, a))
end
A[explorer(values)]
end
Combining the MonteCarloLearner
and the select_action
function, we get our policy:
VBasedPolicy
├─ learner => MonteCarloLearner
│ ├─ approximator => TabularApproximator
│ │ ├─ table => 5478-element Array{Float64,1}
│ │ └─ optimizer => InvDecay
│ │ ├─ gamma => 1.0
│ │ └─ state => IdDict
│ ├─ γ => 1.0
│ ├─ kind => ReinforcementLearningZoo.FirstVisit
│ └─ sampling => ReinforcementLearningZoo.NoSampling
└─ mapping => typeof(select_action)
xxxxxxxxxx
P = VBasedPolicy(;learner=M, mapping=select_action)
xxxxxxxxxx
reset!(E)
7
xxxxxxxxxx
P(E)
That's it!
Now we have a policy and we can use it to roll our simulations.
xxxxxxxxxx
run(P, E, StopAfterEpisode(10))
Training
One main question we haven't answer is, how to train the policy?
Well, the usage is similar to the above one, the only difference is now we wrap the policy
in an Agent
, which is also an AbstractPolicy
. An Agent
is policy + trajectory
, or people usually call it experience replay buffer.
MultiAgentManager
└─ agents => Dict
xxxxxxxxxx
policies = MultiAgentManager(
(
Agent(
policy = NamedPolicy(
p => VBasedPolicy(;
learner=MonteCarloLearner(;
approximator = TabularVApproximator(;
n_state=length(state_space(E)),
init=0.,
opt=InvDecay(1.0)
),
γ = 1.0,
kind = FIRST_VISIT,
sampling = NO_SAMPLING
),
mapping = select_action
)
),
trajectory =VectorSARTTrajectory(
;state=Int,
action=Union{Int, NoOp},
reward=Int,
terminal=Bool
)
)
for p in players(E)
)...
)
xxxxxxxxxx
run(policies, E, StopAfterEpisode(100_000))
The interface is almost the same with the above one. Let me explain what's happening here.
First, the policies
is a MultiAgentManager
and the environment is a Sequential
one. So at each step, the agent manager forward the env
to its inner policies. Here each inner policy is an Agent
. Then it will collect necessary information at each step (here we used a VectorSARTTrajectory
to tell the agent to collect state
, action
, reward
, and is_terminated
info). Finally, after each episode, the agent send the trajectory
to the inner VBasedPolicy
, and then forwarded to the MonteCarloLearner
to update the TabularVApproximator
. Thanks to multiple dispatch each step above is fully customizable.
Testing
Now we have our policy trained. We can happily run several self-plays and see the result. The only necessary change is to drop out the Agent
wrapper.
MultiAgentManager
└─ agents => Dict
xxxxxxxxxx
test_policies = MultiAgentManager([p.policy for (x, p) in (policies.agents)]...)
Here we use a hook to collect the reward of each episode for every player:
0.0
0.0
xxxxxxxxxx
hook = MultiAgentHook(
(
p => TotalRewardPerEpisode()
for p in players(E)
)...
)
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
1.0
-1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-1.0
0.0
0.0
0.0
-1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-1.0
0.0
0.0
-1.0
1.0
-1.0
0.0
0.0
0.0
0.0
0.0
xxxxxxxxxx
run(test_policies, E, StopAfterEpisode(100), hook)
xxxxxxxxxx
begin
plot(hook[x][]; label="x")
plot!(hook[o][]; label="o")
end
As you can see, in most cases, we reach a tie. But why there're still some cases that they don't reach a tie?
Is it because we're not training for enough time?
Maybe. But a more possible reason is that we're still using the EpsilonGreedyExplorer
in the testing mode.
I leave it as an exercise to change the select_action
to a greedy version and confirm that our trained policy reaches a tie everytime.