⚡ Pluto.jl ⚡

Introduction

In the following notebooks, we'll mainly use the ReinforcementLearning.jl package to demonstrate how to generate figures in the book Reinforcement Learning: An Introduction(2nd). In case you haven't used this package before, you can visit juliareinforcementlearning.org for a detailed introduction. Besides, we'll also explain some basic concepts gradually when we meet them for the first time.

ReinforcementLearning.jl contains a collection of tools to describe and solve problems we want to handle in the reinforcement leanring field. Though we are mostly interested in traditional tabular methods here, it also contains many state-of-the-art algorithms. To use it, we can simply run the following code here:

12.3 μs

xxxxxxxxxx
 
using ReinforcementLearning

207 s

Note

As you might have noticed, it takes quite a long time to load this package for the first time (the good news is that it will be largedly decreased after `Julia@v1.6`). Once loaded, things should be very fast.

3.8 μs

Tic-Tac-Toe

In chapter 1.5, a simple game of Tic-Tac-Toe is introduced to illustrate the general idea of reinforcement learning. Before looking into the details about how to implement the monte carlo based policy, let's take a look at how the Tic-Tac-Toe environment is defined in ReinforcementLearning.jl.

9.3 μs

env

# TicTacToeEnv

## Traits

| Trait Type        |                                                                                                                                                   Value |
|:----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------:|
| NumAgentStyle     |                                                                                                               ReinforcementLearningBase.MultiAgent{2}() |
| DynamicStyle      |                                                                                                                  ReinforcementLearningBase.Sequential() |
| InformationStyle  |                                                                                                          ReinforcementLearningBase.PerfectInformation() |
| ChanceStyle       |                                                                                                               ReinforcementLearningBase.Deterministic() |
| RewardStyle       |                                                                                                              ReinforcementLearningBase.TerminalReward() |
| UtilityStyle      |                                                                                                                     ReinforcementLearningBase.ZeroSum() |
| ActionStyle       |                                                                                                               ReinforcementLearningBase.FullActionSet() |
| StateStyle        | (ReinforcementLearningBase.Observation{String}(), ReinforcementLearningBase.Observation{Int64}(), ReinforcementLearningBase.Observation{BitArray{3}}()) |
| DefaultStateStyle |                                                                                                         ReinforcementLearningBase.Observation{String}() |

## Is Environment Terminated?

No

## State Space

`ReinforcementLearningBase.WorldSpace{String}()`

## Action Space

`Base.OneTo(9)`

## Players

  * `ReinforcementLearningEnvironments.Cross()`
  * `ReinforcementLearningEnvironments.Nought()`

## Current Player

`ReinforcementLearningEnvironments.Cross()`

## Current State

```
...
...
...

```

xxxxxxxxxx
 
env = TicTacToeEnv()

92.2 ms

There are many important information provided above. First, the TicTacToeEnv is a zero-sum, two player environment. Sequential means each player takes an action alternatively. Since all the players can observe the same information (the board), it is an environment of perfect information. Note that each player only get the reward at the end of the game (-1, 0, or 1). So we call RewardStyle of this kind of environments TerminalReward(). At each step, only part of the actions are valid (the position of the board), so we say the ActionStyle of this env is of FullActionSet().

All these information are provided by traits. In later chapters, we'll see how to define these traits for new customized environments. Now let's get familiar with some basic interfaces defined with the environment first.

12.1 μs

...
...
...

xxxxxxxxxx
 
state(env) |> Text  # Here we use Text to show the string properly in Pluto

2.8 ms

By default, the state of the TicTacToeEnv is represented as a String. But some other types are also provided.

First, the number of possible valid boards is numerable. You might be interested in this blog to figure out the exact number. So we can use an integer to represent the state.

10.4 μs

xxxxxxxxxx
 
state(env, Observation{Int}())

7.7 ms

Another way to express the state is to use a one-hot bool Array to represent the state. The first and the second dimension mean the row and column of the board. The third dimension means the role in each cell (x, o or empty). This kind of representation is usually more friendly to neural network based algorithms.

3.0 μs

3×3×3 BitArray{3}:
[:, :, 1] =
 1  1  1
 1  1  1
 1  1  1

[:, :, 2] =
 0  0  0
 0  0  0
 0  0  0

[:, :, 3] =
 0  0  0
 0  0  0
 0  0  0

xxxxxxxxxx
 
state(env, Observation{BitArray{3}}())

7.0 ms

Each environment is a functional object which takes in an action and changes its internal state correspondingly.

2.6 μs

xox
ox.
xo.

xxxxxxxxxx
 
begin
    env(1)
    env(2)
    env(3)
    env(4)
    env(5)
    env(6)
    env(7)
    state(env) |> Text
end

13.7 μs

Now we can check some basic information about the env:

3.1 μs

true

2Int641

-1

xxxxxxxxxx
 
is_terminated(env), [reward(env, p) for p in players(env)]

27.7 μs

As you can see, the game is now terminated, and the reward of each player is 1, -1.

64.0 ms

ReinforcementLearningEnvironments.Cross

xxxxxxxxxx
 
reset!(env)

4.2 μs

...
...
...

xxxxxxxxxx
 
state(env) |> Text

11.2 μs

RandomPolicy

Among all the policies provided in ReinforcementLearning.jl, the simplest one is RandomPolicy.

4.0 μs

policy

RandomPolicy
├─ action_space => Nothing
└─ rng => Random._GLOBAL_RNG

xxxxxxxxxx
 
policy = RandomPolicy()

3.8 ms

Int641

xxxxxxxxxx
 
[policy(env) for _ in 1:10]

3.6 ms

...
...
...

xxxxxxxxxx
 
state(env) |> Text

9.8 μs

As you can see, each policy is also a functional object which takes in a env and returns an action.

Note

A policy will never change the internal state of an `env`. The `env` will only get updated after executing `env(action)`.

3.4 μs

Policy + Environment

Now that we have understood the two main concepts in reinforcement learning, we can roll out several simulations and collect some information we are interested in.

6.3 μs

EmptyHook

xxxxxxxxxx
 
run(policy, env, StopAfterEpisode(1))

416 ms

The above run function defined in ReinforcementLearning.jl is quite straight forward. The policy generates an action at each time step and feeds it into the env. The process continues until the end of an episode. Here StopAfterEpisode(1) is a built in stop condition. You can also see many other stop conditions in the doc.

You are encouraged to read the source code of this function. It's pretty simple (less than 30 lines) and easy to understand. I'll wait you here until you are finished.

If you have finished reading it, you'll notice that one important argument is missing in the above function call, the hook. Now we'll add the forth argument to collect the reward of each player in every episode.

11.0 μs

multi_agent_policy

MultiAgentManager
└─ agents => Dict

xxxxxxxxxx
 
multi_agent_policy = MultiAgentManager(
    (
        NamedPolicy(p=>RandomPolicy())
        for p in players(env)
    )...
)

127 ms

multi_agent_hook

MultiAgentHookhooksDictCrossTotalRewardPerEpisoderewardsFloat64reward

0.0

NoughtTotalRewardPerEpisoderewardsFloat64reward

0.0

xxxxxxxxxx
 
multi_agent_hook = MultiAgentHook(
    (
        p => TotalRewardPerEpisode()
        for p in players(env)
    )...
)

566 ms

MultiAgentHookhooksDictCrossTotalRewardPerEpisoderewardsFloat641

0.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

reward

0.0

NoughtTotalRewardPerEpisoderewardsFloat641

0.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

-1.0

reward

0.0

xxxxxxxxxx
 
run(multi_agent_policy, env, StopAfterEpisode(10), multi_agent_hook)

273 ms

Beside that we added a fourth argument in the run, another important difference is that we used a MultiAgentManager policy instead of a simple RandomPolicy. The reason is that we want different players use a separate policy and then we can collect different information of them separately.

Now let's take a look at the total reward of each player in the above 10 episodes:

4.7 μs

1Cross2Nought

xxxxxxxxxx
 
x, o = players(env)

72.0 ns

xxxxxxxxxx
 
using Plots

28.7 s

xxxxxxxxxx
 
begin
    plot(multi_agent_hook[x][]; label="x")
    plot!(multi_agent_hook[o][]; label="o")
end

17.0 s

Tackling the Tic-Tac-Toe Problem with Monte Carlo Prediction

Actually the Monte Carlo method mention in the first chapter to solve the Tic-Tac-Toe problem is explained in Chapter 5.1, so it's ok if you don't fully understand it right now for the first time. Just keep reading and turn back to this chapter once you finished chapter 5.

The intuition behind monte carlo prediction that, we use a table to record the estimated gain at each step. If such estimation is accurate, then we can simply look one step ahead and compare which state is of the largest gain. Then we just take the action which will lead to that state to maximize our reward.

TabularVApproximator

In ReinforcementLearning.jl, the fundamental components used to estimate state values (or state-action values, or any other values we are interested in) are called AbstractApproximator. In this simple example, we'll use the TabularVApproximator to estimate the state value.

2.3 ms

TabularApproximator{1,Array{Float64,1},InvDecay}tableFloat641

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

optimizerInvDecaygamma

1.0

stateIdDict

xxxxxxxxxx
 
begin
    using Flux
    A = TabularVApproximator(;n_state=10, init=0.0, opt=InvDecay(1.0))
    A
end

77.6 ms

A TabularVApproximator uses a Vector underlying to estimate the state value. Here the Tabular means that the state is required to be an Integer. The code above created a TabularVApproximator which has a table of 10 states, 1:10, and each state value is initialized to 0.0. We can get the estimation by:

10.8 μs

0.0

xxxxxxxxxx
 
A(1)

413 ns

And we can update the estimation via:

7.6 μs

0-dimensional view(::Array{Float64,1}, 2) with eltype Float64:
2.5

xxxxxxxxxx
 
update!(A, 2 => A(2) - 5.)

37.0 ms

Let's spend some time on what's happening here. The update! method provided in ReinforcementLearning.jl adopts the similar interface defined in Flux.Optimise. Here the 2 => A(2) - 5.) is something similar to Grad. And then we apply the optimiser (A.optimizer) to calculate the error we want to apply on the approximator. The default optimizer in TabularApproximator is InvDecay(1.0). The effect is the same with calculating the mean value of all the examples.

5.8 μs

true

xxxxxxxxxx
 
begin
    examples = 1:10
    for x in examples
        update!(A, 1 => A(1) - x)
    end
    A(1) == (0. #= the init value is 0. =# + sum(examples)) / (1+length(examples))
end

40.2 μs

By using the same interfaces defined in Flux.jl, we get a bunch of different optimizers free. You can take a look at all the supported optimizers defined in optimisers.jl. In future notebooks, we'll see how to apply some other optimizers beyond InvDecay.

4.5 μs

MonteCarloLearner

In ReinforcementLearning.jl, many different variants of monte carlo learners are supported. We'll skip the implementation detail for now. But to convince you that implementing such an algorithm is quite simple and straightforward in this package, we can take a look at the code snippet and compare it with the pseudocode on the book.

function _update!(
    ::EveryVisit,
    ::TabularVApproximator,
    ::NoSampling,
    L::MonteCarloLearner,
    t::AbstractTrajectory,
)
    S, R = t[:states], t[:rewards]
    V, G, γ = L.approximator, 0.0, L.γ
    for i in length(R):-1:1
        s, r = S[i], R[i]
        G = γ * G + r
        update!(V, s => V(s) - G)
    end
end

4.0 μs

MonteCarloLearner
├─ approximator => TabularApproximator
│  ├─ table => 5478-element Array{Float64,1}
│  └─ optimizer => InvDecay
│     ├─ gamma => 1.0
│     └─ state => IdDict
├─ γ => 1.0
├─ kind => ReinforcementLearningZoo.FirstVisit
└─ sampling => ReinforcementLearningZoo.NoSampling

xxxxxxxxxx
 
M = MonteCarloLearner(;
    approximator = TabularVApproximator(;n_state=5478, init=0., opt=InvDecay(1.0)),
    γ = 1.0,
    kind = FIRST_VISIT,
    sampling = NO_SAMPLING
)

15.4 ms

The MonteCarloLearner is just a wrapper of the TabularVApproximator here. We can feed the env to it and get the the state estimation.

10.6 μs

0.0

xxxxxxxxxx
 
M(1)

470 ns

MethodError: no method matching (::ReinforcementLearningCore.TabularApproximator{1,Array{Float64,1},Flux.Optimise.InvDecay})(::String)

Closest candidates are:

Any(!Matched::Int64) at /mnt/E4E0A9C0E0A998F6/github/ReinforcementLearningCore/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:28

(::ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{1,Array{Float64,1},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling})(::String)@monte_carlo_learner.jl:45
(::ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{1,Array{Float64,1},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling})(::ReinforcementLearningEnvironments.TicTacToeEnv)@monte_carlo_learner.jl:44
top-level scope@Local: 1[inlined]

xxxxxxxxxx
 
M(env)

---

Uh-oh! The MethodError above tells us that the MonteCarloLearner expects the input to be an Int but gets a String. The reason is that the default state style of env is Observation{String}(). So we need to set the default state style to Observation{Int}().

6.1 μs

# TicTacToeEnv |> DefaultStateStyleEnv

## Traits

| Trait Type        |                                                                                                                                                   Value |
|:----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------:|
| NumAgentStyle     |                                                                                                               ReinforcementLearningBase.MultiAgent{2}() |
| DynamicStyle      |                                                                                                                  ReinforcementLearningBase.Sequential() |
| InformationStyle  |                                                                                                          ReinforcementLearningBase.PerfectInformation() |
| ChanceStyle       |                                                                                                               ReinforcementLearningBase.Deterministic() |
| RewardStyle       |                                                                                                              ReinforcementLearningBase.TerminalReward() |
| UtilityStyle      |                                                                                                                     ReinforcementLearningBase.ZeroSum() |
| ActionStyle       |                                                                                                               ReinforcementLearningBase.FullActionSet() |
| StateStyle        | (ReinforcementLearningBase.Observation{String}(), ReinforcementLearningBase.Observation{Int64}(), ReinforcementLearningBase.Observation{BitArray{3}}()) |
| DefaultStateStyle |                                                                                                          ReinforcementLearningBase.Observation{Int64}() |

## Is Environment Terminated?

Yes

## State Space

`Base.OneTo(5478)`

## Action Space

`Base.OneTo(9)`

## Players

  * `ReinforcementLearningEnvironments.Cross()`
  * `ReinforcementLearningEnvironments.Nought()`

## Current Player

`ReinforcementLearningEnvironments.Nought()`

## Current State

```
3938
```

xxxxxxxxxx
 
E = DefaultStateStyleEnv{Observation{Int}()}(env)

11.3 ms

xxxxxxxxxx
 
state(E)

6.5 μs

Now that we have the MonteCarloLearner to estimate state values, how can we use it to generate actions? A simple approach is that, we try each valid action and see what state it leads to. Then we simply select the one which results in the state of the largest state value.

Note

Obviously this approach only applies to deterministic environments with a small action set.

But usually this approach is more suitable for testing instead of training. Because always selecting the best action may fall into local minima and hurt the performance. You'll see more discussions on exploration and exploitation in the next chapter. Here we use the EpsilonGreedy method to select action.

11.0 μs

explorer

EpsilonGreedyExplorer{:linear,false,Random._GLOBAL_RNG}ϵ_stable

0.1

ϵ_init

1.0

warmup_steps

decay_steps

step

rngRandom._GLOBAL_RNGis_training

true

xxxxxxxxxx
 
explorer = EpsilonGreedyExplorer(0.1)

61.2 ms

xxxxxxxxxx
 
begin
    values = [1, 2, 3, 1]
    N = 1000
    actions = [explorer(values) for _ in 1:N]
    plot([sum(actions .== i)/N for i in 1:length(values)])
end

90.5 ms

The above figure shows that with $ϵ$ set to 0.1, we have the probability of $ϵ$ to select an action randomly and $1 - ϵ$ to select the action of the highest value.

Below, we define a function to illustrate how to select an action given an environment.

5.4 μs

select_action (generic function with 1 method)

xxxxxxxxxx
 
function select_action(env, V)
    A = legal_action_space(env)
    values = map(A) do a
        V(child(env, a))
    end
    A[explorer(values)]
end

40.4 μs

Combining the MonteCarloLearner and the select_action function, we get our policy:

10.9 μs

VBasedPolicy
├─ learner => MonteCarloLearner
│  ├─ approximator => TabularApproximator
│  │  ├─ table => 5478-element Array{Float64,1}
│  │  └─ optimizer => InvDecay
│  │     ├─ gamma => 1.0
│  │     └─ state => IdDict
│  ├─ γ => 1.0
│  ├─ kind => ReinforcementLearningZoo.FirstVisit
│  └─ sampling => ReinforcementLearningZoo.NoSampling
└─ mapping => typeof(select_action)

xxxxxxxxxx
 
P = VBasedPolicy(;learner=M, mapping=select_action)

14.5 ms

ReinforcementLearningEnvironments.Cross

xxxxxxxxxx
 
reset!(E)

4.0 μs

xxxxxxxxxx
 
P(E)

27.1 ms

That's it!

Now we have a policy and we can use it to roll our simulations.

6.3 μs

EmptyHook

xxxxxxxxxx
 
run(P, E, StopAfterEpisode(10))

24.3 ms

Training

One main question we haven't answer is, how to train the policy?

Well, the usage is similar to the above one, the only difference is now we wrap the policy in an Agent, which is also an AbstractPolicy. An Agent is policy + trajectory, or people usually call it experience replay buffer.

7.8 μs

policies

MultiAgentManager
└─ agents => Dict

xxxxxxxxxx
 
policies = MultiAgentManager(
    (
        Agent(
            policy = NamedPolicy(
                p => VBasedPolicy(;
                    learner=MonteCarloLearner(;
                        approximator = TabularVApproximator(;
                            n_state=length(state_space(E)),
                            init=0.,
                            opt=InvDecay(1.0)
                        ),
                        γ = 1.0,
                        kind = FIRST_VISIT,
                        sampling = NO_SAMPLING
                    ),
                    mapping = select_action
                )
            ),
            trajectory =VectorSARTTrajectory(
                    ;state=Int,
                    action=Union{Int, NoOp},
                    reward=Int,
                    terminal=Bool
            )
        )
        for p in players(E)
    )...
)

285 ms

EmptyHook

xxxxxxxxxx
 
run(policies, E, StopAfterEpisode(100_000))

37.0 s

The interface is almost the same with the above one. Let me explain what's happening here.

First, the policies is a MultiAgentManager and the environment is a Sequential one. So at each step, the agent manager forward the env to its inner policies. Here each inner policy is an Agent. Then it will collect necessary information at each step (here we used a VectorSARTTrajectory to tell the agent to collect state, action, reward, and is_terminated info). Finally, after each episode, the agent send the trajectory to the inner VBasedPolicy, and then forwarded to the MonteCarloLearner to update the TabularVApproximator. Thanks to multiple dispatch each step above is fully customizable.

6.1 μs

Testing

Now we have our policy trained. We can happily run several self-plays and see the result. The only necessary change is to drop out the Agent wrapper.

3.5 μs

test_policies

MultiAgentManager
└─ agents => Dict

xxxxxxxxxx
 
test_policies = MultiAgentManager([p.policy for (x, p) in (policies.agents)]...)

226 ms

Here we use a hook to collect the reward of each episode for every player:

2.4 μs

hook

MultiAgentHookhooksDictCrossTotalRewardPerEpisoderewardsFloat64reward

0.0

NoughtTotalRewardPerEpisoderewardsFloat64reward

0.0

xxxxxxxxxx
 
hook = MultiAgentHook(
    (
        p => TotalRewardPerEpisode()
        for p in players(E)
    )...
)   

28.8 ms

MultiAgentHookhooksDictCrossTotalRewardPerEpisoderewardsFloat641

0.0

0.0

0.0

0.0

1.0

0.0

0.0

0.0

1.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

more91

1.0

0.0

0.0

1.0

-1.0

1.0

0.0

0.0

0.0

100

0.0

reward

0.0

NoughtTotalRewardPerEpisoderewardsFloat641

0.0

0.0

0.0

0.0

-1.0

0.0

0.0

0.0

-1.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

more91

-1.0

0.0

0.0

-1.0

1.0

-1.0

0.0

0.0

0.0

100

0.0

reward

0.0

xxxxxxxxxx
 
run(test_policies, E, StopAfterEpisode(100), hook)

217 ms

xxxxxxxxxx
 
begin
    plot(hook[x][]; label="x")
    plot!(hook[o][]; label="o")
end

2.1 ms

As you can see, in most cases, we reach a tie. But why there're still some cases that they don't reach a tie?

Is it because we're not training for enough time?

Maybe. But a more possible reason is that we're still using the EpsilonGreedyExplorer in the testing mode.

I leave it as an exercise to change the select_action to a greedy version and confirm that our trained policy reaches a tie everytime.

7.7 μs