keywords: Design,Julia CJKmainfont: KaiTi –-
A Draft Design of Distributed Reinforcement Learning in Julia
I've been thinking for a while about how to design a distributed reinforcement learning package in Julia. Recently I read through the source code of some packages again, including:
and some other resources included here by Joel. Although I still don't have a very clear design, I would like to write down my thoughts here in case they are useful for someone else.
The abstractions for reinforcement learning in rllib are quite straightforward. You may refer RLlib: Abstractions for Distributed Reinforcement Learning for what I'm talking in the next.
- BaseEnv
- Policy Graph
- Policy Evaluation
- Policy Optimizer
- Agent
It has been demonstrated that by using the concepts above most of the popular reinforcement learning algorithms can be implemented in rllib. However, it's not that easy to port those concepts directly into Julia. One of the most important reason is that we don't have an existing foundamental package like Ray. And the infrastructure of parallel programming in Julia is quite different. In the next section, I will try to adapt those concepts in Julia and describe how to implement some typical algorithms in the very high level.
Actors Actors Actors
Environment
Let's start from the environments part first. Environments in RL are relatively independent. By treating all environments asynchronously, rllib shows that it would be very convenient to introduce new environments. So here we also treat environments as actors running asynchronously.
First, we introduce the concept of AbstractEnv
.
abstract type AbstractEnv end
function interact!(env, actions...) end
function observe(env, role) end
function reset!(env) end
Then we can wrap it into an actor
env_actor = @actor begin
env = ExampleEnv(init_configs)
while true
sender, msg = receive()
@match msg
(:interact!, actions) => interact!(env, actions)
(:observe, role) => tell(sender, observe(env, role))
(:reset!,) => reset!(env)
# do something else
(:ping,) => tell(sender, :pong)
end
end
end
# The code above can be further simplified by introducing an `@wrap_actor` macro
env_actors = @wrap_actor ExampleEnv(init_configs)
Policy
In the next, we can have a PolicyGraph
object like the one in rllib:
abstract type AbstractPolicy end
function act(pg, obs) end
function learn(pg, batch) end
function set_weights(pg, weight) end
function get_weights(pg) end
Evaluator
An evaluator will combine Policy and Environment together.
abstract type AbstractEvaluator end
struct ExampleEvaluator <: AbstractEvaluator
env_actor
policy
#...
ExampleEvaluator(env, policy, params..) = new(@wrap_actor env, policy, params...)
end
function sample(ev::Evaluator) end
Again, we can wrap it into an actor.
ev_actor = @wrap_actor ExampleEvaluator(env, policy, params)
When the ev_actor
is invoked, a environment actor will also be invoked (in the same processor by default)
Optimizer
An optimizer will interact with evaluators and do something like parameter updating and distributed sampling.
Demo
Putting all components together. We have the following graph to show how each component is working in the Ape-X algorithm.
TODO: Add figure
And the pseudocode is:
# 1. create environments
env_actors = @wrap_actor CartPoleEnv(configs)
# 2. create policies
policy = DQNPolicy(configs)
# 3. define evaluators
mutable struct ApeXEvaluator
env_actor
policy
batch_size
n_samples
replay_buffer
end
function sample(ev::ApeXEvaluator)
while true
if ev.n_samples >= ev.batch_size
return sample(replay_buffer, evn.batch_size)
else
r, d, s = @await observe(ev.env_actor) # it will be translated into send/receive
a = ev.policy_actor(s) # it will be translated into send/receive
# update replay_buffer
# calc loss
# update grad
end
end
end
ev_actors = @smart_actors ApeXEvaluator env_actors policy_actors
# 4. optimizer
mutable struct ApeXOptimizer
local_ev
remote_evs
end
function step(optimizer::ApeXOptimizer)
samples = @await get_high_priority_samples(optimizer.remote_evs)
# evaluate local_env
# broadcast local_weights
# update priority of replay buffer
end