Noisy evaluation
Meta-learning for noisy evaluation function:
Reinforcement learning is prone to noise. It is harder to reproduce an experiment that trains a reinforcement learning agent. An agent can produce different results despite using the same seed for random number generator. Primary reasons are the following
- Environment is stochastic. For the same action, state of the world and reward can be different
- Agents with different initialization of weights produce different results
- Parallel execution
- Distributed training
- Non-determinism from the ML framework
- Stochastic optimizer
- Same input produces different output
- Stochastic environment and agent (RL)
- Same set of actions produce different output
- Sensitiveness to hyperparameters
Noise in the agent's output impacts any meta-learning. I will demonstrate one example of noise when several parameters are tuned for cartpole
reinforcement learning agent.
# Install dependencies
!pip install ray[rllib]
pip install pyglet~=1.3.2 > /dev/null 2>&1
pip install 'gym[atari]' > /dev/null 2>&1
#apt-get install python-opengl -y > /dev/null 2>&1
#apt install xvfb -y > /dev/null 2>&1
pip install pyvirtualdisplay > /dev/null 2>&1
pip install tensorflow==2.0.0-beta0 > /dev/null 2>&2
I use Ray Tune library to find optimal value for learning rate
and share layers
in the cartpole agent.
import ray
from ray import tune
import pandas as pd
config = {
"env": 'CartPole-v0',
"num_workers": 2,
"vf_share_layers": tune.grid_search([True, False]),
"lr": tune.grid_search([1e-4, 1e-5, 1e-6]),
'seed': 91371
}
results = tune.run(
'PPO',
stop={
'timesteps_total': 50000
},
config=config)
df = results.dataframe()
To demonstrate noise, each trial are evaluated k=3
times. Results from a subset of the trials are plotted above. Note that the agent's output are different for the same parameter LR = 1e-5
and Shared_Layers=False
. A custom agent is defined utilizing neural networks. It is optimized based on policy gradient.
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras import Model
import numpy as np
import gym
class AgentMLPTF(Model):
def __init__(self):
super(AgentMLPTF, self).__init__()
self.d1 = Dense(15, activation='tanh')
self.d2 = Dense(2)
def call(self, x):
# 1. Define Policy
batch = True
if x.ndim == 1:
batch = False
x = np.expand_dims(x, axis=0)
x = self.d1(x)
action_logits = self.d2(x)
# 2. Sample policy to get action
action = tf.random.categorical(action_logits, 1)
action = action.numpy().flatten()
if not batch:
action = np.asscalar(action)
return {"Action":action, "LogProbability":action_logits}
def get_episode_trajectory(env, agent, max_steps=1000):
observation_list = []
reward_list = []
action_list = []
value_list = []
done = False
obs = env.reset()
for _ in range(max_steps):
observation_list.append(obs)
out = agent(obs)
assert ("Action" in out), "The key 'Action' was missing from the agents output."
action = out["Action"]
obs, rew, done, _, = env.step(action)
reward_list.append(rew)
action_list.append(action)
if "Value" in out:
value_list.append(out["Value"])
if done:
break
ret = {
"Observations": observation_list,
"Actions": action_list,
"Rewards": np.array(reward_list, dtype=np.float32)
}
if len(value_list) > 0:
ret["Values"] = value_list
return ret
def reward_to_go(rewards):
return np.flip(np.cumsum(np.flip(rewards)))
def train_policy_grad(env, agent, num_epochs=300):
optimizer = tf.keras.optimizers.Adam(lr=1e-2)
log_reward = 0
log_reward_list = []
logging_period = 20
for epoch in range(num_epochs):
# get the training data
traj = get_episode_trajectory(env, agent)
obs = np.stack(traj["Observations"])
rew = traj["Rewards"]
actions = traj["Actions"]
# compute 'reward-to-go'
rew_2_go = reward_to_go(rew)
# compute gradients + update weights
with tf.GradientTape() as tape:
logits = agent(obs)["LogProbability"]
loss = loss_pg(actions, logits, rew_2_go)
gradients = tape.gradient(loss, agent.trainable_variables)
optimizer.apply_gradients(zip(gradients, agent.trainable_variables))
# log the reward
log_reward += np.sum(rew)
if (epoch % logging_period) == 0:
template = 'Training Epoch {}, Averaged Return: {}'
print(template.format(epoch, log_reward / logging_period))
log_reward_list.append(log_reward / logging_period)
log_reward = 0
return (range(0, num_epochs, logging_period), log_reward_list)
#@title Answer { display-mode: "form" }
def loss_pg(actions, log_probs, returns):
action_masks = tf.one_hot(actions, 2, dtype=np.float64)
log_probs = tf.reduce_sum(action_masks * tf.nn.log_softmax(log_probs), axis=1)
return -tf.reduce_sum(returns * log_probs)
# Note: this is equivalent to:
def loss_pg2(actions, log_probs, returns):
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.NONE)
return tf.reduce_mean(returns * loss(actions, log_probs))
import gym
env_cartpole = gym.make('CartPole-v1')
agent_mlp_tf = AgentMLPTF()
(episodes, rewards) = train_policy_grad(env_cartpole, agent_mlp_tf)
There are various ways we can deal with noise in meta-learning.
- Penalize parameters that produce output with high uncertainty or variability.
- Resample parameters with higher uncertainty to reduce noise and increase quality of the meta-learner
Algorithm:
- Set $N$ = $NUM\_SAMPLES$
- Run all individual $N$ times.
- Noisy feedbacks for $i$th individual: $f_{i}$
- Get Fitted individual based on: $\operatorname*{argmax}_i \bar {f_{i}} + \frac {\sqrt{N}}{\sigma}$
- Run all individuals min_sample times
- Get Fitted individual based on: min(prob + sigma/sqrt(n))
- Repeat steps while no new individuals were sampled
- Add a threshold for standard error