Meta-learning for noisy evaluation function:
Reinforcement learning is prone to noise. It is harder to reproduce an experiment that trains a reinforcement learning agent. An agent can produce different results despite using the same seed for random number generator. Primary reasons are the following

Environment is stochastic. For the same action, state of the world and reward can be different
Agents with different initialization of weights produce different results
Parallel execution
Distributed training
Non-determinism from the ML framework
Stochastic optimizer
- Same input produces different output
Stochastic environment and agent (RL)
- Same set of actions produce different output
Sensitiveness to hyperparameters

Noise in the agent's output impacts any meta-learning. I will demonstrate one example of noise when several parameters are tuned for cartpole reinforcement learning agent.

Application: cartpole agent

cart

# Install dependencies
!pip install ray[rllib]

pip install pyglet~=1.3.2 > /dev/null 2>&1
pip install 'gym[atari]' > /dev/null 2>&1
#apt-get install python-opengl -y > /dev/null 2>&1
#apt install xvfb -y > /dev/null 2>&1
pip install pyvirtualdisplay > /dev/null 2>&1
pip install tensorflow==2.0.0-beta0 > /dev/null 2>&2

Tune an RL agent

I use Ray Tune library to find optimal value for learning rate and share layers in the cartpole agent.

import ray
from ray import tune
import pandas as pd

config = {
    "env": 'CartPole-v0',
    "num_workers": 2,
    "vf_share_layers": tune.grid_search([True, False]),
    "lr": tune.grid_search([1e-4, 1e-5, 1e-6]),
    'seed': 91371
    }

results = tune.run(
    'PPO', 
    stop={
        'timesteps_total': 50000
    },
    config=config)


df = results.dataframe()

To demonstrate noise, each trial are evaluated k=3 times. Results from a subset of the trials are plotted above. Note that the agent's output are different for the same parameter LR = 1e-5 and Shared_Layers=False. A custom agent is defined utilizing neural networks. It is optimized based on policy gradient.

Create a custom agent and environment

import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras import Model
import numpy as np
import gym


class AgentMLPTF(Model):
    def __init__(self):
        super(AgentMLPTF, self).__init__()
        self.d1 = Dense(15, activation='tanh')
        self.d2 = Dense(2)

    def call(self, x):
        # 1. Define Policy
        batch = True
        if x.ndim == 1:
            batch = False
            x = np.expand_dims(x, axis=0)
        x = self.d1(x)
        action_logits = self.d2(x)

        # 2. Sample policy to get action
        action = tf.random.categorical(action_logits, 1)
        action = action.numpy().flatten()
        if not batch:
            action = np.asscalar(action)

        return {"Action":action, "LogProbability":action_logits}

def get_episode_trajectory(env, agent, max_steps=1000):
    observation_list = []
    reward_list = []
    action_list = []
    value_list = []

    done = False
    obs = env.reset()
    for _ in range(max_steps):
        observation_list.append(obs)
        out = agent(obs)
        assert ("Action" in out), "The key 'Action' was missing from the agents output."
        action = out["Action"]        
        obs, rew, done, _, = env.step(action)
        reward_list.append(rew)
        action_list.append(action)
        if "Value" in out:
            value_list.append(out["Value"])
            
        if done:
            break
        
    ret = {
        "Observations": observation_list, 
        "Actions": action_list, 
        "Rewards": np.array(reward_list, dtype=np.float32)
    }
    if len(value_list) > 0:
        ret["Values"] = value_list
        
    return ret

def reward_to_go(rewards):
    return np.flip(np.cumsum(np.flip(rewards)))

def train_policy_grad(env, agent, num_epochs=300):
    optimizer = tf.keras.optimizers.Adam(lr=1e-2)
    log_reward = 0
    log_reward_list = []
    logging_period = 20
    
    for epoch in range(num_epochs):
        # get the training data
        traj = get_episode_trajectory(env, agent)
        obs = np.stack(traj["Observations"])
        rew = traj["Rewards"]
        actions = traj["Actions"]
        
        # compute 'reward-to-go'
        rew_2_go = reward_to_go(rew)
        
        # compute gradients + update weights
        with tf.GradientTape() as tape:
            logits = agent(obs)["LogProbability"]
            loss = loss_pg(actions, logits, rew_2_go)
            
        gradients = tape.gradient(loss, agent.trainable_variables)
        optimizer.apply_gradients(zip(gradients, agent.trainable_variables))
        
        # log the reward
        log_reward += np.sum(rew)
        if (epoch % logging_period) == 0:
            template = 'Training Epoch {}, Averaged Return: {}'
            print(template.format(epoch, log_reward / logging_period))
            log_reward_list.append(log_reward / logging_period)
            log_reward = 0
       
    return (range(0, num_epochs, logging_period), log_reward_list)

#@title Answer { display-mode: "form" }

def loss_pg(actions, log_probs, returns):
    action_masks = tf.one_hot(actions, 2, dtype=np.float64)
    log_probs = tf.reduce_sum(action_masks * tf.nn.log_softmax(log_probs), axis=1)
    return -tf.reduce_sum(returns * log_probs)
  

# Note: this is equivalent to:
def loss_pg2(actions, log_probs, returns):
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.NONE)
    return tf.reduce_mean(returns * loss(actions, log_probs))

import gym
env_cartpole = gym.make('CartPole-v1')

agent_mlp_tf = AgentMLPTF()

(episodes, rewards) = train_policy_grad(env_cartpole, agent_mlp_tf)

There are various ways we can deal with noise in meta-learning.

Penalize parameters that produce output with high uncertainty or variability.
Resample parameters with higher uncertainty to reduce noise and increase quality of the meta-learner

Algorithm:

Set $N$ = $NUM\_SAMPLES$
Run all individual $N$ times.
- Noisy feedbacks for $i$th individual: $f_{i}$
Get Fitted individual based on: $\operatorname*{argmax}_i \bar {f_{i}} + \frac {\sqrt{N}}{\sigma}$

Run all individuals min_sample times
Get Fitted individual based on: min(prob + sigma/sqrt(n))
Repeat steps while no new individuals were sampled
Add a threshold for standard error