today we’re going to implement a popular reinforcement learning technique called Policy Gradients to help us choose which slot machine to play that will give us the highest winnings. We’ve talked a good bit about supervised learning. It’s where we build an algorithm, based on both input and output data, which is great for classification tasks.

**Unsupervised learning is when we don’t have a history of outputs.** We have to build an algorithm based on input data alone. Reinforcement learning is similar in that it is also building an algorithm based on input data alone, but we framed the problem in a different way. The algorithm presents a state that depends on the input data and is either rewarded or punished via an action that it takes, and this continues over time.

It learns from the reward or punishment and continually updates itself over the long term to maximize the reward. Reinforcement learning doesn’t get as much love these days, from the AI community, as the other two do. Because some of the most interesting problems right now, like those in speech recognition, NLP, and computer vision are areas where it’s hard to define the notion of a long-term reward. Don’t hate– reinforce.

Currently, the problems that are best solved by reinforcement learning are either [INAUDIBLE] problems, like getting an NPC to get through a level without dying, or really complex problems, like self-driving cars or folding laundry or understanding my ex-girlfriend. Things that you could do in a simulation, human level tasks. So a lot of the work in reinforcement learning is theoretical instead of application-based, which is just as important, don’t get me wrong. However, there are some real world use cases today.

**Finnair, the flight-booking company, for example, uses RL to decide what action to take for what customers to increase their lifetime value.** And DeepMind reduced Google’s cooling center bill by 40% by using RL to decide the most efficient energy routing strategy for the lowest costs. So let’s take a look at our problem, which comes from probability theory. The problem is that a gambler, which we’ll call our agent, has to decide which slot machine to play, how many times to play each machine, and in which order to play them.

Every time that a machine is played, it outputs a reward value, which is randomly generated from a probability distribution specific to that machine. The goal of the gambler is to maximize the sum total of rewards earned through a sequence of lever pulls. He iteratively plays one lever per round and observes the associated reward to do this. This is called the Multi-armed Bandit Problem. The Bandit being the name of an old style slot machine with an arm, or several, on the side that you pull down.

The agent has to make a choice between using machines that are known to produce good results, exploitation, and trying out new machines that have unknown results but could give better results than the others, exploration. Exploitation is optimizing decisions based on existing knowledge, and exploration is attempting to acquire new knowledge. It’s a trade off that all reinforcement learning agents make when optimizing for a reward value, and it’s particularly relevant in this problem. So let’s start initializing some values. After importing TensorFlow and Numpy, the only two libraries we’ll need to use, we can define our bandits. We’ll be using a four-armed bandit.

That is one slot machine with four levers, and we can refer to each arm as a bandit. So we’ll define our bandits as a list, and each of these values will help decide if a reward is given when pulled. The lower the bandit number, the more likely we’ll get a positive reward. The higher the bandit number, the more likely we’ll get a negative reward. We want our agent to choose the bandit that will give that reward, and will initialize a variable for storing the total number.

Well, next it find a pull bandit function, which given a bandit value, will first generate a random number from a normal distribution with a mean of zero. Then compare the parameter value to the generated number. Depending on the result, it will either return a positive or negative reward.

In practice, this model is used any time you have a project with a fixed budget. It can be used to help best allocate resources to maximize success, since it’s specifically designed to deal with the uncertainty about the difficulty and payoff of each possibility. I wish I had this thing in college. Our agent needs to learn which kind of reward it gets for each possible action, so that it can choose the optimal ones. It’s learning a policy.

So we’re going to use a popular method called Policy Gradients to solve this. We’ll use a simple neural net that learns a policy for picking the best actions and adjusting its weights through gradient descent, using real time feedback from the environment. Since we’re only using a single bandit, our agent ignores the state of the environment, just like the US government. There’s only ever a single unchanging state. If we were introducing multiple bandits, then our agent would need to take state into account when deciding an action. We would learn a value function instead, but let’s keep it simple with a single state.

**Reinforce! Learn the policy. So actions are quality.** Our policy gradient network consists of a set of weights, and each weight corresponds to each of the possible arms to pull, and represents how beneficial our agent thinks it is to pull each arm.

We’ll initialize the weights to one, which means our agent will be optimistic about each arms potential reward. When we update our network, we’ll use what’s called an Epsilon Greedy Policy. This is a way of selecting random actions with uniform distributions from a set of available actions. Using this policy, either we can select random actions with epsilon probability, or we can select an action with one minus epsilon probability that gives maximum reward in a given state.

We’ll define epsilon as 0.1. It’s the chance of taking a random action. Basically, most of the time, our agent will choose the action that corresponds to the largest expected value, but sometimes, with e-probability, it will choose randomly. So this way, it can try out different arms to continue learning about them.

Our agent is the neural network. It’s feed forward and only has one set of weights. We’ll initialize them as a tensor, where each is a set of one for the number of bandits.

Then, we’ll use the argmax function to choose the weight with the highest value and store that as our chosen action. We now need to establish what this training process looks like. Since we want to feed the reward and chosen action into the network to compute the loss, and then use that to update the network, we’ll initialize tensorflow placeholders for both the reward and the action values. Next, we’ll define the responsible weights. It corresponds to the units in the output layer, which corresponded to the chosen action.

When updating the policy, we want to update the likelihood of the actions we actually took, as opposed to all possible actions. So this will be a slice of our weights, and we can define the size of it as well. So this is what our policy loss equation looks like. This is what we want to minimize.

This character is the policy, which we take a log of, and A is the advantage. This is a critical part of RL. It’s a measure of how much better an action was than some baseline.

There are different ways of deciding what that baseline is, and it can get pretty interesting. But right now, we’ll just set it to zero, so we can just think of it as just the reward we receive for each action. This loss function lets us increase the weight for actions that give a positive reward value, and decrease them for actions that give a negative reward value. When we define our loss function programmatically, we can see that it corresponds to the equation where reward holder is the advantage. We can then optimize it with gradient descent and a given learning rate.

When we minimize our loss, it will return a gradient update. So for our training step, we’ll initialize a tf graph, and then for a given number of episodes, we’ll either try a random action or choose one from our network, exploitation versus exploration. We’ll receive a reward from our action of picking one of their bandits. Then, we’ll update the network using our gradients, weight values relevant to the action, and all weight values. We can use a feed dict to fit in both the action and the reward.

We’ll want to see which reward and which bandit we are on during each iteration, so we’ll print them out. When we compile this, we can see that bandit four’s values increase way faster than the other ones. It decides one is best, and then goes all in on it. We can extend this code later on, so that both state and action effect reward, which would be considered the contextual bandit problem. So let’s go over what we’ve learned. Reinforcement learning is usually applied to toy problems or really complex problems.

Policy Gradient methods are a type of RL technique that optimizes a policy with respect to the expected long-term return using gradient descent. And we can apply this strategy to the popular Multi-arm Bandit Problem, which asks how to best allocate resources to maximize success. The coding challenge winner for this video is Mike McDermott. Mike improved the bleeding edge memory network model from my last video to create a Q&A chatbot by adding a bi-directional LSTM and time distributed dense layer to it. This is seriously amazing stuff.

**He could publish his results to a journal, Wizard of the Week.** And the runner up is Vishal Batchu, who also had publishable results, and you can run his code right from the command line. You guys blew my mind, and I bow to you. The coding challenge for this week is to use policy gradients to solve the contextual bandit problem. So state is taken into account.

Details are in the read me, github links go in the comments, and winners are going to be announced next week. Please subscribe, and for now, I’ve got to maximize my arm size, so thanks for watching.