Coding

Deep Reinforcement Learning in Python Tutorial – A Course on How to Implement Deep Learning Papers

  • 00:00:00 what is up everybody today you're gonna
  • 00:00:03 learn how to go from a paper to a fully
  • 00:00:05 functional implementation of deep
  • 00:00:06 deterministic policy gradients if you're
  • 00:00:09 not familiar with deep deterministic
  • 00:00:10 Balaji gradients or DD PG for short it
  • 00:00:13 is a type of deep reinforcement learning
  • 00:00:15 that is used in environments from the
  • 00:00:16 continuous action spaces
  • 00:00:18 you see most environments have discrete
  • 00:00:19 action spaces this is the case with say
  • 00:00:22 the Atari library say like breakout or
  • 00:00:25 space invaders where the edge you can
  • 00:00:27 move left right it can shoot but it can
  • 00:00:29 move left right and shoot by fixed
  • 00:00:31 discrete intervals fixed amounts right
  • 00:00:33 in other environments like say robotics
  • 00:00:35 the robots can move a continuous amount
  • 00:00:38 so it can move in anywhere from you know
  • 00:00:41 a zero to one minus one to plus one
  • 00:00:42 anything along a continuous number
  • 00:00:44 interval and this poses a problem for
  • 00:00:47 most deep reinforcement learning methods
  • 00:00:49 like say q-learning which works
  • 00:00:50 spectacularly well in discrete
  • 00:00:53 environments but cannot tackle
  • 00:00:54 continuous action spaces now if you
  • 00:00:56 don't know what any of this means don't
  • 00:00:58 worry I'm going to give you the rundown
  • 00:00:59 here in a second but for this set of
  • 00:01:01 tutorials you're gonna need to have
  • 00:01:03 installed the open AI gym you'll need
  • 00:01:06 Python 3.6 and you also need tensorflow
  • 00:01:09 and pi torch other packages you'll need
  • 00:01:11 include MATLAB to handle the plotting of
  • 00:01:15 the learning curve which will allow us
  • 00:01:16 to see the actual learning of the agent
  • 00:01:18 as well as numpy to handle your typical
  • 00:01:21 vector operations now here I'll give you
  • 00:01:25 a quick little rundown of reinforcement
  • 00:01:27 learning so the basic idea is that we
  • 00:01:30 have an agent that interacts with some
  • 00:01:31 environment and receives a reward the
  • 00:01:33 rewards kind of take the place of labels
  • 00:01:35 and supervised learning in that they
  • 00:01:36 tell the agent what is good what is it
  • 00:01:39 that it is shooting for in the
  • 00:01:41 environment and so the agent will
  • 00:01:43 attempt to maximize the total rewards
  • 00:01:45 over time by solving something known as
  • 00:01:48 the bellman equation we don't have to
  • 00:01:49 worry about the actual mathematics of it
  • 00:01:51 but just so you know for your future
  • 00:01:52 research the algorithms are typically
  • 00:01:55 concerned with solving the bellman
  • 00:01:56 equation which tells the agent the
  • 00:01:58 expected feature returns assuming to
  • 00:02:00 follow something called it's policy so
  • 00:02:02 the policy is the probability that the
  • 00:02:04 agent will take a set of actions given
  • 00:02:07 it's in some state s it's basically a
  • 00:02:08 probability distribution
  • 00:02:10 now many types of algorithms such
  • 00:02:13 q-learning will attempt to solve the
  • 00:02:14 bellman equation by finding what's
  • 00:02:17 called the value function the value
  • 00:02:18 function or the action value function in
  • 00:02:20 this case map's the current state and
  • 00:02:22 set of possible actions to the expected
  • 00:02:25 feature returns the agent expects to
  • 00:02:27 receive so in other words the agent says
  • 00:02:29 hey I'm in some state meaning some
  • 00:02:31 configuration of pixels on the screen in
  • 00:02:33 the case of the Atari gym Atari library
  • 00:02:36 for instance and says okay if I take one
  • 00:02:38 or another action what is the expected
  • 00:02:41 feature return assuming that I follow my
  • 00:02:42 policy after critic methods are slightly
  • 00:02:45 different and then they attempt to learn
  • 00:02:47 the policy directly and recall the
  • 00:02:50 policy is a probability distribution
  • 00:02:52 that tells the agent what the
  • 00:02:54 probability of selecting an action is
  • 00:02:55 given it's in some state s so these two
  • 00:03:00 algorithms have a number of strengths
  • 00:03:02 between them and deterministic policy
  • 00:03:04 gradients is a way to marry the
  • 00:03:06 strengths of these two algorithms into
  • 00:03:08 something that does really well for
  • 00:03:10 discrete actions sorry continuous action
  • 00:03:12 spaces you don't need to know too much
  • 00:03:13 more than that everything else you need
  • 00:03:15 to know I'll explain in their respective
  • 00:03:16 videos so in the first video you're
  • 00:03:18 gonna get to see how I go ahead and read
  • 00:03:21 papers and then implement them on the
  • 00:03:24 fly and in the second video you're gonna
  • 00:03:27 see the implementation of deep
  • 00:03:30 deterministic policy gradients in Pytor
  • 00:03:32 in a separate environment both of these
  • 00:03:34 environments are in both of these
  • 00:03:36 environments are continuous and so they
  • 00:03:39 will demonstrate the power of the
  • 00:03:40 algorithm quite nicely you don't need a
  • 00:03:43 particularly powerful GPU but you do
  • 00:03:45 need some kind of GPU to run these as it
  • 00:03:48 does take a considerably long time even
  • 00:03:50 on a GPU so you will need at least a
  • 00:03:54 like say a maximal class GPU or above so
  • 00:03:57 something from the 700 series on NVIDIA
  • 00:03:59 side unfortunately neither of these
  • 00:04:02 frameworks really work well with AMD
  • 00:04:04 cards so if you have those you'd have to
  • 00:04:06 figure out some sort of Cluj to get the
  • 00:04:08 OpenCL implementation to trance compiled
  • 00:04:11 to CUDA that's just a technical detail I
  • 00:04:13 don't have any information on that so
  • 00:04:15 you're on your own sorry so this is a
  • 00:04:17 few hours of content grab a snack drink
  • 00:04:20 and watch this at your leisure it's best
  • 00:04:23 to watch it in order I'd actually did
  • 00:04:25 the videos in a separate order
  • 00:04:27 reverse order on my channel just so I
  • 00:04:28 could get it out so I did the
  • 00:04:30 implementation in PI tours first and
  • 00:04:31 then the video on implementing the paper
  • 00:04:33 in tensor flow second but it really is
  • 00:04:36 best for a new audience to go from the
  • 00:04:38 paper paper video to the PI torch video
  • 00:04:42 so I hope you like it leave it in the
  • 00:04:45 comments questions suggestions issues
  • 00:04:47 down below I'll try to address as many
  • 00:04:48 as possible you can check out the code
  • 00:04:51 for this on my github and you can find
  • 00:04:53 many more videos like this all my
  • 00:04:54 youtube channel machine learning with
  • 00:04:56 Phil I hope you all enjoy it let's get
  • 00:04:58 to it what is up everybody in today's
  • 00:05:01 video we're gonna go from the paper on
  • 00:05:03 deep deterministic policy gradients all
  • 00:05:05 the way into a functional implementation
  • 00:05:07 in tensor flow so you're gonna see how
  • 00:05:09 to go from a paper to a real-world
  • 00:05:11 implementation all in one video grab a
  • 00:05:13 snack a drink cuz this is gonna take a
  • 00:05:14 while
  • 00:05:15 let's get started
  • 00:05:22 so the first step in my process really
  • 00:05:25 isn't anything special I just read the
  • 00:05:27 entirety of the paper of course starting
  • 00:05:30 with the abstract the abstract tells you
  • 00:05:31 what the paper is about at a high level
  • 00:05:34 it's just kind of an executive summary
  • 00:05:36 introduction is where the author's will
  • 00:05:38 pay homage to other work in the field
  • 00:05:40 kind of set the stage for what is going
  • 00:05:43 to be presented in the paper as well as
  • 00:05:44 the need for it the background kind of
  • 00:05:47 expands on that and you can see here it
  • 00:05:49 gives us a little bit of mathematical
  • 00:05:51 equations and you will get a lot of
  • 00:05:54 useful information here this won't talk
  • 00:05:57 too much about useful nuggets on
  • 00:05:58 implementation but it does set the stage
  • 00:06:00 for the mathematics you're going to be
  • 00:06:01 implementing which is of course critical
  • 00:06:03 for any deep learning or in this case
  • 00:06:06 deep reinforcement learning paper
  • 00:06:08 implementation the algorithm is really
  • 00:06:11 where all the meat of the problem is it
  • 00:06:13 is in here and that they lay out the
  • 00:06:15 exact steps you need to take to
  • 00:06:17 implement the algorithm right that's why
  • 00:06:20 it's titled that way so this is the
  • 00:06:22 section we want to read most carefully
  • 00:06:24 and then of course they will typically
  • 00:06:26 give a table where they outline the
  • 00:06:28 actual algorithm and oftentimes if I'm
  • 00:06:32 in a hurry I will just jump to this
  • 00:06:34 because I've done this enough times that
  • 00:06:36 I can read this what it's called
  • 00:06:39 pseudocode if you're not familiar with
  • 00:06:40 the pseudocode it's just an English
  • 00:06:41 representation of computer code so we
  • 00:06:45 will typically use that when we outline
  • 00:06:47 a problem and it's often used in papers
  • 00:06:49 of course so typically I'll start here
  • 00:06:52 reading it and then work backward by
  • 00:06:54 reading through the paper to see what I
  • 00:06:55 missed but of course it talks about the
  • 00:06:58 performance across a whole host of
  • 00:07:00 environments and of course all of these
  • 00:07:03 have in common that they are continuous
  • 00:07:05 control so what that means is that the
  • 00:07:09 action space is a vector whose elements
  • 00:07:13 can vary on a continuous real number
  • 00:07:16 line instead of having discrete actions
  • 00:07:18 of zero one two three four or five so
  • 00:07:21 that is the really motivation behind
  • 00:07:23 deep deterministic policy gradients is
  • 00:07:25 that allows us to use deep reinforcement
  • 00:07:27 learning to tackle these types of
  • 00:07:28 problems and in today's video we're
  • 00:07:30 gonna go ahead and tackle the I guess
  • 00:07:33 pendulum swing up also called
  • 00:07:34 pendulum problem reason being is that
  • 00:07:37 while it would be awesome to start out
  • 00:07:40 with something like the bipedal Walker
  • 00:07:42 you never want to start out with maximum
  • 00:07:44 complexity you always want to start out
  • 00:07:46 with something very very small and then
  • 00:07:47 scale your way up and the reason is that
  • 00:07:48 you're going to make mistakes and it's
  • 00:07:51 most easy to debug most quick to debug
  • 00:07:54 very simple environments that execute
  • 00:07:56 very quickly so the pendulum problem
  • 00:07:58 only has I think three elements in its
  • 00:08:00 state vector and only a single action so
  • 00:08:03 or maybe it's two actions I forget but
  • 00:08:04 either way it's very small problem
  • 00:08:06 relative to something like the bipedal
  • 00:08:08 Walker or many of the other environments
  • 00:08:10 you could also use the continuous
  • 00:08:13 version of the cart pole or something
  • 00:08:14 like that and that would be perfectly
  • 00:08:15 fine I've just chosen the pendulum for
  • 00:08:17 this particular video because we haven't
  • 00:08:18 done it before so it's in here that they
  • 00:08:21 give a bunch of plots of all of the
  • 00:08:23 performance of their algorithm of the
  • 00:08:25 various sets of constraints placed upon
  • 00:08:28 it and different implementations so you
  • 00:08:30 can get an idea and one thing you notice
  • 00:08:32 right away it's always important to look
  • 00:08:34 at plots because they give you a lot of
  • 00:08:35 information visually right it's much
  • 00:08:37 easier to gather information from plots
  • 00:08:39 than it is text you see that right away
  • 00:08:41 they have a scale of 1 so that's telling
  • 00:08:44 you it's relative performance and you
  • 00:08:45 have to read the paper to know relative
  • 00:08:47 to what I don't like that particular
  • 00:08:49 approach they have similar data in a
  • 00:08:53 table form and here you see a whole
  • 00:08:56 bunch of environments they used and
  • 00:08:58 there's a broad broad variety they
  • 00:09:00 wanted to show that the algorithm has a
  • 00:09:02 wide arena of applicability which is a
  • 00:09:05 typical technique and papers they want
  • 00:09:07 to show that this is relevant right if
  • 00:09:09 they only showed a single environment
  • 00:09:10 people reading it would say well that's
  • 00:09:12 all well and good you can solve one
  • 00:09:13 environment but what about these dozen
  • 00:09:15 other environments right and part of the
  • 00:09:18 motivation behind reinforcement learning
  • 00:09:19 is generality can as can we model real
  • 00:09:23 learning and biological systems such
  • 00:09:25 that it mimics the generality of
  • 00:09:27 biological learning one thing you notice
  • 00:09:29 right away is that these numbers are not
  • 00:09:30 actual scores so that's one thing I kind
  • 00:09:34 of take note of and causes me to raise
  • 00:09:37 an eyebrow so you have to wonder the
  • 00:09:40 motivation behind that why would the
  • 00:09:42 authors express scores in a ratio
  • 00:09:45 there's a couple different reasons one
  • 00:09:46 is because they want
  • 00:09:47 – just to make all the numbers look
  • 00:09:50 uniform maybe the people reading the
  • 00:09:52 paper wouldn't be familiar with each of
  • 00:09:55 these environment so they don't know
  • 00:09:56 what a good score is and that's a
  • 00:09:58 perfectly valid reason and other
  • 00:10:00 possibilities they want to hide poor
  • 00:10:01 performance I don't think that's going
  • 00:10:02 on here but it does make me raise my
  • 00:10:04 eyebrow whenever I see it the one
  • 00:10:07 exception is the torques which is a
  • 00:10:09 totally open rate race car simulator
  • 00:10:11 environment I don't know if we'll get to
  • 00:10:13 that on this channel that would be a
  • 00:10:14 pretty cool project but that would take
  • 00:10:16 me a few weeks to get through but right
  • 00:10:19 away you notice that they have a whole
  • 00:10:21 bunch of environments these scores are
  • 00:10:22 all relative to one and one is the score
  • 00:10:24 that the agent gets on a planning
  • 00:10:28 algorithm which they also detail later
  • 00:10:29 on so those are the results and they
  • 00:10:33 talk more about I don't think we saw the
  • 00:10:36 headline but they talk about related
  • 00:10:38 work which talks about other algorithms
  • 00:10:40 that are similar and their shortcomings
  • 00:10:42 right they don't ever want to talk up
  • 00:10:44 other algorithms you always want to talk
  • 00:10:46 up your own algorithm to make yourself
  • 00:10:47 sound good you know why else did you be
  • 00:10:49 writing a paper in the first place
  • 00:10:51 and of course our concluding that's high
  • 00:10:52 everything together references I don't
  • 00:10:55 usually go deep into references if there
  • 00:10:58 is something that I feel I really really
  • 00:10:59 need to know I may look at a reference
  • 00:11:01 but I don't typically bother with them
  • 00:11:02 if you were a PhD student then it would
  • 00:11:06 behoove you to go into the references
  • 00:11:07 because you must be an absolute expert
  • 00:11:09 on the topic and for us we're just you
  • 00:11:12 know hobbyists I'm a youtuber so I don't
  • 00:11:15 go into too much depth with the
  • 00:11:17 background information and the next most
  • 00:11:21 important bit of the paper are the
  • 00:11:23 experimental details and it is in here
  • 00:11:25 that it gives us the parameters and
  • 00:11:28 architectures for the network's
  • 00:11:30 so this is where if you saw my previous
  • 00:11:32 video where I did the implementation of
  • 00:11:33 DD PG and PI torch and the continuous
  • 00:11:36 lunar lander environment this is where I
  • 00:11:38 got most of this stuff it was almost
  • 00:11:40 identical with a little bit of tweaking
  • 00:11:41 I left out some stuff from this paper
  • 00:11:44 but pretty much all of it came from here
  • 00:11:47 and particularly the hidden layer sizes
  • 00:11:50 400 and 300 units as well as the
  • 00:11:54 initialization of the parameters from
  • 00:11:57 a uniform distribution of the given
  • 00:11:59 ranges so just to recap this was a
  • 00:12:04 really quick overview of the paper just
  • 00:12:07 showing my process of what I look at the
  • 00:12:10 most important parts are the details of
  • 00:12:13 the algorithm as well as the
  • 00:12:15 experimental details so as you read the
  • 00:12:19 paper like I said I gloss over the
  • 00:12:23 introduction because I don't really I
  • 00:12:25 kind of already understand the
  • 00:12:26 motivation behind it I get the idea
  • 00:12:29 it says it basically tells us that you
  • 00:12:32 can't really handle mmm
  • 00:12:34 just continuous action spaces with DQ
  • 00:12:36 networks we already know that and it
  • 00:12:39 says you know you can discretize the
  • 00:12:40 state space but then you end up with
  • 00:12:42 really really huge actions sorry you can
  • 00:12:44 discretize the action space but then you
  • 00:12:45 end up with the you know whole boatload
  • 00:12:46 of actions you know what is it 2187
  • 00:12:49 action so it's intractable anyway and
  • 00:12:51 they say what we present you know a
  • 00:12:53 model free of policy algorithm and then
  • 00:12:58 it comes down to this section where it
  • 00:12:59 says the network is trained off policy
  • 00:13:02 with samples from a replay buffer to
  • 00:13:03 minimize correlations very good and
  • 00:13:05 train with the target Q network to give
  • 00:13:08 consistent targets during temporal
  • 00:13:10 difference backups so this work we make
  • 00:13:12 use of the same ideas along with a batch
  • 00:13:15 normalization so this is a key chunk of
  • 00:13:18 text and this is why you want to read
  • 00:13:19 the whole paper because sometimes
  • 00:13:21 they'll embed stuff in there that you
  • 00:13:22 may not otherwise catch so as I'm
  • 00:13:25 reading the paper what I do is I take
  • 00:13:27 notes and you can do this in paper you
  • 00:13:29 can do it in you know text document in
  • 00:13:32 this case we reading is the editor so
  • 00:13:33 that way I can show you what's going on
  • 00:13:34 and it's a natural place to put this
  • 00:13:36 stuff because that's where you can
  • 00:13:37 implement the code anyway let's hop over
  • 00:13:38 to the editor and you'll see what I take
  • 00:13:40 notes so right off the bat we always
  • 00:13:43 want to be thinking in terms of what
  • 00:13:44 sort of classes and functions will we
  • 00:13:46 need to implement this algorithm so the
  • 00:13:48 paper mentioned a replay buffer as well
  • 00:13:51 as a target queue network so the target
  • 00:13:55 queue network for now we don't really
  • 00:13:58 know what it's going to be but we can
  • 00:13:59 write it down so we'll say we'll need a
  • 00:14:02 replay buffer class
  • 00:14:05 and need a class for a target hue
  • 00:14:09 Network now I would assume that if you
  • 00:14:12 were going to be implementing a paper of
  • 00:14:13 this advanced difficulty you'd already
  • 00:14:17 be familiar with cue learning where you
  • 00:14:18 know that the target network is just
  • 00:14:20 another instance of a generalized
  • 00:14:22 network the difference between the
  • 00:14:24 target and evaluation networks are you
  • 00:14:26 know the way in which you update their
  • 00:14:28 weights so right off the bat we know
  • 00:14:31 that we're gonna have a single class at
  • 00:14:33 least one if you know something about
  • 00:14:35 actor critic methods you'll know that
  • 00:14:36 you'll probably have two different
  • 00:14:37 classes one for an actor one for a
  • 00:14:38 critic because those two architectures
  • 00:14:40 are generally a little bit different but
  • 00:14:43 what do we know about cue networks we
  • 00:14:44 know that cue networks are state action
  • 00:14:47 value functions right they're not just
  • 00:14:49 value functions so the critic in the
  • 00:14:51 actor critic methods is just a state
  • 00:14:53 value function in general whereas here
  • 00:14:55 we have a cue network which is going to
  • 00:14:57 be a function of the state in action so
  • 00:14:59 we know that it's a function of s and a
  • 00:15:02 so we know right off the bat it's not
  • 00:15:04 the same as a critic it's a little bit
  • 00:15:06 different and it also said we we will
  • 00:15:09 use we will use batch norm so a batch
  • 00:15:14 normalization is just a way of
  • 00:15:16 normalizing inputs to prevent divergence
  • 00:15:18 and a model I think it was discovered in
  • 00:15:21 2015 2014 something like that
  • 00:15:25 so that we use that so we'll need that
  • 00:15:26 in our network so we know at least two
  • 00:15:28 right off the bat a little bit of an
  • 00:15:29 idea of what the network is going to
  • 00:15:30 look like so let's go back to the paper
  • 00:15:31 and see what other little bits of
  • 00:15:32 information we can glean from the text
  • 00:15:34 before we take a look at the algorithm
  • 00:15:35 reading along we can say blah blah they
  • 00:15:41 say a key feature of simplicity it
  • 00:15:44 requires only a straightforward actor
  • 00:15:46 critic architecture and very few moving
  • 00:15:49 parts and then they talk it up and say
  • 00:15:51 can learn policies that exceed the
  • 00:15:53 performance of the planner you know the
  • 00:15:55 planning algorithm even learning from
  • 00:15:57 pixels which we won't get to in this
  • 00:15:59 particular implementation so then okay
  • 00:16:01 no real other Nuggets there
  • 00:16:06 the background talks about the
  • 00:16:08 mathematical structure of the algorithm
  • 00:16:10 so this is really important if you want
  • 00:16:13 to have a really deep in-depth knowledge
  • 00:16:16 of the topic if you already know enough
  • 00:16:20 about the background you would know that
  • 00:16:21 you know the formula for discounted
  • 00:16:23 future rewards you should know that if
  • 00:16:25 you've done a whole bunch of
  • 00:16:26 reinforcement learning algorithms if you
  • 00:16:28 haven't then definitely read through
  • 00:16:29 this section to get the full idea of the
  • 00:16:34 background and the motivation behind the
  • 00:16:36 mathematics other thing to note is it
  • 00:16:38 says the action value function is used
  • 00:16:40 in many algorithms we know that from
  • 00:16:42 deep Q learning and then it talks about
  • 00:16:45 the recursive relationship known as the
  • 00:16:47 bellman equation that is known as well
  • 00:16:49 other thing to note what's interesting
  • 00:16:51 here and this is the next nugget is if
  • 00:16:53 the target policy is deterministic we
  • 00:16:56 can describe as a function mu and so you
  • 00:16:58 see that in the remainder of the paper
  • 00:17:00 like in the algorithm they do indeed
  • 00:17:03 make use of this parameter mu so that
  • 00:17:06 tells us right off the bat that our
  • 00:17:08 policy is going to be deterministic now
  • 00:17:10 if you have you could probably guess
  • 00:17:13 that from the title right D to turn is
  • 00:17:14 deep deterministic policy gradients
  • 00:17:17 right so you would guess from the name
  • 00:17:18 that the policy is deterministic but
  • 00:17:21 what does that mean exactly so a
  • 00:17:22 stochastic policy is one in which the
  • 00:17:25 software map's the probability of taking
  • 00:17:29 an action to a given state so you input
  • 00:17:31 a set of state and outcomes a
  • 00:17:32 probability of selecting an action and
  • 00:17:34 you select an action according to that
  • 00:17:37 probability distribution so that right
  • 00:17:39 away bakes in a solution to the explore
  • 00:17:42 exploit dilemma so long as all
  • 00:17:43 probabilities are finite right so so as
  • 00:17:46 long as a probability of taking an
  • 00:17:47 action for all states doesn't go to zero
  • 00:17:50 there is some element of exploration
  • 00:17:52 involved in that algorithm hew learning
  • 00:17:56 handles the explore exploit dilemma by
  • 00:17:58 using Epsilon greedy action selection
  • 00:18:01 where you have a random parameter that
  • 00:18:02 tells you how often to select a random
  • 00:18:04 number
  • 00:18:05 sorryi random action and then you select
  • 00:18:07 a greedy action the remainder of the
  • 00:18:09 time of course policy gradients don't
  • 00:18:11 work that way they typically use a
  • 00:18:13 stochastic policy but in this case we
  • 00:18:15 have a deterministic policy so you've
  • 00:18:16 got to wonder right away
  • 00:18:18 okay we have a deterministic policy how
  • 00:18:20 are we gonna handle the Explorer exploit
  • 00:18:21 dilemma so let's go back to our text
  • 00:18:23 editor and make a note of that so we
  • 00:18:28 just want to say that the the policy is
  • 00:18:31 deterministic how to handle Explorer
  • 00:18:36 exploit and that's a critical question
  • 00:18:40 right because if you only take what are
  • 00:18:42 perceived as the greedy actions you
  • 00:18:44 never get a really good coverage of the
  • 00:18:46 parameter space of the problem and
  • 00:18:48 you're going to converge on a suboptimal
  • 00:18:49 strategy so this is a critical question
  • 00:18:51 we have to answer in the paper so let's
  • 00:18:53 head back to the paper and see how they
  • 00:18:55 handle it so we're back in the paper and
  • 00:18:58 you can see the reason they introduced
  • 00:19:00 that deterministic policy is to avoid an
  • 00:19:01 inner expectation or maybe that's just a
  • 00:19:03 byproduct I guess it's not accurate to
  • 00:19:05 say that's the reason they do it but
  • 00:19:07 what's neat is says the expectation
  • 00:19:09 depends only on the environment means
  • 00:19:10 it's possible to learn Q to the MU
  • 00:19:12 meaning Q is a function mu off policy
  • 00:19:14 using transitions which are generated
  • 00:19:16 from a different stochastic policy beta
  • 00:19:19 so right there we have off policy
  • 00:19:22 learning which they say explicitly with
  • 00:19:25 a stochastic policy so we are actually
  • 00:19:27 going to have two different policies in
  • 00:19:28 this case so then this already answers
  • 00:19:31 the question of how we go from a
  • 00:19:33 deterministic policy to solving these
  • 00:19:35 sports boy dilemma and the reason is
  • 00:19:37 that we're using a stochastic policy to
  • 00:19:39 learn the greedy policy or a sorry a
  • 00:19:41 purely deterministic policy and of
  • 00:19:45 course they talk about the parallels
  • 00:19:47 with Q learning because there are many
  • 00:19:49 between the two algorithms and you get
  • 00:19:52 to the loss function which is of course
  • 00:19:54 critical to the algorithm and this Y of
  • 00:19:57 T parameter then of course they talk
  • 00:19:59 about what Q learning has been used for
  • 00:20:02 they use they make mention of deep
  • 00:20:05 neural networks which is of course what
  • 00:20:06 we're going to be using that's where the
  • 00:20:08 deep comes from and talks about the car
  • 00:20:10 Atari games which we've talked about on
  • 00:20:12 this channel as well and importantly
  • 00:20:16 they say the the two changes that they
  • 00:20:20 introduce in Q learning which is the
  • 00:20:21 concept of the replay buffer and that's
  • 00:20:22 the target Network which of course they
  • 00:20:24 already mentioned before they're just
  • 00:20:25 reiterating and reinforcing what they
  • 00:20:27 said that's why we want to read the
  • 00:20:29 introduction
  • 00:20:30 and background time material to get a
  • 00:20:31 solid idea what's gonna happen so now we
  • 00:20:34 get to the algorithmic portion and this
  • 00:20:37 is where all of the magic happens so
  • 00:20:39 they again reiterate that it's not
  • 00:20:41 possible to apply q-learning to
  • 00:20:43 continuous action spaces because you
  • 00:20:46 know reasons right it's pretty obvious
  • 00:20:48 you have an infinite number of actions
  • 00:20:49 that's a problem and then they talk
  • 00:20:54 about the deterministic policy gradient
  • 00:20:56 algorithm which we're not going to go
  • 00:20:57 too deep into right it for this video we
  • 00:21:00 don't want to do the full thesis we
  • 00:21:01 don't want to do a full doctoral
  • 00:21:03 dissertation on the field we just want
  • 00:21:05 to know how to implement it and get
  • 00:21:06 moving
  • 00:21:06 so this goes through and gives you an
  • 00:21:11 update for the gradient of this
  • 00:21:13 parameter J and gives it in terms of the
  • 00:21:16 gradient of Q which is the state action
  • 00:21:19 value function and the gradient of the
  • 00:21:23 policy the deterministic policy mu other
  • 00:21:26 thing to note here is that this
  • 00:21:27 gradients these gradings are over two
  • 00:21:29 different parameters so the gradient of
  • 00:21:32 Q is with respect to the actions such
  • 00:21:35 that the action a equals mu of s of T so
  • 00:21:39 what this tells you is that Q is
  • 00:21:43 actually a function not just of the
  • 00:21:46 state but is intimately related to that
  • 00:21:50 policy muse so it's not it's not an
  • 00:21:54 action chosen according to an Arg max
  • 00:21:56 for instance it's an action short chosen
  • 00:21:58 according to the output of the other
  • 00:22:00 Network and for the update of mu it's
  • 00:22:04 just the gradient with respect to the
  • 00:22:06 weights which you would kind of expect
  • 00:22:08 so they talk about another algorithm and
  • 00:22:14 fqc a I don't know what that is honestly
  • 00:22:16 mini-batch version blah-de-blah
  • 00:22:18 our contribution here is to provide
  • 00:22:20 modifications a dbg inspired by the
  • 00:22:22 success of DQ n which allowed to use
  • 00:22:24 neural network function approximate is
  • 00:22:25 to learn in large state in action spaces
  • 00:22:28 online we call DD PG very creative as
  • 00:22:32 they say again we use a replay buffer to
  • 00:22:34 address the issues of correlations
  • 00:22:36 between samples generated on subsequent
  • 00:22:38 steps within an episode
  • 00:22:40 finite size cache size our transition
  • 00:22:45 sample from the environment so we know
  • 00:22:47 all of this so if you don't know all of
  • 00:22:48 it what you need to know here is that
  • 00:22:50 you have state action reward and then
  • 00:22:53 new state transitions so what this tells
  • 00:22:55 the agent is started in some state s
  • 00:22:58 took some action receive some reward and
  • 00:23:00 ended up in some new state why is it
  • 00:23:03 important it's important because in in
  • 00:23:06 anything that isn't dynamic programming
  • 00:23:08 you're really trying to learn the state
  • 00:23:11 probability distributions you're trying
  • 00:23:13 to learn the probability of going from
  • 00:23:15 one state to another and receiving some
  • 00:23:16 reward along the way if you knew all of
  • 00:23:19 those beforehand then you could just
  • 00:23:21 simply solve a set a very very large set
  • 00:23:23 of equations for that matter to arrive
  • 00:23:26 at the optimal solution right if you
  • 00:23:27 knew all those transitions you say ok if
  • 00:23:29 I start in this state and take some
  • 00:23:31 action I'm gonna end up in some other
  • 00:23:33 state with certainty then you'd say well
  • 00:23:35 what's the most advantageous state what
  • 00:23:37 state is going to give me the largest
  • 00:23:38 reward and so you could kind of
  • 00:23:39 construct some sort of algorithm for
  • 00:23:41 traversing that set of equations to
  • 00:23:43 maximize your reward over time now of
  • 00:23:45 course you often don't know that and
  • 00:23:47 that's the point of the replay buffer is
  • 00:23:49 to learn that through experience and
  • 00:23:51 interacting with the environment and it
  • 00:23:54 says when the replay buffer was full all
  • 00:23:56 the samples were discarded ok that makes
  • 00:23:57 sense it's finite size it doesn't grow
  • 00:23:59 indefinitely at each time step actor and
  • 00:24:02 critic are updated by sampling a mini
  • 00:24:04 batch uniformly from the buffer so it
  • 00:24:07 operates exactly according to queue
  • 00:24:08 learning it does a uniform sampling
  • 00:24:11 Brandin sampling of the buffer and uses
  • 00:24:14 that to update the actor and critic
  • 00:24:16 networks what's critical here is that
  • 00:24:19 combining this statement with the topic
  • 00:24:23 of the previous paragraph is that when
  • 00:24:25 we write our replay buffer class it must
  • 00:24:27 sample states had random so what that
  • 00:24:31 means is you don't want to sample a
  • 00:24:33 sequence of subsequent steps and the
  • 00:24:35 reason is that there are large
  • 00:24:36 correlations between those steps right
  • 00:24:38 as you might imagine and those
  • 00:24:40 correlations can cause you to get
  • 00:24:41 trapped in little nicks nooks and
  • 00:24:43 crannies of parameter space and really
  • 00:24:45 cause your algorithm to go wonky so you
  • 00:24:47 want to sample that uniformly so that
  • 00:24:48 way you're sampling across many many
  • 00:24:50 different episodes to get a really good
  • 00:24:52 idea of
  • 00:24:53 the I guess the breath of the parameter
  • 00:24:56 space to use kind of loose language and
  • 00:25:01 then it says directly implementing
  • 00:25:02 q-learning with neural networks approve
  • 00:25:04 the unstable many environments and
  • 00:25:05 they're they're gonna talk about using
  • 00:25:07 the the target Network okay but modified
  • 00:25:11 for actor critic using soft target
  • 00:25:14 updates rather than directly copying the
  • 00:25:16 weight so in q-learning we directly copy
  • 00:25:18 the weights from the evaluation to the
  • 00:25:20 target Network here it says we create a
  • 00:25:23 copy of the actor and critic networks Q
  • 00:25:25 prime and mu prime respectively that are
  • 00:25:28 used for calculating the target values
  • 00:25:30 the weights of these target networks are
  • 00:25:32 then updated by having them slowly track
  • 00:25:34 the learned networks theta prime goes to
  • 00:25:36 theta theta times tau plus one minus tau
  • 00:25:40 times theta prime with town much much
  • 00:25:43 less than one this means that the target
  • 00:25:45 values are constrained to change slowly
  • 00:25:47 greatly improving the stability of
  • 00:25:49 learning okay so this is our next little
  • 00:25:52 nugget so let's head it over to the
  • 00:25:53 paper and make to our text editor and
  • 00:25:55 make note of that what we read was that
  • 00:25:59 the we have to not in camps we don't
  • 00:26:02 want to shout we have two networks
  • 00:26:06 target networks sorry we have two actor
  • 00:26:11 and two critic networks a target for
  • 00:26:16 each updates are soft
  • 00:26:20 according to theta equals tau times
  • 00:26:24 theta plus one minus tau times theta
  • 00:26:33 Prime
  • 00:26:35 so I'm sorry there should be theta Prime
  • 00:26:38 so this is the update rule for the
  • 00:26:40 parameters of our target networks and we
  • 00:26:44 have to target networks one for the
  • 00:26:45 actor and one for the critic so we have
  • 00:26:47 a total of four deep neural networks and
  • 00:26:51 so this is why the algorithm runs so
  • 00:26:53 slowly even on my beastly rig it runs
  • 00:26:57 quite slowly even in the lunar lander in
  • 00:26:59 a continuous lunar lander environment
  • 00:27:02 I've done the bipedal Walker and it took
  • 00:27:05 about 20,000 games to get something that
  • 00:27:07 approximates a decent score so this is a
  • 00:27:09 very very slow algorithm and that 20,000
  • 00:27:12 games took I think about a day to run so
  • 00:27:14 quite slow but nonetheless quite
  • 00:27:18 powerful it's only method we have so far
  • 00:27:19 I've implementing deep reinforcement
  • 00:27:22 learning and continuous control
  • 00:27:23 environments so hey you know beggars
  • 00:27:25 can't be choosers right but we know just
  • 00:27:29 to recap that we're gonna use four
  • 00:27:31 networks two of it are on policy in two
  • 00:27:35 off policy and the updates are gonna be
  • 00:27:38 soft with with Towe much less than one
  • 00:27:45 if you're not familiar with mathematics
  • 00:27:46 this double less than or double
  • 00:27:49 greater-than sign means much less than
  • 00:27:50 or much greater then respectively so
  • 00:27:53 what that means is that towel is gonna
  • 00:27:55 be of order point zero one or smaller
  • 00:27:57 right point one isn't much smaller
  • 00:27:59 that's kind of smaller point zero one I
  • 00:28:01 would consider much smaller they use
  • 00:28:03 we'll see in the in the details we'll
  • 00:28:07 see what value they use but you should
  • 00:28:09 know that it's a order point zero one or
  • 00:28:10 smaller and the reason they do this is
  • 00:28:13 to allow the updates to happen very
  • 00:28:16 slowly to get good convergence as they
  • 00:28:18 said in the paper so let's head back to
  • 00:28:19 the paper and see what other nuggets we
  • 00:28:20 can clean before getting to the outline
  • 00:28:22 of the algorithm and then in the very
  • 00:28:24 next sentence they say this simple
  • 00:28:26 change moves the relative unstable
  • 00:28:27 problem of learning the action by a
  • 00:28:29 function closer to the case of
  • 00:28:30 supervised learning a problem for which
  • 00:28:32 a robust solution exists we found that
  • 00:28:35 having both the target Miu Prime and Q
  • 00:28:36 problem was required to have stable
  • 00:28:38 targets weii in order to consistently
  • 00:28:40 train the critic with out divergence
  • 00:28:42 this may slow learning since the target
  • 00:28:44 networks delay the propagation of value
  • 00:28:46 estimates however in practice we found
  • 00:28:47 this was always
  • 00:28:48 greatly outweighed by the stability of
  • 00:28:49 learning and I found that as well you
  • 00:28:51 don't get a whole lot of diversions but
  • 00:28:52 it does take a while to train then they
  • 00:28:55 talk about learning in low dimensional
  • 00:28:57 and higher dimensional environments and
  • 00:29:00 they do that to talk about the need for
  • 00:29:03 feature scaling so one approach to the
  • 00:29:06 problem which is the ranges of
  • 00:29:10 variations in parameters right so in
  • 00:29:12 different environments like in the
  • 00:29:13 mountain car you can go from plus minus
  • 00:29:16 one point six five minus one point six
  • 00:29:18 to zero point four something like that
  • 00:29:20 and the velocities are plus and minus
  • 00:29:21 point zero seven so you have a two order
  • 00:29:24 magnitude variation there and the
  • 00:29:25 parameters that's kind of large even in
  • 00:29:27 that environment and then when you
  • 00:29:28 compare that to other environments where
  • 00:29:30 you can have parameters that are much
  • 00:29:32 larger on the order of hundreds you can
  • 00:29:34 see that there's a pretty big issue with
  • 00:29:35 the scaling of the inputs to the neural
  • 00:29:39 network which we know from our
  • 00:29:40 experience that neural networks are
  • 00:29:42 highly sensitive to the scaling between
  • 00:29:44 inputs so so their solution that problem
  • 00:29:48 is to manually scale the features so
  • 00:29:50 they're in similar across environments
  • 00:29:52 and units and they do that by using
  • 00:29:55 batch normalization and it says this
  • 00:29:58 technique normalizes each dimension
  • 00:30:00 across the samples in a mini batch dev
  • 00:30:02 unit mean and variance and it also
  • 00:30:05 maintains a running average of the mean
  • 00:30:06 and variance used for normalization
  • 00:30:07 during testing during exploration and
  • 00:30:09 evaluation so in our case training and
  • 00:30:13 testing are slightly different than in
  • 00:30:15 the case of supervised learning so the
  • 00:30:17 supervised learning you maintain
  • 00:30:18 different data sets or shuffled subsets
  • 00:30:22 of a single data set to do training and
  • 00:30:24 evaluation and of course in the
  • 00:30:26 evaluation phase you perform no weight
  • 00:30:29 updates of the network you just see how
  • 00:30:31 it does based on the training and
  • 00:30:34 reinforcement learning you can do
  • 00:30:35 something similar where you have a set
  • 00:30:37 number of games where you train the
  • 00:30:38 agent to achieve some set of results and
  • 00:30:40 then you turn off the learning and allow
  • 00:30:43 it to just choose actions based upon
  • 00:30:46 whatever policy it learns and if you're
  • 00:30:49 using batch normalization in PI torch in
  • 00:30:50 particular there are significant
  • 00:30:53 differences in how batch normalization
  • 00:30:55 is used in the two different phases so
  • 00:30:57 you have to be explicit in
  • 00:31:00 setting training or evaluation mode in
  • 00:31:03 particular in pi torch they don't track
  • 00:31:05 statistics in evaluation mode which is
  • 00:31:07 why when we wrote the DDP G algorithm in
  • 00:31:11 PI torch we had to call the eval and
  • 00:31:13 train functions so often okay so we've
  • 00:31:17 already established will need batch
  • 00:31:18 normalization so everything's kind of
  • 00:31:19 starting to come together we need a
  • 00:31:20 replay Network back to normalization we
  • 00:31:23 need for networks right we need to we
  • 00:31:26 need to each of a target of an actor
  • 00:31:30 into each of eight critic so half of
  • 00:31:33 those are gonna be used for on policy
  • 00:31:34 now of them are going to be used for off
  • 00:31:35 policy for the targets and then it says
  • 00:31:39 will be scrolled down a major challenge
  • 00:31:42 of learning and continuous action spaces
  • 00:31:45 is exploration an advantage of off
  • 00:31:48 policy algorithms such as DDP G is that
  • 00:31:50 we can treat the problem of exploration
  • 00:31:51 independently from the learning
  • 00:31:53 algorithm we constructed an exploration
  • 00:31:55 policy mu prime by adding noise sampled
  • 00:31:58 from a noise process and to our actor
  • 00:32:01 policy okay so right here is telling us
  • 00:32:04 what the basically the target actor
  • 00:32:08 function is it's mu prime is basically
  • 00:32:12 mu plus some noise and n can be chosen
  • 00:32:17 to suit the environment as detailed in
  • 00:32:19 the supplementary materials we used in
  • 00:32:21 Ornstein uhlenbeck process to generate
  • 00:32:24 temporally correlated exploration for
  • 00:32:26 exploration efficiency and physical
  • 00:32:28 control problems with inertia if you're
  • 00:32:30 not familiar face in Ursa just means the
  • 00:32:33 tendency of stuff to stay in motion it
  • 00:32:36 has to do with like environments that
  • 00:32:38 move like the walkers the Cheetahs stuff
  • 00:32:40 like that the ants
  • 00:32:43 okay so we've kind of got one of them
  • 00:32:46 nugget to add to our text editor let's
  • 00:32:49 head back over there and write that down
  • 00:32:51 okay so the target actor is just the
  • 00:33:01 evaluation we'll call it that for lack
  • 00:33:02 of a better word evaluation actor plus
  • 00:33:04 some noise process they used Ornstein
  • 00:33:10 kuhlenbeck I don't think I spelled that
  • 00:33:12 correctly
  • 00:33:14 we'll need to look that up that I've
  • 00:33:19 already looked it up my background is in
  • 00:33:22 physics so it made sense to me it's
  • 00:33:23 basically a noise process that models
  • 00:33:27 the motion of Brownian particles which
  • 00:33:29 are just particles that move around
  • 00:33:31 under the influence of their interaction
  • 00:33:33 with other particles in some type of
  • 00:33:35 medium like losses meeting like a
  • 00:33:37 perfect food or something like that and
  • 00:33:40 in the Ornstein in that case they are
  • 00:33:42 temporally correlated meaning at each
  • 00:33:43 time step is related to the time step
  • 00:33:45 prior to it and I hadn't thought about
  • 00:33:47 it before but that's probably important
  • 00:33:49 for the case of Markov decision
  • 00:33:51 processes right so in MVPs the current
  • 00:33:55 state is only related to the prior state
  • 00:33:57 and the action taken you don't need to
  • 00:33:58 know the full history of the environment
  • 00:33:59 so I wonder if that was chosen that way
  • 00:34:02 if there's some underlying physical
  • 00:34:03 reason for that just kind of a question
  • 00:34:05 of gross female top of my head I don't
  • 00:34:07 know the answer to that if someone knows
  • 00:34:08 pipe drop the answer in the comments I
  • 00:34:11 would be very curious to see the answer
  • 00:34:12 so we have enough Nuggets here so just
  • 00:34:16 to summarize we need to replay buffer
  • 00:34:18 class will also need a class for the
  • 00:34:22 noise right so we'll need a class for
  • 00:34:27 noise a class for the replay buffer
  • 00:34:30 we'll need a class for the target Q
  • 00:34:32 Network and we're going to use batch
  • 00:34:34 normalization the policy will be
  • 00:34:36 deterministic so what that means in
  • 00:34:38 practice is that the policy will output
  • 00:34:41 the actual actions instead of the
  • 00:34:42 probability of selecting the actions so
  • 00:34:44 the policy will be limited by whatever
  • 00:34:48 the action space of the environment is
  • 00:34:50 so we need some way of taking that into
  • 00:34:52 account so so deterministic policy means
  • 00:34:57 outputs the actual action instead
  • 00:35:02 of a probability we'll need a way to
  • 00:35:07 bound the actions to the environment
  • 00:35:11 environment limits and of course these
  • 00:35:14 notes don't make it into the final code
  • 00:35:16 these are just kind of things you think
  • 00:35:17 of as you are reading the paper you
  • 00:35:20 would want to put all your questions
  • 00:35:20 here I don't have questions since I've
  • 00:35:23 already implemented it but this is kind
  • 00:35:24 of my thought process as I went through
  • 00:35:26 it the first time as best as I can model
  • 00:35:29 it after having finished the problem and
  • 00:35:32 you can also use a sheet of paper
  • 00:35:33 there's some kind of magic about writing
  • 00:35:35 stuff down on paper but we're gonna use
  • 00:35:36 the code editor because I don't want to
  • 00:35:37 use an overhead projector to show you
  • 00:35:38 guys a frigging sheet of paper this
  • 00:35:40 isn't grade school here so let's head
  • 00:35:42 back to the paper and take a look and
  • 00:35:45 the actual algorithm to get some real
  • 00:35:47 sense of what we're going to be
  • 00:35:48 implementing the the results really
  • 00:35:51 aren't super important to us yet we will
  • 00:35:54 use that later on if we want to debug
  • 00:35:57 the model performance but the fact that
  • 00:35:58 they express it relative it to a
  • 00:35:59 planning our that makes it difficult
  • 00:36:01 right so scroll down to the data really
  • 00:36:04 quick so they give another thing to note
  • 00:36:09 I didn't talk about this earlier but I
  • 00:36:11 guess now is a good time is the
  • 00:36:14 stipulations on this turn on this
  • 00:36:16 performance data says performance after
  • 00:36:18 training across all environments for at
  • 00:36:20 most 2.5 million steps so I said earlier
  • 00:36:24 I had to train the bipedal walker for
  • 00:36:25 around 20,000 games that's around times
  • 00:36:30 I think that's around about about two
  • 00:36:33 and a half million steps or so I think
  • 00:36:35 it was actually have 15,000 steps so
  • 00:36:37 maybe around three million steps
  • 00:36:38 something like that we report both the
  • 00:36:41 average and best observed across five
  • 00:36:44 runs so why would they use five runs so
  • 00:36:47 if this was a superduper algorithm and
  • 00:36:49 which none of them are this isn't a
  • 00:36:51 slight on their algorithm this isn't
  • 00:36:52 meant to be it's not here anything what
  • 00:36:54 it tells us is that they had to use five
  • 00:36:56 runs because there is some element of
  • 00:36:58 chance involved so you know in one
  • 00:37:01 problem with deep learning is the
  • 00:37:03 problem of replicability right it's hard
  • 00:37:07 to replicate other people's results if
  • 00:37:08 particularly if you use system clocks as
  • 00:37:11 seeds for random number generators right
  • 00:37:13 using the system clock to seed the
  • 00:37:16 random number generator guarantees that
  • 00:37:17 if you run the simulation at even a
  • 00:37:20 millisecond later right that you're
  • 00:37:23 gonna get different results because
  • 00:37:24 gonna be starting with different sets of
  • 00:37:26 parameters now you will get
  • 00:37:27 qualitatively similar results right
  • 00:37:29 you'll be able to repeat the the general
  • 00:37:33 idea of the experiments but you won't
  • 00:37:34 get the exact same results it's kind of
  • 00:37:37 what it's an objection to the whole deep
  • 00:37:39 learning phenomenon and it makes it kind
  • 00:37:41 of not scientific but whatever it works
  • 00:37:43 has an enormous success so we won't
  • 00:37:45 quibble about semantics or you know
  • 00:37:47 philosophical problems but we just need
  • 00:37:49 to know for our purposes that even these
  • 00:37:53 people that invented the algorithm had
  • 00:37:54 to run it several times to get some idea
  • 00:37:57 of what was going to happen because the
  • 00:37:58 algorithm is inherently probabilistic
  • 00:38:00 and so they report averages and
  • 00:38:03 best-case scenarios so that's another
  • 00:38:05 little tidbit and they included results
  • 00:38:09 for both the low dimensional cases where
  • 00:38:11 you receive just a state vector from the
  • 00:38:13 environment as well as the pixel inputs
  • 00:38:15 we won't be doing the pixel inputs for
  • 00:38:17 this particular video but maybe we'll
  • 00:38:19 get to them later I'm trying to work on
  • 00:38:20 that as well
  • 00:38:21 so these are the results and the
  • 00:38:24 interesting tidbit here is that it's
  • 00:38:25 probabilistic it's gonna take five runs
  • 00:38:26 so okay fine other than that we don't
  • 00:38:30 really care about results for now we'll
  • 00:38:32 take a look later but that's not really
  • 00:38:35 our concern at the moment so now we have
  • 00:38:38 a series of questions we have answers to
  • 00:38:40 all those questions we know how we're
  • 00:38:41 gonna handle the Explorer exploit
  • 00:38:42 dilemma we know the purpose of the
  • 00:38:44 target networks we know how we're gonna
  • 00:38:47 handle the noise we know how we're gonna
  • 00:38:49 handle the replay buffer and we know
  • 00:38:52 what the policy actually is going to be
  • 00:38:54 is it's the actual problem it's the
  • 00:38:55 actual actions the agent is going to
  • 00:38:57 take so we know a whole bunch of stuff
  • 00:38:58 so it's time to look at the algorithm
  • 00:39:00 and see how we fill in all the details
  • 00:39:03 so randomly initialize a critic Network
  • 00:39:06 an actor Network with weights theta
  • 00:39:09 Super Q theta super mu so this is
  • 00:39:13 handled by whatever library you use you
  • 00:39:15 don't have to manually initialize
  • 00:39:16 weights but we do know from the
  • 00:39:19 Supplemental materials that they do
  • 00:39:22 constrain these updates to be within
  • 00:39:24 sorry these initializations to be within
  • 00:39:26 some range so
  • 00:39:27 put a note in the back your mind that
  • 00:39:29 you're gonna have to constrain these a
  • 00:39:30 little bit and then it says initialize
  • 00:39:33 target network Q Prime and they mute
  • 00:39:36 Prime with weights that are equal to the
  • 00:39:41 original network so data super EQ Prime
  • 00:39:44 gets initialized to a theta super Q and
  • 00:39:46 theta mu prime gets initialized of theta
  • 00:39:49 super mutant so we will be updating the
  • 00:39:53 weights right off the bat for the target
  • 00:39:55 networks with the evaluation networks
  • 00:39:58 and initialize a replay buffer R now
  • 00:40:01 this is an interesting question how do
  • 00:40:03 you initialize that replay buffer so
  • 00:40:04 I've used a couple different methods you
  • 00:40:07 can just initialize it with all zeros
  • 00:40:09 and then if you do that when you perform
  • 00:40:13 the learning you want to make sure that
  • 00:40:14 you have a number of memories that are
  • 00:40:17 greater than or equal to the mini batch
  • 00:40:19 size of your training so that way you're
  • 00:40:21 not sampling the same states more than
  • 00:40:23 once right if you have 64 memories in a
  • 00:40:26 batch that you want to sample but you
  • 00:40:28 only have 10 memories in your replay
  • 00:40:30 buffer then you're gonna sample let's
  • 00:40:32 say 16 memories and you're gonna sample
  • 00:40:34 each of those memories four times right
  • 00:40:35 so then that's no good so the question
  • 00:40:39 becomes if you update if you neutralize
  • 00:40:41 your replay buffer with zeros then you
  • 00:40:43 have to make sure that you don't learn
  • 00:40:44 until you exit the warm-up period where
  • 00:40:47 the warm-up period is just a number of
  • 00:40:49 steps equal to your replay buffer your
  • 00:40:51 buffer sample size or you can initialize
  • 00:40:55 it with the actual environmental play
  • 00:40:58 now this takes quite a long time you
  • 00:41:00 know the replay buffers are border a
  • 00:41:01 million so if you load the out the
  • 00:41:03 algorithm take a million steps at random
  • 00:41:05 then it's gonna take a long time I
  • 00:41:07 always use zeros and then you know just
  • 00:41:09 wait until the agent fills up the mini
  • 00:41:11 batch size of memories just a minor
  • 00:41:15 detail there then it says for some
  • 00:41:19 number of episodes do so a for loop
  • 00:41:21 initialize a random process in for
  • 00:41:24 action explorations so this is something
  • 00:41:26 now reading it I actually made a little
  • 00:41:28 bit of a mistake so in my previous
  • 00:41:32 implementation I didn't reset the noise
  • 00:41:35 process at the top of every episode
  • 00:41:38 so that's explicit here I must have
  • 00:41:40 missed that line and I've looked at
  • 00:41:43 other people's code some do some don't
  • 00:41:45 but it worked within how many episodes
  • 00:41:48 was it within a 1000 episodes the agent
  • 00:41:53 managed to beat the continuous winter
  • 00:41:54 liner environment so is that critical
  • 00:41:56 maybe not and I think I mentioned that
  • 00:41:58 in the video receiver additional state
  • 00:42:01 observation s1 so for each step of the
  • 00:42:04 episode T equals one to capital T do
  • 00:42:06 select the action a sub T equals mu the
  • 00:42:10 policy plus n sub T according to the
  • 00:42:14 current policy and exploration noise
  • 00:42:15 okay so that's straightforward just use
  • 00:42:19 just feed the state forward what does
  • 00:42:21 that mean it means feed the state
  • 00:42:22 forward through the network receive the
  • 00:42:25 vector output of the action and add some
  • 00:42:28 noise to it
  • 00:42:28 okay execute that action and resuit and
  • 00:42:32 observe reward and new state simple
  • 00:42:35 store the transition you know the old
  • 00:42:38 state action reward a new state in your
  • 00:42:41 replay buffer are okay that's
  • 00:42:43 straightforward each time steps sample a
  • 00:42:47 random mini batch of n transitions from
  • 00:42:50 the replay buffer and then you want to
  • 00:42:52 use that set of transitions to set y sub
  • 00:42:55 I equals so I is sorry having
  • 00:42:59 difficulties here so I is each step of
  • 00:43:01 that is each element of that mini batch
  • 00:43:06 of transitions so you want to basically
  • 00:43:08 loop over that set or do a vectorized
  • 00:43:10 implementation looping is more
  • 00:43:12 straightforward that's what I do I
  • 00:43:14 always opt for the most straightforward
  • 00:43:16 and not necessarily most efficient way
  • 00:43:19 of doing things the first time through
  • 00:43:21 because you want to get it working first
  • 00:43:23 and worry about implementation sorry
  • 00:43:25 efficiency later so set y sub I equals R
  • 00:43:28 sub I plus gamma or gamma is your
  • 00:43:31 discount factor times Q prime of the new
  • 00:43:35 state S sub I plus one times where the
  • 00:43:39 action is chosen according to mu prime
  • 00:43:42 given some weights theta super mutant
  • 00:43:45 and theta super Q prime so what's
  • 00:43:49 important here is
  • 00:43:50 that and this isn't immediately clear if
  • 00:43:52 you're reading this for the first time
  • 00:43:53 what's this is a very important detail
  • 00:43:57 so it's the action must be chosen
  • 00:44:01 according to the target actor Network so
  • 00:44:06 you actually have Q as a function of the
  • 00:44:08 state as well as the output excuse me of
  • 00:44:12 another network that's very important
  • 00:44:15 update the critic by minimizing the loss
  • 00:44:18 basically a weighted average of that y
  • 00:44:20 sub I minus the output from the actual Q
  • 00:44:26 Network where the a sub i's are from the
  • 00:44:29 actually the actions you actually took
  • 00:44:32 during the course of the episode so this
  • 00:44:34 a sub I is from the replay buffer and
  • 00:44:38 these actions right are chosen according
  • 00:44:41 to the target actor network so for each
  • 00:44:46 learning step you're going to have to do
  • 00:44:47 a feed-forward pass of not just this
  • 00:44:50 target Q network but also the target
  • 00:44:52 actor network as well as the evaluation
  • 00:44:56 critic network I hope I said that right
  • 00:44:58 so the feed forward pass of the target
  • 00:45:01 critic network as well as the target
  • 00:45:02 actor Network and the evaluation critic
  • 00:45:05 network as well and then it says update
  • 00:45:09 the actor policy using the sample policy
  • 00:45:11 gradient this is the hardest step in the
  • 00:45:12 whole thing this is the most confusing
  • 00:45:13 part so this is the gradient is equal to
  • 00:45:16 1 over N times the sum so a mean
  • 00:45:19 basically whenever you see 1 over N
  • 00:45:20 times the sum that's a mean the gradient
  • 00:45:23 with respect to actions of Q where the
  • 00:45:27 actions are chosen according to the
  • 00:45:28 policy mu of the current states s times
  • 00:45:31 a gradient with respect to the weights
  • 00:45:33 of MU where they you just put the set of
  • 00:45:37 states ok so that'll be a little bit
  • 00:45:40 tricky to implement so and this is part
  • 00:45:42 of the reason I chose tensor flow for
  • 00:45:43 this particular video is because tensor
  • 00:45:46 flow allows us to calculate gradients
  • 00:45:47 explicitly in pi torts you may have
  • 00:45:50 noticed that all I did was set Q to be a
  • 00:45:53 function of the
  • 00:45:55 the current state as well as the actor
  • 00:45:59 Network and so I loud PI torch to handle
  • 00:46:02 the chain rule this is effectively a
  • 00:46:04 chain rule so let's let's scroll up a
  • 00:46:06 little bit to look at that because this
  • 00:46:08 kind of gave me pause the first 10 times
  • 00:46:10 I read it so this is the hardest part to
  • 00:46:13 implement if you scroll up you see that
  • 00:46:16 this exact same expression appears here
  • 00:46:19 right and this is in reference to this
  • 00:46:21 so it's a gradient with respect to the
  • 00:46:24 weights theta super mu of Q of SN a such
  • 00:46:29 that you're choosing an action a
  • 00:46:30 according to the policy mu so really
  • 00:46:33 what this is is the chain rule so it's
  • 00:46:35 the this grading is a proportional to a
  • 00:46:38 gradient of this quantity times the
  • 00:46:40 gradient of the other quantity it's just
  • 00:46:42 a chain rule from calculus so in the in
  • 00:46:46 the PI torts paper we implemented this
  • 00:46:49 version and we're these are these are
  • 00:46:51 equivalent it's perfectly valid to do
  • 00:46:53 one or the other so in PI tortes we did
  • 00:46:54 this version today we're gonna do this
  • 00:46:57 particular version so that's good to
  • 00:47:00 know
  • 00:47:01 all right so next step on each time step
  • 00:47:05 you want to update the target networks
  • 00:47:06 according to this soft update rule so
  • 00:47:09 theta super EQ prime gets updated as
  • 00:47:12 talen times theta super Q plus 1 minus
  • 00:47:14 tau theta is supercute prime and
  • 00:47:16 likewise for theta super mu prime and
  • 00:47:18 then you just end the two loops so in
  • 00:47:21 practice this looks very very simple but
  • 00:47:23 what do we know off the bat we need a
  • 00:47:25 class for our replay network we need a
  • 00:47:28 class for our knows process we need to
  • 00:47:31 class for the actor in a class for the
  • 00:47:33 critic now you could think that perhaps
  • 00:47:35 have the same but when you look at the
  • 00:47:36 details which we're gonna get to in a
  • 00:47:38 minute you realize you need two separate
  • 00:47:39 classes so you need at least one class
  • 00:47:41 to handle the deep neural networks so
  • 00:47:44 you have at least three classes and I
  • 00:47:46 always add in an agent class on top it's
  • 00:47:48 kind of an interface between the
  • 00:47:49 environment and the deep neural networks
  • 00:47:52 so that's four and we're gonna go with
  • 00:47:54 five but that's for right off the bat
  • 00:47:57 so now that we know the algorithm let's
  • 00:47:58 take a look at the Supplemental details
  • 00:48:01 supplemental information to see
  • 00:48:03 precisely the architectures and
  • 00:48:05 parameters used so we scroll down so
  • 00:48:08 here are the experimental details these
  • 00:48:11 atom for learning the neural network
  • 00:48:13 parameters for the learning rate of ten
  • 00:48:15 to the minus four and ten to the minus
  • 00:48:16 three for the actor and critic
  • 00:48:17 respectively so they tell us alerting
  • 00:48:19 rates 10 to minus 4 10 to the minus 3
  • 00:48:20 for Hydra critic for cue the critic we
  • 00:48:24 included l2 weight decay of 10 to the
  • 00:48:26 minus 2 and use a discount factor of
  • 00:48:29 gamma 0.99 so that gamma is pretty
  • 00:48:32 typical but this important thing is that
  • 00:48:35 for Q and only Q naught mu we included
  • 00:48:38 l2 weight DK of 10 to the minus 2 use a
  • 00:48:40 discount factor of gamma of 0.99 that's
  • 00:48:44 an important detail for the soft target
  • 00:48:46 updates we use tau equals 0.001 so one
  • 00:48:50 part in a thousand that is indeed very
  • 00:48:52 very small okay fine the neural networks
  • 00:48:55 used the rectified non-linearity for all
  • 00:48:57 hidden layers okay the final output
  • 00:49:01 layer the actor was a tangent hyperbolic
  • 00:49:04 layer two bound the actions now tan
  • 00:49:07 hyperbolic goes from minus 1 to plus 1
  • 00:49:09 so in environments in which you have
  • 00:49:15 bounds of plus or minus 2 let's say
  • 00:49:17 you're gonna need a multiplicative
  • 00:49:18 factor so that's just something to keep
  • 00:49:21 in mind the low dimensional networks
  • 00:49:24 but that doesn't that doesn't impact the
  • 00:49:27 tangent hyperbolic he just means there's
  • 00:49:29 a multiplicative factor and they're
  • 00:49:30 related to your environment below
  • 00:49:33 dimensional networks had two hidden
  • 00:49:35 layers with 400 and 300 units
  • 00:49:37 respectively about a hundred thirty
  • 00:49:38 thousand parameters actions were not
  • 00:49:41 included until the second hidden layer
  • 00:49:44 of Q so when you're calculating the
  • 00:49:47 critic function Q you aren't actually
  • 00:49:50 passing forward the action from the very
  • 00:49:53 beginning you're including it as a
  • 00:49:54 separate input at the second hidden
  • 00:49:56 layer of Q that's very important that's
  • 00:49:58 a very important implementation detail
  • 00:50:01 and this is when learning from pixels we
  • 00:50:03 use three convolution layers which we
  • 00:50:05 don't need to know right now we're not
  • 00:50:06 using pixels yet and followed by two
  • 00:50:09 fully connected layers
  • 00:50:11 the final layer weights and biases both
  • 00:50:14 the actor and critic were initialized
  • 00:50:15 from a uniform distribution of plus or
  • 00:50:17 minus 3 by 10 to the minus 3 for the low
  • 00:50:21 dimensional case this was this was to
  • 00:50:25 ensure that the initial outputs for the
  • 00:50:26 policy and value estimates were near
  • 00:50:28 zero the other layers were initialized
  • 00:50:31 from uniform distribution of plus or
  • 00:50:33 minus 1 over square root of F where F as
  • 00:50:35 a fan end of the layer fan is just the
  • 00:50:36 number of input units and the other
  • 00:50:39 layer oh how are that the the the
  • 00:50:46 actions were not included into the fully
  • 00:50:48 connected layers that's for the
  • 00:50:50 convolutional case so here right now I'm
  • 00:50:54 experiencing some confusion reading this
  • 00:50:55 so it says the other ways reinitialize
  • 00:51:02 from uniform distributions related to
  • 00:51:03 the fan in the actions were not included
  • 00:51:05 into the fully connected layers so I'm
  • 00:51:08 guessing since we're talking about fully
  • 00:51:10 connected layers they're talking about
  • 00:51:11 the pixel case right because otherwise
  • 00:51:14 they're all fully connected you know
  • 00:51:16 wouldn't make sense to say specify fully
  • 00:51:18 connected layers so this gives me a
  • 00:51:21 little bit of confusion is this
  • 00:51:22 statement referring to the way I
  • 00:51:24 initially interpreted it it is referring
  • 00:51:27 to both cases for the state vector and
  • 00:51:29 pixel case but whatever I'm gonna
  • 00:51:32 interpret it that way because it seemed
  • 00:51:34 to work
  • 00:51:34 but there's ambiguity there and this is
  • 00:51:37 kind of an example of how reading papers
  • 00:51:39 can be a little bit confusing at times
  • 00:51:40 because the wording isn't always clear
  • 00:51:42 maybe I'm just tired maybe I've been
  • 00:51:43 rambling for about 50 minutes and my
  • 00:51:47 brains turning to mush that's quite
  • 00:51:49 probable actually but anyway we trained
  • 00:51:52 with mini match sizes of 64 for the low
  • 00:51:53 dimensional problems in 69 pixels with a
  • 00:51:56 replay buffer of 10 to the 6 or the
  • 00:51:59 exploration noise process we used
  • 00:52:01 temporally correlated noise in order to
  • 00:52:02 explore well in physical environments to
  • 00:52:04 add momentum once seen it will in Bac
  • 00:52:05 process with theta equals 0.15 and Sigma
  • 00:52:08 equals 0.2 and it tells you what it does
  • 00:52:10 all well and good ok so these are the
  • 00:52:13 implementation details we need 400 units
  • 00:52:15 and 300 minutes for hidden layers atom
  • 00:52:18 optimizer at 10 to the minus 4 for the
  • 00:52:20 actor at 10 to the minus 3 for the
  • 00:52:21 critic for the critic we need an l2
  • 00:52:23 weight decay of 10 the mine
  • 00:52:24 to discount factor of gamma 0.99 and for
  • 00:52:28 the salt that they factor we need point
  • 00:52:29 zero zero one and we need updates we
  • 00:52:32 need initializations that are
  • 00:52:33 proportional to that one over the square
  • 00:52:35 root of fanon for the lower layers and
  • 00:52:37 plus minus point zero zero three zero
  • 00:52:40 zero zero zero three for the final
  • 00:52:43 output layers of our fully cam networks
  • 00:52:46 okay so that's a lot of details we have
  • 00:52:49 everything we need to start implementing
  • 00:52:50 the paper and that only took us about 50
  • 00:52:53 minutes we just might short a thin when
  • 00:52:55 I read it took me quite a while so let's
  • 00:52:58 head back up to the algorithm here and
  • 00:53:01 we'll keep that up as reference for the
  • 00:53:04 remainder of the video because that's
  • 00:53:06 quite critical so let's go ahead and
  • 00:53:09 head back to our code editor and start
  • 00:53:13 coding this up we'll start with the easy
  • 00:53:14 stuff first so let's start coding and we
  • 00:53:18 will start with the probably one of the
  • 00:53:21 most confusing aspects of the problem of
  • 00:53:23 the Ornstein uhlenbeck action noise now
  • 00:53:25 you can go ahead and do a google search
  • 00:53:27 for it and you'll find a Wikipedia
  • 00:53:29 article that talks a lot about the
  • 00:53:31 physical processes behind a lot of
  • 00:53:32 mathematical derivations and that's not
  • 00:53:34 particularly helpful so if you want to
  • 00:53:36 be a physicist I invite you to read that
  • 00:53:38 and check it out it's got some pretty
  • 00:53:39 cool stuff it took me back to my grad
  • 00:53:41 school days but we have a different
  • 00:53:42 mission in mind for the moment the
  • 00:53:44 mission now is to find a code
  • 00:53:46 implementation of this that we can use
  • 00:53:48 in our problem so if you then do a
  • 00:53:50 Google search for Ornstein uhlenbeck
  • 00:53:53 github as and you want to find someone's
  • 00:53:55 github example for it you end up with a
  • 00:53:57 nice example from the open AI baseline
  • 00:54:01 library that shows you the precise form
  • 00:54:04 of it so let me show you that one second
  • 00:54:06 so you can see it here in the github
  • 00:54:10 there is a whole class for this right
  • 00:54:13 here that I've highlighted and this
  • 00:54:15 looks to do precisely what we want it
  • 00:54:16 has previous plus a delta term and a DT
  • 00:54:21 term so it looks like it's going to
  • 00:54:22 create correlations through this X
  • 00:54:24 previous term there's a reset function
  • 00:54:27 to reset the noise which we may want to
  • 00:54:29 use and there's this representation
  • 00:54:32 which we'll probably skip because we
  • 00:54:33 know what we're doing and it's not
  • 00:54:35 really critical for this particular
  • 00:54:37 it would be nice if you were writing a
  • 00:54:39 library as these guys were so they
  • 00:54:41 included the representation method so
  • 00:54:44 let's go ahead and code that up in the
  • 00:54:46 editor and that will tackle our first
  • 00:54:49 class so I'm gonna leave the notes up
  • 00:54:52 there for now they do no harm they're
  • 00:54:54 just comments at the top of the file so
  • 00:54:56 the first thing we'll need is to import
  • 00:54:58 numpy as NP we know we can go ahead and
  • 00:55:02 start and say import tensor flow as TF
  • 00:55:04 we're going to need tensor flow we may
  • 00:55:07 need something like OS to handle model
  • 00:55:10 savings so we can go ahead and import
  • 00:55:12 that as well just a fun fact it is
  • 00:55:16 considered good practice to import your
  • 00:55:18 system level packages first and followed
  • 00:55:22 by your library packages second in
  • 00:55:25 numerical order followed by your own
  • 00:55:27 personal code in numerical order so
  • 00:55:31 that's the imports that we need to start
  • 00:55:33 let's go ahead and code up our class
  • 00:55:36 we'll call it o you action noise and
  • 00:55:38 that'll just be derived from the base
  • 00:55:41 object so the initializer will take a mu
  • 00:55:45 a Sigma now they said they used a
  • 00:55:48 default value of I believe 0.15 and a
  • 00:55:54 theta of 0.2 a DT term is something like
  • 00:55:59 1 by 10 to the minus 2 and our X naught
  • 00:56:02 will save as none and again if you have
  • 00:56:06 any doubts on that just go check out the
  • 00:56:08 open AI baselines for it their
  • 00:56:10 implementation is probably correct right
  • 00:56:12 I'll give them the benefit of the doubt
  • 00:56:15 so go ahead and save your parameters as
  • 00:56:19 usual
  • 00:56:24 so we have a mu a theta a DT Sigma and X
  • 00:56:32 0 and we'll go ahead and call the reset
  • 00:56:36 function at the top now they override
  • 00:56:39 the call method and what this does is it
  • 00:56:42 enables you to say noise equals oh you
  • 00:56:45 an action noise and then when you want
  • 00:56:48 to get the noise you just say our noise
  • 00:56:52 equals noise you can use the parenthesis
  • 00:56:55 that's what overall writing call does
  • 00:56:57 that's a good little tidbit to know so
  • 00:57:01 we want do you implement the equation
  • 00:57:03 that they gave us it's L self that X
  • 00:57:06 previous plus self dot theta times mu
  • 00:57:10 minus X previous times self DT plus
  • 00:57:18 Sigma times at numpy square root self
  • 00:57:24 dot DT times numpy random normal for the
  • 00:57:28 size of mu and set the X previous to the
  • 00:57:34 current value that's how you create the
  • 00:57:36 temporal correlations and return the
  • 00:57:38 value of the noise so we don't have a
  • 00:57:42 value for x previous so we have to set
  • 00:57:44 that with the reset function which takes
  • 00:57:48 no parameters self that X previous
  • 00:57:53 equals x naught if self dot x naught is
  • 00:57:57 not none else numpy zeros like self dot
  • 00:58:04 mia and that's it for the noise function
  • 00:58:07 that's pretty straightforward so that's
  • 00:58:11 all well and good so we have one one
  • 00:58:13 function down so we've taken care of the
  • 00:58:16 noise sorry have one class now we've
  • 00:58:18 taken care of the noise and now we can
  • 00:58:19 move on to the replay buffer so this
  • 00:58:21 will be something similar to what I've
  • 00:58:23 been lamented in the past there are many
  • 00:58:25 different implementations and ways of
  • 00:58:26 implementing this many people we use a
  • 00:58:29 built-in Python data structure called a
  • 00:58:32 DQ or a deck I think it's DQ is the
  • 00:58:34 pronunciation basically it's a queue
  • 00:58:37 that you fill up over time and that's
  • 00:58:39 perfectly valid you can do that there's
  • 00:58:42 no reason not to I prefer using a set of
  • 00:58:45 arrays and using numpy to facilitate
  • 00:58:50 that the reason being that we can
  • 00:58:52 tightly control the data types of the
  • 00:58:55 stuff that we're saving for this
  • 00:58:57 pendulum environment it doesn't really
  • 00:58:59 matter but as you get more involved in
  • 00:59:00 this field you will see that you are
  • 00:59:03 saving stuff that has varying sizes so
  • 00:59:06 if you're trying to save images let's
  • 00:59:08 say from the one of the Atari libraries
  • 00:59:11 or a mu Joko environment or something
  • 00:59:13 like that hope I pronounced that
  • 00:59:14 correctly you'll see that this memory
  • 00:59:18 can explode quite quickly you can eat
  • 00:59:20 into your RAM so the ability to
  • 00:59:21 manipulate the underlying data type
  • 00:59:24 representation whether you want to use
  • 00:59:26 single or double precision for either
  • 00:59:29 your floating-point numbers or integers
  • 00:59:30 is critical to memory management as well
  • 00:59:33 as taking advantage of some other
  • 00:59:35 optimizations for NVIDIA GPUs in the
  • 00:59:38 touring class and above so I always use
  • 00:59:41 the numpy arrays because it's a clean
  • 00:59:43 implementation that allows manipulation
  • 00:59:44 of data types you can use the DQ if you
  • 00:59:47 want it's perfectly valid so our
  • 00:59:52 separate class has its own initializer
  • 00:59:54 of course so we're going to want to pass
  • 00:59:57 in a maximum size the input shape and
  • 01:00:01 the number of actions right because we
  • 01:00:04 have to store the state action reward
  • 01:00:06 and new state tuples we're also going to
  • 01:00:09 want to facilitate the use of the done
  • 01:00:12 flag so we'll have an extra parameter in
  • 01:00:14 here and the reason behind that is
  • 01:00:16 intimately related to how the how the
  • 01:00:19 bellman equation is calculated it took
  • 01:00:21 me a second to think of that at the end
  • 01:00:24 of the episode the agent receives no
  • 01:00:25 further rewards and so the expected
  • 01:00:28 feature reward the discounted future
  • 01:00:29 reward if you will is identically zero
  • 01:00:31 so you have to multiply the reward for
  • 01:00:35 the next state by zero if that next
  • 01:00:38 state follows the terminal state if your
  • 01:00:39 current state is terminal in the next
  • 01:00:41 state is following the terminal state so
  • 01:00:42 you don't want to take into account
  • 01:00:43 anything from that he expected future
  • 01:00:45 rewards because they're identically zero
  • 01:00:47 so we need a done flag this is all I
  • 01:00:49 wanted to say
  • 01:00:51 so we save our parameters I only need to
  • 01:00:59 say that we can say self mmm counter
  • 01:01:00 equals zero and that will keep track of
  • 01:01:02 the position of our most recently saved
  • 01:01:04 memory state memory and arrived up by
  • 01:01:08 zeros mm size by input shape new state
  • 01:01:15 memory same same deal you know it's just
  • 01:01:20 just totally the same good grief
  • 01:01:26 okay so now we have the say action
  • 01:01:29 memory and that's known pi 0 cell top
  • 01:01:32 men's eyes by sulfa n actions we didn't
  • 01:01:37 save an action this did B so we'll just
  • 01:01:38 call it 10 actions we have the reward
  • 01:01:42 memory and that's just a scalar value so
  • 01:01:45 that only gets shaped self mem size we
  • 01:01:50 also need the terminal memory from the
  • 01:01:55 memory and that'll be shaped self dot
  • 01:01:59 men size and I have it numpy float32 if
  • 01:02:04 I recall correctly that is due to the
  • 01:02:06 data types in the PI torch
  • 01:02:08 implementation is probably not necessary
  • 01:02:10 here in the tensorflow implementation
  • 01:02:12 but I left it the same way just to be
  • 01:02:14 consistent it doesn't hurt anything so
  • 01:02:16 next we need a function to store a
  • 01:02:18 current transition so that'll take a
  • 01:02:21 state action reward new state and a done
  • 01:02:25 flag as input so the index of where we
  • 01:02:29 want to store that memory is the memory
  • 01:02:32 counter modulus the men size so for any
  • 01:02:36 men can or less than mem sighs this just
  • 01:02:39 returns mem counter for anything larger
  • 01:02:42 than mem size it just wraps around so if
  • 01:02:45 you have a million memories all the way
  • 01:02:48 up from zero to nine hundred nine
  • 01:02:49 thousand nine hundred ninety nine it
  • 01:02:51 will be that number and then once it's a
  • 01:02:54 million it wraps back round to 0 and
  • 01:02:56 then 1 and 2 and so on and so forth and
  • 01:02:58 so that way your overwriting memories at
  • 01:03:00 the earliest part of the array with the
  • 01:03:02 newest memories
  • 01:03:03 precisely as they described in the paper
  • 01:03:05 if you're using the DQ map method then I
  • 01:03:09 believe you would just pop it off of the
  • 01:03:12 left I believe now don't quote me
  • 01:03:13 because I haven't really used that
  • 01:03:14 implementation but from what I've read
  • 01:03:16 that's how it operates new state memory
  • 01:03:19 sub index equals state underscore reward
  • 01:03:25 memory equals sub index reward as action
  • 01:03:33 memory now keep in mind that the actions
  • 01:03:39 in this case are arrays themselves so
  • 01:03:41 it's an array of arrays just keep that
  • 01:03:43 in the back your mind so that you can
  • 01:03:45 visualize the problem we're trying to
  • 01:03:46 solve here next up we have the terminal
  • 01:03:49 memory now a little twist here is that
  • 01:03:52 we want this to be 1 minus int of done
  • 01:03:56 so done is either true or false so you
  • 01:03:58 don't want to count the rewards after
  • 01:04:01 the episode has ended so when done is
  • 01:04:04 true you want to multiply by 0 so 1
  • 01:04:07 minus in to true is 1 minus 1 which is 0
  • 01:04:11 and when it's not over it's 1 minus 0
  • 01:04:14 which is just 1 so that's precisely the
  • 01:04:16 behavior we want and finally you want to
  • 01:04:18 increment men counter by 1 every time
  • 01:04:20 you store em new memory next we need a
  • 01:04:25 function to sample our buffer and we
  • 01:04:28 want to pass in a batch size
  • 01:04:30 alternatively you could make batch size
  • 01:04:32 a member variable of this class it's not
  • 01:04:35 really a big deal I just did it this way
  • 01:04:38 for whatever reason my chose to do it
  • 01:04:40 that way so what we want to do is we
  • 01:04:42 want to find the minimum either so let's
  • 01:04:47 back up for a second so what we want is
  • 01:04:48 to sample the memories from either the
  • 01:04:51 0th position all the way up to the most
  • 01:04:53 filled memory the last filled memory so
  • 01:04:56 if you have less than the max memory you
  • 01:05:00 want to go from 0 to mem counter
  • 01:05:03 otherwise you want to sample anywhere in
  • 01:05:05 that whole interval so maximum equals
  • 01:05:09 the minimum of over of either mem
  • 01:05:11 counter or
  • 01:05:13 sighs and the reason you don't want to
  • 01:05:15 just use mem can or is that mem counter
  • 01:05:17 goes larger than mem size so if you try
  • 01:05:21 to tell it to select something from the
  • 01:05:23 range men counter when them counter is
  • 01:05:26 greater than mem size you'll end up
  • 01:05:28 trying to access elements of the array
  • 01:05:30 that aren't there and it'll throw an
  • 01:05:31 error so that's why you need this step
  • 01:05:34 next you want to take a random choice of
  • 01:05:37 from zero to maximum of a size batch
  • 01:05:40 size so then you just want to gather
  • 01:05:43 those states from the respective arrays
  • 01:05:48 like so I forgot the self of course new
  • 01:05:54 states actions
  • 01:06:03 rewards and call it terminal
  • 01:06:14 batch I believe that's everything
  • 01:06:17 and you wanna return the state's actions
  • 01:06:21 rewards new states and terminal okay so
  • 01:06:29 so now we are done with the replay
  • 01:06:32 buffer class so that's actually pretty
  • 01:06:34 straightforward and if you've seen some
  • 01:06:36 of my other videos on deep key learning
  • 01:06:37 then you've seen pretty much the same of
  • 01:06:39 implementation
  • 01:06:40 I just typically kept it in the aging
  • 01:06:42 class then we're kind of refining my
  • 01:06:44 approach getting more sophisticated over
  • 01:06:46 time so make sense this ticket in its
  • 01:06:47 own class so we're already like 40
  • 01:06:50 percent of the way there we've got five
  • 01:06:52 classes in total so that's good news so
  • 01:06:54 next up we have to contend with the
  • 01:06:56 actor and critic networks so we'll go
  • 01:06:59 ahead and start with the actor and from
  • 01:07:03 there keep in mind that we have two
  • 01:07:05 actor networks and we're gonna have to
  • 01:07:07 contend with some of the peculiarities
  • 01:07:08 and weaith tensorflow likes to do stuff
  • 01:07:12 so let's get started the actor and of
  • 01:07:15 course in tensorflow you don't derive
  • 01:07:19 from any particular class we're on PI
  • 01:07:20 torch you would derive from an end up
  • 01:07:22 module don't know what I'm doing there
  • 01:07:27 definite so we'll need a learning rate
  • 01:07:31 number of actions unnamed and the name
  • 01:07:34 is there to distinguish the regular
  • 01:07:36 actor network from the target actor
  • 01:07:38 network input dims we're gonna want to
  • 01:07:43 pass it a session so tensorflow has the
  • 01:07:46 construct of the session which houses
  • 01:07:48 the graph and all the variables and
  • 01:07:50 parameters and stuff like that you can
  • 01:07:52 have each class having its own session
  • 01:07:54 but it's more tidy to pass in a single
  • 01:07:57 session to each of the classes a number
  • 01:08:01 of dimensions for the first fully
  • 01:08:02 connected layers so of course I should
  • 01:08:04 be 400 if we're going to implement the
  • 01:08:07 paper precisely fc2 DIMMs is 300 an
  • 01:08:10 action bound a batch size that defaults
  • 01:08:13 to 64 a checkpoint directory and the
  • 01:08:18 purpose of that is to save our model
  • 01:08:20 hand in the case of the pendulum it
  • 01:08:24 doesn't really matter because it's so
  • 01:08:25 quick to run but in general you want a
  • 01:08:27 way of saving these models because it
  • 01:08:28 can take a long time to run so we'll
  • 01:08:32 save our learning rate number of actions
  • 01:08:36 and all of the parameters we passed in
  • 01:08:40 dot fc-1 dims and the purpose of this
  • 01:09:03 action bound is to accommodate
  • 01:09:05 environments where the action is either
  • 01:09:08 greater than plus or minus negative one
  • 01:09:11 so if you can go from plus or minus to
  • 01:09:12 the tangent hyperbolic is only gonna
  • 01:09:14 sample like half your range right from
  • 01:09:16 plus minus one and so you won't have an
  • 01:09:18 action bound and there is a
  • 01:09:18 multiplicative factor to make sure that
  • 01:09:20 you can sample the full range of the
  • 01:09:21 actions available to your agent and we
  • 01:09:26 need to say that checkpoint Dirar
  • 01:09:30 and finally we want to call a build
  • 01:09:32 network function that's not final but
  • 01:09:35 we'll call the build Network function
  • 01:09:36 next up since we have to do the soft
  • 01:09:41 cloning the soft update rule for the
  • 01:09:44 target actor class and target critic
  • 01:09:46 then we know that we have to find a way
  • 01:09:49 of keeping track of the parameters in
  • 01:09:51 each Network so we're gonna keep track
  • 01:09:54 of them here in the variable params and
  • 01:09:58 it's tensorflow dot trainable variables
  • 01:10:00 with a scope of a self dot name so the
  • 01:10:04 we have a single session in a single
  • 01:10:06 graph and you're gonna have multiple
  • 01:10:08 deep neural networks within that graph
  • 01:10:09 we want we don't want to update the
  • 01:10:12 critic network when we're trying to do
  • 01:10:14 the after Network and vice versa right
  • 01:10:16 we want those to be independent and so
  • 01:10:17 we scoped them with their own name so
  • 01:10:19 that tensorflow knows hey this is a
  • 01:10:21 totally different set of parameters from
  • 01:10:22 this
  • 01:10:22 that will aid in copying stuff later and
  • 01:10:25 also make sure that everything is nice
  • 01:10:27 and tidy and the scope is what
  • 01:10:31 facilitates that will also need a saver
  • 01:10:37 object to save the model let's make a
  • 01:10:41 checkpoint file and that's where we use
  • 01:10:44 OS and that will clone into check
  • 01:10:49 pointer and the name plus underscore DD
  • 01:10:52 PG checkpoint so this will automatically
  • 01:10:55 scope the save files for us so that way
  • 01:10:58 we don't confuse the parameters for the
  • 01:11:00 target actor and actor or critic and
  • 01:11:03 target critic or even actor and critic
  • 01:11:05 for that matter very important so we
  • 01:11:09 were gonna have to calculate some
  • 01:11:11 gradients and we're gonna do that by
  • 01:11:12 hand so we're gonna need a series of
  • 01:11:15 functions that will facilitate that and
  • 01:11:18 the first of which is the unnormalized
  • 01:11:23 actor gradients and that is given my
  • 01:11:28 tensorflow radians cell top mu cell top
  • 01:11:32 params and minus self dot action in
  • 01:11:36 gradient so we're gonna calculate some
  • 01:11:40 you will be the MU from the paper the
  • 01:11:43 actual actions of the agent params are
  • 01:11:46 our network parameters and this action
  • 01:11:49 gradients so let's go back to the paper
  • 01:11:51 for a second so we can get some idea
  • 01:11:52 what i'm talking about here so if we
  • 01:11:54 look at the algorithm we can see that
  • 01:11:56 we're going to need these gradients of
  • 01:11:59 the critic function with respect to the
  • 01:12:01 actions taken we're also going to need
  • 01:12:04 the gradient of the actual mu with
  • 01:12:07 respect to the weights of the network so
  • 01:12:08 we're passing in into tensor flow
  • 01:12:10 gradients this function mu the
  • 01:12:13 parameters to get the gradient with
  • 01:12:15 respect to those parameters and then
  • 01:12:17 we're gonna have to calculate the
  • 01:12:18 gradient of the critic with respect to
  • 01:12:21 the actions taken so we'll have to
  • 01:12:23 calculate this later that's a
  • 01:12:25 placeholder that's the minus self-taught
  • 01:12:26 action gradient and we're gonna
  • 01:12:28 calculate that in the learning function
  • 01:12:30 but that's where all of that comes from
  • 01:12:32 so now let's go back to the code editor
  • 01:12:34 and continue
  • 01:12:35 so that's our uh normalized actor
  • 01:12:37 gradients and uh normalized just because
  • 01:12:40 we're going to take one over the sum 1
  • 01:12:42 over N times a sum we need a function
  • 01:12:47 for performing that normalization so the
  • 01:12:50 actor gradients has to be this parameter
  • 01:12:52 has to be a list so we cast it as a list
  • 01:12:55 I'm just going to map a lambda function
  • 01:12:57 and XT F dot div X and batch size no big
  • 01:13:10 deal there so optimize is our
  • 01:13:14 optimization step and of course that's
  • 01:13:19 the atom optimizer we want to optimize
  • 01:13:21 with the learning rate of self doubt
  • 01:13:23 learning rate and we want to apply
  • 01:13:26 gradients so typically we use dot
  • 01:13:29 minimize loss but in this case we're
  • 01:13:31 calculating our gradients manually so we
  • 01:13:33 need to apply those gradients and what
  • 01:13:35 do we want to apply want to apply the
  • 01:13:37 actor gradients to the programs and I
  • 01:13:42 would encourage you to go look at the
  • 01:13:45 tensor flow code for all of these the
  • 01:13:47 videos getting a little bit long we're
  • 01:13:48 already up to like an hour and 20 hour
  • 01:13:50 and 10 20 minutes something like that so
  • 01:13:51 I'm not gonna go through all the
  • 01:13:53 documentation for tensor flow feel free
  • 01:13:55 to look that up I had to when I was
  • 01:13:58 building this house so next up we need
  • 01:14:00 to build our network
  • 01:14:06 so this is how we're gonna handle the
  • 01:14:09 scoping TF dot variable underscore scope
  • 01:14:13 self that names that way every network
  • 01:14:15 gets its own scope we need a placeholder
  • 01:14:18 for the input that'd be a 32 bit
  • 01:14:22 floating number shape of nun which is
  • 01:14:27 batch size you can put dims
  • 01:14:29 beiongs and that won't work let's put
  • 01:14:33 this on its own line and we're gonna
  • 01:14:36 give it a name it's not critical
  • 01:14:38 the name parameter is just for debugging
  • 01:14:42 if something goes wrong then you can
  • 01:14:43 kind of trace where it went wrong makes
  • 01:14:45 life a little bit easier the action
  • 01:14:47 gradient is also a placeholder that is
  • 01:14:49 what we're gonna calculate in the learn
  • 01:14:51 function for the agent and that gets a
  • 01:14:58 shape of nun by and actions so it'll be
  • 01:15:03 the gradient of cute respect to each
  • 01:15:04 action so it has number of dimensions
  • 01:15:06 but there's actions and so those are our
  • 01:15:10 two placeholder variables now we get to
  • 01:15:13 construct the actual network so let's
  • 01:15:15 handle the initialization first so f1 is
  • 01:15:18 the fan-in it's 1 divided by numpy
  • 01:15:20 square root FC 1 dims and our dense 1 TF
  • 01:15:28 layers dot dense and that takes self but
  • 01:15:32 input as input with self dot FC 1 dims
  • 01:15:35 as units our kernel initializer egos
  • 01:15:40 random I forgot an importance it's
  • 01:15:43 random uniform one second minus F 1 2 F
  • 01:15:47 1 and the bias initializer current L no
  • 01:15:53 that's not right Kernell bias
  • 01:15:55 initializer random
  • 01:15:58 uniform minus F 1 to F 1 that gets to
  • 01:16:03 parentheses so we have to come back up
  • 01:16:06 to our imports import tensor flow dot
  • 01:16:15 initializers know so we have to come
  • 01:16:18 back up here and say from tensor flow
  • 01:16:20 without initializers in this shell
  • 01:16:24 either import random uniform and now
  • 01:16:29 we're good to go let's come back down
  • 01:16:32 here we have dense one now we want to do
  • 01:16:37 the batch normalization so batch 1
  • 01:16:40 equals TF layers badge normalization
  • 01:16:44 dense one and that doesn't get
  • 01:16:47 initialized so now let's activate our
  • 01:16:52 first layer and that's just the rail u
  • 01:16:57 activation of the batch normal batch
  • 01:16:59 normed now it is an open debate from
  • 01:17:02 what I've read online about whether or
  • 01:17:04 not you should do the activation before
  • 01:17:07 or after the batch normalization I'm in
  • 01:17:11 the camp of doing it after the
  • 01:17:14 activation after the batch norm that's
  • 01:17:16 because the real activation function at
  • 01:17:19 least in at least in the case of rail
  • 01:17:21 you so NREL you you lop off everything
  • 01:17:23 lower than zero so your statistics might
  • 01:17:26 get skewed to be positive instead of
  • 01:17:27 maybe they're zero maybe they're
  • 01:17:29 negative who knows so I think the batch
  • 01:17:31 norm is probably best before the
  • 01:17:33 activation and indeed this works out
  • 01:17:35 this is something you can play with so
  • 01:17:37 go ahead and fork this repo and play
  • 01:17:39 around with it and see how much of a
  • 01:17:41 difference it makes for you maybe I
  • 01:17:42 missed something when I try to do it the
  • 01:17:43 other way it's entirely possible I miss
  • 01:17:45 stuff all the time so you know it's
  • 01:17:48 something you can improve upon that's
  • 01:17:49 how I chose to do it and it seems to
  • 01:17:51 work and who knows maybe other
  • 01:17:53 implementations work as well so now
  • 01:17:56 let's add some space so f2 is 1 over
  • 01:18:01 square root
  • 01:18:03 su to Tim's dents to is similar that
  • 01:18:09 takes layer one activation as input with
  • 01:18:14 FC two dims you don't need that I'm
  • 01:18:18 gonna come up here and copy this there
  • 01:18:24 we go except we have to change f1 to f2
  • 01:18:31 perfect then we have batch two and that
  • 01:18:41 takes dents to is input your to
  • 01:18:44 activation batch 2 now finally we have
  • 01:18:52 the output layer which is the actual
  • 01:18:55 policy of our agent the deterministic
  • 01:18:56 policy of course and from the paper that
  • 01:18:59 gets initialized with the value of zero
  • 01:19:01 point zero is zero three and we're going
  • 01:19:04 to call this mu that gets layered to
  • 01:19:09 activation as input and that needs and
  • 01:19:13 actions as the number of output units
  • 01:19:16 what's our activation that is tangent
  • 01:19:19 hyperbolic tanh CH and I will go ahead
  • 01:19:25 and copy the initializers here of course
  • 01:19:32 that gets f3 not f2
  • 01:19:36 perfect can I tap that I can okay so
  • 01:19:40 that is mu and then we want to take into
  • 01:19:43 account the fact that our environment
  • 01:19:45 may very well require actions that have
  • 01:19:49 values plus greater than plus or minus
  • 01:19:51 one so self dot mu is TF that multiplied
  • 01:19:56 with mu and the action bound an action
  • 01:20:00 bound will be you know something like
  • 01:20:01 two it needs to be positive so that way
  • 01:20:03 you don't flip the actions but that's
  • 01:20:06 pretty straightforward so now we've
  • 01:20:08 built our network built our network next
  • 01:20:11 thing we need is a way of getting the
  • 01:20:13 actual actions out of the network so we
  • 01:20:15 have a prediction function that takes
  • 01:20:17 some inputs and you want to return self
  • 01:20:21 dot sass run self dot M you with a feed
  • 01:20:25 dictionary of self dot input and pass in
  • 01:20:30 inputs that's all that is there is to
  • 01:20:33 passing the doing the feed-forward kind
  • 01:20:36 of an interesting contrast to how pi
  • 01:20:37 torch does it with the explicit
  • 01:20:39 construction of the feed-forward
  • 01:20:41 function this just runs the session on
  • 01:20:43 this and then goes back and finds all
  • 01:20:45 the associations between the respective
  • 01:20:48 variables nice and simple
  • 01:20:50 now we need a function to Train and
  • 01:20:53 that'll take inputs and gradients this
  • 01:20:57 is what will perform the actual back
  • 01:20:59 propagation through the network and you
  • 01:21:01 want to run self dot optimized that's
  • 01:21:04 our function that accommodates learning
  • 01:21:07 with a feed and dictionary of inputs
  • 01:21:12 inputs and self dot action gradient
  • 01:21:18 gradients so that is also reasonably
  • 01:21:22 straightforward so you know let's format
  • 01:21:26 that a little bit better alright so that
  • 01:21:29 is our training function next we need
  • 01:21:32 two functions to accommodate the loading
  • 01:21:34 and the saving of the model so define
  • 01:21:37 save checkpoint and print then you want
  • 01:21:47 to say self dot saver dot save very
  • 01:21:51 creative donuts save the current session
  • 01:21:54 to the checkpoint file and the load
  • 01:22:00 checkpoint function is the same thing
  • 01:22:01 just in Reverse so here we want to print
  • 01:22:07 floating checkpoint self dot saver dot
  • 01:22:13 restore that session and the Check Point
  • 01:22:19 file so that will you can only call this
  • 01:22:23 after and Stan she ating the agent so it
  • 01:22:25 will have a default session with some
  • 01:22:28 initialized values and you want to load
  • 01:22:30 the variables from the checkpoint file
  • 01:22:31 into that checkpoint and that is it for
  • 01:22:34 the actor class this is reasonably
  • 01:22:37 straightforward the only real mojo here
  • 01:22:39 is the actor gradients and this is just
  • 01:22:42 two functions that accommodate the fact
  • 01:22:44 that we are going to manually calculate
  • 01:22:45 the gradient of the critic with respect
  • 01:22:48 to the actions taken so let's go ahead
  • 01:22:51 and do the critic class next that is
  • 01:22:54 also very similar so that again derives
  • 01:23:00 from the base object gets an initializer
  • 01:23:03 and it's pretty much the same number of
  • 01:23:07 actions a name input dims a session s
  • 01:23:11 you won bin's se2 dims a batch size will
  • 01:23:18 default to 64 and a check pointer will
  • 01:23:21 default to this just a note you have to
  • 01:23:25 do a make derp on this temps livecd DPG
  • 01:23:28 first otherwise it'll bark at you not a
  • 01:23:31 big deal just something to be aware of
  • 01:23:33 since it's identical let's go ahead and
  • 01:23:36 copy a good chunk of this stuff here
  • 01:23:39 what exactly is the same all of this now
  • 01:23:44 we need the checkpoint file let's grab
  • 01:23:45 that ctrl C come down here control V and
  • 01:23:51 voila it is all the same so
  • 01:23:53 very straightforward nothing too magical
  • 01:23:57 about that so now let's handle the
  • 01:23:59 optimizer means we already have a we've
  • 01:24:02 already called the function to build our
  • 01:24:03 network we can define our optimizer so
  • 01:24:04 the self thought optimize TF train Adam
  • 01:24:09 optimizer and we're gonna minimize the
  • 01:24:12 loss which we will calculate in the
  • 01:24:17 build Network function and we also need
  • 01:24:21 a function to actually calculate the
  • 01:24:23 gradients of Q with respect hey so we
  • 01:24:26 have self – ingredients self dot Q and
  • 01:24:32 actions let's build our network with
  • 01:24:40 Tina sorry TF a variable scope self that
  • 01:24:44 name so now we need our placeholders
  • 01:24:49 again we need subletting placeholder
  • 01:24:56 float32 we need a shape nun by input
  • 01:25:03 dims now if you're not too familiar with
  • 01:25:07 tensorflow
  • 01:25:07 specifying none in the first dimension
  • 01:25:09 tell sensor flow you're gonna have some
  • 01:25:11 type of batch of inputs and you don't
  • 01:25:13 know what that batch size will be
  • 01:25:14 beforehand so just expect anything and
  • 01:25:20 let's delete that say i need a comma for
  • 01:25:23 sure and say name equals inputs here we
  • 01:25:31 go
  • 01:25:31 did i forget a comma up here let's just
  • 01:25:33 make sure i did not okay so we also need
  • 01:25:41 the actions because remember we only
  • 01:25:44 take into account the actions and the
  • 01:25:47 second hidden layer of the critic neural
  • 01:25:50 network
  • 01:25:51 I have float 342 shape none by and
  • 01:25:58 actions name of actions now much like
  • 01:26:07 with q-learning we have a target value
  • 01:26:09 let me just go back to the paper really
  • 01:26:12 quickly show you what that precisely
  • 01:26:13 will be so that target value will be
  • 01:26:17 this quantity here and we will calculate
  • 01:26:18 that in the learning function as we as
  • 01:26:22 we get to the agent class so let's go
  • 01:26:25 back to the code editor and finish this
  • 01:26:26 up so Q target is just another
  • 01:26:28 placeholder and that's a floating point
  • 01:26:32 number done by one it's a scalar so it
  • 01:26:38 is shaped batch size by one and we will
  • 01:26:41 call it targets okay so now we have a
  • 01:26:48 pretty similar setup to the actor
  • 01:26:51 Network so let's go ahead and come up
  • 01:26:53 here and copy this no not all that just
  • 01:26:58 this and come back down here so f1 we
  • 01:27:03 recognize that's just the fan-in we have
  • 01:27:06 a dense layer for the inputs and we want
  • 01:27:10 to initialize that with a random number
  • 01:27:11 pretty straightforward then we can come
  • 01:27:14 up to f2 and copy that as as well sorry
  • 01:27:19 the second layers it'll be a little bit
  • 01:27:22 different but we'll handle that
  • 01:27:23 momentarily so now f2 the layer 2 is
  • 01:27:29 pretty similar the only real difference
  • 01:27:32 is that we want to get rid of that
  • 01:27:34 activation and the reason we want to get
  • 01:27:36 rid of that activation is because after
  • 01:27:38 I do the batch norm we have to take into
  • 01:27:39 account the actions so we need another
  • 01:27:41 layer so action in
  • 01:27:44 Steve layers dense it's going to take in
  • 01:27:48 cellphone actions which will pass in
  • 01:27:50 from the learning function and that's
  • 01:27:53 going to output F c2 dims
  • 01:27:57 with a value activation so then our
  • 01:28:02 state actions will be the the addition
  • 01:28:07 of the batch two and the action in and
  • 01:28:13 then we want to go ahead and activate
  • 01:28:16 that okay so this is something else I
  • 01:28:22 pointed out in my PI torch video where
  • 01:28:24 this is a point of debate with the way
  • 01:28:25 I've implemented this I've done it a
  • 01:28:27 couple different ways they've done the
  • 01:28:28 different variations and this is what I
  • 01:28:29 found to work I'm doing a double
  • 01:28:31 activation here so I'm activating the
  • 01:28:33 rel you I'm doing the Arroyo activation
  • 01:28:35 on the actions in on the output of that
  • 01:28:37 that dense layer and then I'm activating
  • 01:28:39 the sum now the the Raju function is non
  • 01:28:44 commutative with respect to the addition
  • 01:28:46 functions so the value of the sum is
  • 01:28:48 different than the sum of the values and
  • 01:28:50 you can prove that to yourself on a
  • 01:28:52 sheet of paper but it's debatable on
  • 01:28:54 whether or not the way I've done it is
  • 01:28:56 correct it seems to work so I'm gonna
  • 01:28:58 stick with it for now again for kit
  • 01:29:00 cloning change it up see how it works
  • 01:29:02 and improve it for me that would be
  • 01:29:04 fantastic
  • 01:29:06 then make a pull request and I'll
  • 01:29:08 disseminate that to the community so we
  • 01:29:10 have our state actions now we need to
  • 01:29:14 calculate the actual output of the layer
  • 01:29:15 and an f3 is our uniform initializer for
  • 01:29:20 our final layer which is self dot Q
  • 01:29:23 those all areas that dense that takes
  • 01:29:26 state actions as input outputs a single
  • 01:29:29 unit and we have the kernel initializers
  • 01:29:34 and bias initializers similar to up here
  • 01:29:41 let's paste that OOP there we go that's
  • 01:29:46 a little bit better still got a whole
  • 01:29:48 bunch of white space there all right so
  • 01:29:50 we are missing one thing and that one
  • 01:29:53 thing is the regularizer so as they said
  • 01:29:56 on the paper they have L to regular
  • 01:29:58 regularization on the critic so we do
  • 01:30:01 that with kernel regular Iser equals TF
  • 01:30:05 Kerris regul our eyes errs dot l2 or the
  • 01:30:11 value zero zero one that 0 1 sorry so
  • 01:30:15 that is that for the cue and notice that
  • 01:30:19 it outputs one unit and it outputs a
  • 01:30:22 single unit because this is a scalar
  • 01:30:23 value you want the value of the
  • 01:30:26 particular state action pair finally we
  • 01:30:29 have the loss function and that's just a
  • 01:30:34 mean squared error error cue target and
  • 01:30:39 Q so Q target is the placeholder up here
  • 01:30:43 and that's what we'll pass in from the
  • 01:30:45 learn function from the agent and self
  • 01:30:47 dot q is the output of the deep neural
  • 01:30:49 network okay so now similar to our
  • 01:30:52 critic we have a prediction function it
  • 01:30:58 takes inputs and actions and you want to
  • 01:31:01 return self that's a stun self dot Q
  • 01:31:07 with a feed dictionary of sub that input
  • 01:31:13 inputs and actions oops
  • 01:31:19 next you need a training function and
  • 01:31:24 that's slightly more complicated than
  • 01:31:25 the actor visa takes in the inputs
  • 01:31:28 actions and a target and you want to
  • 01:31:32 return self assess run so to optimize
  • 01:31:41 with a feed dictionary of sub that input
  • 01:31:46 inputs self dot actions actions and Q
  • 01:31:53 target space gun that's a little wonky
  • 01:32:01 whatever we'll leave it that's a
  • 01:32:03 training function next we need a
  • 01:32:05 function to get the action gradients and
  • 01:32:07 that'll run that action gradients
  • 01:32:08 operation up above so let's get that
  • 01:32:14 again takes inputs and actions as input
  • 01:32:17 and you want to return self that session
  • 01:32:21 run self dot action ingredients with a
  • 01:32:26 its own feed dictionary equals self dot
  • 01:32:31 input inputs and self out actions
  • 01:32:34 actions
  • 01:32:38 and then we also have the save and load
  • 01:32:42 checkpoint functions which are identical
  • 01:32:44 to the actor siliceous copy and paste
  • 01:32:47 those there we go
  • 01:32:51 so that is it for the critic class so
  • 01:32:57 now we have most of what we need we have
  • 01:32:59 our our our noise our replay buffer our
  • 01:33:04 actor and our critic now all we need is
  • 01:33:07 our agent now the agent is what ties
  • 01:33:10 everything together it handles the
  • 01:33:11 learning functionality it allows the
  • 01:33:13 noise the memory of the replay buffer as
  • 01:33:16 well as the four different deep neural
  • 01:33:18 networks and that derives from the face
  • 01:33:23 object
  • 01:33:25 initializer is a little bit long and
  • 01:33:28 that takes an alpha and a beta these are
  • 01:33:31 the learning rates for the actor and
  • 01:33:33 critic respectively recall from what we
  • 01:33:35 read in the paper they use 0 0 0 1 and 0
  • 01:33:38 0 1 for both networks we need input dims
  • 01:33:42 tau the environment that's how we're
  • 01:33:44 going to get the action bounds a gamma
  • 01:33:47 is 0.99 for the paper in this case we
  • 01:33:50 need number of actions equals to a MEMS
  • 01:33:56 eyes for our memory of 1 million player
  • 01:34:02 1 size of 400 and layer 2 size of 300
  • 01:34:06 and a batch size of 64 so of course we
  • 01:34:11 want to save all of our parameters
  • 01:34:17 we'll need the memory and that's just a
  • 01:34:20 replay buffer max sighs whoops input
  • 01:34:25 dims and number of actions we will need
  • 01:34:29 a batch sighs here's where we're gonna
  • 01:34:34 store the session and this is so that we
  • 01:34:37 have a single session for all four
  • 01:34:40 networks and I believe and I believe
  • 01:34:45 don't quote me on this but I tried it
  • 01:34:48 with an individual Network for sorry an
  • 01:34:51 individual session for each network and
  • 01:34:52 it was very unhappy when I was
  • 01:34:55 attempting to copy over parameters from
  • 01:34:56 one network to another I figured there
  • 01:34:58 was some scoping issues so I just
  • 01:35:00 simplified it by having a single session
  • 01:35:02 there's no real reason that I can think
  • 01:35:04 up to have more than one and that is an
  • 01:35:08 actor that gets alpha a number of
  • 01:35:11 actions the name is just actor input
  • 01:35:14 dims the session layer one sized layer
  • 01:35:19 to size action space dot high and that's
  • 01:35:24 our action bounce the actions based out
  • 01:35:26 high next we have a critic that gets
  • 01:35:32 beta and actions its name is just critic
  • 01:35:36 the dims self that session there one
  • 01:35:42 sized layer to size and we don't pass in
  • 01:35:46 anything about the environment there so
  • 01:35:48 now we can just copy these and instead
  • 01:35:52 of actor it is target actor
  • 01:35:58 and let's go ahead and clean up and be
  • 01:36:02 consistent with our pet bait style
  • 01:36:05 guides always important and that
  • 01:36:08 actually makes you stand out I've worked
  • 01:36:10 on projects where the manager was quite
  • 01:36:13 happy to see that I had a somewhat
  • 01:36:15 strict adherence to it just something to
  • 01:36:18 take note of so then we have a target
  • 01:36:22 critic as well and we will clean up this
  • 01:36:29 and so that is all four of our deep
  • 01:36:34 neural networks that's pretty
  • 01:36:35 straightforward so now we need noise
  • 01:36:40 that's a no you action noise with mu
  • 01:36:45 equals numpy zeroes in the shape and
  • 01:36:49 actions so now we need operations to
  • 01:36:56 perform the soft updates so the first
  • 01:37:00 time I tried it I defined it as its own
  • 01:37:02 separate function and that was a
  • 01:37:06 disaster it was a disaster because it
  • 01:37:09 would get progressively slower at each
  • 01:37:11 soft update and I don't quite know the
  • 01:37:14 reason for that I just know that that's
  • 01:37:17 what happened and so when I moved it
  • 01:37:19 into the initializer and defined one
  • 01:37:21 operation
  • 01:37:21 I'm guessing it's because it adds every
  • 01:37:24 time you call it it probably adds
  • 01:37:25 something to the graph so that adds
  • 01:37:26 overhead to the calculation that's my
  • 01:37:28 guess I don't know that's accurate it's
  • 01:37:30 just kind of how I reasoned it but
  • 01:37:31 anyway let's go ahead and define our
  • 01:37:35 update operations here so what we want
  • 01:37:39 to do is iterate over our target critic
  • 01:37:43 parameters and call the assignment
  • 01:37:47 operation what we want to assign we want
  • 01:37:50 to assign the product of critic params
  • 01:37:57 and self that towel plus give that
  • 01:38:03 multiply self dot target critic params
  • 01:38:08 up I times or and no that should be a
  • 01:38:13 sorry a comma comma and one – cell towel
  • 01:38:22 so we get to let's hit man that's and
  • 01:38:29 that is a list comprehension for I in
  • 01:38:32 range length of target critic top params
  • 01:38:37 I gets that you don't need that I
  • 01:38:41 believe no so then we have a similar
  • 01:38:45 operation for the actor we just want to
  • 01:38:52 swap actor and critic target actor so
  • 01:38:58 thought actor actor dot params DUP
  • 01:39:05 params I do the same thing here didn't I
  • 01:39:09 target actor
  • 01:39:13 and up here da prams okay target actor
  • 01:39:20 actor and then target actor there so now
  • 01:39:25 we have our update soft update
  • 01:39:28 operations according to the paper
  • 01:39:32 finally we have constructed all the
  • 01:39:35 graphs you know for the whole four
  • 01:39:39 networks so we have to initialize our
  • 01:39:41 variables self that section that run TF
  • 01:39:45 global variables initializer you can't
  • 01:39:50 really run anything without initializing
  • 01:39:51 it and as per the paper at the very
  • 01:39:54 beginning we want to update the network
  • 01:39:57 parameters and at the beginning we want
  • 01:40:06 to pass in the we want the target
  • 01:40:09 networks to get updated with the full
  • 01:40:13 ver the full value of the critic of the
  • 01:40:16 evaluation networks and so I'm passing
  • 01:40:19 in a parameter of first equals true so
  • 01:40:22 since it's confusing let's do that first
  • 01:40:24 that function first update network
  • 01:40:26 parameters first will default to false
  • 01:40:32 so if first we need to say old tau
  • 01:40:36 equals self dot tau I need to save the
  • 01:40:39 old value of tau to set it again so that
  • 01:40:41 I can reset it so that's how it goes one
  • 01:40:44 and say self dot target critic session
  • 01:40:48 run update critic cell targets actor
  • 01:40:54 that session that run self dot update
  • 01:40:57 actor and as I recall this is important
  • 01:41:01 which session used to run the update
  • 01:41:06 although maybe not since I'm using only
  • 01:41:09 one session if you want to play around
  • 01:41:11 with it go ahead
  • 01:41:14 and then go ahead and reset the towel to
  • 01:41:16 the old value because you only want to
  • 01:41:19 do this particular update where you
  • 01:41:21 update the target network with the
  • 01:41:24 original network full values on the
  • 01:41:26 first turn on the first go through
  • 01:41:29 otherwise just go ahead and run the
  • 01:41:35 update function boom so next we need a
  • 01:41:47 way of storing transitions self state
  • 01:41:51 action reward new state and done so self
  • 01:41:58 up memory dot store transition you want
  • 01:42:03 to store all this stuff this is just an
  • 01:42:06 interface from one class to another you
  • 01:42:09 know this may or may not be great
  • 01:42:10 computer science practice but it works
  • 01:42:13 next we want a way of choosing an action
  • 01:42:15 and that should take a state as input
  • 01:42:19 since we have defined the input variable
  • 01:42:24 to be shaped batch size by n actions you
  • 01:42:27 want to reshape the state to be 1 by
  • 01:42:33 sorry it should be 1 by the observation
  • 01:42:36 space new access all and that's because
  • 01:42:44 we have the come up here for the actor
  • 01:42:47 Network just so we're clear it is
  • 01:42:50 because this has shape none by input
  • 01:42:53 dims so if you just pass in the
  • 01:42:56 observation vector that has shape input
  • 01:42:58 dims and it's going to get it's gonna
  • 01:43:01 get uppity with you
  • 01:43:02 so you just have to reshape it because
  • 01:43:04 you're only passing in a single
  • 01:43:05 observation to determine what action to
  • 01:43:07 take so mu actor dot predict state noise
  • 01:43:13 equals self dot noise new prime
  • 01:43:18 you guys mu plus noise return Sub Zero
  • 01:43:23 so this returns a tuple so you want the
  • 01:43:27 0th element so now we have the learning
  • 01:43:30 function and of course this is where all
  • 01:43:31 the magic happens so if you have not
  • 01:43:37 filled up the memory then you want to go
  • 01:43:40 ahead and bail out otherwise you want to
  • 01:43:47 go ahead and sample your memory sometime
  • 01:43:56 memory that sample buffer batch size so
  • 01:44:05 next we need to do the update from the
  • 01:44:08 paper so let's go back to the paper and
  • 01:44:10 make sure we are clear on that so we
  • 01:44:14 need to we already sampled this so we
  • 01:44:18 need to calculate this and to do that
  • 01:44:20 we're gonna need the q prime the target
  • 01:44:23 critic network as output as well as the
  • 01:44:28 output from the target actor Network and
  • 01:44:30 then we use that to update the loss for
  • 01:44:33 the critic and then we're gonna need the
  • 01:44:35 output from the critic as well as from
  • 01:44:38 the actor network so we need to
  • 01:44:41 basically pass States and actions
  • 01:44:44 through all four networks to get the
  • 01:44:46 training function the learning function
  • 01:44:48 so let's go ahead and head back to the
  • 01:44:50 code editor and do that so our critic by
  • 01:44:53 you for Q prime so cute critic value
  • 01:44:56 underscore sub left target critic
  • 01:44:59 predict and you want to pass in the new
  • 01:45:02 states and you will also want to pass in
  • 01:45:04 the actions that come from the target
  • 01:45:07 actor
  • 01:45:10 new state I need one extra parentheses
  • 01:45:16 there so then we want to calculate the Y
  • 01:45:18 sub i's for the targets so that's an
  • 01:45:20 empty list for J and range set up batch
  • 01:45:24 size target dot append reward critic
  • 01:45:33 value underscore J times done sub J and
  • 01:45:38 that's where the kept harping on getting
  • 01:45:41 no rewards after the terminal state
  • 01:45:42 that's where that comes from
  • 01:45:45 when done is true then it's 1 minus true
  • 01:45:49 which is 0 so you're multiplying this
  • 01:45:51 quantity by 0 so you don't take into
  • 01:45:53 account the value of the next state
  • 01:45:55 right which is calculated up here you
  • 01:45:57 only take into account the most recent
  • 01:45:59 reward as you would want so then we just
  • 01:46:02 want to reshape that target into
  • 01:46:08 something that is batch size by 1 that's
  • 01:46:11 to be consistent with our placeholders
  • 01:46:12 and now we want to call the critic train
  • 01:46:16 function right because we have
  • 01:46:20 everything we need we have the state's
  • 01:46:21 actions and targets with States and
  • 01:46:23 actions from the replay buffer and the
  • 01:46:25 target from this calculation here very
  • 01:46:28 easy now we need to do the actor update
  • 01:46:32 so the action else self dot actor dot
  • 01:46:36 predict we get the predictions from the
  • 01:46:39 actor for this states are grads the
  • 01:46:43 gradients get action gradients state
  • 01:46:48 payouts that'll get the remember that'll
  • 01:46:51 do a feed forward and get the gradient
  • 01:46:54 of the critic with respect to the
  • 01:46:57 actions taken and then you want to train
  • 01:47:01 the actor state and grads the gradients
  • 01:47:06 it's a tuple so you want to dereference
  • 01:47:08 it and get the 0th element and then
  • 01:47:10 finally you want to update network
  • 01:47:12 parameters whoo okay so that is it for
  • 01:47:18 the learn function now we have two other
  • 01:47:21 bookkeeping functions to handle which is
  • 01:47:23 save models and this will save all of
  • 01:47:29 our models so a self thought actor does
  • 01:47:31 save check point
  • 01:47:33 self thought target actor save check
  • 01:47:37 point
  • 01:47:38 so plot critic and a function to load
  • 01:47:49 models sounds like a dog fine out there
  • 01:47:55 so we want to load check points instead
  • 01:47:58 of saving them
  • 01:48:02 load check points load and load so that
  • 01:48:10 is it for the aging class that only took
  • 01:48:13 about an hour so we're up to about two
  • 01:48:15 hours for the video so longest one yet
  • 01:48:17 so this is an enormous amount of work
  • 01:48:20 this is already three hundred ten lines
  • 01:48:22 of code if you've made it this far
  • 01:48:23 congratulations this is no mean feat
  • 01:48:25 this took me you know a couple weeks to
  • 01:48:27 hammer through but we've gotten through
  • 01:48:29 it in a couple hours so this is all
  • 01:48:32 there is to the implementation now we
  • 01:48:35 have to actually test it so let's open
  • 01:48:36 up a new file and save that as main
  • 01:48:43 tensorflow pi and what we want to do now
  • 01:48:48 is go ahead and test this out in the
  • 01:48:51 pendulum environment
  • 01:48:56 we call it tensorflow original import
  • 01:49:01 our agent
  • 01:49:02 we need Jim we need do we need numpy for
  • 01:49:07 this yes we do
  • 01:49:11 we need we don't need tensor flow from
  • 01:49:14 utils import plot learning and we don't
  • 01:49:20 need OS so we will have a score history
  • 01:49:24 I know what I want to do that with a say
  • 01:49:29 if name equals main now we want to say
  • 01:49:36 env Jim not make Bend you glum – be zero
  • 01:49:46 agent equals agent it gets a learning
  • 01:49:54 rate of zero zero zero one a beta of
  • 01:50:00 zero one put dims 303 towel zero one
  • 01:50:09 pass in our environment batch size 64
  • 01:50:14 being verbose here and it looks like
  • 01:50:18 when I ran this I actually use different
  • 01:50:20 layer sizes well that's an issue for
  • 01:50:25 hyper parameter tuning I just want to
  • 01:50:28 demonstrate that this works and actions
  • 01:50:31 equals one so
  • 01:50:34 when I got good results for this I
  • 01:50:36 actually used 800 by 600 and I cut these
  • 01:50:40 learning rates in half but we'll go
  • 01:50:42 ahead and use the values from the paper
  • 01:50:45 sometimes I do a hyper parameter tuning
  • 01:50:48 other thing we need to do is set our
  • 01:50:50 random seed you know we can just put it
  • 01:50:55 here
  • 01:50:55 whatever the reason we want to do this
  • 01:50:57 is for replicability and i have yet to
  • 01:51:01 see any limitation of this where they
  • 01:51:02 don't set the random seed and i've tried
  • 01:51:05 it without it and you don't get very
  • 01:51:06 good sampling of a replay buffer so this
  • 01:51:09 seems to be a critical step and most
  • 01:51:10 implementations I've seen on github do
  • 01:51:12 the same thing so let's go ahead and
  • 01:51:16 play our episodes say a thousand games
  • 01:51:20 you know view that reset we forgot our
  • 01:51:23 score history of course that's to keep
  • 01:51:26 track of the scores over the course of
  • 01:51:27 our games don't is false and the score
  • 01:51:35 for the episode is zero let's play the
  • 01:51:37 episode say act agent dot shoes action
  • 01:51:41 takes jobs as input new state reward
  • 01:51:46 done info equal C and V dot step act
  • 01:51:53 agent got remember Hobbs Act reward new
  • 01:51:57 state and done I guess the int done
  • 01:52:01 really isn't necessary we take care of
  • 01:52:02 that in the replay buffer funk funk
  • 01:52:04 class but you know doesn't hurt to be
  • 01:52:08 explicit and then you want to rebel we
  • 01:52:11 want to learn on each time step because
  • 01:52:13 this is a temporal difference method
  • 01:52:14 keep track of your score and set your
  • 01:52:17 old state to be the new state so that
  • 01:52:19 way when you choose an action on the
  • 01:52:20 next step you are using the most recent
  • 01:52:22 information finally at the end of every
  • 01:52:25 episode you want to append that score to
  • 01:52:28 the score history and print episode I
  • 01:52:34 score percent up to F score 100 game
  • 01:52:42 average person not to have percent numpy
  • 01:52:48 mean a score history last 100 games –
  • 01:52:55 100 on one more and at the end equals
  • 01:53:04 pendulum dot PNG plot learning plot
  • 01:53:11 learning score history file name and a
  • 01:53:17 window of 100 the reason I choose a
  • 01:53:20 window of 100 is because many
  • 01:53:23 environments to find salt as trailing
  • 01:53:28 100 games over some amount the pendulum
  • 01:53:31 doesn't actually have a solved amount so
  • 01:53:35 what we get is actually on par with some
  • 01:53:37 of the best results people have on the
  • 01:53:39 leaderboard so it looks like it does
  • 01:53:40 pretty well so that is it for the main
  • 01:53:44 function let's go ahead and head to the
  • 01:53:46 terminal and see how many typos I made
  • 01:53:49 all right here we are let's go ahead and
  • 01:53:52 run the main file invalid syntax I have
  • 01:53:57 an extra print to see I'm just going to
  • 01:53:58 delete that really quick run that I have
  • 01:54:04 the star out of place let's go back to
  • 01:54:06 the code editor and handle that so it
  • 01:54:09 looks like it is on line 95 which is
  • 01:54:15 here
  • 01:54:19 all right back to the terminal let's try
  • 01:54:21 it again
  • 01:54:23 and it says line 198 invalid syntax oh
  • 01:54:29 it's because it's a comma instead of a
  • 01:54:31 colon
  • 01:54:32 all right I'll fix that once more all
  • 01:54:38 right so that was close so actor object
  • 01:54:42 has no attribute input dims line 95 no
  • 01:54:46 ok that's easy
  • 01:54:48 let's head back so it's in line 95 just
  • 01:54:53 means I forgot to save input dims and
  • 01:55:03 that probably means I forgot it and then
  • 01:55:05 critic as well since I did it cut and
  • 01:55:07 paste yes it does
  • 01:55:16 alright now we'll go back to the
  • 01:55:17 terminal okay that's the last one
  • 01:55:19 all right moment of truth
  • 01:55:24 critic takes one positional argument but
  • 01:55:26 seven were given good grief okay one
  • 01:55:29 second so that is of course in the agent
  • 01:55:33 function so I have line two thirty four
  • 01:55:40 one two three four five six seven
  • 01:55:44 parameters indeed so critic takes
  • 01:55:47 learning rate number of actions name
  • 01:55:49 input Tim session interesting one two
  • 01:55:52 three four five six seven
  • 01:55:56 what have I done oh that's why it's
  • 01:56:03 class critic of course all right let's
  • 01:56:06 go back to the terminal and see what
  • 01:56:06 happens named action bound is not
  • 01:56:09 defined and that is in the line 148 okay
  • 01:56:17 line 148 that is in the critic ah that's
  • 01:56:23 because I don't need it there let's
  • 01:56:24 delete it that was just for the actor
  • 01:56:26 it's because I cut and pasted always
  • 01:56:27 dangerous I tried to save a few
  • 01:56:29 keystrokes too
  • 01:56:30 my hands and ended up wasting time
  • 01:56:32 instead all right it's go back to the
  • 01:56:33 terminal
  • 01:56:34 all right actor has no attribute inputs
  • 01:56:44 that is on line 126 self dot inputs it's
  • 01:56:51 probably self dot input yes that's why I
  • 01:56:56 do the same thing here okay
  • 01:57:05 perfect and it runs so I'll let that run
  • 01:57:09 for a second but I let it run for a
  • 01:57:12 thousand games earlier and this is the
  • 01:57:14 output I got now keep in mind it's a
  • 01:57:16 slightly different parameters the point
  • 01:57:18 here isn't that whether or not we can
  • 01:57:21 replicate the results because we don't
  • 01:57:23 even know what the results really were
  • 01:57:24 because they express it as a fraction of
  • 01:57:26 a planning a fraction of the performance
  • 01:57:29 of a planning agent so who knows what
  • 01:57:31 that really means I did a little bit of
  • 01:57:33 hybrid parameter tuning all I did was
  • 01:57:35 double the number of input units and
  • 01:57:37 have the learning rate and I ended up
  • 01:57:39 with something that looks like this so
  • 01:57:41 you can see it gets around 150 or so
  • 01:57:43 steps because they're on hundred years
  • 01:57:46 so steps to salt and if you check the
  • 01:57:48 leaderboards that it shows that that's
  • 01:57:50 actually a reasonable number some of the
  • 01:57:52 best environments only have 152 steps
  • 01:57:54 some of them do a little bit better
  • 01:57:55 sorry best agents solvent in hunter 52
  • 01:57:58 steps or achieve a best score run or
  • 01:58:00 vide steps but it's pretty reasonable so
  • 01:58:02 so the default implementation looks like
  • 01:58:05 it's very very slow to learn you can see
  • 01:58:08 how it's kind of starts out band gets
  • 01:58:10 worse and then gets starts to get a
  • 01:58:12 little bit better so that's pretty
  • 01:58:13 typical you see this you know
  • 01:58:15 oscillation and performance over time
  • 01:58:16 pretty frequently but that is that that
  • 01:58:19 is how you go from an implementation in
  • 01:58:23 about two hours of course it took me you
  • 01:58:26 know many you know a couple many times
  • 01:58:28 that to get this set up for you guys but
  • 01:58:31 I hope this is helpful
  • 01:58:33 I'm going to milk this for all it's
  • 01:58:34 worth this has been a tough project so
  • 01:58:36 I'm gonna present many many more
  • 01:58:37 environments in the future I may even do
  • 01:58:40 a video like this for pi torch I have
  • 01:58:42 yet to work on a kerosene
  • 01:58:43 limitation for this but there are many
  • 01:58:46 more
  • 01:58:46 DDP G videos to come so subscribe to
  • 01:58:49 make sure you don't miss that leave a
  • 01:58:51 comment share this if you found it
  • 01:58:52 helpful that helps me immensely I would
  • 01:58:54 really appreciate it and I'll see you
  • 01:58:56 all in the next video welcome back
  • 01:59:02 everybody in this tutorial you are gonna
  • 01:59:03 code a deterministic policy gradient
  • 01:59:06 agent to beat the continuous lunar
  • 01:59:07 lander environment in pi torch no prior
  • 01:59:10 experience needed you don't need to know
  • 01:59:11 anything about deep reinforcement
  • 01:59:12 learning you just have to follow along
  • 01:59:14 let's get started
  • 01:59:20 so we start as usual with our imports
  • 01:59:23 will need OS to handle file operations
  • 01:59:25 and all of this stuff from torch that
  • 01:59:27 we've come to expect as well as numpy
  • 01:59:29 I'm not going to do a full overview of
  • 01:59:31 the paper that will be in a future video
  • 01:59:33 where I will show you how to go from the
  • 01:59:35 paper to an actual implementation of
  • 01:59:37 deep deterministic policy gradients so
  • 01:59:39 make sure you subscribe so you don't
  • 01:59:40 miss that but in this video we're just
  • 01:59:42 going to get at the very high-level
  • 01:59:43 overview the 50,000 foot view if you
  • 01:59:47 will that will be sufficient to get an
  • 01:59:49 agent to beat the continuous lunar
  • 01:59:50 lander environment so that's good enough
  • 01:59:51 so the gist of this is we're gonna need
  • 01:59:54 several different classes so we'll need
  • 01:59:56 a class to encourage exploration in
  • 01:59:59 other words the type of noise and you
  • 02:00:01 might have guessed that from the word
  • 02:00:02 deterministic it means that the policy
  • 02:00:03 is deterministic as in it chooses some
  • 02:00:06 action with certainty and so if it is
  • 02:00:08 purely deterministic you can't really
  • 02:00:10 explore so we'll need a class to handle
  • 02:00:12 that we'll also need a class to handle
  • 02:00:14 the replay memory because deep
  • 02:00:16 deterministic policy gradients works by
  • 02:00:17 combining the magic of actor critic
  • 02:00:20 methods with the magic of deep Q
  • 02:00:21 learning which of course has a replay
  • 02:00:23 buffer and we'll also need classes for
  • 02:00:27 our critic and our actor as well as the
  • 02:00:30 agent so that's kind of a mouthful we'll
  • 02:00:33 handle them one bit at a time and we
  • 02:00:36 will start with the noise so this class
  • 02:00:39 is called oh you action noise and the
  • 02:00:42 öyou stands for Ornstein Willem Beck
  • 02:00:45 so that is a type of noise from physics
  • 02:00:47 that models the motion of a Brownian
  • 02:00:49 particle meaning this particle subject
  • 02:00:51 to a random walk based on interactions
  • 02:00:53 with other nearby particles it gives you
  • 02:00:55 a temporally correlated meaning
  • 02:00:57 correlated in time set type of noise
  • 02:01:00 that centers around a mean of zero so
  • 02:01:04 we're gonna have a number of parameters
  • 02:01:06 mu a Sigma a theta if I could type that
  • 02:01:11 would be fantastic as well as a DT as in
  • 02:01:15 the differential with respect to time
  • 02:01:16 and an initial value that will get an
  • 02:01:18 original value of none so if you want to
  • 02:01:21 know more about it then just go ahead
  • 02:01:22 and check out the Wikipedia article but
  • 02:01:24 the overview I gave you is sufficient
  • 02:01:26 for this tutorial so of course we want
  • 02:01:30 to save all of our values
  • 02:01:36 X zero and we want to call a reset
  • 02:01:39 function so the reset function will
  • 02:01:41 reset the temporal correlation which you
  • 02:01:43 may want to do that from time to time
  • 02:01:44 turns out it's not necessary for our
  • 02:01:46 particular implementation but it is a
  • 02:01:49 good function to have nonetheless so
  • 02:01:52 next we're gonna override the call
  • 02:01:53 function if you aren't familiar with
  • 02:01:55 this this allows you to say noise equals
  • 02:01:59 oh you action noise and then call noise
  • 02:02:03 so that allows you to instead of saying
  • 02:02:05 noise get noise just say noise with
  • 02:02:07 parenthesis or whatever the name of the
  • 02:02:09 object is so that's overriding the call
  • 02:02:12 function so we'll have an equation for
  • 02:02:14 that X previous plus theta times the
  • 02:02:19 quantity mu minus self dot X previous
  • 02:02:23 times self DT plus self dot Sigma times
  • 02:02:31 numpy square root cell DT times MP
  • 02:02:35 random normal size equals mu shape so
  • 02:02:41 it's a type of random normal noise that
  • 02:02:44 is correlated in time through this mu
  • 02:02:46 minus X previous term and every time you
  • 02:02:51 calculate a new value you want to set
  • 02:02:53 the old value the previous value to the
  • 02:02:55 new one and go ahead and return the
  • 02:02:57 value so the reset function all that
  • 02:03:02 does is check to make sure x0 exists if
  • 02:03:05 it doesn't it sets it equal to some 0
  • 02:03:08 value so so X previous equals Delta X
  • 02:03:12 naught if is not none else numpy zeros
  • 02:03:17 like in the shape of self dot mutant
  • 02:03:20 that's it for the action noise again
  • 02:03:23 this will be used in our actor class to
  • 02:03:26 add in some exploration noise to the
  • 02:03:28 action selection next we need the replay
  • 02:03:31 buffer class and this is pretty
  • 02:03:34 straightforward it's just going to be
  • 02:03:35 set of numpy arrays in the shape of the
  • 02:03:38 action space number the observation
  • 02:03:42 space and rewards so that way we can
  • 02:03:44 have a memory of events that have
  • 02:03:46 happened
  • 02:03:46 so we can sample them during the
  • 02:03:47 learning step if you haven't seen my
  • 02:03:50 videos on deep q-learning
  • 02:03:51 please check those out they will make
  • 02:03:53 all this much more clear as well as
  • 02:03:55 checking out the videos on actor critic
  • 02:03:56 methods because this again D DPG kind of
  • 02:03:59 combines actor critic with deep key
  • 02:04:01 learnings so that will really be helpful
  • 02:04:02 for you I'll go ahead and link those
  • 02:04:05 here as well so that way you can get
  • 02:04:09 educated so scroll down a bit we want to
  • 02:04:13 save our men's eyes as max size so then
  • 02:04:18 then counter we'll start out at zero
  • 02:04:20 again this is just gonna be a set of
  • 02:04:22 arrays that keep track or matrices in
  • 02:04:24 this case that keep track of the state
  • 02:04:27 reward action transitions and that will
  • 02:04:31 be in shape immense eyes so however
  • 02:04:32 number of memories we want to store and
  • 02:04:34 input shape so if you are relatively new
  • 02:04:38 to Python this star variable idiom it
  • 02:04:42 isn't a pointer if you're coming from C
  • 02:04:43 or C++ it is an idiom that means to
  • 02:04:47 unpack a tuple so this makes our class
  • 02:04:49 extensible so we can pass in a list of a
  • 02:04:51 single element as in the case of the
  • 02:04:53 lunar lander and continuous winter and
  • 02:04:55 land environment that we'll use today or
  • 02:04:57 later on when we get to the continuous
  • 02:04:59 car racing environment we'll have images
  • 02:05:01 from the screen so this will accommodate
  • 02:05:03 both types of observation vectors it's a
  • 02:05:05 way of making stuff extensible and the
  • 02:05:09 new State memory is of course the same
  • 02:05:11 shape so we just copy it it looks like I
  • 02:05:16 am missing a parenthesis somewhere I
  • 02:05:20 guess we'll find it when I go ahead and
  • 02:05:23 run the program so oh it's not self
  • 02:05:26 thought and it's it's def there we go
  • 02:05:29 definite perfect so we'll also need an
  • 02:05:33 action memory and that of course will
  • 02:05:38 also be an array of zeros in the shape
  • 02:05:40 of size by number of actions I believe
  • 02:05:44 that means I need an extra bread the
  • 02:05:47 scene there yes and we'll also have a
  • 02:05:50 reward memory
  • 02:05:52 and that will just be in shape mm sighs
  • 02:05:55 we also need a terminal memory so in
  • 02:06:00 reinforcement learning we have the
  • 02:06:03 concept of the terminal state so when
  • 02:06:05 the episode is over the agent enters the
  • 02:06:07 terminal state from which it receives no
  • 02:06:09 future rewards so the value of that
  • 02:06:11 terminal State is identically zero and
  • 02:06:14 so the way we're going to keep track of
  • 02:06:16 when we transition into terminal States
  • 02:06:18 is by saving the done flags from the
  • 02:06:20 open AI gym environment and that'll be
  • 02:06:24 shape numpy by zeros mmm sighs and I've
  • 02:06:31 called this float32 it's probably
  • 02:06:33 because torch is a little bit particular
  • 02:06:35 with data types so we have to be
  • 02:06:38 cognizant of that we need a function to
  • 02:06:42 store transitions which is just a state
  • 02:06:45 action reward new state and done flag so
  • 02:06:51 the index is going to be the first
  • 02:06:54 available position so mem counter just
  • 02:06:58 keeps track of the last memory you
  • 02:07:00 stored it's just an integer quantity
  • 02:07:01 from 0 up to mem sighs and so when mem
  • 02:07:05 counter becomes greater than men's size
  • 02:07:07 it just wraps around from zero so when
  • 02:07:08 they're equal to zero and when it's
  • 02:07:10 equal to mmm sighs plus one it becomes
  • 02:07:12 one and so on and so forth so state
  • 02:07:15 memory sub index equals state action
  • 02:07:20 memory index
  • 02:07:26 reward equals state underscore and the
  • 02:07:36 terminal memory good grief
  • 02:07:39 terminal memory index doesn't equal done
  • 02:07:42 but it equals one – done
  • 02:07:44 so the reason is that when we get to the
  • 02:07:47 update equation the bellman equation for
  • 02:07:49 our learning function you'll see we want
  • 02:07:51 to multiply by whether or not the
  • 02:07:53 episode is over and that gets is
  • 02:07:55 facilitated by 1 minus the quantity done
  • 02:07:59 just incrementer and next we need to
  • 02:08:04 sample that buffer so sample buffer and
  • 02:08:07 that will take in a batch size as input
  • 02:08:11 so the max memory is going to be the
  • 02:08:15 minimum of either mem counter or mem
  • 02:08:19 size not the max but the minimum then
  • 02:08:24 match is just going to be a random
  • 02:08:25 choice of maximum index of maximum
  • 02:08:31 number of elements excuse me equal to
  • 02:08:34 batch size scroll down a bit and then we
  • 02:08:38 want to get ahold of the respective
  • 02:08:41 states actions rewards and new states
  • 02:08:42 and terminal flags and pass them back to
  • 02:08:44 the learning function sub batch new
  • 02:08:50 states
  • 02:08:57 rewards you know it's not easy to type
  • 02:09:05 and talk at the same time apparently and
  • 02:09:08 let's get actions self dot action memory
  • 02:09:13 fetch and terminal
  • 02:09:23 good grief so we want to return States
  • 02:09:26 actions
  • 02:09:27 rewards new states and the terminal
  • 02:09:31 flags perfect so that is it for our
  • 02:09:33 replay memories so you're gonna see this
  • 02:09:36 a lot in the other videos on deep
  • 02:09:38 deterministic policy gradients because
  • 02:09:41 we're gonna need it for basically
  • 02:09:43 anything that uses a memory next let's
  • 02:09:46 go ahead and get to the meat of the
  • 02:09:48 problem with our critic Network and as
  • 02:09:51 is often the case when you're dealing
  • 02:09:53 with PI torch you want to derive your
  • 02:09:55 neural network classes from n n dot
  • 02:09:57 module that gives you access to
  • 02:09:59 important stuff like the Train and eval
  • 02:10:02 function which will set us in train or
  • 02:10:04 evaluation mode very important later I
  • 02:10:06 couldn't get it to work until I figure
  • 02:10:08 that out so a little tidbit for you you
  • 02:10:10 also need access to the parameters for
  • 02:10:12 updating the weights of the neural
  • 02:10:14 network so let's define our initialize
  • 02:10:18 function the beta is our learning rate
  • 02:10:22 we'll need input dims number of
  • 02:10:25 dimensions for the first fully connected
  • 02:10:26 layer as well as second connected layer
  • 02:10:28 number of actions a name the name is
  • 02:10:32 important for saving the network you'll
  • 02:10:35 see that we have many different networks
  • 02:10:37 so we'll want to keep track of which one
  • 02:10:40 is which very important as well as I
  • 02:10:45 checkpoint directory for saving the
  • 02:10:46 model they're also very important as
  • 02:10:47 this model runs very very slowly so
  • 02:10:49 you'll want to us save it periodically
  • 02:10:54 and you want to call the super
  • 02:10:56 constructor for critic Network and that
  • 02:11:00 will call the constructor friend dot
  • 02:11:01 module I believe input dims equals input
  • 02:11:04 dims I see one dims so these would just
  • 02:11:11 be the parameters for our deep neural
  • 02:11:13 network that approximates the value
  • 02:11:15 function the number of actions a
  • 02:11:19 checkpoint file and that is OS path join
  • 02:11:24 checkpoint der with name plus underscore
  • 02:11:29 DD PG and if you check my github repo I
  • 02:11:32 will upload the train model because this
  • 02:11:35 model takes a long time to train so I
  • 02:11:38 want you to be able to take advantage of
  • 02:11:40 the fully trading model that I've spent
  • 02:11:41 the resources in time training up so you
  • 02:11:43 may as well benefit from that so next up
  • 02:11:46 we need to define the first layer of our
  • 02:11:47 neural network just a linear layer and
  • 02:11:50 that'll take input dims and output F c1
  • 02:11:55 dims we're also going to need a number
  • 02:11:58 for initializing the weights and biases
  • 02:12:01 of that layer of the neural network
  • 02:12:04 we're going to call that f1 and it's
  • 02:12:06 divided by 1 over the square root of the
  • 02:12:09 number of dimensions into the network so
  • 02:12:13 self dot FC 1 dot weight a data dot size
  • 02:12:20 and that returns a tuple so I need the
  • 02:12:23 zeroth element of that and then we want
  • 02:12:26 to initialize that layer by using T
  • 02:12:30 torch and in an it uniform underscore
  • 02:12:33 the tensor you want to initialize which
  • 02:12:36 is FC 1 weight data up from minus 1 to
  • 02:12:41 positive 1 so this will be a small
  • 02:12:44 number of order point 1 or so not
  • 02:12:47 exactly 0.1 but of order point 1 and you
  • 02:12:51 also want to initialize the biases by
  • 02:12:57 Assad data and that gets the same number
  • 02:12:59 and again in the future video when we go
  • 02:13:02 over the derivation of the paper I'll
  • 02:13:04 explain all of this but just for now
  • 02:13:06 know that this is too
  • 02:13:07 constrain the initial weights of the
  • 02:13:10 network to a very narrow region of
  • 02:13:11 parameter space to help you get better
  • 02:13:13 convergence perfect so oh and make sure
  • 02:13:18 to subscribe so you don't miss that
  • 02:13:19 video because it's gonna be lit so bn1
  • 02:13:23 is a layered norm and takes FC one dims
  • 02:13:27 as input The Bachelor Malaysian helps
  • 02:13:30 with convergence of your model you don't
  • 02:13:34 get good convergence if you don't have
  • 02:13:35 it so leave it in so FC 2 is a second
  • 02:13:38 layer another linear it takes FC 1 dims
  • 02:13:41 as input and outputs FC 2 dims good
  • 02:13:48 grief
  • 02:13:49 and we want to do the same thing with
  • 02:13:51 initialization so it's 1 over the square
  • 02:13:53 root of self FC 2 wait data dot size 0
  • 02:13:59 and you want to do T and n in that
  • 02:14:03 uniform underscore C to wait data minus
  • 02:14:09 f2 up to F to make sure that's right so
  • 02:14:13 we don't screw that up because that's
  • 02:14:14 important that looks correct the syntax
  • 02:14:18 here is the first parameter is a tensor
  • 02:14:20 you want to initialize and then the
  • 02:14:21 lower and upper boundaries so next we
  • 02:14:25 need B n 2 which is our second batch
  • 02:14:27 normal layer FC 2 dims and just a note
  • 02:14:36 the fact that we have a normalization
  • 02:14:38 layer a bachelor type layer means that
  • 02:14:40 we have to use the eval and trained
  • 02:14:42 functions later kind of a nuisance it
  • 02:14:45 took me a while to figure that out
  • 02:14:48 the critic network is also going to get
  • 02:14:51 a action value because the action value
  • 02:14:55 function takes in the states and actions
  • 02:14:57 as input but we're gonna add it in at
  • 02:15:00 the very end of the network linear
  • 02:15:05 actions see two dims and this gets a
  • 02:15:10 constant initialization of zero zero
  • 02:15:12 three or a sorry the the next one the
  • 02:15:19 output gets a initialization of zero
  • 02:15:23 zero three and since this is a scalar
  • 02:15:26 value it just has one output and you
  • 02:15:29 want to initialize it again uniformly
  • 02:15:32 not cute at weight data and that gets
  • 02:15:36 minus F 3 up to F 3 and likewise for
  • 02:15:42 bias data okay so now we have our
  • 02:15:50 optimizer and that will be the atom
  • 02:15:53 optimizer and what are we going to
  • 02:15:55 optimize the parameters and the learning
  • 02:15:58 rate will be beta so you notice that we
  • 02:16:00 did not define parameters right here
  • 02:16:04 right we're just calling it and this
  • 02:16:05 comes from the inheritance from n n dot
  • 02:16:08 module and that's why we do that so we
  • 02:16:09 get access to the network parameters
  • 02:16:12 next you certainly want to run this on a
  • 02:16:15 GPU because it is an incredibly
  • 02:16:17 expensive algorithm so you want to call
  • 02:16:20 the device so T dot device CUDA 0 if T
  • 02:16:25 CUDA is available else CUDA 1 so I have
  • 02:16:32 two GPUs if you only have a single GPU
  • 02:16:36 it will be else CPU but I don't
  • 02:16:39 recommend running this on a CPU so next
  • 02:16:42 you want to send the whole network to
  • 02:16:43 your device by self dot to self dot
  • 02:16:46 device we are almost there for the
  • 02:16:49 critic class next thing we have to worry
  • 02:16:51 about is the forward function and that
  • 02:16:54 takes a state and an ax
  • 02:16:55 as input keep in mind the actions are
  • 02:16:58 continuous so it's a vector in this case
  • 02:17:00 length two for the continuous lunar
  • 02:17:02 lander environment it's two real numbers
  • 02:17:04 in a list or numpy array format so state
  • 02:17:08 value its FC one state and then you want
  • 02:17:13 to pass it through B and one state value
  • 02:17:18 and finally you want to activate it if
  • 02:17:23 that value state value now it is an open
  • 02:17:26 debate whether or not you want to do the
  • 02:17:27 value before or after the batch
  • 02:17:30 normalization in my mind it makes more
  • 02:17:33 sense to do the batch normalization
  • 02:17:35 first because when you are calculating
  • 02:17:38 batch statistics if you apply the value
  • 02:17:41 first then your lopping off everything
  • 02:17:43 below zero right so that means that your
  • 02:17:46 statistics going to be skewed toward the
  • 02:17:47 positive end when perhaps the real
  • 02:17:50 distribution has a mean of zero instead
  • 02:17:52 of a positive mean or maybe it even has
  • 02:17:53 a negative mean which you wouldn't see
  • 02:17:55 if you used the value function before
  • 02:18:00 the batch normalization so just
  • 02:18:02 something to keep in mind you can play
  • 02:18:04 around with it feel free to clone this
  • 02:18:05 and see what you get but that's how I've
  • 02:18:07 done it
  • 02:18:08 I did try both ways and this seemed to
  • 02:18:11 work the best so next we want to feed it
  • 02:18:15 into the second fully connected layer BN
  • 02:18:21 to state value and then we want to be
  • 02:18:27 into that sorry I already did that one
  • 02:18:30 second let the cat out so we've already
  • 02:18:32 done the bachelor Malaysian we don't
  • 02:18:33 want to activate it yet what we want to
  • 02:18:35 do first is taking account of the action
  • 02:18:39 value and what we're going to do is
  • 02:18:45 activate the action through the action
  • 02:18:47 value layer and perform a value
  • 02:18:49 activation on it right away
  • 02:18:50 we're not going to calculate bash
  • 02:18:51 statistics on this so we don't need to
  • 02:18:53 worry about that but what we want to do
  • 02:18:56 is add the two values together so state
  • 02:18:59 action value F dot rel you Teta add
  • 02:19:02 state value
  • 02:19:04 action value other thing that's a little
  • 02:19:06 bit wonky here and I invite you to clone
  • 02:19:08 this and play with yourself is that I am
  • 02:19:11 double rel Ewing the action value
  • 02:19:13 function so the action value quantity so
  • 02:19:15 I do a value here and then I do a rel
  • 02:19:18 you on the add now this is a little bit
  • 02:19:21 sketchy I've played around with it and
  • 02:19:23 this is the way it works for me so if
  • 02:19:26 you can clone it and get it to work a
  • 02:19:28 different way the other possibility is
  • 02:19:29 that you don't do this value here but
  • 02:19:32 you do the value after the add so value
  • 02:19:35 is a non commutative function with ADD
  • 02:19:37 so what that means is that if you do an
  • 02:19:39 addition first and then a rally that's
  • 02:19:42 different than doing the sum of the two
  • 02:19:43 values right or so if you take value of
  • 02:19:45 minus ten plus value of five you get a
  • 02:19:49 value of minus 10 of zero plus value of
  • 02:19:52 five is five so you get five if you take
  • 02:19:54 the value of minus ten plus five then
  • 02:19:56 you get a RAL U of minus five or zero so
  • 02:19:59 it's a non commutative function so it
  • 02:20:01 does matter the order but this is the
  • 02:20:03 way I found it to work I've seen other
  • 02:20:04 implementations that do it differently
  • 02:20:06 feel free to clone this and do your own
  • 02:20:08 thing with it I welcome any additions
  • 02:20:11 improvements or comments so then we want
  • 02:20:15 to get the actual state action value by
  • 02:20:18 passing that's additive quantity through
  • 02:20:22 our final layer of the network and go
  • 02:20:26 ahead and return that a little bit of
  • 02:20:29 bookkeeping we have a check save check
  • 02:20:32 point function and just save print
  • 02:20:38 and then you want to call T dot save
  • 02:20:41 self dot state dict what this does is it
  • 02:20:44 creates a state dictionary where the
  • 02:20:47 keys are the names of the parameters and
  • 02:20:48 the values are the parameters themselves
  • 02:20:50 and where do you want to save that you
  • 02:20:53 want to say that in the checkpoint file
  • 02:20:55 then you also have the load checkpoint
  • 02:20:59 good grief checkpoint function and that
  • 02:21:03 does the same thing just in Reverse
  • 02:21:06 loading checkpoint and you want self dot
  • 02:21:12 load State dict T download self dot
  • 02:21:16 checkpoint file so that is it for our
  • 02:21:20 critic Network now we move on to the
  • 02:21:22 actor network then of course derives
  • 02:21:27 from an N dot module we have an init
  • 02:21:31 function takes alpha if we get spelled
  • 02:21:35 correctly input dins FC one dims FC two
  • 02:21:39 dims this is pretty similar to the
  • 02:21:43 critic network it will just have a
  • 02:21:46 different structure in particular we
  • 02:21:48 don't have the we don't have the actions
  • 02:21:56 I can't type and talk at the same time
  • 02:22:00 but it's pretty similar nonetheless so
  • 02:22:03 input dims we want to save
  • 02:22:08 and actions see one dims fc2 dims same
  • 02:22:17 deal let me go copy the checkpoint file
  • 02:22:22 function just to make life easy perfect
  • 02:22:28 I like to make life easy so we have our
  • 02:22:30 first fully connected layer and in doubt
  • 02:22:33 linear take self dot input dims as input
  • 02:22:37 and FC one dims and of course it
  • 02:22:39 operates in the same way as I discussed
  • 02:22:42 in the replay buffer where it will just
  • 02:22:44 unpack the tuple next we have the
  • 02:22:48 initialization by a very similar one
  • 02:22:50 over NP square root self-taught FC one
  • 02:22:54 that weight data size zeroth element and
  • 02:22:57 we want to initialize the first layer
  • 02:23:01 uniformly within that interval in its
  • 02:23:05 uniform underscore s you want to weight
  • 02:23:09 data minus F 1 and F 1 copy that that
  • 02:23:15 will be FC 1 bias data FC 2 is another
  • 02:23:23 linear layer takes FC 1 dims as inputs
  • 02:23:26 and outputs FC 2 dims as you might
  • 02:23:29 expect and the initialization for that
  • 02:23:33 will be basically the same thing except
  • 02:23:36 for layered two so FC 2 weight data and
  • 02:23:41 then you know what let's just copy this
  • 02:23:45 paste and make sure we don't mess it up
  • 02:23:48 FC 2 F 2 whoo good grief and plus minus
  • 02:23:56 F 2 and that is all well and good other
  • 02:24:02 thing we forgot is the batch norm and
  • 02:24:05 that is an N layer norm and that takes
  • 02:24:09 FC 1 dims as input
  • 02:24:12 likewise for layer two that is another
  • 02:24:17 layer norm takes FC two dims as input
  • 02:24:20 shape and that doesn't get initialized
  • 02:24:23 but we do have the f3 and that is zero
  • 02:24:27 zero three this comes from the paper
  • 02:24:29 we'll go over this in a future video but
  • 02:24:32 don't worry about it self thought mu mu
  • 02:24:34 is the representation of the policy in
  • 02:24:37 this case it is a real vector of shape
  • 02:24:41 and actions it's the actual action not
  • 02:24:44 the probability right because this is
  • 02:24:45 deterministic so it's just a linear
  • 02:24:47 layer takes FC two dims as input and
  • 02:24:50 outputs the number of actions and we
  • 02:24:54 want to do the same thing where we
  • 02:24:56 initialize the weights let's copy-paste
  • 02:24:59 and instead of FC two it will be mu and
  • 02:25:05 a set of f2 it is f3 as you might expect
  • 02:25:12 am I forgetting anything I don't believe
  • 02:25:14 so
  • 02:25:15 so finally we have an optimizer and
  • 02:25:17 that's again optimum self dot parameters
  • 02:25:22 and learning rate equals alpha we also
  • 02:25:26 want to do the same thing with the
  • 02:25:27 device t dot device CUDA 0 if T dot CUDA
  • 02:25:35 is available else CPU and finally send
  • 02:25:42 it to the device that is that so next we
  • 02:25:49 have the feed forward so that takes the
  • 02:25:53 state as input so I'm just gonna call it
  • 02:25:57 X this is bad naming don't ever do this
  • 02:25:59 self dot F c1 do as I say not as I do
  • 02:26:04 self dot be in one of state
  • 02:26:09 value of X that's a mistake should be X
  • 02:26:16 X equals self dot FC 2 of X B into X and
  • 02:26:24 then X equal T tan hyperbolic self that
  • 02:26:29 Mew of X and then return X so what this
  • 02:26:33 will do is pass the current state or
  • 02:26:35 whatever state or set of states batch in
  • 02:26:37 this case you want to look at perform
  • 02:26:40 the first feed forward pass batch
  • 02:26:43 Norment value
  • 02:26:44 send it through the second layer and
  • 02:26:46 batch norm but not activate send it
  • 02:26:48 through to the final layer now I take
  • 02:26:51 that back I do want to activate that
  • 02:26:53 silly mean if thought well you X and
  • 02:26:56 then it will pass it through the final
  • 02:26:58 layer Mew and perform a tangent
  • 02:26:59 hyperbolic activation so what that will
  • 02:27:02 do is it'll bound it between minus 1 and
  • 02:27:04 plus 1 and that's important for many
  • 02:27:07 environments later on we can multiply it
  • 02:27:09 by the actual action bounds so some
  • 02:27:11 environments have a max action of plus
  • 02:27:13 minus 2 so if you're bounding it by plus
  • 02:27:15 or minus 1 that's not going to be very
  • 02:27:16 effective so you just have a
  • 02:27:18 multiplicative factor later on and then
  • 02:27:21 I'm gonna go copy the to save and load
  • 02:27:24 checkpoint functions because those are
  • 02:27:26 precisely the same that's it for our
  • 02:27:29 actor next we come to our final class
  • 02:27:32 the meat of the problem the agent and
  • 02:27:38 that just gets derived from the base
  • 02:27:40 object and that gets a whole bunch of
  • 02:27:44 parameters alpha and beta of course you
  • 02:27:46 need to pass in learning rates for the
  • 02:27:47 actor and critic networks input dims a
  • 02:27:51 quantity called tau I haven't introduced
  • 02:27:53 that yet but we'll get to it in a few
  • 02:27:54 minutes we're gonna pass in the
  • 02:27:56 environment that's to get the action
  • 02:27:57 space that I talked about just a second
  • 02:27:59 ago the gamma which is the agents
  • 02:28:02 discount factor so if you're not
  • 02:28:03 familiar of reinforcement learning and
  • 02:28:04 agent values are reward now more than in
  • 02:28:07 values or reward in the future because
  • 02:28:09 there's uncertainty around future
  • 02:28:10 rewards so it makes no sense to value it
  • 02:28:12 as much as a current reward so what's
  • 02:28:14 the discount factor you know how much
  • 02:28:16 less does it value if you reward 1%
  • 02:28:18 that's where we get a gamma of 0.99
  • 02:28:21 it's a hyper parameter you can play
  • 02:28:22 around with it values like 0.95 all the
  • 02:28:25 way up to 0.99 our typical number of
  • 02:28:30 actions will default it to to a lot of
  • 02:28:33 environments only have two actions the
  • 02:28:36 max size of our memory that gets 1
  • 02:28:39 million one one two three one through
  • 02:28:41 three layer one size equals default of
  • 02:28:47 400 there are two size 300 is our
  • 02:28:50 default and again that comes from the
  • 02:28:53 paper and a batch size for our batch
  • 02:28:57 learning from our replay memory so you
  • 02:29:02 want to go ahead and save the parameters
  • 02:29:04 equal towel and you want to instantiate
  • 02:29:07 a memory that's a replay buffer of size
  • 02:29:11 max size input dims and end actions we
  • 02:29:19 also want to store the batch size for
  • 02:29:24 our learning function we want to
  • 02:29:26 instantiate our first actor yes there
  • 02:29:29 are more than one and that gets alpha
  • 02:29:32 input dims layer one size layer to size
  • 02:29:38 and actions equals and actions name
  • 02:29:42 equals actor so let's copy that so next
  • 02:29:51 we have our target actor so much like
  • 02:29:54 the deep queue Network algorithm this
  • 02:29:56 uses target networks as well as the base
  • 02:30:00 network so it's an off policy method and
  • 02:30:03 the difference here is this going to be
  • 02:30:05 called target actor
  • 02:30:06 it'll be otherwise identical this will
  • 02:30:10 allow us to have multiple different
  • 02:30:12 agents with similar names and you'll see
  • 02:30:16 how that plays into it momentarily we
  • 02:30:19 also need a critic that's a critic
  • 02:30:21 network takes beta input dims
  • 02:30:26 they are one size higher to size and
  • 02:30:31 actions equals and actions name equals
  • 02:30:35 critic so let's be nice and tidy there
  • 02:30:41 and we also have a target critic as well
  • 02:30:46 and that is otherwise identical it just
  • 02:30:50 gets a different name and this is very
  • 02:30:52 similar to q-learning where you have Q
  • 02:30:54 eval and Q next or Q target whatever you
  • 02:30:57 want to call it same concept
  • 02:31:00 okay so those are all of our networks
  • 02:31:02 what else we need we need noise and
  • 02:31:05 that's our oh you action noise and the
  • 02:31:08 MU is just going to be numpy zeroes of
  • 02:31:11 shape and actions so it'll give you an
  • 02:31:14 array of zeros this is the mean of the
  • 02:31:17 rewards over time and next we need
  • 02:31:20 another function you may be able to
  • 02:31:22 predict if you've seen my videos on Q
  • 02:31:24 learning we should check out is the
  • 02:31:25 update network parameters and we'll call
  • 02:31:31 it with an initial value I equals 1 so
  • 02:31:33 what this does is it solves a problem of
  • 02:31:36 a moving target so in Q learning if you
  • 02:31:40 use one network to calculate both the
  • 02:31:43 action as well as the value of that
  • 02:31:46 action then you're really chasing a
  • 02:31:47 moving target because you're updating
  • 02:31:48 that estimate every turn right so you
  • 02:31:51 are end up using the same parameter for
  • 02:31:53 both and it can lead to divergence so
  • 02:31:56 the solution to that is to use a target
  • 02:31:57 network that learns the values of these
  • 02:31:59 states and action combinations and then
  • 02:32:02 the other network is what learns the
  • 02:32:03 policy and then of course periodically
  • 02:32:07 you have to overwrite the target
  • 02:32:09 parameter target networks parameters
  • 02:32:11 with the evaluation that where
  • 02:32:12 parameters and this function will do
  • 02:32:13 precisely that except that we have four
  • 02:32:16 networks instead of two so next we want
  • 02:32:20 to choose an action
  • 02:32:22 and that takes whatever the current
  • 02:32:25 observation of the environment is now
  • 02:32:27 very very important you have to put the
  • 02:32:30 actor into evaluation mode now this
  • 02:32:32 doesn't perform an evaluation step this
  • 02:32:34 just tells PI torch that you don't want
  • 02:32:37 to calculate statistics for the batch
  • 02:32:40 normalization and this is very critical
  • 02:32:42 if you don't do this the agent will not
  • 02:32:44 learn and it doesn't do what you think
  • 02:32:47 the name implies it would do write the
  • 02:32:49 corresponding the complementary function
  • 02:32:51 is trained it doesn't perform a training
  • 02:32:53 set that puts it in training mode where
  • 02:32:55 it does store those statistics in the
  • 02:32:57 graph for the batch normalization if you
  • 02:32:59 don't do batch norm then you don't need
  • 02:33:01 to do this but if you do what's the
  • 02:33:04 other function drop out drop out does
  • 02:33:07 the same thing or has the same tick
  • 02:33:08 where you have to call the eval and
  • 02:33:10 train functions so let's start by
  • 02:33:13 putting our observation into a tensor do
  • 02:33:19 you tie because T dot float to self
  • 02:33:21 actor device so that'll turn it into a
  • 02:33:25 CUDA float tensor now you want to get
  • 02:33:27 the actual action from the in the actor
  • 02:33:32 network so feed that forward to self dot
  • 02:33:36 actor dot device and this makes sure
  • 02:33:38 that you send it to the device so it's a
  • 02:33:40 CUDA tenser so mu prime is going to be
  • 02:33:43 mu plus T dot tensor what are we going
  • 02:33:46 to use self that noise that'll give us
  • 02:33:48 our exploration noise and that is going
  • 02:33:51 to be D type of float and we will send
  • 02:33:55 that to actor device and then you want
  • 02:34:00 to say should be actor dot train
  • 02:34:06 shouldn't ya self dot actor dot train
  • 02:34:09 yes and then you want to return mu Prime
  • 02:34:13 now CPU detach numpy so this is an idiom
  • 02:34:19 within pi torch where you have to
  • 02:34:22 basically do this otherwise it doesn't
  • 02:34:25 it doesn't doesn't give you the actual
  • 02:34:28 number you write it's gonna try to pass
  • 02:34:30 out a tensor which doesn't work because
  • 02:34:32 you can't pass a tensor into the open a
  • 02:34:34 gym so
  • 02:34:35 kind of a funny little quirk but it is
  • 02:34:38 necessary so now we need a function to
  • 02:34:41 store state transitions and this is just
  • 02:34:45 kind of an interface for our replay
  • 02:34:48 memory class so memory store transition
  • 02:34:54 state action reward new state done
  • 02:34:59 simple so now we come to the meat of the
  • 02:35:05 problem where we actually be learning so
  • 02:35:07 you don't want to learn if you haven't
  • 02:35:10 filled up at least batch size of your
  • 02:35:12 memory buffers so self dot memory mem
  • 02:35:15 counter is less than self that batch
  • 02:35:18 size then you just want to return
  • 02:35:21 otherwise action reward new state done
  • 02:35:26 you want to sample your memory memory
  • 02:35:31 dot sample buffer self that batch size
  • 02:35:39 then you want to go ahead and turn all
  • 02:35:41 of those into tensors that's because
  • 02:35:46 they come back as an umpire raised in
  • 02:35:50 this case we'll put them on the critic
  • 02:35:51 device as long as around the same device
  • 02:35:53 it doesn't matter I do this for
  • 02:35:54 consistency because these values will be
  • 02:35:56 used in the critic Network so the Dunn's
  • 02:36:01 equal Tita tensor done to self critic
  • 02:36:07 device you need the new state P that
  • 02:36:11 tensor new state D type T float to self
  • 02:36:18 critic device you also need the actions
  • 02:36:30 predict out device and you need states
  • 02:36:37 that tensor device and now we come to
  • 02:36:46 another quirk of Pi torch where we're
  • 02:36:47 going to have to send everything to eval
  • 02:36:49 mode for the targets it may not be that
  • 02:36:53 important I did it for consistency so we
  • 02:37:07 want to calculate the target actions
  • 02:37:08 much like you do in the bellman equation
  • 02:37:10 for Q learning deep Q learning target
  • 02:37:13 actor for word new state you want the
  • 02:37:20 critic value underscore which is the new
  • 02:37:22 states so the target critic dot forward
  • 02:37:27 and that takes target actions as input
  • 02:37:32 so what we're doing is getting the
  • 02:37:34 target actions from the target actor
  • 02:37:37 Network in other words what actions it
  • 02:37:39 should it take based on the target
  • 02:37:40 actors estimates and then plugging that
  • 02:37:43 into the state value function for the
  • 02:37:46 target critic network you also want the
  • 02:37:50 critic value which is self dot critic
  • 02:37:53 dot forward for state and action so in
  • 02:37:56 other words what was the what is your
  • 02:37:58 estimate of the values of the states and
  • 02:37:59 actions we actually encountered in our
  • 02:38:01 subset of the replay buffer so now we
  • 02:38:04 have to calculate the targets that we're
  • 02:38:07 going to move towards or J in range self
  • 02:38:10 that batch size and I use a loop instead
  • 02:38:14 of a vectorized implementation because
  • 02:38:17 the vectorizing implementation is a
  • 02:38:19 little bit tricky if you don't do it
  • 02:38:20 properly you can end up with something
  • 02:38:21 of shape batch size by batch size which
  • 02:38:23 won't flag an error but it definitely
  • 02:38:27 gives you the wrong answer and you don't
  • 02:38:28 get learning so target dot append
  • 02:38:33 reward sub J + self dot gamma x critic
  • 02:38:37 Bayou underscore sub J times done sub J
  • 02:38:41 so this is where I was talking about the
  • 02:38:42 done flags
  • 02:38:43 if the episode is over then the value of
  • 02:38:46 the resulting state is multiplied by
  • 02:38:49 zero and so you don't take it into
  • 02:38:50 account you only take into account the
  • 02:38:52 reward from the current state precisely
  • 02:38:55 as one would want so now let's go ahead
  • 02:38:57 and turn that target into a tensor sorry
  • 02:39:01 tensor target not to self critic dot
  • 02:39:06 device and we want to reshape this
  • 02:39:10 target egos target dot view cell top
  • 02:39:14 batch size and one now now we can come
  • 02:39:20 to the calculation of the loss functions
  • 02:39:23 so we want to set the critic back into
  • 02:39:27 training mode because we have already
  • 02:39:30 performed the evaluation now we want to
  • 02:39:32 actually calculate the values for batch
  • 02:39:34 normalization a train
  • 02:39:38 so about critic dot optimizer zero grad
  • 02:39:42 in PI torch whenever you calculate the
  • 02:39:45 loss function you want to zero your
  • 02:39:46 gradients that's so that gradients from
  • 02:39:48 previous steps don't accumulate and
  • 02:39:49 interfere with the calculation it can
  • 02:39:51 slow stuff down you don't want that so
  • 02:39:54 critical loss is just good grief f dot
  • 02:39:59 MSE means square error loss between the
  • 02:40:02 target and the critic value so then you
  • 02:40:06 want to back propagate back propagate
  • 02:40:08 that backward and step your optimizer
  • 02:40:15 boom so that's it for the critic so now
  • 02:40:19 we want to set the critic into
  • 02:40:21 evaluation mode for the calculation of
  • 02:40:25 the loss for our actor network so how
  • 02:40:28 about actor dot optimizer 0 grad and I
  • 02:40:32 apologize that this is confusing it was
  • 02:40:34 confusing to me it took me a while to
  • 02:40:36 figure it out this is one of the ways in
  • 02:40:38 which tensorflow is superior to PI torch
  • 02:40:40 you don't have this quirk I tend to like
  • 02:40:43 tensor flow a little bit better but you
  • 02:40:46 know whatever
  • 02:40:46 well we'll just figure it out man and
  • 02:40:48 get it going so mu equals the for
  • 02:40:52 propagation of the state I'm gonna put
  • 02:40:55 the actor into training mode and you
  • 02:40:57 want to calculate your actor loss that's
  • 02:40:59 just minus self dot critic dot forward
  • 02:41:03 state state and mu after loss equals T
  • 02:41:09 dot mean of actor loss again stay tuned
  • 02:41:11 for the derivation from the paper this
  • 02:41:13 is all outlined there otherwise it seems
  • 02:41:16 mysterious but this video is already 45
  • 02:41:19 minutes long so you know that ought to
  • 02:41:21 wait for a future video after loss
  • 02:41:23 backward and self dot actor dot
  • 02:41:28 optimizer dot step and then we're done
  • 02:41:32 learning so now after you finish
  • 02:41:33 learning you want to update the network
  • 02:41:36 parameters for your target actor and
  • 02:41:37 target critic networks so self taught
  • 02:41:39 update network parameters whew man ok so
  • 02:41:47 we're almost there I promise
  • 02:41:52 let's go ahead and do that def update
  • 02:41:54 network parameters self and tau equals
  • 02:42:01 none by default so tau is a parameter
  • 02:42:06 that allows the update of the target
  • 02:42:09 network to gradually approach the
  • 02:42:12 evaluation networks and this is
  • 02:42:15 important for a nice slow convergence
  • 02:42:18 you don't want to take two largest steps
  • 02:42:20 in between updates so tau is a small
  • 02:42:22 number much much less than one so
  • 02:42:28 so if tau is none then you want to say
  • 02:42:32 tau equals self dot tau now this may
  • 02:42:35 seem mysterious the reason I'm doing
  • 02:42:36 this is because at the very beginning
  • 02:42:39 when we call the initializer we say
  • 02:42:43 update network parameters tau equals 1
  • 02:42:46 this is because in that in the very
  • 02:42:48 beginning we want to update or sorry we
  • 02:42:50 want all the networks to start with the
  • 02:42:51 same weights and so we call it with a
  • 02:42:53 towel of 1 and in that case tau is not
  • 02:42:57 none so tau is just 1 and you will get
  • 02:43:00 the update rule here in a second so this
  • 02:43:03 is more hocus-pocus with PI torch actor
  • 02:43:09 named parameters so this will do is
  • 02:43:13 it'll get all the names of the
  • 02:43:14 parameters from these networks and we
  • 02:43:22 want to do the same thing for target
  • 02:43:23 actor parameters target critic params
  • 02:43:40 okay now that we have the parameters
  • 02:43:42 let's turn them into a dictionary that
  • 02:43:44 makes iterating them much easier because
  • 02:43:47 this is actually a generator so I
  • 02:43:49 believe don't quote me on that
  • 02:43:53 critic Graham's
  • 02:43:57 the actor estate dict equals dict of the
  • 02:44:01 actor programs target critic state
  • 02:44:07 militias to target critic dict
  • 02:44:11 people's dict Rams cramps boom okay
  • 02:44:25 almost there so now we want to iterate
  • 02:44:27 over these dictionaries and copy
  • 02:44:30 parameters so for name in critic state
  • 02:44:34 dict critic state dict
  • 02:44:39 sub name equals equals tau times critic
  • 02:44:43 state dict name dot clone plus one minus
  • 02:44:55 tau times target critic dict name dot
  • 02:45:00 clone cell the target critic a load
  • 02:45:05 state dict critic state dict so what
  • 02:45:11 this does is it iterates over this
  • 02:45:13 dictionary looks at the key in the in
  • 02:45:15 the in the dictionary and updates the
  • 02:45:18 values from this particular network and
  • 02:45:21 you can see that when tau is 1 you get 1
  • 02:45:23 minus 1 is 0 so it's just this equals
  • 02:45:26 tau 1 times that so as I did the
  • 02:45:28 identity and then it loads the target
  • 02:45:31 critic with that parameter so at the
  • 02:45:33 very beginning it'll load it with the
  • 02:45:35 parameters from the initial critic
  • 02:45:40 network and likewise for the Patra
  • 02:45:43 network so let's go ahead and copy this
  • 02:45:44 and just go ahead and change critic to
  • 02:45:49 actor and then we'll be done with that
  • 02:45:52 function and we'll only have one other
  • 02:45:55 thing to take care of before we get to
  • 02:45:58 the main program target actor
  • 02:46:07 actor state dict I believe that is it
  • 02:46:14 yes indeed it is now there should be
  • 02:46:21 target actor yes perfect
  • 02:46:27 okay now it's right so next up we have
  • 02:46:31 two other bookkeeping functions to save
  • 02:46:33 the models so def save models and you
  • 02:46:38 definitely want this because this thing
  • 02:46:39 takes forever to train and self-taught
  • 02:46:48 critic and target actor and target
  • 02:46:59 critic and load models does the inverse
  • 02:47:05 operation yeah just copy all this load
  • 02:47:15 keep things simple right and again I
  • 02:47:20 will upload since this takes so long I'm
  • 02:47:22 going to upload my saved model
  • 02:47:24 parameters to the get up for you
  • 02:47:26 but this is it this is 275 lines so this
  • 02:47:30 is probably the longest project we have
  • 02:47:32 worked on here had machine learning with
  • 02:47:34 fill if you made it this far
  • 02:47:35 congratulations we're already 50 minutes
  • 02:47:37 in and we're almost done I promise so
  • 02:47:39 let's come over to our main function and
  • 02:47:44 we want to import our agent
  • 02:47:49 so DDP G torch import agent we want to
  • 02:47:55 import Djinn we want to do we want yes
  • 02:48:01 we want numpy we want my SuperDuper
  • 02:48:06 awesome import plot learning function
  • 02:48:12 and that is it so env Jim make lunar
  • 02:48:18 lander Conte in US v to agent equals
  • 02:48:25 agent alpha zero point one two three
  • 02:48:29 four to five so two point five by ten to
  • 02:48:31 the minus five beta equals zero point
  • 02:48:34 zero zero zero to five so two point five
  • 02:48:36 by ten to the minus four input dims
  • 02:48:39 equals a list with element eight town
  • 02:48:43 zero point zero zero one and V equals E
  • 02:48:46 and B well that reminds me I didn't
  • 02:48:48 multiply by the action space high in the
  • 02:48:51 function for the choose action
  • 02:48:54 don't worry that'll be in the potential
  • 02:48:56 implementation or I can leave it as an
  • 02:48:58 exercise to the reader it doesn't matter
  • 02:48:59 for this environment when we get to
  • 02:49:01 other ones that word doesn't matter I'll
  • 02:49:03 be a little bit more diligent about that
  • 02:49:05 that size is 64 size 400 and r2 size 300
  • 02:49:10 and actions plus two now another
  • 02:49:15 interesting tidbit is that we have to
  • 02:49:17 set the random C this is not something
  • 02:49:20 I've done before but this is a highly
  • 02:49:23 sensitive learning method so if you read
  • 02:49:27 the original paper they do averages over
  • 02:49:29 five runs and that's because every run
  • 02:49:32 is a little bit different and I suspect
  • 02:49:33 that's why they had to initialize the
  • 02:49:35 weights and biases within such a narrow
  • 02:49:37 range right you don't want to go away
  • 02:49:38 from plus and minus one when you can
  • 02:49:39 constrain to something much more narrow
  • 02:49:41 it gives you little bit more
  • 02:49:42 repeatability
  • 02:49:43 so we have to set the numpy random seed
  • 02:49:47 to some value instead of none so in this
  • 02:49:49 case I've used zero I've seen other
  • 02:49:51 values used please clone this and see
  • 02:49:54 what happens if you input other seed
  • 02:49:56 values so next we need a score history
  • 02:50:00 to keep track of the scores over time
  • 02:50:04 and we need to
  • 02:50:09 iterate over a thousand games done
  • 02:50:13 equals false score equals zero
  • 02:50:16 observation equals E and V dot reset I
  • 02:50:19 got a new observation so while not done
  • 02:50:23 agent dot choose action Bob's new state
  • 02:50:31 reward done info equals e and we got
  • 02:50:34 step act agent dot remember we want to
  • 02:50:39 keep track of that transition Hobbs Act
  • 02:50:41 reward new States int done agent learn
  • 02:50:47 we learn on every step because this is a
  • 02:50:49 temporal difference learning method
  • 02:50:50 instead of a Monte Carlo type method
  • 02:50:52 where we would learn at the end of every
  • 02:50:53 episode keep track of the score and set
  • 02:50:57 your old state to the new state so at
  • 02:51:01 the end of the episode if at the end of
  • 02:51:04 every episode I want to print print the
  • 02:51:07 place marker so we'll say score history
  • 02:51:10 dot append score print episode I score
  • 02:51:18 sent up to F score 100 game average %
  • 02:51:27 dot 2f and what this will do is take the
  • 02:51:34 last 100 games and compute the mean so
  • 02:51:36 that way you can get an idea vast
  • 02:51:37 learning remember with the luminol an
  • 02:51:39 environment salt means that it has
  • 02:51:41 gotten ace an average score of 1 200
  • 02:51:44 over the last 100 games so every 25
  • 02:51:49 games we want to save the model agent
  • 02:51:55 save models and at the end file name
  • 02:51:59 equals lunar lander PNG that's not on
  • 02:52:05 the loop you want to do at the end of
  • 02:52:06 all the loop game's plot learning I'll
  • 02:52:09 name
  • 02:52:10 no score history
  • 02:52:12 file name in a window of 100 games Wow
  • 02:52:17 so an hour in we finally finished this
  • 02:52:20 now we get to go to the terminal and see
  • 02:52:22 how many typos I made I'm sure there's
  • 02:52:24 probably 50 so let's get to it alright
  • 02:52:27 so here we are let's see what we get you
  • 02:52:30 want to run torch Lunar Lander fingers
  • 02:52:32 crossed
  • 02:52:33 okay so that's a stupid one so in line
  • 02:52:37 30 we forgot an equal sign let's go back
  • 02:52:39 there and fix that alright here we are
  • 02:52:42 so it says line 30 yes right here all
  • 02:52:51 right did we do that again anywhere else
  • 02:52:53 not that I can see but that's no
  • 02:52:57 guarantee alright so let's go back to
  • 02:52:58 the terminal
  • 02:53:00 alright let's try it again huh line 119
  • 02:53:07 okay typical all right so 119 right
  • 02:53:17 there so that's in the actor let's just
  • 02:53:22 scroll down
  • 02:53:26 that's the agent class I don't think I
  • 02:53:30 did it there all right I'm going back to
  • 02:53:31 the terminal
  • 02:53:32 alright so I started it and it ran so
  • 02:53:35 let's see built-in function has no
  • 02:53:45 function or method has no attribute
  • 02:53:48 numpy all right that's an interesting
  • 02:53:50 bug let's fix that so that is on line
  • 02:53:53 192 in our choose action function and mu
  • 02:53:59 Prime oh that's why it's detached as a
  • 02:54:03 function not an object I should fix it
  • 02:54:06 let's head back to the terminal
  • 02:54:13 rewards is not deep find so that is in
  • 02:54:19 line 55 okay ah it's just called reward
  • 02:54:29 there we go I had the s and back to the
  • 02:54:32 terminal
  • 02:54:33 ah perfect that's easy to fix
  • 02:54:45 Victor temps ICD PG huh because I didn't
  • 02:54:55 make the directory first that's easy
  • 02:55:01 perfect now it's running so I'm not
  • 02:55:04 gonna let this run all 1000 games
  • 02:55:05 because it takes about a couple hours or
  • 02:55:07 so instead let's take a look here so I
  • 02:55:11 was running this earlier when I was
  • 02:55:12 making the videos for the sorry the
  • 02:55:15 recording the agents play while making
  • 02:55:18 for this video and you can see that
  • 02:55:20 within under 650 games it went ahead and
  • 02:55:24 solved it so when you print out the
  • 02:55:25 trailing average for the last hundred
  • 02:55:27 games we get a reward of well over 200
  • 02:55:29 now keep in mind one interesting thing
  • 02:55:32 is that this is still actually in
  • 02:55:34 training mode it's not excuse me it's
  • 02:55:37 not in full evaluation mode because we
  • 02:55:39 still have some noise right if you
  • 02:55:41 wanted to do a pure evaluation of the
  • 02:55:42 agent you would set the noise to zero
  • 02:55:44 we'll do that in a set of future videos
  • 02:55:46 so there's a whole bunch of stuff I can
  • 02:55:47 do on this topic but just keep in mind
  • 02:55:50 that this is an agent that is still
  • 02:55:51 taking some random actions the noise is
  • 02:55:53 nonzero and so it is still taking
  • 02:55:56 suboptimal actions of getting a score of
  • 02:55:58 260 and still beating the environment
  • 02:56:00 even though that noise is present and
  • 02:56:02 you can see it in like episode 626 where
  • 02:56:06 it gets a score of 26 so and then in
  • 02:56:09 episode 624 where it does you know eight
  • 02:56:11 point five eight points so that is
  • 02:56:14 pretty cool stuff so this is a very
  • 02:56:16 powerful algorithm and keep in mind this
  • 02:56:17 was a continuous action space totally
  • 02:56:19 intractable for q-learning right that is
  • 02:56:21 simply not possible
  • 02:56:23 it's an infinitude of actions so you
  • 02:56:25 need something like DD PG to handle this
  • 02:56:27 and it handles it quite well in future
  • 02:56:29 videos we're going to get to the Walker
  • 02:56:30 the bipedal Walker we're gonna get to
  • 02:56:32 the learning from pixels where we do the
  • 02:56:34 continuous bracing environment and we'll
  • 02:56:37 probably get into other stuff from the
  • 02:56:39 Robo school of the open AI gym so make
  • 02:56:41 sure to subscribe so that you can see
  • 02:56:43 that in future videos go ahead and check
  • 02:56:46 out the github for this so you can get
  • 02:56:47 the weight so you can play around with
  • 02:56:49 this so you don't have to spend a couple
  • 02:56:50 hours training it on your GPU make sure
  • 02:56:53 to leave a like share this if you found
  • 02:56:55 it helpful that is incredibly helpful to
  • 02:56:56 me and leave a comment down below I
  • 02:56:59 answer all my questions I look forward
  • 02:57:01 to seeing you all in the next video