Coding

Real-World Python Machine Learning Tutorial w/ Scikit Learn (sklearn basics, NLP, classifiers, etc)

  • 00:00:00 hey what's up guys and girls and welcome
  • 00:00:01 back to another video very excited for
  • 00:00:03 this one today we're gonna be going
  • 00:00:05 through the scikit-learn library of
  • 00:00:07 Python which is a very important library
  • 00:00:09 for machine learning in Python and
  • 00:00:12 definitely this video was pretty highly
  • 00:00:14 requested so I'm excited to finally get
  • 00:00:16 to it so what will specifically be doing
  • 00:00:18 in this video is Ruby kind of slowly
  • 00:00:21 walking our way through the cycle
  • 00:00:22 learning library and like kind of
  • 00:00:24 showing all the different avenues you
  • 00:00:25 can go but our ultimate task that we'll
  • 00:00:28 be building in this video is going to be
  • 00:00:30 two different machine learning models
  • 00:00:32 the first model is going to
  • 00:00:33 automatically classify text that we put
  • 00:00:36 in as positive or negative so for some
  • 00:00:39 examples if I ran this line of code
  • 00:00:42 right here
  • 00:00:42 you know I thoroughly enjoyed this five
  • 00:00:44 stars that's positive bad book do not
  • 00:00:47 buy that be a negative example very
  • 00:00:49 interesting stuff thank you that's
  • 00:00:52 positive and as you see the machine
  • 00:00:53 learning model outputs corresponding
  • 00:00:55 values I could change this to something
  • 00:00:57 like horrible do not buy her the I
  • 00:01:02 already said do not buy horrible waste
  • 00:01:06 of time
  • 00:01:07 and that will now we'll see this switch
  • 00:01:09 to negative to cool so it like
  • 00:01:12 automatically knows you know if these
  • 00:01:14 are positive or negative and this could
  • 00:01:16 be applied to all sorts of cool things
  • 00:01:18 like I could you know incorporate this
  • 00:01:20 model into my YouTube comments and see
  • 00:01:22 how much positive how much negative
  • 00:01:24 stuff I'm getting so it's really fun
  • 00:01:26 kind of playing with actual text data
  • 00:01:28 and creating a scikit-learn machine
  • 00:01:30 learning model and I guess getting this
  • 00:01:33 as the output then we'll do one other
  • 00:01:36 task which is very similar the model
  • 00:01:39 doesn't change too much but it's just
  • 00:01:41 kind of cool to see a little bit more
  • 00:01:42 what you can do with the same type of SK
  • 00:01:45 learning stuff so this is also a NLP
  • 00:01:47 natural language processing model and
  • 00:01:49 this one instead of positive or negative
  • 00:01:51 we have these several different
  • 00:01:54 categories that were kind of grouping
  • 00:01:55 our comments in and kind of the way you
  • 00:01:58 can think about this one is imagine
  • 00:02:00 you're like in charge of Twitter like HR
  • 00:02:03 and you know you're getting all these
  • 00:02:05 positive and negative feedback to your
  • 00:02:08 Twitter account but you don't know what
  • 00:02:10 products they're talking about this
  • 00:02:12 machine learning model
  • 00:02:13 medically classifies text as a certain
  • 00:02:17 category so like great for my wedding
  • 00:02:18 would map to a clothing category I loved
  • 00:02:21 it in my garden would be patio category
  • 00:02:24 and good computer would be electronics
  • 00:02:27 and as you see that Maps correctly to
  • 00:02:29 these things so those are the two models
  • 00:02:30 will be building will be walking through
  • 00:02:32 all sorts of cool stuff with
  • 00:02:34 scikit-learn quickly let me just cover
  • 00:02:36 the timeline on this video and this will
  • 00:02:38 be in the comments so make sure to check
  • 00:02:40 that out we're gonna start out with kind
  • 00:02:41 of just a brief overview of why you use
  • 00:02:43 SK Larian what's its purpose when you
  • 00:02:46 should use it when you shouldn't use it
  • 00:02:47 then we'll jump into loading in data
  • 00:02:50 into SK learn and what SK learn can do
  • 00:02:53 to help us with that and we'll be
  • 00:02:54 choosing our classifier once we choose
  • 00:02:57 our classifier will be you know
  • 00:02:59 evaluating the performance of the
  • 00:03:01 different classifiers we'll be doing
  • 00:03:02 some fine-tuning of the classifier to
  • 00:03:04 make it even better try to think if
  • 00:03:07 there's anything else we will also save
  • 00:03:10 our model so you don't have to retrain
  • 00:03:12 it every time if you wanted to use it in
  • 00:03:13 production and also I just want to say
  • 00:03:16 real quick this is a fairly long video
  • 00:03:18 so if you want to break this up into
  • 00:03:20 like 15 minutes I'm gonna try to
  • 00:03:23 structure it so it's pretty easy to do
  • 00:03:24 that so don't feel like you have to
  • 00:03:25 watch this whole video at one time watch
  • 00:03:27 one chunk come back to the second part
  • 00:03:29 later but I kinda want to keep it all
  • 00:03:31 together because I think it's good to
  • 00:03:32 kind of see the full from start from
  • 00:03:35 data to final model Avenue ok so a
  • 00:03:40 little background information on SK
  • 00:03:42 learn and machine learning more
  • 00:03:44 generally is that I would say that
  • 00:03:45 pretty much every machine learning task
  • 00:03:47 has several steps associated with it so
  • 00:03:51 I mean at its core any sort of machine
  • 00:03:53 learning thing you want to do you're
  • 00:03:54 gonna have some sort of question that
  • 00:03:55 you want to answer so that's always like
  • 00:03:57 the first thing you know you have this
  • 00:03:58 question that you want to answer and
  • 00:04:00 then you need to find some data that
  • 00:04:02 will help you answer that question you
  • 00:04:03 can build model around that data once
  • 00:04:06 you have that data found and you apply
  • 00:04:07 have to do some prep on the data some
  • 00:04:09 sort of processing some sort of
  • 00:04:11 filtering once you have your data all
  • 00:04:14 prepped you're going to want to build
  • 00:04:15 some sort of model around the data once
  • 00:04:18 you have a model you're going to next
  • 00:04:20 move on to doing some evaluation how
  • 00:04:22 well is your model performing
  • 00:04:23 and kind of after that you'll make
  • 00:04:27 improvements you'll tune the model they
  • 00:04:28 find different parameters you'll you
  • 00:04:30 know maybe tune your date a little bit
  • 00:04:31 more and just try to improve your model
  • 00:04:33 as much as possible SK learn helps
  • 00:04:37 simplify that entire pipeline they're
  • 00:04:40 their help to help a lot of common
  • 00:04:42 things you'll want to do to improve your
  • 00:04:44 model and to get your model ready to be
  • 00:04:46 built or I'm actually using a specific
  • 00:04:49 algorithm for the model SK learned
  • 00:04:52 packages is that packages all that up
  • 00:04:54 into a nice library for you so a little
  • 00:04:58 bit more detail like I'm here looking at
  • 00:04:59 the SK learn site there's all sorts of
  • 00:05:02 classification algorithms that are out
  • 00:05:04 there and you know here is some code for
  • 00:05:07 a certain classification algorithm that
  • 00:05:11 long and convoluted but with SK learning
  • 00:05:15 you can use that same algorithm in you
  • 00:05:18 know a couple lines same thing with your
  • 00:05:21 regression clustering everything that
  • 00:05:25 might take a lot of your own code to do
  • 00:05:28 you know someone's already SK learning
  • 00:05:30 has already packaged that algorithm up
  • 00:05:31 for you and it's easy to just use
  • 00:05:33 yourself another separation I want to
  • 00:05:36 make real quick is that I would say
  • 00:05:38 there's kind of two types of models that
  • 00:05:41 you can build in machine learning so
  • 00:05:43 this is very general you know on the one
  • 00:05:45 hand you have the neural network the
  • 00:05:48 deep learning models in and on the other
  • 00:05:50 hand you have these traditional
  • 00:05:52 algorithmic type machine learning models
  • 00:05:55 SK learn is really I think helpful in
  • 00:05:59 this traditional algorithmic models they
  • 00:06:04 you know have packaged up all of this
  • 00:06:06 stuff for you to use from the algorithm
  • 00:06:09 side for neural net type stuff we're not
  • 00:06:11 going to be covering it in this video
  • 00:06:12 and I don't you know I saw I looked at
  • 00:06:15 some stuff on the documentation they
  • 00:06:17 have a little bit of neural net stuff it
  • 00:06:19 looks like an SK learn but I would
  • 00:06:20 really recommend if you want to get
  • 00:06:22 neural network machine learning
  • 00:06:23 experience check out either tensorflow I
  • 00:06:26 use pi torch personally a lot there's
  • 00:06:30 other libraries that really drilled
  • 00:06:32 around neural networks so we're focusing
  • 00:06:34 on that traditional algorithmic
  • 00:06:36 approach side here okay so now that
  • 00:06:38 we've gone through a little bit about
  • 00:06:40 ska iron let's go back to the question
  • 00:06:42 where originally trying to answer and
  • 00:06:43 that is automatically classifying
  • 00:06:45 comments as positive or negative and to
  • 00:06:48 do this we first need some data to train
  • 00:06:50 our model the easiest approach like the
  • 00:06:53 initial stupidest approach you could do
  • 00:06:54 is imagine me just going through mainly
  • 00:06:57 writing like stuff like and that would
  • 00:07:05 be negative or stuff like he is he is
  • 00:07:13 the best that would be positive I could
  • 00:07:17 do this but I'd be here for years
  • 00:07:19 basically creating enough data for the
  • 00:07:23 needs of this video and to make a good
  • 00:07:24 model so yeah mainly creating data is
  • 00:07:26 probably not the best approach you could
  • 00:07:28 crowdsource and like get a lot of people
  • 00:07:31 to manually cut you data that could work
  • 00:07:32 potentially a little better but still
  • 00:07:34 time-consuming probably costly not the
  • 00:07:37 best so the best approach to getting
  • 00:07:40 data you need is try to be creative how
  • 00:07:42 can we find data that will help us do
  • 00:07:43 what we need that already exists in for
  • 00:07:46 this positive negative feedback model
  • 00:07:50 the place that I'm decided to go was
  • 00:07:56 amazon.com so I mean we click on
  • 00:07:58 anything like this cute little oh my
  • 00:08:00 gosh this is so adorable this cute
  • 00:08:04 little costume and we can go to the
  • 00:08:07 reviews and the reviews basically here
  • 00:08:09 are reviews of any product basically
  • 00:08:13 have already labeled data for us if
  • 00:08:15 something is five stars this material in
  • 00:08:18 here is gonna be very positive on the
  • 00:08:21 other hand let's see if we can find a
  • 00:08:23 one-star I don't know if I'll find it
  • 00:08:26 with how adorable this costume is okay
  • 00:08:28 yeah we have a one star review right
  • 00:08:30 here maybe make this a little bigger
  • 00:08:35 like this is not quality at all like in
  • 00:08:39 this one star review we have a negative
  • 00:08:41 feedback we have negative things that
  • 00:08:42 people are saying so if we can take a
  • 00:08:45 lot of Amazon
  • 00:08:46 views we can use that as our training
  • 00:08:49 data so that's exactly what we're gonna
  • 00:08:51 do so what I ended up doing and I made
  • 00:08:54 it a little bit simpler for you is that
  • 00:08:55 this guy JM Colley at USC D did a lot of
  • 00:09:00 work for us where you don't actually
  • 00:09:02 scrape through Amazon but from 1996 to
  • 00:09:05 July 2014 he collected all sorts of
  • 00:09:09 Amazon review data so I went ahead and
  • 00:09:12 took some of that and broke it down I
  • 00:09:15 have a script that shows exactly how I
  • 00:09:17 did it
  • 00:09:18 that'll show you but I just kind of
  • 00:09:21 broke that down into a more manageable
  • 00:09:23 amount of reviews for several of these
  • 00:09:26 categories here from the year 2014 so
  • 00:09:29 that's what we'll be using as our data
  • 00:09:31 as our training data okay to get that
  • 00:09:33 data and begin actually writing our code
  • 00:09:35 I need you to go to my github page Keith
  • 00:09:39 galley this will also be linked to in
  • 00:09:41 the description and yeah Keith's got a
  • 00:09:43 slash just came hyoeun and while you're
  • 00:09:44 at it
  • 00:09:45 follow my github because the more
  • 00:09:47 milestones I hit more fun pictures like
  • 00:09:49 this I got a post okay and so once
  • 00:09:53 you're on the github page you're going
  • 00:09:54 to want to go to data and to begin I
  • 00:09:57 just go I recommend just going to
  • 00:09:59 sentiment and downloading this file I
  • 00:10:01 called books small so this is 1,000
  • 00:10:04 Amazon reviews from 2014 specifically on
  • 00:10:07 ebooks so go ahead you can click on that
  • 00:10:11 and if you click on that there's
  • 00:10:13 somewhere – yeah just download the raw
  • 00:10:15 file here or you click raw and then you
  • 00:10:18 can do save as so books small and text
  • 00:10:22 document should be fine or maybe just
  • 00:10:26 even do all files then just do JSON
  • 00:10:32 don't mind we're I'm saving it yeah so
  • 00:10:36 I'm just saving that I already have it
  • 00:10:38 saved but yeah save it somewhere that's
  • 00:10:40 close to where you're writing your code
  • 00:10:41 okay open up your favorite code editor I
  • 00:10:44 personally like Jupiter notebooks for
  • 00:10:46 anything diet data science related and
  • 00:10:48 that's actually started diving into the
  • 00:10:50 machine learning tasks so to begin
  • 00:10:53 before I write even any code that's just
  • 00:10:55 look at our data again and I guess for
  • 00:11:00 the first time this is what our data
  • 00:11:02 looks like this book's small that I told
  • 00:11:03 you to download it looks like this so
  • 00:11:06 it's a JSON file I guess and each line
  • 00:11:10 is a another JSON object so we're gonna
  • 00:11:13 have to if we want to like load this in
  • 00:11:15 in a logical way we're gonna have to
  • 00:11:16 load in this file line-by-line and then
  • 00:11:19 the water are important fields here in
  • 00:11:21 this JSON so I see we have this review
  • 00:11:25 text and that's actually what is the the
  • 00:11:28 text content in the review so that's
  • 00:11:29 important so we're going to want that
  • 00:11:30 and then the other thing that is going
  • 00:11:32 to be important oh my gosh this is a
  • 00:11:34 long one is going to be this overall
  • 00:11:36 that's the out of 5 star rating so those
  • 00:11:40 are the two fields that I really care
  • 00:11:41 about and then we can kind of do some
  • 00:11:44 additional stuff with that so that's a
  • 00:11:46 file and let's start processing that so
  • 00:11:50 one thing I recommend is let's first
  • 00:11:54 import the JSON library that will just
  • 00:11:57 allow us to process that file and then
  • 00:12:01 if we're going to want to load this in
  • 00:12:03 line by line this is what we can do so
  • 00:12:06 we're gonna go well first off we need to
  • 00:12:09 know our file name and our file is
  • 00:12:13 called the path that I leave saved mine
  • 00:12:16 in and this is relative to the code that
  • 00:12:20 I'm writing it's within the data folder
  • 00:12:22 and then it was within the sentiment
  • 00:12:25 thing and it was called books small
  • 00:12:29 touch JSON so that was my file name you
  • 00:12:31 might have something different
  • 00:12:33 alright now we want to open up that file
  • 00:12:35 so we can type in with open file name
  • 00:12:41 what that guy is f so we're opening this
  • 00:12:46 file name as apps the file is called F
  • 00:12:49 and then what we can do is for line and
  • 00:12:52 F and just to see what we have right now
  • 00:12:54 I'm going to just do a print line and
  • 00:12:57 this is Python 3 just a heads up so I
  • 00:12:59 have those surrounding it with the
  • 00:13:00 parentheses and I'm gonna break out a
  • 00:13:01 little because I don't wanna run
  • 00:13:02 everything right now okay cool so I did
  • 00:13:04 get something out of that and now what I
  • 00:13:09 want to do is I want to quickly I want
  • 00:13:13 to quickly what do I want to quickly do
  • 00:13:15 I want to get the review text so we want
  • 00:13:20 to get in the overall
  • 00:13:29 we want to get the review text here and
  • 00:13:33 like this is definitely a positive thing
  • 00:13:35 it's four star overall like divine by
  • 00:13:38 storm with his unique new novel she
  • 00:13:41 develops a world like unlike any others
  • 00:13:43 so this all sounds good and this is
  • 00:13:45 about books so let's try to get that
  • 00:13:47 review text so printed line normally if
  • 00:13:52 this is like loaded in a dictionary we
  • 00:13:54 should just be able to do line and then
  • 00:13:57 review texts like this so let's see if
  • 00:14:00 that works
  • 00:14:01 ah we're getting there and so the reason
  • 00:14:04 we get an error is because right now
  • 00:14:05 this is just raw text so we need to use
  • 00:14:08 this JSON library to actually load it in
  • 00:14:10 as a review like dictionary basically so
  • 00:14:16 we could say something like review
  • 00:14:18 equals json dot loads line and now if I
  • 00:14:36 print out review of review text we
  • 00:14:41 should get what we're looking for here
  • 00:14:44 know what type in us cool now we just
  • 00:14:49 get that text and if I just wanted that
  • 00:14:52 overall score I could do review overall
  • 00:14:55 that was the JSON key that will produce
  • 00:15:00 the right value that what did I just do
  • 00:15:03 for point out cool so that was the four
  • 00:15:06 stars so we got both the things we
  • 00:15:08 wanted so now what we really need to do
  • 00:15:09 is just gather this all in a nice way so
  • 00:15:13 I'm gonna get rid of this
  • 00:15:21 breach statement and I'm going to say
  • 00:15:23 reviews that's going to start out as an
  • 00:15:25 empty list and what we're going to do is
  • 00:15:27 we're going to just do reviews dot
  • 00:15:32 append and maybe we'll append a tuple
  • 00:15:35 object of review review text and a score
  • 00:15:45 and the square
  • 00:15:51 I can go ahead and delete this because I
  • 00:15:54 don't want all of it to print for all of
  • 00:15:56 the thousand lines and it just check to
  • 00:15:59 make sure it works let's print out like
  • 00:16:01 a random object in that so I'm going to
  • 00:16:03 just go with like review number five
  • 00:16:08 love the book great storyline keeps you
  • 00:16:11 entertained for a first novel from this
  • 00:16:13 offer she did a great job would
  • 00:16:14 definitely recommend so this is a very
  • 00:16:16 positive review I'm surprised that they
  • 00:16:18 gave it just four stars because they
  • 00:16:20 could have gave it at five but yeah
  • 00:16:22 whatever but yeah as it looks like it
  • 00:16:24 loaded in properly and now if we just
  • 00:16:27 wanted access like the review we were
  • 00:16:32 just doozy row and we'd we wanted the
  • 00:16:35 text or the score we do one well this
  • 00:16:39 does work with this whole indexing to
  • 00:16:41 get like the score and the text it's not
  • 00:16:46 the neatest way and I think one issue I
  • 00:16:47 see with a lot of like data scientists
  • 00:16:49 and machine learning engineers is that
  • 00:16:52 it gets messy their data gets messy it
  • 00:16:55 gets hard to parse through someone's
  • 00:16:58 code so what we're gonna do real quick
  • 00:17:00 is make a data class for are all of the
  • 00:17:05 data that we're loading in so I'll do
  • 00:17:06 this above so we're to call this class
  • 00:17:08 review and we're going to initialize it
  • 00:17:16 with you always have the initialize it
  • 00:17:20 with self self
  • 00:17:30 texts and score and basically self dot
  • 00:17:37 text will equal text and self dot score
  • 00:17:43 the score is equal to start like the
  • 00:17:45 number of stars what I'm saying with
  • 00:17:46 that as you go to score and then we do
  • 00:17:50 some additional stuff within this class
  • 00:17:51 to like
  • 00:17:53 convert this score to a sentiment and
  • 00:17:55 we'll just ultimately having this class
  • 00:17:57 will make things neater so now what
  • 00:18:01 we're going to do instead of appending
  • 00:18:03 this tuple we're going to go ahead and
  • 00:18:07 create a review object so review and
  • 00:18:13 then we're gonna pass in the text and
  • 00:18:16 the score so we already actually have
  • 00:18:18 that here text and score so now instead
  • 00:18:23 of doing one index to get the the score
  • 00:18:28 I can do reviews five dot score and as
  • 00:18:30 you see it stays the same and if I did
  • 00:18:32 the text text now I can easily get the
  • 00:18:35 text just a little bit neater and helps
  • 00:18:38 you kind of keep track of things and
  • 00:18:39 we're going to do a little bit more
  • 00:18:41 within this class so one thing I want to
  • 00:18:43 do is initialize some sort of sentiment
  • 00:18:46 so a new self-taught sentiment and for
  • 00:18:49 sentiment four or five stars means
  • 00:18:51 positive one or two stars means negative
  • 00:18:54 and then I guess we can use three as
  • 00:18:56 like neutral so I'll create a function
  • 00:18:59 within the review class called get
  • 00:19:01 sentiment all functions within a class
  • 00:19:05 pass in itself and so what we'll do is
  • 00:19:09 and I'll set this to self get sentiment
  • 00:19:16 so once once we fill out this answer it
  • 00:19:19 will return whatever this function does
  • 00:19:21 and okay so if it's three or four or
  • 00:19:27 five stars it should be positive if it's
  • 00:19:29 one or two stars it should be negative
  • 00:19:30 so I'll start with the negative if self
  • 00:19:34 dot score is less than or equal to two
  • 00:19:37 and we want to set it or we want to
  • 00:19:40 return negative and I'll just use
  • 00:19:46 strings for now L if self dot score
  • 00:19:51 equals equals three this is going to be
  • 00:19:53 a neutral case I don't know if we'll
  • 00:19:54 actually use this at all
  • 00:19:55 but I just want all of our possible
  • 00:19:57 scores to be covered so that's neutral
  • 00:19:59 and finally else this is going to be
  • 00:20:04 score of four or five that is going to
  • 00:20:08 be returned positive and another small
  • 00:20:13 thing that I like to do whenever
  • 00:20:15 possible is I don't like having just
  • 00:20:17 strings floating around I like to be
  • 00:20:18 very consistent with those strings so I
  • 00:20:21 don't actually have accidentally like
  • 00:20:22 take the wrong thing so I'm gonna
  • 00:20:24 actually create an enum class which is
  • 00:20:26 just a regular class but you kind of
  • 00:20:29 call them enums and programming speak
  • 00:20:32 and I'm going to call this sentiment and
  • 00:20:34 that's gonna have a couple different
  • 00:20:36 properties it's going to have a negative
  • 00:20:39 equals negative
  • 00:20:42 and you'll see why I'm doing this in one
  • 00:20:44 sec neutral equals neutral and positive
  • 00:20:52 equals positive so the reason I did this
  • 00:20:57 is the reason I did this is because now
  • 00:21:05 let's I'm gonna run this again and I'm
  • 00:21:11 gonna run this again I can now do
  • 00:21:13 something like reviews dot sentiment so
  • 00:21:16 remember this is a four star review so
  • 00:21:19 it should be positive and it says
  • 00:21:21 positive right that's fine we said that
  • 00:21:23 here but now basically instead of always
  • 00:21:24 typing out the strings negative and
  • 00:21:26 neutral and maybe forgetting that we
  • 00:21:28 capitalized the whole thing or spelling
  • 00:21:30 it wrong we refer to these things as the
  • 00:21:32 sentiment class and we do sentiment
  • 00:21:36 negative or sentiment dot pause the
  • 00:21:43 urges are this is neutral mutual and
  • 00:21:46 sentiment dot positive
  • 00:21:54 it's just really to ensure that we're
  • 00:21:57 being consistent it's also kind of nice
  • 00:21:59 because it's a lot of our eTCO deters
  • 00:22:03 can auto recommend this to us but now if
  • 00:22:09 we go run this and go back down here
  • 00:22:12 you'll still it's still says positive
  • 00:22:13 it's just a little bit neater having
  • 00:22:15 this type of thing you don't have to do
  • 00:22:17 that though but just something I like to
  • 00:22:19 do okay so we now have this review class
  • 00:22:21 that automatically fills out the
  • 00:22:22 positive or negative sentiment so we're
  • 00:22:24 getting there
  • 00:22:24 all right next let's go ahead and do
  • 00:22:26 some further prep of our data and
  • 00:22:27 basically what we're gonna be doing next
  • 00:22:29 is let's just take this text again
  • 00:22:33 basically the issue is when we're
  • 00:22:36 dealing with text data it's really hard
  • 00:22:39 to build you know machine learning
  • 00:22:41 models around text data machine learning
  • 00:22:43 models love matrices and you know
  • 00:22:46 numerical data as input numerical
  • 00:22:49 vectors so we're gonna talk about some
  • 00:22:52 ways to convert text into into a
  • 00:22:56 quantitative vector and we're using
  • 00:22:59 bag-of-words to start real quick to
  • 00:23:01 understand how bag-of-words works if you
  • 00:23:03 don't know already imagine we have these
  • 00:23:05 two phrases this book is great and this
  • 00:23:09 book was so bad so the way we do
  • 00:23:12 bag-of-words is we first break this up
  • 00:23:14 into like a dictionary of tokens or
  • 00:23:16 words in this case we'll do unigram so
  • 00:23:19 just single words as our dictionary so
  • 00:23:23 we'll break that up so we have this book
  • 00:23:26 is great sometimes you might optionally
  • 00:23:31 include this explanation exclamation
  • 00:23:33 mark but it kind of depends on how your
  • 00:23:35 vectorizing this then we also have from
  • 00:23:38 the second talk was so bad
  • 00:23:42 so if we were training a bag of words
  • 00:23:44 model we'd use all of these words to
  • 00:23:47 create this sort of dictionary and then
  • 00:23:51 to actually convert this word into a
  • 00:23:55 numerical vector all we have to do is
  • 00:23:58 map ones and zeroes to the terms over
  • 00:24:02 here so this book is great would have
  • 00:24:05 ones map
  • 00:24:07 to the first four words but it doesn't
  • 00:24:09 have was doesn't have so doesn't have
  • 00:24:11 bad in it so those would be zeros ones
  • 00:24:15 were the word is zeros were not this
  • 00:24:18 book was so bad that would look
  • 00:24:20 something like this you'd have this book
  • 00:24:22 they both exist is is not in this
  • 00:24:26 sentence great is not in this sentence
  • 00:24:28 was so bad those are all in the sentence
  • 00:24:31 so this would have been a like fit
  • 00:24:33 transform process four-bagger words the
  • 00:24:37 count vectorizer on all of this one last
  • 00:24:42 thing is if you wanted to now transform
  • 00:24:46 a word or a sentence you hadn't seen
  • 00:24:51 before so imagine we have was a very
  • 00:24:54 great book one small detail about this
  • 00:24:57 is a and very weren't in the original
  • 00:25:00 training set so we actually end up like
  • 00:25:03 dropping those terms out we we don't
  • 00:25:05 know what to do with them because we
  • 00:25:06 didn't see them when we were fitting our
  • 00:25:07 vectorizer but with bag of words for
  • 00:25:10 this if we saw this in like testing time
  • 00:25:12 that I'd be converted into 0 for this
  • 00:25:15 one for book greet is a one was is a one
  • 00:25:21 and everything else is a zero that isn't
  • 00:25:23 existing and then we can't handle a and
  • 00:25:25 very because we didn't see him at
  • 00:25:26 training time so that's in a high level
  • 00:25:29 how bag of words works one last thing
  • 00:25:33 before we start actually using better
  • 00:25:35 words and pretty much any machine
  • 00:25:38 learning task you want to do you have
  • 00:25:39 some set of data and so right now we
  • 00:25:41 have all of these reviews and we have a
  • 00:25:47 thousand reviews total right now if I'm
  • 00:25:50 not mistaken I can double check that
  • 00:25:54 yes oh we have a thousand reviews right
  • 00:25:56 now but whenever we're building machine
  • 00:25:58 learning models you know we want some
  • 00:25:59 subset of that to be training data in
  • 00:26:01 some said subset test and basically with
  • 00:26:05 the common pattern rather SK learn is
  • 00:26:06 they have nice methods to do pretty much
  • 00:26:08 most things you would think people would
  • 00:26:10 want to do frequently with machine
  • 00:26:12 learning so in this case we want to
  • 00:26:14 split that a thousand into a training
  • 00:26:17 set and a test set so what I would do
  • 00:26:20 when I'm like doing this on my own is I
  • 00:26:22 just would literally look up SK learn
  • 00:26:24 train name test split or something like
  • 00:26:28 that and you know there's a couple
  • 00:26:32 different options that pop up right here
  • 00:26:33 but usually your first result if you
  • 00:26:36 google something pretty straightforward
  • 00:26:38 will be what you're looking for so we
  • 00:26:41 have this train test to split it has all
  • 00:26:45 sorts of information about it one thing
  • 00:26:47 that's really actually kind of cool this
  • 00:26:49 is someone commented to this the other
  • 00:26:50 day is okay so how do we import this
  • 00:26:53 let's see some usage okay you can do
  • 00:26:57 this okay so we're trying to split our
  • 00:27:02 reviews a thousand reviews into a
  • 00:27:04 training set and the test set so I was
  • 00:27:08 just showing you the documentation but
  • 00:27:10 what's actually pretty cool is if you
  • 00:27:12 are using a Jupiter notebook like I am
  • 00:27:14 if you do Shift + tab on that function
  • 00:27:20 so me it gives me the host same exact
  • 00:27:27 documents basically right here my
  • 00:27:29 Jupiter notebook which is pretty neat
  • 00:27:31 so split and raise or matrices into
  • 00:27:34 random train and test subsets so that
  • 00:27:35 sounds exactly what I want I can go down
  • 00:27:38 even farther and look more about it one
  • 00:27:42 thing I'm noting is test sighs if float
  • 00:27:46 should be between zero and one point O
  • 00:27:48 and represents proportion of data set to
  • 00:27:51 include in the test split so that's the
  • 00:27:53 first thing that I am noticing is
  • 00:27:55 important is test size also that we can
  • 00:27:58 just pass in any sort of arrays and it
  • 00:28:01 will take care of those let's see you
  • 00:28:03 can also specify train size
  • 00:28:06 random state is another important one
  • 00:28:08 this basically allows you to seed your
  • 00:28:11 random split so if you wanted to repeat
  • 00:28:13 the same exact split in multiple
  • 00:28:15 instances if you just set this to some
  • 00:28:18 value any time you set it to that same
  • 00:28:21 value you'll get the same exact split
  • 00:28:24 stratify also could be important
  • 00:28:26 basically it would keep the proportion
  • 00:28:30 of class labels so in our case sentiment
  • 00:28:32 dot negative sentiment a positive the
  • 00:28:35 same in both splits or relatively equal
  • 00:28:37 so it wouldn't just like take all the
  • 00:28:39 positives in one set and all the
  • 00:28:41 negatives and the other by accident okay
  • 00:28:47 so let's pass them an N so we want to
  • 00:28:48 pass in our reviews we want to give it a
  • 00:28:54 test size so let's say our test size is
  • 00:28:57 going to be 0.33 33% of the reviews will
  • 00:29:02 be what we can use in test that means
  • 00:29:04 that 66% is training and also give it a
  • 00:29:09 random state so that we get the same
  • 00:29:12 thing every time and what is this return
  • 00:29:16 that's the last thing I want to look at
  • 00:29:17 so I'm going to just highlight this
  • 00:29:19 again shift-tab
  • 00:29:20 where's the return returns lists and
  • 00:29:33 okay so it returns an X&Y looks like
  • 00:29:39 okay no matter how it sorry I was just
  • 00:29:43 reading that and I got a little bit
  • 00:29:45 confused so how many ever no matter how
  • 00:29:47 many lists you passed it and how many
  • 00:29:48 arrays you passed in its gonna output
  • 00:29:51 two times that so that our case we're
  • 00:29:52 just passing a single reviews list so
  • 00:29:56 we'll get an x and a y back versi that's
  • 00:30:04 not actually that I guess this would be
  • 00:30:05 more appropriately training and test
  • 00:30:10 let's run that so if we look at the
  • 00:30:14 length of training we we took sixty-six
  • 00:30:16 percent of things so this should be 666
  • 00:30:20 about I don't know how it's gonna round
  • 00:30:23 but let's see 670 okay very close yeah I
  • 00:30:30 guess 33 percent exactly we didn't
  • 00:30:32 specify a third exactly and then test
  • 00:30:36 should be 330 so you see how we nicely
  • 00:30:39 split that up into a training and test
  • 00:30:42 so now what we're gonna do is fit our
  • 00:30:44 bag of words model to the training set
  • 00:30:49 we'll build a classifier on that
  • 00:30:51 training set and then we'll do every it
  • 00:30:54 will test everything on our test data
  • 00:30:55 all right so we want to pass our
  • 00:30:57 training set into our bag of words
  • 00:30:58 vectorizer
  • 00:30:59 let's just look at that's like print the
  • 00:31:04 first row of our training our first
  • 00:31:08 training review these are all the
  • 00:31:10 reviews okay and this is still the
  • 00:31:13 object so if I wanted to print the first
  • 00:31:15 text remember we can do this cool cool
  • 00:31:23 okay so this is a positive review let's
  • 00:31:27 just check so remember we can do
  • 00:31:28 sentiment just remind yourself the data
  • 00:31:31 okay cool
  • 00:31:32 so what we're going to want to pass into
  • 00:31:35 the bag of words vectorizer
  • 00:31:38 and or maybe more broadly thinking are
  • 00:31:43 we want to have take text and be able to
  • 00:31:47 predict whether or not it is positive or
  • 00:31:50 negative so our X the thing that we're
  • 00:31:53 passing into our model is going to be
  • 00:31:55 the text and our Y is the category or
  • 00:31:59 the sentiment so that's positive or
  • 00:32:00 negative so it's probably worthwhile
  • 00:32:02 splitting our training data into like
  • 00:32:04 training X or maybe able to call this
  • 00:32:06 train X and train Y so to get the just
  • 00:32:12 the text for X we can do a little list
  • 00:32:15 comprehension so we could do X dot text
  • 00:32:19 for X in training and for train Y we can
  • 00:32:25 pretty easily do X dot sentiment for X
  • 00:32:29 and training so now going back to the
  • 00:32:35 what we were just doing with the text if
  • 00:32:38 I did train X 0 now I don't have to do
  • 00:32:41 dot txt and you see that that text we
  • 00:32:44 already saw is there
  • 00:32:46 similarly train Y should tell us that
  • 00:32:49 positive sentiment as you see it's
  • 00:32:52 positive so now we've split it up to
  • 00:32:56 actually be our X the X is what we pass
  • 00:32:58 in and then the Y is what we're trying
  • 00:33:00 to predict so for training we use x and
  • 00:33:03 y together to like know how to build our
  • 00:33:07 model and then when we're testing we
  • 00:33:10 test on just test X which is going to be
  • 00:33:15 X dot text
  • 00:33:20 4xn tests and tests why would be same
  • 00:33:25 thing as train Y X dot sentiment for X
  • 00:33:30 and test so yeah so when we're trying to
  • 00:33:35 test our model does we pass it just test
  • 00:33:38 X and see if it matches up closely with
  • 00:33:40 test y but that'll be in a second okay
  • 00:33:43 so sorry I got sidetracked but
  • 00:33:45 bag-of-words vectorization on this so
  • 00:33:48 once again this is the common theme you
  • 00:33:51 know we have this bag of urns method
  • 00:33:53 that we want to do but how do we
  • 00:33:55 actually do it with SK learn well let's
  • 00:33:57 just do another quick Google search SK
  • 00:33:59 learn bag of words it's probably a safe
  • 00:34:01 bet what happens so we get a bunch of
  • 00:34:04 these responses and once again first to
  • 00:34:07 probably either oh these ones are going
  • 00:34:09 to be good options I'm gonna just click
  • 00:34:11 on working with text data here it looks
  • 00:34:14 like there's a little bit of mumble
  • 00:34:16 jumble but let's look up the bag of
  • 00:34:20 words in this bag of words Oh bags of
  • 00:34:26 words okay
  • 00:34:27 bags are words okay in order to perform
  • 00:34:31 machine learning on text documents we
  • 00:34:33 need to first turn text content into
  • 00:34:35 numerical feature vectors so that's what
  • 00:34:36 we've been talking about in this
  • 00:34:38 tutorial so far most intuitive way to do
  • 00:34:40 this so is backwards representation so
  • 00:34:42 they have a little description in here
  • 00:34:44 and then as we see right here it looks
  • 00:34:47 like they're actually doing it with SK
  • 00:34:49 learn so this looks super easy we can
  • 00:34:51 kind of just copy these lines so I'll
  • 00:34:53 copy this line right here and then count
  • 00:34:57 back okay cool and it may be I actually
  • 00:35:00 what's useful is to actually look up
  • 00:35:01 that count vectorizer
  • 00:35:03 because that's really what we're gonna
  • 00:35:05 want to do that the use because it just
  • 00:35:11 showed it a nice example and if we go
  • 00:35:14 down here what often is even most
  • 00:35:17 helpful and I'm like trying to get
  • 00:35:19 familiar with a certain SK larren
  • 00:35:21 library or tool or method is to find the
  • 00:35:24 example they always provide examples in
  • 00:35:26 the documentation
  • 00:35:26 and so if I look at this example it has
  • 00:35:29 these four documents in its corpus and
  • 00:35:33 as you can see it extracts out these
  • 00:35:37 features here and if you remember bag of
  • 00:35:38 words that we just went over this is the
  • 00:35:41 last feature so for the first document
  • 00:35:43 the word this appears so there should be
  • 00:35:46 a one in the last spot for the first
  • 00:35:48 document it is does the word third
  • 00:35:50 appear in the first document no it does
  • 00:35:53 not so zero just cool this is what we
  • 00:35:56 want and the one thing to note that is
  • 00:35:57 slightly different is it's not a binary
  • 00:35:59 thing with this compactor Iser by
  • 00:36:01 default like for the second document has
  • 00:36:05 two occurrences of document and that's
  • 00:36:08 represented with the two here and I
  • 00:36:12 think there should be a way to make this
  • 00:36:13 binary so you could just do one or zero
  • 00:36:17 I'd have to double check that you
  • 00:36:19 actually binary right here so non zero
  • 00:36:22 counts would be just set to one as
  • 00:36:23 opposed to two so you can play around
  • 00:36:25 and see what works better for your model
  • 00:36:27 but let's go ahead and actually do this
  • 00:36:29 now with our thing so I copy that in one
  • 00:36:33 thing to note is if you want to use that
  • 00:36:34 shift-tab method method to see the docs
  • 00:36:38 you have to run your cell first it won't
  • 00:36:40 know what it's looking for unless you do
  • 00:36:42 that I tripped myself up a little while
  • 00:36:44 ago but not doing that okay so what's in
  • 00:36:46 that example vectorizer equals counter
  • 00:36:49 vectorizer we'll just copy that into our
  • 00:36:51 cell as well yeah it was popping it from
  • 00:36:56 here
  • 00:37:03 okay character he's a contractor and
  • 00:37:06 then it does fit transform corpus okay
  • 00:37:17 and vectorizer dot fit transform and now
  • 00:37:25 our corpus instead of being the little
  • 00:37:27 review corpus they showed us here it's
  • 00:37:30 actually just all of our training
  • 00:37:32 reviews so be trained acts and so what
  • 00:37:36 do we get when we run that ah a really
  • 00:37:39 big matrix so it's a six hundred and
  • 00:37:42 seventy so that's every one of our
  • 00:37:44 training values so all of those have
  • 00:37:48 their own row and each one of those rows
  • 00:37:50 has seven thousand three hundred and
  • 00:37:52 seventy two columns so this is a really
  • 00:37:54 really massive matrix it's a lot bigger
  • 00:37:56 than this little example matrix but then
  • 00:37:58 again if we think about this it makes
  • 00:38:00 sense why it's bigger because we have
  • 00:38:04 now six hundred and seventy documents
  • 00:38:06 that are all longer than these pieces of
  • 00:38:09 text so like our matrix is pretty dang
  • 00:38:11 big but to our computer it's not that
  • 00:38:14 big of a deal so we're gonna be totally
  • 00:38:15 fine with this vectorizer
  • 00:38:18 okay now that we have this vectorizer we
  • 00:38:22 can go ahead and start getting ready to
  • 00:38:23 actually build a model around it so
  • 00:38:26 basically what this outputted and what
  • 00:38:30 we see right here is what we're actually
  • 00:38:33 gonna want to use as our training input
  • 00:38:36 so like we had before we had train X
  • 00:38:42 which equal let's just like print out
  • 00:38:45 the first train X and I'll just ignore
  • 00:38:50 this real quick so that was just like
  • 00:38:53 piece of text what this now outputs is
  • 00:38:56 like train X vectors equals that and now
  • 00:39:02 if we did train X vectors 0 we get a
  • 00:39:09 matrix that represented I'm gonna just
  • 00:39:13 print both of these things
  • 00:39:19 basically so we have this text and now
  • 00:39:22 this train X vectors here is a matrix
  • 00:39:25 that represents this text so it has ones
  • 00:39:28 as you can see here and all of the
  • 00:39:31 positions were actually it's counting so
  • 00:39:33 two for a specific word all the
  • 00:39:36 positions that are nonzero in our matrix
  • 00:39:39 and if you wanted to see that a more
  • 00:39:40 traditional way I think you good to
  • 00:39:42 array and this is the entire matrix but
  • 00:39:46 because it was seven thousand three
  • 00:39:47 hundred seventy-two we don't get to see
  • 00:39:49 the whole thing but just know that
  • 00:39:50 inside of this there's a one wherever
  • 00:39:53 this piece of text triggered something
  • 00:39:56 in our vectorizer
  • 00:39:58 if that hopefully makes sense all right
  • 00:40:02 and one thing that also I just find
  • 00:40:04 useful to know is that in this step
  • 00:40:08 right here we kind of did two steps at
  • 00:40:10 once we fit a vectorizer to this
  • 00:40:14 training data and then we also
  • 00:40:16 transformed our training data to give us
  • 00:40:18 these trained X vectors there's a you
  • 00:40:21 can do this the same way in two steps
  • 00:40:24 there's basically two separate functions
  • 00:40:27 you could use so you could do first
  • 00:40:29 vectorizer dot fit train X and that will
  • 00:40:34 do anything that will just fit the
  • 00:40:37 training data and then if you did train
  • 00:40:43 X vectors equals vectorizer dot
  • 00:40:47 transform train X then you will get the
  • 00:40:53 final same result so what I'm saying is
  • 00:40:57 that these two steps fit and transform
  • 00:40:59 in this function fit transform they're
  • 00:41:03 the exact same thing they just lumped it
  • 00:41:06 in because you so often want to fit a
  • 00:41:07 model then transform things they also
  • 00:41:09 have this kind of helpful to in one
  • 00:41:11 function but doing that in two steps
  • 00:41:15 would have been the same thing just
  • 00:41:17 useful to know because most every like
  • 00:41:20 SK line I think has like a fit transform
  • 00:41:22 fit and transform and usually you'll
  • 00:41:25 want to fit
  • 00:41:26 and transform it first but now actually
  • 00:41:28 when we like played around with getting
  • 00:41:29 our test X vectors so numerical
  • 00:41:33 representations alright of our test set
  • 00:41:36 text we just do we could just do
  • 00:41:39 vectorizer dot transform because we
  • 00:41:43 don't want to like fit a new model so
  • 00:41:45 you just transform it to text X okay so
  • 00:41:49 we have with this train X vectors now
  • 00:41:54 our final data basically is train X
  • 00:41:57 vectors and train Y and basically we'll
  • 00:42:02 want to fit a model around these so
  • 00:42:04 let's start looking at choosing the
  • 00:42:05 right model okay we have our X's and our
  • 00:42:08 wise now let's actually do some
  • 00:42:10 classification and SK learn offers a ton
  • 00:42:13 of different classifiers so I'm going to
  • 00:42:16 just kind of go through some of the
  • 00:42:18 options and this is a big part of like
  • 00:42:20 there's all these options for
  • 00:42:22 classifiers but unless you kind of study
  • 00:42:24 up on these classifiers and get a little
  • 00:42:26 bit of like a more theoretical
  • 00:42:28 understanding of them it's hard to
  • 00:42:30 sometimes make a decision so like here's
  • 00:42:34 an example classifier comparison and you
  • 00:42:37 can see all these different types of
  • 00:42:40 classifiers that are all built super
  • 00:42:42 easily through SK learn and what I would
  • 00:42:45 recommend is if you're trying to like
  • 00:42:46 figure out what classifier is right you
  • 00:42:49 know do some additional google searching
  • 00:42:52 watch some youtube videos on these I
  • 00:42:54 know like MIT OpenCourseWare the actual
  • 00:42:57 professor that I had when I was taking
  • 00:42:59 in AI class at MIT who unfortunately
  • 00:43:01 passed away but like if I looked up like
  • 00:43:04 linear SVM MIT you know you'd get all of
  • 00:43:09 these lectures that good information you
  • 00:43:11 get a more theoretical understanding so
  • 00:43:14 you can kind of go through and like
  • 00:43:16 figure out what you know get a feel for
  • 00:43:18 these different models and maybe have a
  • 00:43:20 feel for which one's gonna be better the
  • 00:43:23 other part of that though is you know
  • 00:43:26 part of it is just testing just train
  • 00:43:29 your model running it on the test data
  • 00:43:31 seeing what performs better so we'll
  • 00:43:32 take a couple of these fit a couple
  • 00:43:34 different models and see kind of how
  • 00:43:35 they
  • 00:43:36 perform so we're gonna take an SVM if I
  • 00:43:39 take both of these the linear kernel all
  • 00:43:43 that so that will be one probably do
  • 00:43:48 some sort of decision tree I'll throw in
  • 00:43:51 some sort of naive Bayes maybe we'll
  • 00:43:54 also do I think logistic regression
  • 00:43:56 regression which is not here but okay so
  • 00:44:00 we'll import SVM to start
  • 00:44:03 okay so classification
  • 00:44:12 start with a linear SVM and it's very
  • 00:44:21 straightforward to use this we can just
  • 00:44:25 do classifier and I'm just gonna say
  • 00:44:28 classifier SVM SVC support vector
  • 00:44:32 classifier I think it stands for
  • 00:44:34 oftentimes you also see support vector
  • 00:44:36 machine SVM so we're actually I think
  • 00:44:42 more generally you can do from SK learn
  • 00:44:44 import SVM and we can do sv m dot SVC
  • 00:44:55 and if I look run this and just look up
  • 00:44:58 the documentation here there's couple
  • 00:45:01 things we pass in kernel C value and
  • 00:45:10 yeah reading the documentation the
  • 00:45:13 penalty parameter C of the error term if
  • 00:45:15 you you know read up on the theory
  • 00:45:17 you'll have a little bit more idea of
  • 00:45:18 these different parameters it's good to
  • 00:45:20 know what you can play around with so
  • 00:45:23 we're gonna make this linear to start so
  • 00:45:26 we're going to say kernel equals linear
  • 00:45:30 but we could look at all the different
  • 00:45:32 options like kernels one option C is
  • 00:45:36 another parameter you can play with
  • 00:45:39 gammas another parameter all sorts of
  • 00:45:42 stuff so we have kernel linear and that
  • 00:45:47 would give us a classifier and all we
  • 00:45:51 have to do to fit this classifier to our
  • 00:45:53 X's and Y's that we defined right here
  • 00:45:55 there's we can do Co left SVM and this
  • 00:46:01 should be a familiar command to you fit
  • 00:46:03 train X vectors
  • 00:46:07 and train why so we pass in an X & Y to
  • 00:46:11 fit this classifier to our data there we
  • 00:46:15 go and then what's cool is we could go
  • 00:46:18 ahead and I predict something so based
  • 00:46:20 on all of the training reviews we had we
  • 00:46:23 can go ahead and do CL f sv m dot
  • 00:46:27 predict and real quick we need to get
  • 00:46:31 something to predict so maybe we just
  • 00:46:33 look up I'm going to copy the I'm going
  • 00:46:38 to print out real quick our test X 0
  • 00:46:42 let's see so this is a positive reveal
  • 00:46:47 looks like we could also get the vector
  • 00:46:51 for that oops just as vectors 0 I think
  • 00:47:02 I defined that maybe not train – vectors
  • 00:47:07 so to get test X factors we can do just
  • 00:47:12 sort of probably to find out the error
  • 00:47:14 test X vectors equals vectorizer and we
  • 00:47:22 don't want to fit again we just want to
  • 00:47:24 transform because this is our test data
  • 00:47:26 just X
  • 00:47:34 okay now sorry have these text vectors
  • 00:47:37 defined so we have something to pass it
  • 00:47:39 and now what we can do with this trained
  • 00:47:41 SVM classifier we can go ahead and
  • 00:47:45 predict whether the words this review
  • 00:47:51 every new McCole better is better than
  • 00:47:54 last this is no exception we can predict
  • 00:47:56 if that's positive or negative we can do
  • 00:47:59 test X but our CLF sv m dot predict so
  • 00:48:06 pretty much all of these classifiers I
  • 00:48:08 think all of these classifiers have a
  • 00:48:09 predict method and then you'd pass in
  • 00:48:12 that same type of vector that you used
  • 00:48:15 to train it but we haven't actually seen
  • 00:48:18 this test vector so we don't know what
  • 00:48:20 the proper output should be but I'm
  • 00:48:22 saying it should be positive if our
  • 00:48:24 classifier was trained properly and as
  • 00:48:27 you see we get positive okay let's go
  • 00:48:29 ahead and create a couple other
  • 00:48:30 classifiers so I mentioned a couple that
  • 00:48:34 I'll try I mean there's so many options
  • 00:48:36 to choose from so I'll do decision tree
  • 00:48:42 will have a naive Bayes option we'll do
  • 00:48:50 a how about a logistic regression
  • 00:48:57 and we're just do these rapid-fire it's
  • 00:48:59 pretty cool how easy it is to do this
  • 00:49:04 okay so let's do decision tree first if
  • 00:49:07 we go back to that comparison thing we
  • 00:49:11 see decision tree classifier here I can
  • 00:49:14 click on that and get the full
  • 00:49:15 documentation for that okay just look at
  • 00:49:18 the example so it has a fit as a
  • 00:49:23 violence protective call okay so from SK
  • 00:49:30 learn and pour that and then base we're
  • 00:49:40 copying the same code as before ah no I
  • 00:49:43 didn't want to do it there I wanted to
  • 00:49:44 do it here insert so below coffee so
  • 00:49:50 doing decision tree real quick we're
  • 00:49:54 gonna want to fit that so classifier
  • 00:49:56 will just call a classifier decision
  • 00:49:58 equals citizen tree classifier and then
  • 00:50:08 once again like you can look up the the
  • 00:50:11 stocks of this and see what options you
  • 00:50:15 have for things to do and you kind of
  • 00:50:18 play around with that to do whatever is
  • 00:50:20 best we can do see left a decision dot
  • 00:50:23 fit we're going to do this train X
  • 00:50:26 vectors again same as I can fitting it
  • 00:50:27 to the same exact data and the Train Y
  • 00:50:31 and let's see what happens so we have a
  • 00:50:35 another classifier we've just trained to
  • 00:50:38 decision tree classifier and we could do
  • 00:50:40 the same type of thing let's see if it
  • 00:50:42 predicts it properly
  • 00:50:48 so as you see it's like we just built
  • 00:50:51 two different classifiers super quickly
  • 00:50:55 and this is positive too
  • 00:50:58 we could go naive Bayes and I'm going to
  • 00:51:00 keep saying this like check out read
  • 00:51:03 into these these different classifiers
  • 00:51:06 like without being just saying it's like
  • 00:51:10 you're just kind of randomly there's no
  • 00:51:12 like thought process and choosing one
  • 00:51:14 over the other so you kind of get a if
  • 00:51:18 you have some understanding behind the
  • 00:51:19 scenes of how these work you'll have
  • 00:51:21 better ability to do model selection so
  • 00:51:25 naive Bayes and this time we'll just
  • 00:51:27 like look up something out already from
  • 00:51:29 some different classifiers a scalar
  • 00:51:43 there's all sorts of documentation all
  • 00:51:47 sorts of different types of classifiers
  • 00:51:48 here so I want a naive Bayes so we'll
  • 00:51:53 try to just this Gaussian naive Bayes
  • 00:52:03 okay
  • 00:52:05 now kind of basically just repeat the
  • 00:52:08 same thing paste in this code this will
  • 00:52:12 be guessing that a phase Jessie Gaussian
  • 00:52:19 naive Bayes that's like a tongue twister
  • 00:52:21 Gaussian naive Bayes so what happens
  • 00:52:24 here same type of deal
  • 00:52:27 cool logistic regression would be the
  • 00:52:31 last one
  • 00:52:32 I'm just rapid-fire doing these logistic
  • 00:52:37 regression SK Laird
  • 00:52:46 how do I get you okay cool
  • 00:52:59 I'll call this CLF lodge because
  • 00:53:03 logistic and then we'll just copy this
  • 00:53:26 what happened okay so there's some
  • 00:53:33 future warnings but everything's fine
  • 00:53:34 everything's predicting positive that
  • 00:53:36 all seems good okay now that we have
  • 00:53:38 some trained models that's actually
  • 00:53:39 evaluate these a little bit more
  • 00:53:40 comprehensively so let's look at the
  • 00:53:42 entire test set and see how accurately
  • 00:53:44 we predict every one of those test
  • 00:53:46 examples
  • 00:53:47 it's like above we like you know seem to
  • 00:53:50 predict positive correctly for every one
  • 00:53:51 of these models that looks good at least
  • 00:53:53 initially so how can we do this a little
  • 00:53:56 bit more comprehensively we can go ahead
  • 00:53:58 and do that's just a quick look just to
  • 00:54:04 show you where you would find it a knife
  • 00:54:08 baby would work so what are the
  • 00:54:10 functions you can do with that
  • 00:54:11 classifier that we trained or maybe I'll
  • 00:54:14 just look up like SVM S Caitlyn there's
  • 00:54:21 a specific function I'm looking for I
  • 00:54:22 just want to show you where you'd find
  • 00:54:23 in the documentation okay score this is
  • 00:54:28 what it is so we pass in an X and the y
  • 00:54:31 and it will predict all the X's and see
  • 00:54:35 how it compares to Y so score is what
  • 00:54:36 we're looking for and if I clicked on
  • 00:54:38 score you would see probably how to use
  • 00:54:41 it in an example maybe not but you could
  • 00:54:43 look up score example but here's how we
  • 00:54:45 use it let's take our SVM classifier
  • 00:54:49 pass in the test X vectors that's what
  • 00:54:53 we want to convert and see how this
  • 00:54:56 predictions for this compares to our
  • 00:54:58 actual test y
  • 00:55:06 sorry SV score now it should work 82%
  • 00:55:11 not too bad let's see how the other ones
  • 00:55:13 do there was was the second one I did I
  • 00:55:19 think decision let's score and passing
  • 00:55:22 the same exact vectors so we're just
  • 00:55:26 quickly seeing the accuracy this score
  • 00:55:29 function if we look at it I could look
  • 00:55:33 at it here or right here it turns the
  • 00:55:38 mean accuracy on the given test data and
  • 00:55:40 labels so it sounds good
  • 00:55:42 so what would be we could print all of
  • 00:55:44 these so that's the first one the SVM
  • 00:55:50 printed out the second one let's just
  • 00:55:54 copy this two lines to get our last two
  • 00:55:58 okay we had the night Gaussian naive
  • 00:56:01 Bayes and we had a logistic regression
  • 00:56:04 okay let's see how all these do 82 78 78
  • 00:56:09 84 okay all very good which is pretty
  • 00:56:14 nice we didn't do much at all
  • 00:56:16 but is there a catch and I'm gonna tell
  • 00:56:18 you right now
  • 00:56:19 there is a catch so we just looked up
  • 00:56:21 the accuracy the other metric like
  • 00:56:24 accuracy is one thing like how many
  • 00:56:26 labels did you actually predict
  • 00:56:27 correctly the more important metric that
  • 00:56:30 we usually care about as a data
  • 00:56:31 scientist I would say when we're doing
  • 00:56:33 classification tasks is the f1 score so
  • 00:56:37 I'm just gonna shorten this line up so
  • 00:56:40 this right here was the mean accuracy on
  • 00:56:48 all of our test labels
  • 00:56:49 let's look up how our f1 scores are for
  • 00:56:54 these same exact classifiers so to use
  • 00:56:58 f1 score in SK learn you can import it
  • 00:57:01 like this and then if I run that we can
  • 00:57:05 do a little look at the signature by
  • 00:57:08 doing shift tab on your jupiter notebook
  • 00:57:10 and so what does it say compute the f1
  • 00:57:14 score also known as the balance of score
  • 00:57:17 and gives you a little information if
  • 00:57:19 you don't know too much about up score
  • 00:57:21 on how it's actually calculated it's
  • 00:57:23 worth just looking into that because it
  • 00:57:25 is a good measure so we need to pass in
  • 00:57:29 y true and Y predicted and optionally we
  • 00:57:33 can pass in some labels so labels might
  • 00:57:35 be helpful for us okay
  • 00:57:37 so we first need the pass in the Y trail
  • 00:57:41 so for all of our models this will be
  • 00:57:43 the same it'll just be test Y and then
  • 00:57:46 we need to pass in the predicted Y so
  • 00:57:50 predicted Y would be C left on SVM
  • 00:57:53 predict of all of those things in our
  • 00:57:57 test X vectors so we do it like that
  • 00:58:00 so let's see what our f1 score is for
  • 00:58:03 the linear SVM ah what happened target
  • 00:58:09 us multi-class but average equals binary
  • 00:58:12 okay we'll try Noah averaging I just
  • 00:58:16 want to see each individual class how
  • 00:58:18 good its f1 score is average equals none
  • 00:58:21 okay cool okay so what does the what are
  • 00:58:25 these numbers mean well it's a little
  • 00:58:27 bit clearer if I pass in labels so
  • 00:58:29 labels are going to be equal to
  • 00:58:31 sentiment dot positive sentiment
  • 00:58:40 and this three labels I'd notice that we
  • 00:58:43 actually forgot to filter out our
  • 00:58:45 neutral labels so now we were building a
  • 00:58:48 classifier to do positive neutral and
  • 00:58:53 negative instead of just positive or
  • 00:58:55 negative we might switch that I think we
  • 00:58:56 will probably switch that so we'll go
  • 00:58:58 through how to do that a neutral and
  • 00:59:02 sentiment dot positive or negative so I
  • 00:59:08 might just make things clear oh and one
  • 00:59:12 thing to know it is if I like switch the
  • 00:59:14 order here we now know which ones
  • 00:59:16 associated with what if I switch the
  • 00:59:18 order of these labels because it might
  • 00:59:24 not be clear otherwise I put like
  • 00:59:27 positive at the back you'd see the 9d
  • 00:59:31 switch to a different location as you
  • 00:59:34 see so go back though because it was
  • 00:59:38 more clear okay so we have f1 score of
  • 00:59:42 90% for positive so it's very good for
  • 00:59:44 positive but it is trash for this this
  • 00:59:49 model is pretty trash for the neutral
  • 00:59:53 and negative labels so we want to
  • 00:59:56 improve that because yes we can predict
  • 00:59:59 positive but we want to equally be able
  • 01:00:01 to predict negative too
  • 01:00:03 basically right now what our model is
  • 01:00:05 doing is always predicting positive and
  • 01:00:07 often times I guess it's right based on
  • 01:00:09 our test data split okay let's just see
  • 01:00:12 does this hold true for the other models
  • 01:00:15 too
  • 01:00:16 so like try the decision tree that oh
  • 01:00:20 wow okay that did very good on positive
  • 01:00:26 again awful on neutral and negative so
  • 01:00:29 this might be the common trend Gaussian
  • 01:00:32 naive Bayes that is same thing good I'm
  • 01:00:37 positive bad on negative and neutral and
  • 01:00:40 finally logistic regression yeah it
  • 01:00:44 almost like looks like it's good on this
  • 01:00:46 but that's a point zero nine three not
  • 01:00:49 0.93 okay so very good positive they all
  • 01:00:52 are very good paused
  • 01:00:53 so now we've we've discovered something
  • 01:00:56 how can we make our model better on not
  • 01:00:59 just positive examples so that's what
  • 01:01:01 we're gonna work on next okay when all
  • 01:01:03 these models perform like equally as bad
  • 01:01:05 on neutral and negative you know I'm not
  • 01:01:09 really thinking that it's a model issue
  • 01:01:11 right now I'm thinking it's more of a
  • 01:01:13 data issue so we're gonna do a little
  • 01:01:16 bit investigating into our data and just
  • 01:01:18 see if we can find anything so let's
  • 01:01:21 look at her training data so train X I
  • 01:01:23 guess yeah train ax so remember what
  • 01:01:26 train X was that was all the labels that
  • 01:01:29 we used and we ended up actually
  • 01:01:31 converting them to vectors think I'm
  • 01:01:34 more curious about is train Y so let's
  • 01:01:39 just look at the first like five
  • 01:01:41 elements okay and this already is
  • 01:01:44 telling this isn't our training data we
  • 01:01:47 have five positive things in a row we
  • 01:01:50 could honestly probably let's look at
  • 01:01:55 the entire list okay five right in the
  • 01:02:03 first five okay that's kind of telling
  • 01:02:05 we can do train Y dot count of sentiment
  • 01:02:10 dot positive and let's see how many
  • 01:02:12 actual positive labels we have 552 okay
  • 01:02:18 so remember we had 670 train labels
  • 01:02:21 total in five hundred fifty two of them
  • 01:02:23 are positive so right away our models
  • 01:02:26 are gonna be like heavily biased towards
  • 01:02:29 these positive labels because there's
  • 01:02:30 way more of them so we do the same thing
  • 01:02:34 for negative forty seven negative labels
  • 01:02:36 so and I guess the rest would be neutral
  • 01:02:39 so here's our issue we need to let's
  • 01:02:44 balance our negative and positive data
  • 01:02:48 so let's do that real quick because it
  • 01:02:50 wouldn't make sense to balance the 47
  • 01:02:54 like if we balanced it exactly with 47
  • 01:02:57 then we'd have 47 positive that seems
  • 01:02:59 like a little bit too little amount of
  • 01:03:01 training data so what I want you to do
  • 01:03:03 real quick is well
  • 01:03:04 you download a bigger data set if you go
  • 01:03:08 to my github go to data sentiment
  • 01:03:12 download this 10,000 file that's
  • 01:03:15 actually ten thousand reviews instead of
  • 01:03:16 dia thousand those originally in that
  • 01:03:18 this and if you're curious on how I made
  • 01:03:21 this ten thousand file the way was I
  • 01:03:25 used this data process file I wrote and
  • 01:03:28 basically looked at all the reviews in
  • 01:03:30 this massive file that I downloaded from
  • 01:03:32 this what I brushed on here so right
  • 01:03:35 here I downloaded that unzipped it then
  • 01:03:38 ran this file on it changed this final
  • 01:03:41 data to ten thousand took ten thousand
  • 01:03:44 samples randomly and then just basically
  • 01:03:47 change the name here so it's ten
  • 01:03:49 thousand examples so we should be able
  • 01:03:50 to get a little bit more negative
  • 01:03:52 examples out of that so alright so you
  • 01:03:56 download the download the data sentiment
  • 01:04:03 book small click on it view raw and I
  • 01:04:07 guess you could even just click and then
  • 01:04:09 you would save as and save it just like
  • 01:04:12 you saved the first file okay and now
  • 01:04:15 what we need to do is go ahead and
  • 01:04:18 instead of loading in up here books
  • 01:04:21 small you need to load in whatever you
  • 01:04:24 named it I called my book small JSON or
  • 01:04:26 book small ten thousand a song so I'm
  • 01:04:29 gonna run that I'm gonna do all the same
  • 01:04:33 stuff with this I'll do the same sort of
  • 01:04:36 vectorization now it's going to be a
  • 01:04:38 bigger vectorization and really when we
  • 01:04:44 go down here to the count of negatives
  • 01:04:47 we should see a higher number 436 cool
  • 01:04:50 that's a lot better in my opinion you
  • 01:04:53 know about 10x for our training data so
  • 01:04:57 we have more negatives but that also
  • 01:05:00 means that we have more positives 5611
  • 01:05:05 so what we're going to do is actually
  • 01:05:07 create a little container class for
  • 01:05:10 reviews so we can like call it evenly
  • 01:05:12 distribute method that will even out our
  • 01:05:15 training data in our test
  • 01:05:17 so we'll do that in the same spot that
  • 01:05:20 we created this original data class so
  • 01:05:24 instead of we have this review class and
  • 01:05:26 now what we're gonna do is add one more
  • 01:05:28 class and that's going to be review
  • 01:05:33 container and to initialize that we will
  • 01:05:40 want self as always and then reviews and
  • 01:05:44 the reviews we already know what our
  • 01:05:46 self dot reviews I equals reviews and we
  • 01:05:51 can create a method called and I'm
  • 01:05:55 making this a class just because it
  • 01:05:57 makes things neater we'll create a
  • 01:06:00 method called evenly distribute it takes
  • 01:06:08 himself and what that will do is count
  • 01:06:11 the number of it will count the number
  • 01:06:15 of negative sentiment things and it will
  • 01:06:22 okay so will write this and just bear
  • 01:06:24 with me so we want to get all of our
  • 01:06:28 negative examples from the reviews so
  • 01:06:31 what we'll do to do that is we're gonna
  • 01:06:34 filter our reviews list based on what is
  • 01:06:38 equal to negative so this is a little
  • 01:06:41 bit of a so filter bear with me
  • 01:06:47 X dot sentiment we want all of our
  • 01:06:50 reviews the sentiment of our reviews to
  • 01:06:53 be equal to sentiment dot negative and
  • 01:06:57 that will be done on the entire review
  • 01:07:02 so self dot reviews I think that's good
  • 01:07:09 and I'll just print negative real quick
  • 01:07:13 so we can see what this is doing it
  • 01:07:16 basically what it's doing and I'm not
  • 01:07:18 gonna actually print negative it's
  • 01:07:20 looking at all these reviews mapping
  • 01:07:23 every sentiment and it's basically just
  • 01:07:26 filtering based on every sentiment
  • 01:07:28 is negative so whenever the sense was
  • 01:07:31 negative we're keeping track of that in
  • 01:07:32 the negative list here and we can do the
  • 01:07:35 same thing for positive equals filter
  • 01:07:42 I'll just copy this line and we want
  • 01:07:50 this to be positive I'm gonna ignore at
  • 01:07:53 this point we're gonna ignore the
  • 01:07:54 neutral examples you could factor those
  • 01:07:55 in if you wanted to but I'm just going
  • 01:07:57 to make sure that my negative values is
  • 01:08:00 equal to my positive values I think this
  • 01:08:02 should build our model make it a little
  • 01:08:03 bit better okay so now if I printed out
  • 01:08:10 the lengths of negative and positive and
  • 01:08:12 I could do that if I rerun this Reaver
  • 01:08:16 container I could pass in all my reviews
  • 01:08:21 somewhere down here I pass in like those
  • 01:08:27 review container of training and we'll
  • 01:08:34 just run contain our dot evenly
  • 01:08:37 distribute
  • 01:08:43 okay so we have positive and negative in
  • 01:08:47 here and what I'm going to just do to
  • 01:08:49 just see what's happening is we'll do
  • 01:08:54 print negative zero dot txt print
  • 01:09:04 negative print I guess length of
  • 01:09:09 negative and it'll print the length of
  • 01:09:13 positive okay so what happens when we
  • 01:09:19 run this down here filter is not
  • 01:09:26 subscrip to ball it is because we now
  • 01:09:29 need to convert this to a list when we
  • 01:09:32 filter something it doesn't
  • 01:09:33 automatically convert it to back to a
  • 01:09:35 list so you need to surround it with the
  • 01:09:36 list blocks and now what happens here
  • 01:09:42 cool so it gives us our texts from the
  • 01:09:46 negative shows us that we have 436
  • 01:09:48 pieces of text and there's 5611 positive
  • 01:09:53 so what we're going to do is just
  • 01:09:55 basically shrink the amount of positive
  • 01:09:56 examples to be equal to the length of
  • 01:10:02 the negative so that is equal to
  • 01:10:04 positive length negative
  • 01:10:13 and I guess yeah okay
  • 01:10:16 so let's give a little length and
  • 01:10:19 negative and now what we're gonna do is
  • 01:10:20 our final reviews is going to be equal
  • 01:10:27 to negative plus positive shrunk so now
  • 01:10:32 what we're basically doing is shrinking
  • 01:10:34 the amount of reviews we're actually
  • 01:10:35 storing and just only containing the
  • 01:10:37 negative ones and the positive ones that
  • 01:10:39 are equal to the amount of the negative
  • 01:10:41 and just for a good measure
  • 01:10:43 we're gonna also import random I think
  • 01:10:52 it might have this might be overkill it
  • 01:10:53 might be doing this twice and we're
  • 01:10:54 gonna shuffle those reviews just make
  • 01:10:57 sure that our data is kind of evenly and
  • 01:10:59 random when the negatives and the
  • 01:11:01 positives come actually yet we do need
  • 01:11:03 to do this because we're with otherwise
  • 01:11:06 we'd have all the negatives and then all
  • 01:11:08 the positives so we're just going to
  • 01:11:10 shuffle up the order so that you don't
  • 01:11:12 know if a negative or positive is coming
  • 01:11:14 next some of the algorithmic models will
  • 01:11:19 this will be important for okay cool
  • 01:11:23 so we're gonna rerun that hopefully
  • 01:11:24 everything works and remember we
  • 01:11:26 imported random here too okay now if we
  • 01:11:29 get on to our evenly distribute and if
  • 01:11:32 we run that are actually that's
  • 01:11:35 container dot reviews and we'll just get
  • 01:11:40 the lengths that we'll just see if it
  • 01:11:41 the total number of reviews now that
  • 01:11:44 we're looking at are smaller and then
  • 01:11:47 I'll do a couple additional thing okay
  • 01:11:49 872 so that sounds like exactly in the
  • 01:11:51 middle of you know how many negative
  • 01:11:53 there were times two okay so now instead
  • 01:11:58 of doing this we you know kind of bake a
  • 01:11:59 lot of this stuff into our container
  • 01:12:01 type now so I'm gonna just call this
  • 01:12:04 like
  • 01:12:08 trained container and we'll make a test
  • 01:12:14 container too and you'll see all of what
  • 01:12:18 I'm doing in a sec and I want to note –
  • 01:12:22 real quick part of the reason I'm doing
  • 01:12:24 this and breaking things out into
  • 01:12:25 classes it's not necessary but like real
  • 01:12:28 world this is a like a real world
  • 01:12:29 tutorial and it's always important to
  • 01:12:31 try to like keep your stuff neat and
  • 01:12:33 even though this is easy enough to do if
  • 01:12:36 we have to do this a lot of times it
  • 01:12:38 might get annoying so we'll bake in some
  • 01:12:41 of this break down into our review
  • 01:12:43 container class so what we can do is add
  • 01:12:48 a couple of new methods so we could do
  • 01:12:50 def get text self and all that would do
  • 01:12:56 is return this basically
  • 01:13:06 and not training I'd be actually in self
  • 01:13:10 dot reviews return we could get why or
  • 01:13:20 get our labels so get sentiment I'll say
  • 01:13:22 yourself that would be returned
  • 01:13:26 excellent sentiment for X in review self
  • 01:13:31 thought reviews and now what I can do is
  • 01:13:38 instead of doing this tree necks now it
  • 01:13:42 just becomes trained container dot get
  • 01:13:47 text and train ye becomes trained
  • 01:13:52 container dot get sentiment same thing
  • 01:13:58 for this this would be a test container
  • 01:14:01 dot get text and test container dot get
  • 01:14:10 sentiment
  • 01:14:14 okay and now what we can do is we could
  • 01:14:19 like do something like actually check in
  • 01:14:24 our train why the count of self dot
  • 01:14:29 sentiment or sorry train why the count
  • 01:14:32 of sentiment dot positive and the count
  • 01:14:39 of sentiment dot negative review
  • 01:14:49 container has no attribute I don't think
  • 01:14:51 I ran this cell again run and run and
  • 01:14:57 run 436 is negative let's see about
  • 01:15:01 positive 5611 okay didn't quite do what
  • 01:15:06 we wanted to do and that's because we
  • 01:15:09 also just need a real quick run that
  • 01:15:14 function though that we wrote called
  • 01:15:16 evenly distribute
  • 01:15:20 now we should be good okay now positives
  • 01:15:24 for 36 and negative should be the same
  • 01:15:32 that was a fairly long aside but yeah we
  • 01:15:35 got to what we're doing and now does
  • 01:15:37 that help us out and I don't think it's
  • 01:15:40 important to evenly distribute test X
  • 01:15:43 but you could if you wanted to it really
  • 01:15:46 depends on what you think your incoming
  • 01:15:48 data will be and like practice all right
  • 01:15:51 so we have this now equal so we're going
  • 01:15:56 to vectorize it the same way using our
  • 01:16:01 slimmed-down train X and train
  • 01:16:04 Oh actually train X vectors okay that's
  • 01:16:07 good test X vectors I could we could
  • 01:16:09 also bake this into our review container
  • 01:16:13 class if we wanted to but it's fine like
  • 01:16:16 this for now okay now we're fitting to
  • 01:16:19 the new data and we can fit all these to
  • 01:16:23 the new data and let's see what happens
  • 01:16:26 to our scores come on baby
  • 01:16:30 okay our scores decreased is it a good
  • 01:16:35 thing or bad thing I mean normally you'd
  • 01:16:37 say that's a bad thing but let's see our
  • 01:16:38 f1 squares how do they look and our f1
  • 01:16:45 scores I mean overall they seem to get a
  • 01:16:53 bit better I'm not gonna look at
  • 01:16:55 logistic regression let's look at the
  • 01:16:57 SVM for now okay I mean it got
  • 01:17:01 definitely better it's still not great
  • 01:17:04 but it got better so what can we do next
  • 01:17:08 let's like keep getting better and
  • 01:17:09 better so the first thing I'm going to
  • 01:17:11 say we should do and you know again you
  • 01:17:15 as the data scientist we get it cannot
  • 01:17:17 control the data I'm gonna say that I
  • 01:17:20 want a model that kind of does a good
  • 01:17:21 job predicting one about half and half
  • 01:17:24 are positive and negative so right now
  • 01:17:26 in our test set we have an overwhelming
  • 01:17:29 amount
  • 01:17:30 of positive I can show you that test why
  • 01:17:35 account is 2767 for positive and now we
  • 01:17:39 have 208 for negative so why did this f1
  • 01:17:43 score for negative and also we dropped a
  • 01:17:46 neutral out of this completely so I
  • 01:17:47 could actually just kind of delete that
  • 01:17:50 for my labels and also for the time
  • 01:17:53 being let's just focus on this SVM
  • 01:17:55 because it's hard to go through every
  • 01:17:57 one of these I'd say in general for this
  • 01:18:00 model probably logistic regression or
  • 01:18:02 the SVM will be your best performing
  • 01:18:05 okay so the reason that this is low is
  • 01:18:09 because we trained it equally but in our
  • 01:18:12 tests that we only have 208 negatives so
  • 01:18:15 like our model is a little bit more
  • 01:18:16 custom to like predict 5050 and you know
  • 01:18:22 it really depends on what you're looking
  • 01:18:23 for and there's also ways to probably
  • 01:18:25 make it a little bit more robust where
  • 01:18:26 even though you train with about 50/50
  • 01:18:28 you still could handle that uncertainty
  • 01:18:31 but I'm going to just go ahead and start
  • 01:18:33 out by doing the evenly distribute for
  • 01:18:40 the test container as well
  • 01:18:46 so now we'll get about 200 for both
  • 01:18:52 positive and negative and everything
  • 01:19:01 else should stay the same here nothing
  • 01:19:03 changed nothing changed here all this
  • 01:19:07 stays the same now we just go back and
  • 01:19:09 let's see what happens to our f1 score
  • 01:19:11 now that we made it about half and half
  • 01:19:13 so we basically what was happening
  • 01:19:15 before is we probably were getting a lot
  • 01:19:17 of false positives if there is something
  • 01:19:20 that was actually like a positive
  • 01:19:21 comment we were predicting it negative
  • 01:19:24 because we're about equal like pretty
  • 01:19:28 good positive and negative based on it
  • 01:19:29 so let's see what happens now that we
  • 01:19:31 make this we even do that so I mean
  • 01:19:34 sorry this is not the model I'm running
  • 01:19:36 but as you can see 208 208 okay let's
  • 01:19:41 see what happens to our f1 score for
  • 01:19:42 negative it shouldn't predict I mean it
  • 01:19:46 should pretty negative at a higher
  • 01:19:47 accurate rate because it's test doesn't
  • 01:19:51 even know what happened I think I didn't
  • 01:19:55 like reset things so I was like doing
  • 01:19:59 some weird stuff
  • 01:20:10 okay what's gonna happen now come on
  • 01:20:15 work okay cool and look at that
  • 01:20:19 yeah shot way up because it wasn't
  • 01:20:22 predicting negative so frequently or I
  • 01:20:24 mean it was predicting negative as
  • 01:20:26 equally but our actual test data was
  • 01:20:29 more equally distributed and I'm saying
  • 01:20:32 that we wanted that all right so we got
  • 01:20:39 like our base model working and one
  • 01:20:41 thing that's probably good to do is
  • 01:20:42 let's do a little bit of a qualitative
  • 01:20:44 analysis as well so we could take a test
  • 01:20:51 set such as the one I showed before we
  • 01:20:57 could transform that using our
  • 01:20:58 vectorizer and that's our new and we
  • 01:21:05 could use the SVM to predict what our
  • 01:21:08 labels should be and as we see look at
  • 01:21:14 that positive negative negative and we
  • 01:21:17 get it's really kind of fun to play
  • 01:21:19 around with this great positive still if
  • 01:21:27 I said like not great I'm actually
  • 01:21:29 curious what this will say I think the
  • 01:21:31 great will outweigh it because we didn't
  • 01:21:32 actually use diagrams that which would
  • 01:21:36 count this as one thing oh wow not great
  • 01:21:38 well that's pretty impressive very good
  • 01:21:44 book positive very brilliant all sorts
  • 01:21:49 of words okay I guess it didn't know the
  • 01:21:51 word brilliant and that could probably
  • 01:21:54 be accounted for with more training data
  • 01:21:57 very fun that's good
  • 01:22:00 so looks pretty good to me but let's
  • 01:22:03 make this even better sorry I'm going
  • 01:22:06 insane okay and also just really quick
  • 01:22:10 gonna refresh all these scores since we
  • 01:22:12 changed up our test data and as you see
  • 01:22:15 they increase because of their
  • 01:22:17 predicting positive and negative both
  • 01:22:20 more accurate
  • 01:22:22 okay so let's drive up these scores any
  • 01:22:24 higher even higher and the first way
  • 01:22:27 we're gonna do that
  • 01:22:28 and then she's gonna have a feel for
  • 01:22:30 vectorizer x' but let's scroll back up
  • 01:22:31 to our vectorizer with the count
  • 01:22:33 vectorizer let's think of our examples
  • 01:22:36 that I showed when I was initially
  • 01:22:37 introducing backwards this book is great
  • 01:22:43 was one I think the other was this book
  • 01:22:46 was so bad okay
  • 01:22:48 so with the count vectorizer the main
  • 01:22:51 issue is that it weights each word
  • 01:22:54 equally even though if certain words
  • 01:22:57 don't have nearly as much meaning to a
  • 01:23:00 sentence like in this case this and
  • 01:23:04 great would be weighted equally as like
  • 01:23:06 one count while the great is the one
  • 01:23:10 that defines the sentiment and the this
  • 01:23:13 has no meaning really so instead of
  • 01:23:17 using count vectorizer we can do
  • 01:23:18 something smarter and that is use a
  • 01:23:20 tf-idf vectorizer and that stands for
  • 01:23:24 term frequency inverse document
  • 01:23:26 frequency so basically a term is
  • 01:23:30 important if it occurs a lot throughout
  • 01:23:32 a review just like grade only appears
  • 01:23:35 once but it's like yes it's as important
  • 01:23:37 as any of these words and it's inverse
  • 01:23:40 document frequency means a word is less
  • 01:23:42 important if it occurs in a lot of
  • 01:23:44 documents so for example this is was so
  • 01:23:49 if we have a big corpus of documents
  • 01:23:50 those words would appear a ton and as a
  • 01:23:53 result their weight would be less than
  • 01:23:55 great who occurs but occurs less
  • 01:23:58 frequently and ultimately plays more of
  • 01:24:02 a role into the meaning of the the
  • 01:24:05 document or the the review or whatever
  • 01:24:07 you want to call these so tf-idf allows
  • 01:24:10 us to wait this great and bad higher
  • 01:24:14 than stuff like this and book and that
  • 01:24:19 ultimately will help our performance so
  • 01:24:20 all we have to do to make this change is
  • 01:24:23 changes to tf-idf vectorizer
  • 01:24:26 and we should be able to run this we
  • 01:24:31 should be able to rebuild our models and
  • 01:24:40 oh no I updated it really quickly but
  • 01:24:44 I'm not mistaken by doing that we're
  • 01:24:50 focusing just on SVM that went up by
  • 01:24:53 about a percentage point it looked like
  • 01:24:55 so it did some damage and you probably
  • 01:24:58 could tweak around with that make it
  • 01:25:00 even better
  • 01:25:00 I think logistic regression actually
  • 01:25:02 went down but so I mean this is I mean
  • 01:25:08 machine learning in general like you do
  • 01:25:09 something maybe it increases it maybe it
  • 01:25:11 doesn't try to be smart about how you do
  • 01:25:13 things and you know you always kind of
  • 01:25:16 just have to play around and try things
  • 01:25:18 but our SVM went up okay that's cool so
  • 01:25:21 that's one thing we can do to increase
  • 01:25:22 our performance let's see what happens
  • 01:25:26 to our f1 score is for SVM yeah they
  • 01:25:31 seem to both go up for both sentiment
  • 01:25:35 positive and negative that's cool yeah
  • 01:25:38 I'm guessing these will stay the same
  • 01:25:39 hopefully yeah all right so we're gonna
  • 01:25:41 try to increase our accuracy even
  • 01:25:43 further and we'll be doing that through
  • 01:25:45 a method called grid search and to
  • 01:25:48 access grid search we can do from SK
  • 01:25:51 learned on import grid search CV and if
  • 01:25:58 I was like uncertain what this was
  • 01:26:01 called I would probably be googling like
  • 01:26:04 parameter tuning SK learn and you'd
  • 01:26:07 probably find this by doing that okay so
  • 01:26:12 grid first search CV and what does that
  • 01:26:15 do well let's go back to our SVM that
  • 01:26:19 SVC model that we used and I'm gonna
  • 01:26:25 real quick
  • 01:26:28 look up the docs for this so as you see
  • 01:26:33 there's a bunch of different things
  • 01:26:35 doing capacitor SCC there's the C
  • 01:26:38 parameter this is Colonel degree gamma
  • 01:26:42 all this stuff and to be honest like
  • 01:26:46 especially for like stuff like the C
  • 01:26:48 value maybe Colonel you have a better
  • 01:26:49 feel I don't know what values to that
  • 01:26:53 are going to be best for my data so is
  • 01:26:55 there a way I can programmatically test
  • 01:26:58 a lot of different options and like
  • 01:27:00 choose the best one for me and that's
  • 01:27:02 exactly where this grid search TV comes
  • 01:27:05 in
  • 01:27:05 so we'll do something called like tuned
  • 01:27:10 SVM equals and how do we use grid search
  • 01:27:17 well and I'll actually get back to that
  • 01:27:20 in a second comment this out real quick
  • 01:27:21 well what are the parameters we were
  • 01:27:24 looking at
  • 01:27:31 we'll see Colonel Gama let's just focus
  • 01:27:36 on the sea value and colonel value right
  • 01:27:40 now and remember for Colonel we were
  • 01:27:42 using a linear but you also use one of
  • 01:27:44 these so we're gonna pass in some
  • 01:27:46 parameters so what our parameters gonna
  • 01:27:49 be you're gonna be first off we'll have
  • 01:27:53 Colonel and for now I'm just gonna list
  • 01:27:56 in to Colonel options so Colonel and I'm
  • 01:28:00 gonna pass in this is a dictionary
  • 01:28:03 American Colonel to our options so we
  • 01:28:05 want to look at linear so that's what
  • 01:28:07 we're currently using and there's this
  • 01:28:08 other one that's the default
  • 01:28:10 that's radial based I believe Colonel
  • 01:28:15 Selby one of our parameters we can also
  • 01:28:18 pass in a C value right now the default
  • 01:28:21 is I think one as you see here but maybe
  • 01:28:24 one isn't the best value so we can pass
  • 01:28:26 in a bunch of different values you'll be
  • 01:28:27 passing 1 4 8 16 32 etc and basically
  • 01:28:36 what's gonna happen is when we use this
  • 01:28:39 grid search when we use this grid search
  • 01:28:44 so I'll say SVC equals this VM to SVC a
  • 01:28:48 classifier equals grid search CV of the
  • 01:28:54 classifier that we want to pass in so
  • 01:28:55 that's this right here
  • 01:28:57 and then we're going to pass in the
  • 01:29:00 parameters and then we can optionally
  • 01:29:02 pass in this CV value which is basically
  • 01:29:06 how many times it wants us to split the
  • 01:29:09 data up to like cross validate and make
  • 01:29:12 sure that things are working well with
  • 01:29:15 the specific parameter on our training
  • 01:29:18 set so now we have our classifier to
  • 01:29:20 find you can do classifier to fit train
  • 01:29:23 X train why so we're fitting it just
  • 01:29:27 like we fit all these other models up
  • 01:29:29 here or up here train X vectors I guess
  • 01:29:33 I have to pass a train X
  • 01:29:35 vectors train why okay and so now what's
  • 01:29:40 gonna happen is it will check the linear
  • 01:29:43 kernel it will look at for this linear
  • 01:29:46 kernel it will check all of these C
  • 01:29:48 values figure out what's the best option
  • 01:29:51 and then it will also check this RBF
  • 01:29:53 kernel and check all of these C values
  • 01:29:56 and then it will choose a kernel and a C
  • 01:29:59 value that it predicts will do the best
  • 01:30:02 on unseen data so let's fit that this
  • 01:30:07 sometimes will take a little while
  • 01:30:08 depending on how many parameters you
  • 01:30:10 pass in I don't actually know how long
  • 01:30:15 this will take we'll see all right
  • 01:30:18 worked okay cool
  • 01:30:20 so it found after doing this that C it's
  • 01:30:28 found that it was a good value that we
  • 01:30:30 put in was one was fine but actually
  • 01:30:32 changed our kernel value it recommended
  • 01:30:34 the radial based kernel so now we have a
  • 01:30:36 new classifier and if we went ahead and
  • 01:30:39 did the same scoring method that we've
  • 01:30:42 done on the other values and I'll do
  • 01:30:48 this in the next cell so we don't have
  • 01:30:50 to rerun this okay so remember our SVM
  • 01:30:59 before was 80 point seven percent
  • 01:31:04 accurate and now let's see what happens
  • 01:31:06 after we do a little bit of fine tuning
  • 01:31:07 oh it's about the same yes that RBF
  • 01:31:13 kernel didn't do much kind of coming
  • 01:31:15 down to the end of this end of this
  • 01:31:20 first model I mean there's room for
  • 01:31:23 improvement here a couple things you
  • 01:31:25 could do if you wanted to improve any of
  • 01:31:29 these models further specifically and
  • 01:31:32 just focus on this SVM that we just fit
  • 01:31:35 I guess this is really pretty general
  • 01:31:38 for all of them think about how our
  • 01:31:40 texts what our texts look like like one
  • 01:31:43 thing you could potentially do is just
  • 01:31:44 strip out any of like the
  • 01:31:45 code words maybe that would help a bit
  • 01:31:47 you could one thing I noticed that might
  • 01:31:50 be a problem is interest in sir so is
  • 01:31:56 that words like good and good they
  • 01:32:02 probably would treat this I think by
  • 01:32:05 default you know you're not stripping
  • 01:32:06 out punctuation so like good explanation
  • 01:32:10 and good would be treated as different
  • 01:32:11 words it's like one thing you could do
  • 01:32:14 to make your bottle better to strip out
  • 01:32:16 all of the excellent punctuation marks
  • 01:32:19 what else can you do you could look at
  • 01:32:22 exploring more complex things than just
  • 01:32:24 doing bag of words and tf-idf
  • 01:32:26 vectorization there's a lot of cool
  • 01:32:29 state-of-the-art
  • 01:32:30 language models that are out there
  • 01:32:31 there's open a eyes GPT that maybe you
  • 01:32:35 could do something funny cool with
  • 01:32:37 there's google's Burt that's also good
  • 01:32:41 those are both topics for maybe future
  • 01:32:43 videos but this also isn't a natural
  • 01:32:46 language processing video so I'm not
  • 01:32:47 gonna cover the future areas so let's
  • 01:32:51 talk about saving the model okay so we
  • 01:32:56 have this classifier right here that we
  • 01:32:57 want to save so that we don't have to
  • 01:32:59 retrain it the next time we want to use
  • 01:33:01 it well we can do that very easily with
  • 01:33:04 this library called pickle so if you
  • 01:33:07 should I think have pickle maybe a bit
  • 01:33:08 of default but if you don't you could do
  • 01:33:11 a pip install a pickle and get it and
  • 01:33:13 basically what we can do with this
  • 01:33:15 pickle library is I can do with open
  • 01:33:19 I've created some direct or they might
  • 01:33:23 see where I am one sec and you can
  • 01:33:27 actually so in my SK learn directory you
  • 01:33:34 have this models and I'm gonna save a
  • 01:33:37 model here so models slash sentiment
  • 01:33:44 classifier dot pickle that's what we're
  • 01:33:47 going to save our classifier as
  • 01:33:50 so with open Santa Claus pickle and we
  • 01:33:54 need to write there so we're gonna open
  • 01:33:56 a writing buffer as f we can do pickle
  • 01:34:00 dot dump CL F to F so this is taking our
  • 01:34:05 classifier that we were using up here
  • 01:34:08 basically the SVM but with some tune
  • 01:34:11 parameters that ended up being basically
  • 01:34:13 the same parameters giving us the same
  • 01:34:14 result we have this and we're dumping it
  • 01:34:24 into all the parameters in here into
  • 01:34:27 this file so I can run that now what's
  • 01:34:32 cool is that I can go ahead and even if
  • 01:34:36 the CLF wasn't defined or if I wanted to
  • 01:34:38 find out something else I can just
  • 01:34:39 simply load in the model so I could go
  • 01:34:44 ahead and do something like so I could
  • 01:34:48 do I need to open that file so I can
  • 01:34:57 just copy this
  • 01:35:01 open and now this time I'm reading it so
  • 01:35:04 I should read buffer as f we want to do
  • 01:35:08 classifier so I'm just gonna call loaded
  • 01:35:15 CLF and just so you're clear if I was
  • 01:35:19 trying to do loaded CL f dot predict now
  • 01:35:31 like this would not be defined so it
  • 01:35:34 wouldn't work
  • 01:35:36 loaded CL f equals pickle dot load and I
  • 01:35:41 just need to load that file do that now
  • 01:35:44 I can do loaded see a left predict test
  • 01:35:50 X vector so just just the first one what
  • 01:35:57 is the first one test X 0
  • 01:36:04 as you see it did output something so
  • 01:36:07 like we have a review that looks like
  • 01:36:11 this and without doing anything before
  • 01:36:13 we did enough of a train a model we were
  • 01:36:15 able to just load this pickled file and
  • 01:36:18 use it just as we were using the
  • 01:36:20 classifier before after the training so
  • 01:36:23 that's pretty cool it's very very useful
  • 01:36:25 because if you're training these models
  • 01:36:27 you want to be able to say if you want
  • 01:36:28 to be able to use them in production and
  • 01:36:30 using this pickle by dumping your
  • 01:36:34 trained models and then reloading them
  • 01:36:36 that's how you're gonna do this because
  • 01:36:38 this tutorial is getting pretty long now
  • 01:36:40 that we're done through everything I'm
  • 01:36:42 gonna kind of skip over the category
  • 01:36:44 classifier I'll go through it really
  • 01:36:45 quickly it is on my github so if you
  • 01:36:48 want to see this exact details check
  • 01:36:50 there so if you go to SK learn keith
  • 01:36:53 gaya SK learn and you go into the
  • 01:36:56 category classifier you could download
  • 01:37:00 this too you'll see that it's basically
  • 01:37:02 the same thing overall is what we did
  • 01:37:05 for a couple differences like I added an
  • 01:37:08 enum class for categories basically then
  • 01:37:11 we go through and load in data from all
  • 01:37:16 these different files and those all that
  • 01:37:18 data is can easily be found through data
  • 01:37:22 slash category you download all of these
  • 01:37:24 files and then so it look goes through
  • 01:37:33 each one of these files loads and all
  • 01:37:35 the data associated with that file and
  • 01:37:38 then based on which file it loaded from
  • 01:37:41 it sets a specific category and that can
  • 01:37:44 be seen here too that it has the reviews
  • 01:37:48 the same way preps the data pretty
  • 01:37:51 similarly one thing you'll notice is
  • 01:37:52 that evenly distribute is crossed out
  • 01:37:55 because for this one because we're
  • 01:37:58 looking at just which category so
  • 01:38:01 electronics or clothing or grocery that
  • 01:38:05 some review comes from we don't care
  • 01:38:07 whether the positive and negative stuff
  • 01:38:10 is evenly distributed at least I don't
  • 01:38:12 think it would be that important to we
  • 01:38:16 use the tf-idf vectorize
  • 01:38:17 just like before here's some
  • 01:38:19 classification stuff pretty similar f1
  • 01:38:27 score does actually very well as you can
  • 01:38:29 see with here all these are like hitting
  • 01:38:31 pretty accurately across the board so
  • 01:38:34 that's pretty impressive to see I do the
  • 01:38:36 grid search again that honestly just
  • 01:38:38 gets you back to even though you had all
  • 01:38:41 these options to choose from it gets you
  • 01:38:42 back to C equals one and kernel equals
  • 01:38:45 our bf but it did check all these
  • 01:38:48 finally like save the pickle file and
  • 01:38:54 also save the vectorizer in this case
  • 01:38:57 because it's nice basically when you
  • 01:38:59 want to do tests like this like quick
  • 01:39:02 kind of qualitative tests you want to
  • 01:39:04 use the vectorizer as well so figured
  • 01:39:06 I'd save the vectorizer as well those
  • 01:39:09 pickle files you can download from my
  • 01:39:11 github as well they're in the models
  • 01:39:12 directory then the last thing I did here
  • 01:39:14 which was kind of cool just because it
  • 01:39:16 fit well was I did a little bit of a
  • 01:39:18 confusion matrix so you can look at this
  • 01:39:21 code basically as you see stuff across
  • 01:39:25 this diagonal is so on the upper axis is
  • 01:39:32 enough to double check this actually but
  • 01:39:35 one of these is the Y predictions and
  • 01:39:38 one of these is the y actuals so as you
  • 01:39:40 see it's very wide prediction y actual
  • 01:39:43 are very close but baby sometimes you
  • 01:39:45 know you mistake clothing for
  • 01:39:48 electronics your mistake books for
  • 01:39:54 grocery like there are mistakes and
  • 01:39:56 that's what the confusion matrix shows
  • 01:39:58 but you could use this code if you want
  • 01:39:59 to replicate some confusion matrix type
  • 01:40:02 of stuff all right that's all we're
  • 01:40:03 going to do in this video thank you guys
  • 01:40:05 for watching if you enjoyed this video
  • 01:40:07 make sure to throw it a big thumbs up
  • 01:40:08 also mean a lot to me if you don't mind
  • 01:40:11 subscribing
  • 01:40:12 really just seeing more subscribers
  • 01:40:14 motivates me to make more videos also
  • 01:40:17 check out my socials my Instagram
  • 01:40:19 Twitter and then final thing I wanted to
  • 01:40:21 say is if you like this dope shirt that
  • 01:40:24 I'm wearing you should definitely check
  • 01:40:25 out my buddy Sebastian his Instagram
  • 01:40:28 page where he makes all this cool
  • 01:40:29 content
  • 01:40:30 is right here and I'll also link it in
  • 01:40:33 the description thanks again guys peace
  • 01:40:36 out
  • 01:40:38 [Music]