Coding

Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)

  • 00:00:00 what's up guys and welcome back to
  • 00:00:01 another video in this video we are not
  • 00:00:03 talking about the fluffy animal or the
  • 00:00:05 song by designer but instead we're gonna
  • 00:00:07 dive into the pandas library of Python
  • 00:00:10 which I find to be one of the most
  • 00:00:12 useful libraries when you're doing
  • 00:00:14 anything data science related in Python
  • 00:00:16 so this video will be a good standalone
  • 00:00:18 video if you've never done anything with
  • 00:00:20 pandas kind of going from zero to like
  • 00:00:22 fairly comfortable in one sitting it
  • 00:00:25 also is a good video if you have you
  • 00:00:28 know some Python pandas experience but
  • 00:00:31 you're looking to try to figure out how
  • 00:00:32 to do something specific if you're in
  • 00:00:35 that second case look down in the
  • 00:00:36 comments and I'll pin a timeline of what
  • 00:00:41 we're doing this video so you can find
  • 00:00:42 exactly what you're looking for quickly
  • 00:00:44 so one question you might have as you
  • 00:00:46 watch this video is like wipe andis a
  • 00:00:48 lot of the stuff you'll see me doing you
  • 00:00:51 probably could replicate in Excel but
  • 00:00:53 there are specific benefits to using
  • 00:00:55 pandas and the first benefit is you have
  • 00:00:59 a lot more flexibility with pandas and
  • 00:01:01 using Python in general like what you
  • 00:01:03 can do in Excel I think is very limited
  • 00:01:06 compared to what you can do using the
  • 00:01:07 whole Python programming language so
  • 00:01:11 like the flexibility that Python offers
  • 00:01:13 is reason one to use pandas and then the
  • 00:01:16 second reason this is also very
  • 00:01:17 important reason is pandas allows you to
  • 00:01:20 work with a lot larger data sets then
  • 00:01:24 Excel does Excel really kind of
  • 00:01:26 struggles once you start loading in
  • 00:01:27 really large files so the second reason
  • 00:01:29 of why pandas is you can work with big
  • 00:01:32 big data if you're finding this video
  • 00:01:36 useful don't forget to click that
  • 00:01:37 subscribe button because I'll be making
  • 00:01:39 a lot more tutorials building on this
  • 00:01:41 type of stuff in the future ok to begin
  • 00:01:44 this video you should already have
  • 00:01:45 Python 3 installed and then you need to
  • 00:01:47 open up a terminal window and type in
  • 00:01:50 pip install pandas if you don't already
  • 00:01:52 have the library as you can see I
  • 00:01:57 already have it once you have the
  • 00:01:59 library we can actually begin loading in
  • 00:02:03 data super quickly so I just want to
  • 00:02:04 dive into the data right away and we'll
  • 00:02:06 use that data to kind of learn
  • 00:02:08 everything we need to know regarding
  • 00:02:10 this library so I have a link in the
  • 00:02:12 description
  • 00:02:13 Chintu my github page where I have a CSV
  • 00:02:16 of data that we're going to be using for
  • 00:02:18 this video so go to my github page and
  • 00:02:21 then this data is going to be on Pokemon
  • 00:02:24 I found this data on kaggle it's like a
  • 00:02:27 good open source machine learning
  • 00:02:29 website that you can kind of like do all
  • 00:02:31 sorts of challenges and I thought it was
  • 00:02:33 perfect for an introductory video on
  • 00:02:35 pandas so you don't have to be a huge
  • 00:02:38 fan of Pokemon but it's a great data set
  • 00:02:40 to get started so click on the CSV
  • 00:02:42 version and that's kind of the most
  • 00:02:44 important one as you can see you can
  • 00:02:46 kind of get a feel for what's in this
  • 00:02:47 data so we have all the different
  • 00:02:49 Pokemon and then all of their kind of
  • 00:02:52 stats in we'll be doing all sorts of
  • 00:02:55 manipulations and doing all sorts of
  • 00:02:57 analysis on this data throughout the
  • 00:02:59 video but I want you to click on raw and
  • 00:03:02 then once you have the raw file you can
  • 00:03:05 just save as I called my Pokemon data
  • 00:03:09 and I save it as a CSV CSV is important
  • 00:03:12 for loading it in properly but you can
  • 00:03:16 name it whatever so hooking my data to
  • 00:03:18 the CSV and one thing to note is
  • 00:03:20 wherever you're writing your code you
  • 00:03:21 should save this data in the same exact
  • 00:03:23 directory just so it's easy to load in
  • 00:03:27 these files okay once you have the data
  • 00:03:32 saved locally open up your favorite text
  • 00:03:34 editor for the purposes of this video
  • 00:03:37 I'm going to be using Jupiter notebooks
  • 00:03:38 because I like using that for data
  • 00:03:41 science related stuff but you can use
  • 00:03:42 sublime text pycharm whatever you like
  • 00:03:45 to write your code in and I'm going to
  • 00:03:47 just clear all this that I have on the
  • 00:03:48 screen right now okay so the first thing
  • 00:03:51 we're gonna do is load data into PI
  • 00:03:54 pandas so we have this CSV and you can
  • 00:03:57 open up the CSV and look but exactly
  • 00:04:00 what you saw on this page so this is
  • 00:04:06 what we're gonna load into the pandas
  • 00:04:08 library and we're gonna load it in and
  • 00:04:09 what is called a data frame so it's
  • 00:04:12 super important you know everything
  • 00:04:15 about a data frame but that's kind of
  • 00:04:16 what the object type is that
  • 00:04:18 panis allows you to manipulate
  • 00:04:20 everything with okay so the first thing
  • 00:04:24 we need to do is type in import pandas
  • 00:04:28 to get the library and usually what
  • 00:04:30 you'll see is it's kind of annoying to
  • 00:04:32 have to reference pandas every time you
  • 00:04:34 type anything in that uses it so we
  • 00:04:37 usually import it as pandas as PD so
  • 00:04:40 just do that and then to quickly get our
  • 00:04:43 data loaded in we're going to say pokey
  • 00:04:47 meaning Pokemon or maybe I can just call
  • 00:04:49 this like DF for data frame equals PD
  • 00:04:54 and then there's this really useful
  • 00:04:55 function called read CSV and then you
  • 00:04:59 have to pass in the path to that CSV and
  • 00:05:01 if you wrote the put the CSV in the same
  • 00:05:05 file or in the same location that you're
  • 00:05:07 writing your code you can just do the
  • 00:05:11 name of the file dot CSV if I run this
  • 00:05:16 it loaded it in and you can't see that
  • 00:05:19 it loaded it in but if I went ahead and
  • 00:05:21 did print DF you can see that all that
  • 00:05:26 data is there in that DF variable and if
  • 00:05:30 you don't want to load in all of the
  • 00:05:31 data you can use the there's these two
  • 00:05:34 useful functions to look at just the top
  • 00:05:36 of the data and just the bottom so I
  • 00:05:37 could do DF head and then I could
  • 00:05:40 specify a number of rows so I'm going to
  • 00:05:41 just say three for now think the default
  • 00:05:44 if you didn't put that three in there is
  • 00:05:45 five so you see I just now can see the
  • 00:05:48 top three rows and it's a little bit
  • 00:05:50 easier to read my data using that and I
  • 00:05:53 also could do if I wanted to see the
  • 00:05:55 bottom three rows I could do tail three
  • 00:05:58 and get as you can see the index is
  • 00:06:00 changed to the bottom got those bottom
  • 00:06:03 rows okay I'm going to just comment this
  • 00:06:05 out real quick I also want to show you
  • 00:06:07 that if you don't have your your data in
  • 00:06:10 a CSV format that's fine we can also
  • 00:06:13 very easily load in Excel files or tab
  • 00:06:18 separated files so on that github page I
  • 00:06:22 also just for the sake of practice
  • 00:06:25 included this same exact file in txt
  • 00:06:29 format which is a tab
  • 00:06:31 separated format as you can see in the
  • 00:06:34 Excel format so if you want to try this
  • 00:06:36 or you have a set of data that you're
  • 00:06:38 trying to manipulate you can also do
  • 00:06:41 I'll load those files in so I can do PD
  • 00:06:44 dot read Excel that's another built-in
  • 00:06:49 function to pandas and my excel file was
  • 00:06:51 like Pokemon data
  • 00:06:54 xlsx I believe just check yeah I don't
  • 00:07:00 know yeah I think that's the extension
  • 00:07:01 we'll get an error if not I'll comment
  • 00:07:04 this line out too and I can do a print
  • 00:07:07 of DF xlsx ahead three
  • 00:07:14 as you can see that same data is read in
  • 00:07:17 from that excel file and then the last
  • 00:07:21 thing we can try to do I'll move this
  • 00:07:24 just so it's a little bit cleaner just
  • 00:07:28 down here comment it out real quick
  • 00:07:31 I can also load in that tab separated
  • 00:07:33 file so this one's a little bit
  • 00:07:36 different I can do PD read CSV Pokemon
  • 00:07:42 data dot txt and watch what happens when
  • 00:07:46 I run this it's probably not giving me
  • 00:07:48 an error let's see oh it didn't print
  • 00:07:50 yeah let's see what happens I think it's
  • 00:07:53 gonna yeah it loaded it all in is like
  • 00:07:55 this one single column so the difference
  • 00:07:59 with this tab separated file and just to
  • 00:08:02 remind you what this looks like just
  • 00:08:04 instead of having commas that's
  • 00:08:06 separating the different columns its
  • 00:08:08 tabs we need to in our read CSV function
  • 00:08:13 specify a delimiter it's actually
  • 00:08:16 separating them in this case it's a tab
  • 00:08:18 which is specified by /t I believe I
  • 00:08:22 don't remember the differences between
  • 00:08:23 forward slash and back slash and yeah
  • 00:08:27 look at that we have the columns in the
  • 00:08:30 way they were looking when we were just
  • 00:08:32 doing the CSV also note for this TSV the
  • 00:08:36 tab separated file you could change this
  • 00:08:38 to anything that was actually separating
  • 00:08:41 your column so if like let's say for
  • 00:08:43 whatever reason you had three exes
  • 00:08:45 separating your columns you would set
  • 00:08:48 delimiter equals xxx all right let's
  • 00:08:50 move on to the next thing and that's
  • 00:08:51 going to be actually reading our data
  • 00:08:53 easily within the pandas framework so
  • 00:08:57 the first thing is reading the headers
  • 00:08:59 so in our data we have several headers
  • 00:09:01 and we can figure out what those are by
  • 00:09:04 doing a print D F dot columns so if we
  • 00:09:08 want the headers we just do DF columns
  • 00:09:11 as you can see there's the Pokemon
  • 00:09:15 number or the Pokedex number I think
  • 00:09:18 it's been a little while since I've
  • 00:09:21 refreshed my Pokemon scale is the name
  • 00:09:23 of the Pokemon but typed the two types
  • 00:09:26 and all of the stats information is
  • 00:09:29 whether or not they're legendary so
  • 00:09:30 these are all the columns we can work
  • 00:09:31 with it's all just not print that for
  • 00:09:36 now also this a Jupiter notebook I'll
  • 00:09:39 save this on the github page once I'm
  • 00:09:41 finished with the video so you can also
  • 00:09:43 look at this if you use Jupiter
  • 00:09:45 notebooks follow along with this if you
  • 00:09:48 just download it from the github page or
  • 00:09:51 clone it alright so now that we know our
  • 00:09:54 columns let's read a specific column so
  • 00:09:57 to do that we have our data frame still
  • 00:09:59 that we loaded up here and I can do DF
  • 00:10:03 dot let's say I wanted to get the name
  • 00:10:08 of the Pokemon so if I just did print do
  • 00:10:13 you name and ran that as you can see I
  • 00:10:17 get all the Pokemon and it does actually
  • 00:10:20 abbreviate it just so I'm not printing
  • 00:10:22 out like 800 different things so that
  • 00:10:25 gives me that and I could also specify
  • 00:10:27 that I only wanted like 0-5 probably by
  • 00:10:32 doing this yes now I just get the fit
  • 00:10:34 the top five names one thing that's
  • 00:10:39 interesting you could also do DF name
  • 00:10:42 like this this doesn't really work for
  • 00:10:44 to word names but you could also get the
  • 00:10:48 names like that I usually just do it
  • 00:10:51 in the using the brackets and if you
  • 00:10:57 want to get multiple columns at the same
  • 00:11:00 time you can change this just single
  • 00:11:02 word to a list of column names some in
  • 00:11:05 turning a list here and then separating
  • 00:11:08 it by commas so name say type 1 and
  • 00:11:12 that's a like H pace we're getting 3
  • 00:11:15 different columns and there aren't even
  • 00:11:16 all in order so it's kind of nice so
  • 00:11:19 that you can get so if you want to look
  • 00:11:20 at specific things and not be cluttered
  • 00:11:23 with so much extra stuff you can do that
  • 00:11:26 moving on to printing each row that's
  • 00:11:29 just comp this up real quick
  • 00:11:31 probably the easiest way to print out
  • 00:11:32 each row so I'm going to just show you
  • 00:11:35 remind you what's in our actual data set
  • 00:11:38 again so let's print out the first four
  • 00:11:39 rows and let's say we wanted to print
  • 00:11:42 out this first row index of one row so I
  • 00:11:46 guess this is actually this zeroth row
  • 00:11:48 so that it has Ivysaur grass poison etc
  • 00:11:51 and it if we want to adjust that row we
  • 00:11:53 can use this eye luke function on the
  • 00:11:57 data frame which stands for integer
  • 00:11:59 location so if I passed an eye look of 1
  • 00:12:03 that will give me everything that was in
  • 00:12:05 that first row could also use this to
  • 00:12:08 get myself multiple rows I could do 1 to
  • 00:12:10 4 and that would get me all of these
  • 00:12:14 rows so another way to get rows wherever
  • 00:12:18 you want in the data frame and the same
  • 00:12:22 ìlook function can be used to grab a
  • 00:12:24 specific location so let's say I wanted
  • 00:12:27 to I'm gonna just change this to 0 real
  • 00:12:30 quick I wanted to get the venusaur name
  • 00:12:34 here so if we did the indexing of that
  • 00:12:37 it's on the second row and it's the 0
  • 00:12:41 first column if we're counting with
  • 00:12:44 numbers so if I wanted to just get that
  • 00:12:46 specific position we could do print di
  • 00:12:50 Luke and then the second row and then do
  • 00:12:55 comma the actual position 1 so we want
  • 00:12:57 the first position the first column as
  • 00:12:59 you see that gives us Venusaur building
  • 00:13:03 up on this one thing I often
  • 00:13:04 myself trying to do is iterate through
  • 00:13:08 each row in my dataset as I'm reading
  • 00:13:11 yet and so to do that what I would
  • 00:13:13 recommend you do there's probably other
  • 00:13:15 ways I'm sure there are is do for index
  • 00:13:18 comma row in D F dot it er
  • 00:13:22 Rose it her iterate through rows
  • 00:13:27 probably the easiest way to just go row
  • 00:13:30 by row and just access any sort of data
  • 00:13:32 you might want so I could do print index
  • 00:13:36 comma row and run that as you can see it
  • 00:13:42 didn't format it nicely for me but it's
  • 00:13:45 going first row then getting the data
  • 00:13:48 for the second row etc and one thing
  • 00:13:50 that's pretty nice about this is I could
  • 00:13:52 if I just wanted the name in the index
  • 00:13:56 free throw
  • 00:13:57 I could iterate through and just get
  • 00:13:59 that information I don't know I find
  • 00:14:03 this pretty useful for all sorts of
  • 00:14:05 different tasks that I'm doing well
  • 00:14:07 working with my data and then one
  • 00:14:10 additional function I want to get to
  • 00:14:13 right now I'm gonna go into this and
  • 00:14:15 more depth a little bit but in addition
  • 00:14:18 to having the I look we also have DF
  • 00:14:21 cloak and this is used for I guess
  • 00:14:26 finding specific data in our data set
  • 00:14:30 that isn't just integer based isn't just
  • 00:14:33 like the specific rows it's based on
  • 00:14:36 more textual information numerical
  • 00:14:38 information so one thing that's really
  • 00:14:40 cool you can do with this is I can do DF
  • 00:14:43 cloak and then I can access only the
  • 00:14:46 rows that have DF name or let's say type
  • 00:14:52 one equal to let's say fire so this
  • 00:14:57 should give us like chars are
  • 00:14:58 immediately oh gosh Charmander Charizard
  • 00:15:02 the middle one so let's run it and
  • 00:15:06 hopefully this works
  • 00:15:07 oh yeah afterwards they're pretty nice
  • 00:15:10 when they don't print it Wow I should
  • 00:15:12 have known this yeah so
  • 00:15:16 as you can see it's only giving me the
  • 00:15:18 type one that's equal to fire and I
  • 00:15:20 could do the same thing if I only wanted
  • 00:15:22 to look at the grass pokemon as you can
  • 00:15:28 see now we get Bulbasaur Ivysaur
  • 00:15:29 venusaur etc you can just keep doing
  • 00:15:33 this and you can use multiple conditions
  • 00:15:34 so this is super super powerful to do
  • 00:15:39 all sorts of conditional statements and
  • 00:15:40 I'm going to get into this the more
  • 00:15:42 fancy advanced stuff with regarding this
  • 00:15:44 later on in the video
  • 00:15:46 while we're on this topic another useful
  • 00:15:47 thing we can do with our data frame is
  • 00:15:49 we can use this stop described method
  • 00:15:52 which gives us like all the high level
  • 00:15:54 like mean standard deviation type stats
  • 00:15:57 from that so as you can see some of
  • 00:16:01 these categories it's not super super
  • 00:16:03 useful like pokedex number it doesn't we
  • 00:16:06 don't really care about the mean but for
  • 00:16:07 like HP attack defense special attack
  • 00:16:11 etc it's pretty cool little method to
  • 00:16:15 use because you have all these metrics
  • 00:16:17 you can quickly just look at your data
  • 00:16:18 another useful thing we can do is I just
  • 00:16:21 print the data frame again we can do
  • 00:16:26 some sorting of the values so let's say
  • 00:16:27 instead of going from first pokedex
  • 00:16:31 downwards we could do sort of by let's
  • 00:16:35 say alphabetical name so i could do sort
  • 00:16:37 values and then i have to pass in the
  • 00:16:40 column i want to sort so if I sorted
  • 00:16:43 values by name now I have it
  • 00:16:46 alphabetical if I wanted to make it the
  • 00:16:49 other way I could do extending so that's
  • 00:16:52 and that equal to false so now it's
  • 00:16:54 gonna be descending as you can see he
  • 00:17:00 also can combine multiple columns in
  • 00:17:03 this so let's say we had sorting by type
  • 00:17:07 one and then we wanted to have our
  • 00:17:09 second sort parameter P by H P so this
  • 00:17:14 would give us all like I guess probably
  • 00:17:16 the bug pokemon because that would be
  • 00:17:17 the first alphabetical one and then it
  • 00:17:21 would give us the lowest or highest HP
  • 00:17:24 from that let's see what happens
  • 00:17:27 yeah as you can see bug and this is the
  • 00:17:30 lowest so what we could do is also pass
  • 00:17:33 it in descending and this time because
  • 00:17:35 we have two columns when you specify
  • 00:17:37 true or false for both it might you
  • 00:17:40 might be able to do this yeah we can do
  • 00:17:42 this but if you want to separate if one
  • 00:17:47 is extending one's descending we can do
  • 00:17:50 see now I got it descending but we got
  • 00:17:53 the farthest down type one so I can do
  • 00:17:56 something like this so we want the first
  • 00:18:00 one to be ascending and the second one
  • 00:18:03 to be descending so now type one will be
  • 00:18:06 going a through Z and each P will go
  • 00:18:09 from high to low as you can see so
  • 00:18:15 sorting the values is very useful as
  • 00:18:17 well okay now that we know how to read
  • 00:18:19 our data that's like we start making
  • 00:18:21 some changes to it so let's look at our
  • 00:18:25 data again okay so let you get this data
  • 00:18:32 one change that I think would be cool to
  • 00:18:35 make is we have all these stat
  • 00:18:37 categories I think would be pretty cool
  • 00:18:40 to add all these stats together and
  • 00:18:43 create like a total category which kind
  • 00:18:45 of potentially could help us rank which
  • 00:18:48 Pokemon are the best so let's go ahead
  • 00:18:51 and do that and one thing that's cool
  • 00:18:53 and I guess true about most things
  • 00:18:56 programming is there's multiple ways to
  • 00:18:58 do this so we're adding a column that is
  • 00:19:00 the total of those stats so one way we
  • 00:19:04 could do it is we just go ahead and
  • 00:19:06 access our data frame and then just call
  • 00:19:09 this new column total and we can just
  • 00:19:11 reference it it like this right now
  • 00:19:13 and we will say that that equals this is
  • 00:19:16 probably the easiest way to read but not
  • 00:19:20 the I guess fastest way to do it but you
  • 00:19:22 could do DF of HP plus DF of attack I'll
  • 00:19:32 just probably speed this up when I'm
  • 00:19:33 actually editing this
  • 00:19:43 okay so now we had to find this one was
  • 00:19:47 you have the dataframe total is gonna
  • 00:19:51 equal all the other columns run that I
  • 00:19:57 guess we don't see anything but if I go
  • 00:19:59 ahead and do data frame dot head five as
  • 00:20:06 we can see over here on the right side
  • 00:20:10 we have this new column name total and I
  • 00:20:15 would say I recommend always when you do
  • 00:20:18 something like this just making sure
  • 00:20:20 that you did it like actually is the
  • 00:20:22 total that you're trying to get 49 plus
  • 00:20:25 65 plus 65 plus 45 because you could
  • 00:20:30 easily see that this total is a valid
  • 00:20:32 number but if you don't actually
  • 00:20:34 double-check that it's the right number
  • 00:20:36 you kind of run into a dangerous
  • 00:20:37 territory and as we can see perfect 318
  • 00:20:42 is exactly what we are looking for so
  • 00:20:44 that's one way to do it another way we
  • 00:20:47 could go about doing this and actually
  • 00:20:49 because I have I'm using a Jupiter
  • 00:20:51 notebook actually what we might want to
  • 00:20:53 do first is drop some columns so one
  • 00:20:58 thing about Jupiter notebooks is like if
  • 00:21:00 I run this again even though I've
  • 00:21:01 commented it out lost myself it still
  • 00:21:06 has that data frame in memory so it even
  • 00:21:10 though this is commented out it still
  • 00:21:13 has that data frame in memory so it
  • 00:21:16 doesn't remove the total after I even
  • 00:21:18 comp this out it just stays in memory
  • 00:21:21 but so one thing we might want to do is
  • 00:21:23 drop a specific column so if I wanted to
  • 00:21:26 go ahead and drop the total column and
  • 00:21:29 just show how to do it in another way I
  • 00:21:32 could do data frame drop and then I can
  • 00:21:35 specify the columns and I'm gonna
  • 00:21:41 specify total I'm gonna run that yeah
  • 00:21:44 why did you not disappear and so the
  • 00:21:47 reason this did not disappear is because
  • 00:21:49 it actually
  • 00:21:50 directly modified or doesn't directly
  • 00:21:53 remove that column I believe you have to
  • 00:21:55 just reset it to dataframe so I can go
  • 00:21:58 ahead and do this and we should see this
  • 00:22:00 total column here the right side which
  • 00:22:03 my face is blocking will see this
  • 00:22:07 disappear run that yay
  • 00:22:10 so that was dropping a column so now if
  • 00:22:14 I wanted to go ahead and do the add a
  • 00:22:16 column in a different way maybe a little
  • 00:22:18 bit more succinct of my way I can go
  • 00:22:20 ahead and do DF total that stays the
  • 00:22:23 same and then what I'm going to do this
  • 00:22:26 time is I'm going to use that I Lok
  • 00:22:29 function that we learned so integer
  • 00:22:34 location I want all the rows so the
  • 00:22:37 first input is going to be the colon
  • 00:22:39 which just means all rows everything and
  • 00:22:42 then the columns I actually want to add
  • 00:22:45 together will be HP through speed so
  • 00:22:49 that will be this is 0 1 2 3 4 so this
  • 00:22:53 will be the fourth column to the 5th 6th
  • 00:23:02 7th 8th 9th to the ninth ninth column
  • 00:23:08 and I'll run and then I can there's a
  • 00:23:10 dot sum function you can use and you
  • 00:23:13 want to specify if you're adding
  • 00:23:15 horizontally you want to specify axis
  • 00:23:17 equals 1 if you said actually su equals
  • 00:23:20 0 that would be adding vertically ok
  • 00:23:24 and we have our totals again and one
  • 00:23:28 thing you might have noticed I don't
  • 00:23:29 know if you caught this but because I
  • 00:23:33 have this 318 down here I realized that
  • 00:23:35 this 273 is actually wrong so that's why
  • 00:23:38 it's good to make that check the error I
  • 00:23:40 made was that it shouldn't end at 9 if
  • 00:23:44 we want to include this speed it
  • 00:23:45 actually has to go to the next one
  • 00:23:47 because the end parameter and like lists
  • 00:23:50 and don't be everything we the end
  • 00:23:53 parameter enlists is exclusive so 10th
  • 00:23:56 means the tenth column is the first one
  • 00:23:58 we don't include so every run that now
  • 00:24:00 you see that the totals are actually
  • 00:24:02 correct as we did the math down here
  • 00:24:05 the last change we'll make before we
  • 00:24:07 resave this as a another CSV the last
  • 00:24:14 change we'll make before we receive this
  • 00:24:16 as an updated CSV will be let's say we
  • 00:24:19 didn't want this total column all the
  • 00:24:21 way over here on the right side it makes
  • 00:24:23 a little bit more sense I would say to
  • 00:24:26 be either to the left of HP sorry I
  • 00:24:30 don't know what tells us or the right of
  • 00:24:33 speed so we can do this in a few
  • 00:24:36 different ways and the way I'm gonna
  • 00:24:37 choose is probably not the most
  • 00:24:38 efficient but it it makes sense given
  • 00:24:42 what we've done already so remember that
  • 00:24:44 we could grab specific columns like this
  • 00:24:47 so if I wanted total I wanted HP and
  • 00:24:51 like defense let's say and note I can
  • 00:24:56 order these however I want so if I
  • 00:24:59 wanted to reorder my columns and then
  • 00:25:01 save it afterwards I could just do DF
  • 00:25:03 equals whatever order I choose and
  • 00:25:07 because it's a little bit annoying to
  • 00:25:09 type out all these things I'm going to
  • 00:25:11 get the columns as a list to do that I
  • 00:25:14 will do calls equal DF and you don't
  • 00:25:17 have to know why why I'm typing what I'm
  • 00:25:19 typing works exactly I'm looking at the
  • 00:25:22 documentation as I do this and I
  • 00:25:28 recommend you guys do the same always
  • 00:25:29 look at the documentation there's great
  • 00:25:31 stuff here I can't get everything out
  • 00:25:33 here in this single video doing the best
  • 00:25:35 I can but definitely check the
  • 00:25:37 documentation out I'll link to that in
  • 00:25:39 the description ok so I'm getting the
  • 00:25:41 columns and instead of ordering it like
  • 00:25:44 this I'm going to do ranges so if I want
  • 00:25:47 these first four columns and then total
  • 00:25:51 and then the last the remaining columns
  • 00:25:53 I could do something like this calls of
  • 00:25:57 0 to 4
  • 00:25:59 they'll get me the first four in the
  • 00:26:02 same order I want it plus calls of
  • 00:26:04 negative one that's just reverse
  • 00:26:06 indexing getting the total here I might
  • 00:26:09 be blocking that again the
  • 00:26:11 here and then finally the remaining
  • 00:26:17 stuff we would need to add to that would
  • 00:26:19 be four to five six seven eight nine ten
  • 00:26:24 eleven twelve and we include twelve
  • 00:26:26 because that would be in our the first
  • 00:26:28 one we actually don't include in the
  • 00:26:30 final data frame so let's see what
  • 00:26:32 happens when we do that we want to see
  • 00:26:33 this total go over here know what
  • 00:26:39 happened okay so why do we get this
  • 00:26:41 error can only concatenate lists not
  • 00:26:45 string to list so that's telling me
  • 00:26:48 something probably in here is messed up
  • 00:26:50 and what I'm seeing is that because this
  • 00:26:54 is a single column it's not gonna be
  • 00:26:57 it's just gonna be a string so I have to
  • 00:27:00 actually share out of that in / in
  • 00:27:02 brackets to make it a list and then I
  • 00:27:04 can go ahead and run this again and we
  • 00:27:06 wanted to see the total switched over to
  • 00:27:09 the left side and there we go it is
  • 00:27:12 there cool and one comment I want to
  • 00:27:16 make as I said before this type of
  • 00:27:19 change doesn't actually really modify
  • 00:27:21 our data at all it's just kind of a
  • 00:27:24 visual thing so I didn't really care too
  • 00:27:26 too much about how I went about and did
  • 00:27:28 it but one thing I really want to note
  • 00:27:31 here is be careful when you're hard
  • 00:27:32 coding numbers in like this if your data
  • 00:27:36 is changing and you have these hard to
  • 00:27:40 criticism uh kind of like just using
  • 00:27:43 actual names
  • 00:27:45 so even calculating the total like this
  • 00:27:47 is a bit dangerous so maybe instead of
  • 00:27:49 using four to ten one thing you could
  • 00:27:51 potentially do is get the index of your
  • 00:27:54 start so that would when we were doing
  • 00:27:56 this it was the index of HTP and then go
  • 00:27:59 to the index of speed that would be one
  • 00:28:02 way to do it's a little bit safer I
  • 00:28:04 would say all right now that we're done
  • 00:28:06 with all of this let's move on to saving
  • 00:28:09 our new CSV so just a reminder of what
  • 00:28:12 we have in our data frame we follow this
  • 00:28:14 information and in this previous of
  • 00:28:17 cells is where I actually defined the
  • 00:28:19 data frame just as a reminder this date
  • 00:28:23 frames not coming out of
  • 00:28:24 midair so I have this data frame and now
  • 00:28:28 I want to save this updated and let's
  • 00:28:30 start by saving it to a CSV so just like
  • 00:28:33 we had the dot read CSV we also have a
  • 00:28:37 built-in function in panda is called to
  • 00:28:39 CSV so I could just call this something
  • 00:28:42 like modified TA or modified CSV and now
  • 00:28:48 it will take whatever is in this data
  • 00:28:49 frame and output it to nice comma
  • 00:28:51 separated values format so because I got
  • 00:28:55 to this next cell we know it did that I
  • 00:28:56 can check my directory and as you can
  • 00:29:00 see there's this modified CSV and I'll
  • 00:29:03 just open that up real quick just so you
  • 00:29:04 can see it all the information is there
  • 00:29:07 load okay so we see we have all the
  • 00:29:12 stuff we wanted and this total column
  • 00:29:16 there which is cool the one thing that
  • 00:29:18 is annoying about the current state of
  • 00:29:20 this stop texting me I'm making a video
  • 00:29:24 who has the nerve okay sorry
  • 00:29:31 so the one thing that might be annoying
  • 00:29:33 is that you have all these indexes over
  • 00:29:35 here and I don't really care to have
  • 00:29:38 those so the quick fix to not save all
  • 00:29:42 these indexes with your data you can if
  • 00:29:44 you want to but you can go ahead and
  • 00:29:46 pass in the variable index equals false
  • 00:29:52 run that again and then I reopen my
  • 00:29:55 modified CSV you will see that that
  • 00:30:01 stuff is all gone so yeah now we just
  • 00:30:04 have the Pokedex sr for this column to
  • 00:30:06 the left which is perfect you can also
  • 00:30:09 go ahead and there's also a built-in to
  • 00:30:14 excel function so I could if I wanted to
  • 00:30:17 save this as a excel even though right
  • 00:30:19 now we're just working with the data
  • 00:30:20 frame it's easy to output it to that
  • 00:30:22 format so to excel we'll call this
  • 00:30:25 modified dot X there xlsx and we can
  • 00:30:31 also make the index false here
  • 00:30:34 run and so that well now we have these
  • 00:30:38 two modified this is the actual excel
  • 00:30:40 file I could load that but for the sake
  • 00:30:42 of time I'm not going to and then
  • 00:30:44 finally the last way we load it in three
  • 00:30:46 formats I might as well save three
  • 00:30:48 formats so the last one is what if we
  • 00:30:50 wanted to save that tab-separated file
  • 00:30:53 so we can do to CSV again modified I'm
  • 00:30:57 going to call this modified txt and
  • 00:31:00 index equals false and then the one
  • 00:31:04 thing on this is there's no delimiter
  • 00:31:07 parameter for when we're doing to CSV
  • 00:31:10 which is kind of annoying but there is a
  • 00:31:14 separator parameter you can pass in and
  • 00:31:17 look at the documentation if you need to
  • 00:31:20 remember this I'm looking at the
  • 00:31:21 documentation as I speak and so I can
  • 00:31:23 specify that I want to separate it with
  • 00:31:25 tabs instead of commas is that that's
  • 00:31:28 gonna happen by default so run that and
  • 00:31:32 I will actually open this one up just so
  • 00:31:34 you can see modified here and if I drag
  • 00:31:38 that and you can see that all the data
  • 00:31:39 is there no indexes on the left and it's
  • 00:31:42 all separated by tabs so that looks
  • 00:31:45 pretty good alright now that we've done
  • 00:31:48 all of that let's move into some more
  • 00:31:49 advanced Panda stuff and we'll start out
  • 00:31:52 with some more advanced filtering of our
  • 00:31:54 data so just a reminder this is our data
  • 00:31:57 frame so as a first example I showed
  • 00:32:02 before was that we could specify a
  • 00:32:04 specific type for example that we want
  • 00:32:06 it's okay the DF cloak and then we said
  • 00:32:09 DF of type 1 equals or equals equals
  • 00:32:17 let's say graphs we're only going to get
  • 00:32:20 the rows that actually have grass as
  • 00:32:23 their type on so as you can see all
  • 00:32:26 these type ones are grass in addition
  • 00:32:30 and we can do just more so than just one
  • 00:32:34 location condition we can pass in
  • 00:32:36 multiple so I can do something like DF
  • 00:32:39 type 1 equals grass and let's say we
  • 00:32:41 wanted DF
  • 00:32:44 of type two two equal poison so I can
  • 00:32:50 type it in like this run it oh no we got
  • 00:32:53 an error so the thing you got to do here
  • 00:32:57 is we have to separate our conditions
  • 00:32:59 with parentheses for whatever reason not
  • 00:33:01 quite sure why that is so here I have
  • 00:33:07 two conditions separate them by
  • 00:33:09 parentheses now as you can see we only
  • 00:33:11 have grass and poison now and one thing
  • 00:33:14 to note is usually we're typing out and
  • 00:33:17 like this but inside of our pandas
  • 00:33:20 dataframe when we're filtering we just
  • 00:33:22 do the actual and sign let's say if we
  • 00:33:26 wanted type 1 equals grass or type 2
  • 00:33:30 equals poison then we could do the or
  • 00:33:33 sign like this it's a little bit
  • 00:33:35 different than you're normally used to
  • 00:33:37 just the convention of the Python pandas
  • 00:33:40 library and just look this up if you
  • 00:33:42 forget so I run that now we should have
  • 00:33:46 one is poison either type one is grass
  • 00:33:50 or tuck two is poison and as you can see
  • 00:33:52 this is a bug type this is poison so I
  • 00:33:55 was able to separate those two
  • 00:33:56 conditions by N or instead of an and and
  • 00:34:03 we don't have to just use text
  • 00:34:04 conditions I could also add in let's say
  • 00:34:06 we wanted type 1 is equal to grass type
  • 00:34:11 2 is equal to poison and let's say we
  • 00:34:15 wanted the HP to be a fairly high value
  • 00:34:19 so just looking at these feel like 70 is
  • 00:34:23 a good cutoff value so HP has to be
  • 00:34:26 greater than 70 I can also specify
  • 00:34:28 conditions like this and around that now
  • 00:34:35 you see we only have five rows that it
  • 00:34:38 actually filters out and you could go
  • 00:34:40 ahead if you wanted to there's a couple
  • 00:34:43 different things you can do with this so
  • 00:34:46 first if you let's say you wanted to
  • 00:34:48 make a new data frame that was just the
  • 00:34:49 filter data I could just do something
  • 00:34:51 like new DF equals this and now if I
  • 00:34:56 print out
  • 00:34:57 nudee f we get just those five rows but
  • 00:35:02 I could go ahead and just print out D F
  • 00:35:06 and we still have everything also worth
  • 00:35:09 mentioning real quick I can easily save
  • 00:35:12 this new data frame as a new CSV kind of
  • 00:35:15 to checkpoint my work and maybe if I
  • 00:35:17 wanted to do this on many different
  • 00:35:19 filters kind of have this more specific
  • 00:35:21 CSV files that I could dive in and look
  • 00:35:24 at in more depth it's like I call this
  • 00:35:26 something like filtered dot CSV if I ran
  • 00:35:30 this you'd see in here that I have this
  • 00:35:35 filtered and it contains the data that I
  • 00:35:38 just grabbed out one thing to note when
  • 00:35:41 you are filtering your data and you
  • 00:35:43 shrink down the data size is when you
  • 00:35:45 print out that data frame so I'll
  • 00:35:47 comment this out okay I can just print
  • 00:35:50 out new D F as you can see one thing
  • 00:35:54 that's weird is this is the index here
  • 00:35:56 so it goes to 350 77 652 even though
  • 00:36:00 we've filtered out our data the old
  • 00:36:02 index stayed there and that get annoying
  • 00:36:05 if you're trying to do some additional
  • 00:36:06 processing with this new data frame so
  • 00:36:09 if you want to reset your index you can
  • 00:36:11 go new D F dot reset index and you can
  • 00:36:19 start off by just setting new D F equal
  • 00:36:24 to new DF is don't reset index now if I
  • 00:36:26 print out new DF you see that we have 0
  • 00:36:31 1 2 3 4 and by default it saves that old
  • 00:36:35 index there as a new column if you don't
  • 00:36:38 want that to happen we can modify it
  • 00:36:40 further we can do we can do drop equals
  • 00:36:47 true so this will get rid of the old in
  • 00:36:49 these indices as you can see now we
  • 00:36:52 don't have that then the last thing is
  • 00:36:54 if you don't want to have to reset it to
  • 00:36:55 a new data frame you can actually do
  • 00:36:57 this in place as well which just
  • 00:37:00 probably conserves a little bit of
  • 00:37:01 memory and if I run this I don't even
  • 00:37:06 set it to a new variable it just will
  • 00:37:09 change the value of within
  • 00:37:11 the given new DF and as you can see we
  • 00:37:13 got the new indexes for our filtered out
  • 00:37:17 data so that's something useful too to
  • 00:37:20 be aware of because if you're running
  • 00:37:22 through your new data frame like row by
  • 00:37:25 row and you're trying to get a specific
  • 00:37:26 spot even though it's like the fourth
  • 00:37:29 row that you see it might be you might
  • 00:37:31 need to index like you know the semi
  • 00:37:33 first position and that would get really
  • 00:37:35 annoying so resetting indexes is helpful
  • 00:37:38 in this case in addition to I guess
  • 00:37:43 equals conditions greater than less than
  • 00:37:45 etc not equals we also have other types
  • 00:37:48 of conditions we can use basically
  • 00:37:50 anything you can think of so one thing
  • 00:37:52 that I see that is kind of annoying me
  • 00:37:54 with this data is if you look in here
  • 00:37:57 maybe this is because I'm like a little
  • 00:37:59 bit outdated on my Pokemon knowledge but
  • 00:38:01 I've seen these like in mega versions of
  • 00:38:03 Pokemon and I'm not quite sure what that
  • 00:38:05 really means so let's say I wanted to
  • 00:38:08 filter out all the names that contained
  • 00:38:11 mega and it's tough to do with equal
  • 00:38:15 science you know because contain is not
  • 00:38:17 quite equal to because we want to allow
  • 00:38:19 a lot of different things there so I
  • 00:38:21 could not allow the name to include mega
  • 00:38:24 by doing the following so I'm going to
  • 00:38:27 delete the stuff that's inside of here
  • 00:38:29 maybe I'll just comment it out so you
  • 00:38:31 can still see it but I'm going to do DF
  • 00:38:34 cloak and then I'm gonna pass in da name
  • 00:38:40 then I need to get the string parameter
  • 00:38:43 of the name this is something you should
  • 00:38:45 just kind of I guess remember about the
  • 00:38:49 contains function string and then dot
  • 00:38:52 contains mega so if I run this you'll
  • 00:39:00 see that all of these ones are just the
  • 00:39:03 columns that include the word mega and
  • 00:39:07 then if we want to get the reverse of
  • 00:39:09 this this is another good symbol to
  • 00:39:11 remember because it's not quite what you
  • 00:39:13 would think it would be but within the
  • 00:39:15 alok function if we want to do not
  • 00:39:17 instead of maybe think you know to be
  • 00:39:19 the explanation point it's actually this
  • 00:39:22 squiggly line
  • 00:39:23 so if I run this now we drop all those
  • 00:39:29 ones that had the mega so as you can see
  • 00:39:31 there's no Megas anymore in our data so
  • 00:39:35 that's pretty useful and taking this
  • 00:39:38 even a step farther this contains
  • 00:39:41 function I find to be very very powerful
  • 00:39:44 because in addition to just doing exact
  • 00:39:46 words we can also pass in reg X
  • 00:39:51 expressions and do all sorts of like
  • 00:39:52 complicated filtering with this so let's
  • 00:39:56 say that's the first example let's say
  • 00:40:00 we wanted to see if the string wanted a
  • 00:40:04 simple way to get if the type one was
  • 00:40:06 either grass or fire so to do that first
  • 00:40:10 have to just import regular expressions
  • 00:40:13 and I would recommend looking into
  • 00:40:16 regular expressions if you don't know
  • 00:40:18 what they are super super powerful and
  • 00:40:19 filtering data based on certain textual
  • 00:40:22 patterns so I can do reg x equals true
  • 00:40:31 and right now I'm trying to find if type
  • 00:40:33 1 is equal to fire or let's say grass so
  • 00:40:42 in the writer reg X expression this
  • 00:40:44 means or so I want it to either match
  • 00:40:46 fire or grass run that shoot it did not
  • 00:40:51 give me anything and the reason it
  • 00:40:53 didn't give me anything is because the
  • 00:40:55 capitalisation was off so this is gonna
  • 00:40:57 be another good point so see that did
  • 00:41:00 work type 1 grass type 1 fire etc but a
  • 00:41:04 probably nicer way to do this because
  • 00:41:06 you might have all sorts of funky
  • 00:41:07 capitalization is I could go ahead and
  • 00:41:10 change it back to this way but there's a
  • 00:41:12 flag that you can use so I can say Flags
  • 00:41:16 equals re dot I and that's going to be
  • 00:41:21 ignore case so I run that again as you
  • 00:41:24 can see grass and fire is grabbed even
  • 00:41:27 though I specified it without the
  • 00:41:30 capital letters one more example let's
  • 00:41:34 say I wanted to get all Pokemon names
  • 00:41:37 that contains started with API so
  • 00:41:40 probably the first example you might
  • 00:41:42 think of as Pikachu but he also would
  • 00:41:44 have like Pidgeotto and probably a bunch
  • 00:41:48 of new ones that I don't know so if I
  • 00:41:50 wanted to just get data in the name
  • 00:41:53 category that started with P I I could
  • 00:41:56 use red x's to do the following I could
  • 00:41:58 do P I and then specify that I need it
  • 00:42:04 to start with P I but the next set of
  • 00:42:06 letters can be a through Z and let's say
  • 00:42:12 like this star means one or more and
  • 00:42:16 yeah this is all just Rex information if
  • 00:42:19 it seems super super foreign to you look
  • 00:42:22 into Ray guesses and if I do this we
  • 00:42:27 didn't get anything what happened that's
  • 00:42:29 because I said type 1 so if I actually
  • 00:42:31 change this to name run it as you can
  • 00:42:35 see oh we got Caterpie so I did
  • 00:42:37 something messed up with my reg ex but
  • 00:42:39 as you can see there's all these PA
  • 00:42:41 names in it and if I wanted to eliminate
  • 00:42:45 this from happening the PA letter to be
  • 00:42:50 in the middle I can specify a start of
  • 00:42:52 line with this carrot run that now we've
  • 00:42:55 got only our names that begin with a P I
  • 00:42:58 and you might find this you know there's
  • 00:43:01 many different use cases where you might
  • 00:43:02 find something like this useful to do to
  • 00:43:05 filter out your data in a kind of
  • 00:43:07 complex manner building off the
  • 00:43:08 filtering we did in the last examples we
  • 00:43:11 can actually change our data frame based
  • 00:43:14 on the conditions that we filter out by
  • 00:43:16 so let's imagine I wanted to I didn't
  • 00:43:20 like the name fire for type 1 I thought
  • 00:43:24 that you know bitter and if it was name
  • 00:43:26 like flame flamie flamer if you have our
  • 00:43:31 fire type you're actually a flamer so
  • 00:43:33 let's make that change and I know this
  • 00:43:35 is going against Pokemon tradition but
  • 00:43:37 just to show you DFL oak and we want to
  • 00:43:43 have DF of type 1
  • 00:43:50 equal equal fire and if that is the case
  • 00:43:54 well I can do if I specify with a comma
  • 00:43:58 I can specify a parameter so I'm going
  • 00:44:02 to say type 1 so this is the column I
  • 00:44:04 want and I can do equals like flamer it
  • 00:44:13 looks like something is off why does it
  • 00:44:14 look like something is off that's
  • 00:44:16 because I have an extra bracket there
  • 00:44:19 now it should be good run that don't see
  • 00:44:22 anything but if I do DF oh shoot you can
  • 00:44:27 see that now type 1 is flamer as opposed
  • 00:44:30 to fire if I wanted to change it back
  • 00:44:32 I could go fire and this is flavor now I
  • 00:44:39 have fire again we also can do like
  • 00:44:43 specify this to be some different
  • 00:44:48 different calm it doesn't have to be the
  • 00:44:49 same column we're editing so maybe you
  • 00:44:52 decided that legendary pokémon are all
  • 00:44:54 Pokemon that are of type fire and you
  • 00:44:59 can make this true in that case and as
  • 00:45:03 you can see now all the fire pokemon are
  • 00:45:05 legendary which obviously isn't true but
  • 00:45:09 it's kind of cool that we can use one
  • 00:45:10 condition to set the parameter of
  • 00:45:13 another column and I'm kind of screwed
  • 00:45:18 up this data frame in general now
  • 00:45:21 because I did that so what I could do is
  • 00:45:23 use kind of my check point that was the
  • 00:45:25 modified CSC so I'm gonna just say CSV
  • 00:45:27 equals DAR PD dot read CSV modified dot
  • 00:45:36 CSV now I'm just kind of loading my
  • 00:45:38 check point that I had a while back yeah
  • 00:45:42 so now I fixed up the false the
  • 00:45:44 legendary by just reloading my data
  • 00:45:47 frame you can also change multiple
  • 00:45:50 parameters at the time so I'm gonna just
  • 00:45:52 do this as a demonstration but imagine
  • 00:45:54 we wanted to say like something like if
  • 00:45:57 the total
  • 00:46:00 is greater than 500 it's pretty damn
  • 00:46:08 good Pokemon I was gonna say these
  • 00:46:12 changes that I'm gonna make her here
  • 00:46:13 don't really matter but just to show you
  • 00:46:17 that certain conditions can be modified
  • 00:46:22 multiple conditions can be modified at a
  • 00:46:25 single time so if I want to modify
  • 00:46:28 multiple columns at a single time I can
  • 00:46:30 pass in a list and if I set like this to
  • 00:46:35 the test value don't worry about this
  • 00:46:38 that's what I'm saying test value so if
  • 00:46:40 the total is greater than 500 these two
  • 00:46:43 columns should instead of having their
  • 00:46:44 normal values should have test value
  • 00:46:47 let's see if that's true
  • 00:46:48 oh and I just need to print out data
  • 00:46:51 frame as you can see this total is
  • 00:46:56 greater than 500 this is greater than
  • 00:47:01 500 this is greater than 500 all of them
  • 00:47:04 modified as we wanted and another neat
  • 00:47:07 thing to know is that you can modify
  • 00:47:09 them individually as well so this could
  • 00:47:11 be tests that are like will say that
  • 00:47:15 this is test one and this is test two so
  • 00:47:22 now we are specifying what generation
  • 00:47:26 becomes and what legendary becomes if
  • 00:47:28 this specific condition is met as you
  • 00:47:32 can see that updated appropriately
  • 00:47:34 comment all this out real quick and just
  • 00:47:36 reload the data frame as it was
  • 00:47:42 initially
  • 00:47:49 okay so that's just I'm just resetting
  • 00:47:51 the changes cause I don't want to stay
  • 00:47:53 but showing you that you can do these
  • 00:47:55 things then they become super super
  • 00:47:57 useful all right we're gonna end this
  • 00:47:58 video by doing something that I find
  • 00:48:00 very useful and that's using the group
  • 00:48:02 by function to help you do some
  • 00:48:05 aggregate statistics so let's start by
  • 00:48:08 loading in that check pointed CSV we
  • 00:48:11 kind of created so I modified dot CSV is
  • 00:48:14 what I'm gonna load it in and just
  • 00:48:16 reminder this is what it's looking like
  • 00:48:18 all right so with this group by function
  • 00:48:23 we can start doing some really kind of
  • 00:48:26 cool analysis on things so for example
  • 00:48:30 one thing I could do is if I wanted to
  • 00:48:32 see like the average HP and attack of
  • 00:48:36 all the Pokemon grouped by which type
  • 00:48:40 they are so like maybe trying to figure
  • 00:48:41 out like Oh our specific types like have
  • 00:48:45 better skills so maybe a rock pokémon
  • 00:48:47 would have like hi defense I think
  • 00:48:50 there's a rock pokémon steel Pokemon
  • 00:48:52 would have hi defense you know maybe a
  • 00:48:55 poison Pokemon would have high attack so
  • 00:48:58 we can start seeing that as a kind of
  • 00:49:00 holistic measure by using this group by
  • 00:49:02 function so I can go D F dot group I and
  • 00:49:06 let's say I wanted to group by type one
  • 00:49:08 and we're gonna look for the averages of
  • 00:49:13 all the type one Pokemon so I can run
  • 00:49:16 that and here we get all the stats
  • 00:49:20 broken down by their mean sorted by what
  • 00:49:23 type one is so if I look at a bug it has
  • 00:49:29 like an average attack of 70 and make
  • 00:49:33 this even more useful I can do that use
  • 00:49:35 the sort function we learned
  • 00:49:36 so sort values and we'll sort on let's
  • 00:49:40 say defense and I'm gonna make this
  • 00:49:47 ascending
  • 00:49:50 the equals false so I can see the
  • 00:49:51 highest defense type one and as I
  • 00:49:54 mentioned before this is we did kind of
  • 00:49:57 see what I was expecting to see that if
  • 00:50:00 we took the average of all steel
  • 00:50:01 pokemons they have the highest defense
  • 00:50:05 which is kind of cool it's cool to see
  • 00:50:09 that we could also instead of doing
  • 00:50:11 defense we could look at what is of all
  • 00:50:14 Pokemon that are in the have a type 1
  • 00:50:17 that's the same who has the highest
  • 00:50:19 attack and I always thinking maybe would
  • 00:50:21 be poison but I'm not positive okay so
  • 00:50:24 we got this does make sense
  • 00:50:25 we got dragon with the highest average
  • 00:50:29 attack and a lot of the dragon pokémon
  • 00:50:31 are legendary so there's no surprise
  • 00:50:34 there that they're super powerful and
  • 00:50:36 like holy crap like a dragon would be
  • 00:50:38 very scary
  • 00:50:39 fighting obviously that makes sense that
  • 00:50:42 they should have good attack let's see
  • 00:50:44 who has the best HP this might the
  • 00:50:49 dragon also is the best HP but it's kind
  • 00:50:52 of cool that we can group by type 1 and
  • 00:50:54 see all these like useful stats about
  • 00:50:57 them just on that group it's a useful
  • 00:51:00 little analysis tool and I could
  • 00:51:01 additionally do something like dot some
  • 00:51:05 so I can use the three that come off to
  • 00:51:07 like come to me in my mind immediately
  • 00:51:09 there might be others that you can
  • 00:51:11 aggregate statistics you can do with
  • 00:51:13 group I but some or mean some and count
  • 00:51:18 so if I did this and some do everything
  • 00:51:22 here I have like all of the HP's added
  • 00:51:25 up and you know you got to be thinking
  • 00:51:27 about why you're doing something when
  • 00:51:29 you're doing anything data science
  • 00:51:30 related in this case it doesn't really
  • 00:51:31 make sense for me to sum up these
  • 00:51:33 properties because like I could sum up
  • 00:51:36 and see that one's like way higher than
  • 00:51:38 another but because you don't know how
  • 00:51:40 many type ones are bug or type ones are
  • 00:51:44 are dark
  • 00:51:45 etc this aggregate sum doesn't make
  • 00:51:49 sense in this context but you can do it
  • 00:51:52 then you also then you also have
  • 00:51:57 count so if I run this we have all the
  • 00:52:06 counts of the different out of pokémon
  • 00:52:10 that are type 1 so 69 are bog 31 are
  • 00:52:14 dark 32 or dragon type 1 and if you want
  • 00:52:19 to clean this up a little bit like you
  • 00:52:20 have a lot of the same values everywhere
  • 00:52:23 basically it's any time it's a non zero
  • 00:52:27 non false number or I think just yeah
  • 00:52:30 any time the row is filled in so like
  • 00:52:34 the reason this is 52 right here is
  • 00:52:36 because the type 2 is just blank so it
  • 00:52:38 didn't count those counts so if you
  • 00:52:43 wanted to just have like a clean this is
  • 00:52:45 the count you could add DF equals or DF
  • 00:52:49 of count equals 1 so basically what I'm
  • 00:52:54 doing is I'm filling in a column added
  • 00:52:56 to the data frame and I'll show you this
  • 00:52:59 that's just a one for every route as you
  • 00:53:03 can see on the right side where my face
  • 00:53:07 is normally blocking there's this one
  • 00:53:10 here so now what I could do is I could
  • 00:53:16 do that same group by count right oh
  • 00:53:21 shoot it's calming it up and I get all
  • 00:53:26 of this back but if I want to just make
  • 00:53:28 my life easier I can do just get the
  • 00:53:31 count column and now I have this useful
  • 00:53:38 little set where there's 69 bug 31 dark
  • 00:53:44 32 dragon etc this is a little bit
  • 00:53:47 easier to read now
  • 00:53:48 and now that I have been it easier to
  • 00:53:50 read format
  • 00:53:51 I could also group by multiple
  • 00:53:53 parameters at the same time so I could
  • 00:53:55 do type 1 and type 2 so looking at all
  • 00:53:59 the subsets so of type 1 bug to have a
  • 00:54:04 type 2 of electric to have a type 2 of
  • 00:54:07 fighting etc I can do all sorts of count
  • 00:54:10 and this gets really useful if you're
  • 00:54:12 working with a really really massive
  • 00:54:15 data set and I'll get into that quickly
  • 00:54:18 I don't know if I'll go through the full
  • 00:54:19 example but imagine you had a really
  • 00:54:21 really big data set and you couldn't
  • 00:54:25 even load it all into one data frame
  • 00:54:26 this group I and like count and sum is
  • 00:54:29 super useful because you can take your
  • 00:54:33 data frame and kind of squish it so if
  • 00:54:35 you're like tracking some number of like
  • 00:54:37 how many times an event occurs in your
  • 00:54:39 data frame you can use this group I and
  • 00:54:42 like count things and then kind of
  • 00:54:45 squeeze your data frame make it smaller
  • 00:54:47 based on this group by function if that
  • 00:54:51 made any sense so I'm not gonna show you
  • 00:54:54 guys exactly just because we've been
  • 00:54:55 working with the data set that's not
  • 00:54:56 that small I don't really feel like
  • 00:54:57 bringing in a new data set right at the
  • 00:54:59 end but imagine you're working with a
  • 00:55:01 data set a file that's on the order of
  • 00:55:04 like 20 gigabytes it's pretty dang big
  • 00:55:06 and you don't really know how to best
  • 00:55:09 process it one thing that's really
  • 00:55:11 useful about the Python pandas library
  • 00:55:13 is it allows you to read in like a file
  • 00:55:16 like that you can read it in chunks at a
  • 00:55:19 time so instead of reading it into all
  • 00:55:21 20 gigabytes because now unless you have
  • 00:55:23 a very very powerful machine you're not
  • 00:55:25 going to be able to load all of that
  • 00:55:26 into memory you can load it in let's say
  • 00:55:29 100 megabytes at a time and so normally
  • 00:55:33 when we are reading the CSV we would do
  • 00:55:37 like PD read CSV modified F CSV is file
  • 00:55:43 I'm using right now and that would load
  • 00:55:46 everything in so instead of doing that
  • 00:55:49 what we can do is we can pass in this
  • 00:55:52 chunk size parameter so I'm going to
  • 00:55:56 just say for now chunk size equals 5
  • 00:55:58 just for the example and that means 5
  • 00:56:01 rows are being passed in at a time so if
  • 00:56:04 I did for DF in PD read CSV that means
  • 00:56:09 that my DF right here would be 5 rows of
  • 00:56:13 my total data set modified CSV and
  • 00:56:17 because this is rows and you might would
  • 00:56:19 rather think a bit in like terms of
  • 00:56:21 memory size you can do a little bit of
  • 00:56:23 math with the Rose to figure out how
  • 00:56:25 much memory that will actually be taking
  • 00:56:27 if you think every row is probably like
  • 00:56:30 maybe 10 or 20 bytes shouldn't be much
  • 00:56:34 more than that you can do some math on
  • 00:56:36 how big this is and if you run into an
  • 00:56:38 area you can always like that you don't
  • 00:56:40 have enough memory you can always shrink
  • 00:56:41 this chunk size so we're working with a
  • 00:56:44 really big data set we made a set you
  • 00:56:48 know our chunk size to a hundred
  • 00:56:49 thousand rows at a time which is a lot
  • 00:56:51 of rows but nowhere near how much that
  • 00:56:53 full 20 gigabytes would be but for our
  • 00:56:55 example we're just loading again five
  • 00:56:58 rows at a time and I can show you that
  • 00:57:01 that is happening so 10 or like chunk
  • 00:57:06 data frame and then the data frame just
  • 00:57:09 to see how it's working so we have the
  • 00:57:14 first data frame and as you can see it's
  • 00:57:16 five rows second data frame another next
  • 00:57:20 five rows third data frame the third set
  • 00:57:23 of five rows etc so this loaded in the
  • 00:57:27 data frame but in chunks of five so
  • 00:57:31 what's useful with like the aggregate
  • 00:57:33 stuff we were just going through is you
  • 00:57:35 could also like to find some new data
  • 00:57:37 frame equals that say PD data frame and
  • 00:57:43 you can give it like the same columns as
  • 00:57:46 you had in your original data frame this
  • 00:57:50 would just create a new data frame
  • 00:57:51 that's empty
  • 00:57:52 with the same column names basically
  • 00:57:55 what you can do is you could let's say
  • 00:57:58 like define DF group by type 1 let's say
  • 00:58:08 and get like the count of that stored in
  • 00:58:13 results and what you can do here is with
  • 00:58:17 that new data frame you defined you
  • 00:58:19 could do something like you can use the
  • 00:58:21 concat function of pandas which just
  • 00:58:24 appends two data frames together and you
  • 00:58:26 could do something like p or new data
  • 00:58:29 frame equals PD concat
  • 00:58:32 of the new data frame and results so
  • 00:58:37 basically what what this would do is
  • 00:58:38 always take your new data frame as you
  • 00:58:40 go through chunks append on results and
  • 00:58:43 store it back to new data frame so as
  • 00:58:46 you did this as you did more iterations
  • 00:58:48 you be building this new data frame of
  • 00:58:51 all the information in your original
  • 00:58:53 really really large data set but because
  • 00:58:57 your each chunk you're like aggregating
  • 00:58:59 doing some sort of group buy and count
  • 00:59:01 you're shrinking that data size down so
  • 00:59:03 that this final new data frame has the
  • 00:59:06 meaning that comes out of that big
  • 00:59:08 original data frame but it's a lot
  • 00:59:11 smaller you can actually do more
  • 00:59:12 analysis now on this shrunken down size
  • 00:59:15 new data frankly if that hopefully
  • 00:59:18 hopefully that makes sense if you need
  • 00:59:19 me to clarify this just leave me a
  • 00:59:21 comment down below and I'll try to clear
  • 00:59:24 things up regarding that all right
  • 00:59:26 that's all I'm gonna do in this video
  • 00:59:28 hopefully you have control you kind of
  • 00:59:30 feel like you have control the pandas
  • 00:59:32 library now if you felt like you learned
  • 00:59:34 something make sure to hit that
  • 00:59:35 subscribe button it would mean a lot to
  • 00:59:36 me
  • 00:59:37 I'm gonna build off of this video in
  • 00:59:39 future videos such as like plotting
  • 00:59:41 stuff in our data frames and you know
  • 00:59:44 kind of doing some advanced stuff using
  • 00:59:46 like regular expressions I don't know if
  • 00:59:48 it will be tell it specifically a pandas
  • 00:59:50 but a lot of useful information that you
  • 00:59:52 can take your panda skills and build off
  • 00:59:54 of so subscribe for all of that if you
  • 00:59:57 have any questions about anything I
  • 00:59:58 covered in the video you leave a comment
  • 00:59:59 down below and I'll try to help you out
  • 01:00:01 and clarify and also if there's any like
  • 01:00:04 additional features you would love to
  • 01:00:06 see in pandas that I've missed leave a
  • 01:00:08 comment down below what that is and
  • 01:00:10 maybe I'll make a follow-up part two to
  • 01:00:12 this video all right that's all I got
  • 01:00:14 thank you guys again for watching and
  • 01:00:16 peace out
  • 01:00:20 [Music]
  • 01:00:23 you