- 00:00:00 what's up guys and welcome back to
- 00:00:01 another video in this video we are not
- 00:00:03 talking about the fluffy animal or the
- 00:00:05 song by designer but instead we're gonna
- 00:00:07 dive into the pandas library of Python
- 00:00:10 which I find to be one of the most
- 00:00:12 useful libraries when you're doing
- 00:00:14 anything data science related in Python
- 00:00:16 so this video will be a good standalone
- 00:00:18 video if you've never done anything with
- 00:00:20 pandas kind of going from zero to like
- 00:00:22 fairly comfortable in one sitting it
- 00:00:25 also is a good video if you have you
- 00:00:28 know some Python pandas experience but
- 00:00:31 you're looking to try to figure out how
- 00:00:32 to do something specific if you're in
- 00:00:35 that second case look down in the
- 00:00:36 comments and I'll pin a timeline of what
- 00:00:41 we're doing this video so you can find
- 00:00:42 exactly what you're looking for quickly
- 00:00:44 so one question you might have as you
- 00:00:46 watch this video is like wipe andis a
- 00:00:48 lot of the stuff you'll see me doing you
- 00:00:51 probably could replicate in Excel but
- 00:00:53 there are specific benefits to using
- 00:00:55 pandas and the first benefit is you have
- 00:00:59 a lot more flexibility with pandas and
- 00:01:01 using Python in general like what you
- 00:01:03 can do in Excel I think is very limited
- 00:01:06 compared to what you can do using the
- 00:01:07 whole Python programming language so
- 00:01:11 like the flexibility that Python offers
- 00:01:13 is reason one to use pandas and then the
- 00:01:16 second reason this is also very
- 00:01:17 important reason is pandas allows you to
- 00:01:20 work with a lot larger data sets then
- 00:01:24 Excel does Excel really kind of
- 00:01:26 struggles once you start loading in
- 00:01:27 really large files so the second reason
- 00:01:29 of why pandas is you can work with big
- 00:01:32 big data if you're finding this video
- 00:01:36 useful don't forget to click that
- 00:01:37 subscribe button because I'll be making
- 00:01:39 a lot more tutorials building on this
- 00:01:41 type of stuff in the future ok to begin
- 00:01:44 this video you should already have
- 00:01:45 Python 3 installed and then you need to
- 00:01:47 open up a terminal window and type in
- 00:01:50 pip install pandas if you don't already
- 00:01:52 have the library as you can see I
- 00:01:57 already have it once you have the
- 00:01:59 library we can actually begin loading in
- 00:02:03 data super quickly so I just want to
- 00:02:04 dive into the data right away and we'll
- 00:02:06 use that data to kind of learn
- 00:02:08 everything we need to know regarding
- 00:02:10 this library so I have a link in the
- 00:02:12 description
- 00:02:13 Chintu my github page where I have a CSV
- 00:02:16 of data that we're going to be using for
- 00:02:18 this video so go to my github page and
- 00:02:21 then this data is going to be on Pokemon
- 00:02:24 I found this data on kaggle it's like a
- 00:02:27 good open source machine learning
- 00:02:29 website that you can kind of like do all
- 00:02:31 sorts of challenges and I thought it was
- 00:02:33 perfect for an introductory video on
- 00:02:35 pandas so you don't have to be a huge
- 00:02:38 fan of Pokemon but it's a great data set
- 00:02:40 to get started so click on the CSV
- 00:02:42 version and that's kind of the most
- 00:02:44 important one as you can see you can
- 00:02:46 kind of get a feel for what's in this
- 00:02:47 data so we have all the different
- 00:02:49 Pokemon and then all of their kind of
- 00:02:52 stats in we'll be doing all sorts of
- 00:02:55 manipulations and doing all sorts of
- 00:02:57 analysis on this data throughout the
- 00:02:59 video but I want you to click on raw and
- 00:03:02 then once you have the raw file you can
- 00:03:05 just save as I called my Pokemon data
- 00:03:09 and I save it as a CSV CSV is important
- 00:03:12 for loading it in properly but you can
- 00:03:16 name it whatever so hooking my data to
- 00:03:18 the CSV and one thing to note is
- 00:03:20 wherever you're writing your code you
- 00:03:21 should save this data in the same exact
- 00:03:23 directory just so it's easy to load in
- 00:03:27 these files okay once you have the data
- 00:03:32 saved locally open up your favorite text
- 00:03:34 editor for the purposes of this video
- 00:03:37 I'm going to be using Jupiter notebooks
- 00:03:38 because I like using that for data
- 00:03:41 science related stuff but you can use
- 00:03:42 sublime text pycharm whatever you like
- 00:03:45 to write your code in and I'm going to
- 00:03:47 just clear all this that I have on the
- 00:03:48 screen right now okay so the first thing
- 00:03:51 we're gonna do is load data into PI
- 00:03:54 pandas so we have this CSV and you can
- 00:03:57 open up the CSV and look but exactly
- 00:04:00 what you saw on this page so this is
- 00:04:06 what we're gonna load into the pandas
- 00:04:08 library and we're gonna load it in and
- 00:04:09 what is called a data frame so it's
- 00:04:12 super important you know everything
- 00:04:15 about a data frame but that's kind of
- 00:04:16 what the object type is that
- 00:04:18 panis allows you to manipulate
- 00:04:20 everything with okay so the first thing
- 00:04:24 we need to do is type in import pandas
- 00:04:28 to get the library and usually what
- 00:04:30 you'll see is it's kind of annoying to
- 00:04:32 have to reference pandas every time you
- 00:04:34 type anything in that uses it so we
- 00:04:37 usually import it as pandas as PD so
- 00:04:40 just do that and then to quickly get our
- 00:04:43 data loaded in we're going to say pokey
- 00:04:47 meaning Pokemon or maybe I can just call
- 00:04:49 this like DF for data frame equals PD
- 00:04:54 and then there's this really useful
- 00:04:55 function called read CSV and then you
- 00:04:59 have to pass in the path to that CSV and
- 00:05:01 if you wrote the put the CSV in the same
- 00:05:05 file or in the same location that you're
- 00:05:07 writing your code you can just do the
- 00:05:11 name of the file dot CSV if I run this
- 00:05:16 it loaded it in and you can't see that
- 00:05:19 it loaded it in but if I went ahead and
- 00:05:21 did print DF you can see that all that
- 00:05:26 data is there in that DF variable and if
- 00:05:30 you don't want to load in all of the
- 00:05:31 data you can use the there's these two
- 00:05:34 useful functions to look at just the top
- 00:05:36 of the data and just the bottom so I
- 00:05:37 could do DF head and then I could
- 00:05:40 specify a number of rows so I'm going to
- 00:05:41 just say three for now think the default
- 00:05:44 if you didn't put that three in there is
- 00:05:45 five so you see I just now can see the
- 00:05:48 top three rows and it's a little bit
- 00:05:50 easier to read my data using that and I
- 00:05:53 also could do if I wanted to see the
- 00:05:55 bottom three rows I could do tail three
- 00:05:58 and get as you can see the index is
- 00:06:00 changed to the bottom got those bottom
- 00:06:03 rows okay I'm going to just comment this
- 00:06:05 out real quick I also want to show you
- 00:06:07 that if you don't have your your data in
- 00:06:10 a CSV format that's fine we can also
- 00:06:13 very easily load in Excel files or tab
- 00:06:18 separated files so on that github page I
- 00:06:22 also just for the sake of practice
- 00:06:25 included this same exact file in txt
- 00:06:29 format which is a tab
- 00:06:31 separated format as you can see in the
- 00:06:34 Excel format so if you want to try this
- 00:06:36 or you have a set of data that you're
- 00:06:38 trying to manipulate you can also do
- 00:06:41 I'll load those files in so I can do PD
- 00:06:44 dot read Excel that's another built-in
- 00:06:49 function to pandas and my excel file was
- 00:06:51 like Pokemon data
- 00:06:54 xlsx I believe just check yeah I don't
- 00:07:00 know yeah I think that's the extension
- 00:07:01 we'll get an error if not I'll comment
- 00:07:04 this line out too and I can do a print
- 00:07:07 of DF xlsx ahead three
- 00:07:14 as you can see that same data is read in
- 00:07:17 from that excel file and then the last
- 00:07:21 thing we can try to do I'll move this
- 00:07:24 just so it's a little bit cleaner just
- 00:07:28 down here comment it out real quick
- 00:07:31 I can also load in that tab separated
- 00:07:33 file so this one's a little bit
- 00:07:36 different I can do PD read CSV Pokemon
- 00:07:42 data dot txt and watch what happens when
- 00:07:46 I run this it's probably not giving me
- 00:07:48 an error let's see oh it didn't print
- 00:07:50 yeah let's see what happens I think it's
- 00:07:53 gonna yeah it loaded it all in is like
- 00:07:55 this one single column so the difference
- 00:07:59 with this tab separated file and just to
- 00:08:02 remind you what this looks like just
- 00:08:04 instead of having commas that's
- 00:08:06 separating the different columns its
- 00:08:08 tabs we need to in our read CSV function
- 00:08:13 specify a delimiter it's actually
- 00:08:16 separating them in this case it's a tab
- 00:08:18 which is specified by /t I believe I
- 00:08:22 don't remember the differences between
- 00:08:23 forward slash and back slash and yeah
- 00:08:27 look at that we have the columns in the
- 00:08:30 way they were looking when we were just
- 00:08:32 doing the CSV also note for this TSV the
- 00:08:36 tab separated file you could change this
- 00:08:38 to anything that was actually separating
- 00:08:41 your column so if like let's say for
- 00:08:43 whatever reason you had three exes
- 00:08:45 separating your columns you would set
- 00:08:48 delimiter equals xxx all right let's
- 00:08:50 move on to the next thing and that's
- 00:08:51 going to be actually reading our data
- 00:08:53 easily within the pandas framework so
- 00:08:57 the first thing is reading the headers
- 00:08:59 so in our data we have several headers
- 00:09:01 and we can figure out what those are by
- 00:09:04 doing a print D F dot columns so if we
- 00:09:08 want the headers we just do DF columns
- 00:09:11 as you can see there's the Pokemon
- 00:09:15 number or the Pokedex number I think
- 00:09:18 it's been a little while since I've
- 00:09:21 refreshed my Pokemon scale is the name
- 00:09:23 of the Pokemon but typed the two types
- 00:09:26 and all of the stats information is
- 00:09:29 whether or not they're legendary so
- 00:09:30 these are all the columns we can work
- 00:09:31 with it's all just not print that for
- 00:09:36 now also this a Jupiter notebook I'll
- 00:09:39 save this on the github page once I'm
- 00:09:41 finished with the video so you can also
- 00:09:43 look at this if you use Jupiter
- 00:09:45 notebooks follow along with this if you
- 00:09:48 just download it from the github page or
- 00:09:51 clone it alright so now that we know our
- 00:09:54 columns let's read a specific column so
- 00:09:57 to do that we have our data frame still
- 00:09:59 that we loaded up here and I can do DF
- 00:10:03 dot let's say I wanted to get the name
- 00:10:08 of the Pokemon so if I just did print do
- 00:10:13 you name and ran that as you can see I
- 00:10:17 get all the Pokemon and it does actually
- 00:10:20 abbreviate it just so I'm not printing
- 00:10:22 out like 800 different things so that
- 00:10:25 gives me that and I could also specify
- 00:10:27 that I only wanted like 0-5 probably by
- 00:10:32 doing this yes now I just get the fit
- 00:10:34 the top five names one thing that's
- 00:10:39 interesting you could also do DF name
- 00:10:42 like this this doesn't really work for
- 00:10:44 to word names but you could also get the
- 00:10:48 names like that I usually just do it
- 00:10:51 in the using the brackets and if you
- 00:10:57 want to get multiple columns at the same
- 00:11:00 time you can change this just single
- 00:11:02 word to a list of column names some in
- 00:11:05 turning a list here and then separating
- 00:11:08 it by commas so name say type 1 and
- 00:11:12 that's a like H pace we're getting 3
- 00:11:15 different columns and there aren't even
- 00:11:16 all in order so it's kind of nice so
- 00:11:19 that you can get so if you want to look
- 00:11:20 at specific things and not be cluttered
- 00:11:23 with so much extra stuff you can do that
- 00:11:26 moving on to printing each row that's
- 00:11:29 just comp this up real quick
- 00:11:31 probably the easiest way to print out
- 00:11:32 each row so I'm going to just show you
- 00:11:35 remind you what's in our actual data set
- 00:11:38 again so let's print out the first four
- 00:11:39 rows and let's say we wanted to print
- 00:11:42 out this first row index of one row so I
- 00:11:46 guess this is actually this zeroth row
- 00:11:48 so that it has Ivysaur grass poison etc
- 00:11:51 and it if we want to adjust that row we
- 00:11:53 can use this eye luke function on the
- 00:11:57 data frame which stands for integer
- 00:11:59 location so if I passed an eye look of 1
- 00:12:03 that will give me everything that was in
- 00:12:05 that first row could also use this to
- 00:12:08 get myself multiple rows I could do 1 to
- 00:12:10 4 and that would get me all of these
- 00:12:14 rows so another way to get rows wherever
- 00:12:18 you want in the data frame and the same
- 00:12:22 ìlook function can be used to grab a
- 00:12:24 specific location so let's say I wanted
- 00:12:27 to I'm gonna just change this to 0 real
- 00:12:30 quick I wanted to get the venusaur name
- 00:12:34 here so if we did the indexing of that
- 00:12:37 it's on the second row and it's the 0
- 00:12:41 first column if we're counting with
- 00:12:44 numbers so if I wanted to just get that
- 00:12:46 specific position we could do print di
- 00:12:50 Luke and then the second row and then do
- 00:12:55 comma the actual position 1 so we want
- 00:12:57 the first position the first column as
- 00:12:59 you see that gives us Venusaur building
- 00:13:03 up on this one thing I often
- 00:13:04 myself trying to do is iterate through
- 00:13:08 each row in my dataset as I'm reading
- 00:13:11 yet and so to do that what I would
- 00:13:13 recommend you do there's probably other
- 00:13:15 ways I'm sure there are is do for index
- 00:13:18 comma row in D F dot it er
- 00:13:22 Rose it her iterate through rows
- 00:13:27 probably the easiest way to just go row
- 00:13:30 by row and just access any sort of data
- 00:13:32 you might want so I could do print index
- 00:13:36 comma row and run that as you can see it
- 00:13:42 didn't format it nicely for me but it's
- 00:13:45 going first row then getting the data
- 00:13:48 for the second row etc and one thing
- 00:13:50 that's pretty nice about this is I could
- 00:13:52 if I just wanted the name in the index
- 00:13:56 free throw
- 00:13:57 I could iterate through and just get
- 00:13:59 that information I don't know I find
- 00:14:03 this pretty useful for all sorts of
- 00:14:05 different tasks that I'm doing well
- 00:14:07 working with my data and then one
- 00:14:10 additional function I want to get to
- 00:14:13 right now I'm gonna go into this and
- 00:14:15 more depth a little bit but in addition
- 00:14:18 to having the I look we also have DF
- 00:14:21 cloak and this is used for I guess
- 00:14:26 finding specific data in our data set
- 00:14:30 that isn't just integer based isn't just
- 00:14:33 like the specific rows it's based on
- 00:14:36 more textual information numerical
- 00:14:38 information so one thing that's really
- 00:14:40 cool you can do with this is I can do DF
- 00:14:43 cloak and then I can access only the
- 00:14:46 rows that have DF name or let's say type
- 00:14:52 one equal to let's say fire so this
- 00:14:57 should give us like chars are
- 00:14:58 immediately oh gosh Charmander Charizard
- 00:15:02 the middle one so let's run it and
- 00:15:06 hopefully this works
- 00:15:07 oh yeah afterwards they're pretty nice
- 00:15:10 when they don't print it Wow I should
- 00:15:12 have known this yeah so
- 00:15:16 as you can see it's only giving me the
- 00:15:18 type one that's equal to fire and I
- 00:15:20 could do the same thing if I only wanted
- 00:15:22 to look at the grass pokemon as you can
- 00:15:28 see now we get Bulbasaur Ivysaur
- 00:15:29 venusaur etc you can just keep doing
- 00:15:33 this and you can use multiple conditions
- 00:15:34 so this is super super powerful to do
- 00:15:39 all sorts of conditional statements and
- 00:15:40 I'm going to get into this the more
- 00:15:42 fancy advanced stuff with regarding this
- 00:15:44 later on in the video
- 00:15:46 while we're on this topic another useful
- 00:15:47 thing we can do with our data frame is
- 00:15:49 we can use this stop described method
- 00:15:52 which gives us like all the high level
- 00:15:54 like mean standard deviation type stats
- 00:15:57 from that so as you can see some of
- 00:16:01 these categories it's not super super
- 00:16:03 useful like pokedex number it doesn't we
- 00:16:06 don't really care about the mean but for
- 00:16:07 like HP attack defense special attack
- 00:16:11 etc it's pretty cool little method to
- 00:16:15 use because you have all these metrics
- 00:16:17 you can quickly just look at your data
- 00:16:18 another useful thing we can do is I just
- 00:16:21 print the data frame again we can do
- 00:16:26 some sorting of the values so let's say
- 00:16:27 instead of going from first pokedex
- 00:16:31 downwards we could do sort of by let's
- 00:16:35 say alphabetical name so i could do sort
- 00:16:37 values and then i have to pass in the
- 00:16:40 column i want to sort so if I sorted
- 00:16:43 values by name now I have it
- 00:16:46 alphabetical if I wanted to make it the
- 00:16:49 other way I could do extending so that's
- 00:16:52 and that equal to false so now it's
- 00:16:54 gonna be descending as you can see he
- 00:17:00 also can combine multiple columns in
- 00:17:03 this so let's say we had sorting by type
- 00:17:07 one and then we wanted to have our
- 00:17:09 second sort parameter P by H P so this
- 00:17:14 would give us all like I guess probably
- 00:17:16 the bug pokemon because that would be
- 00:17:17 the first alphabetical one and then it
- 00:17:21 would give us the lowest or highest HP
- 00:17:24 from that let's see what happens
- 00:17:27 yeah as you can see bug and this is the
- 00:17:30 lowest so what we could do is also pass
- 00:17:33 it in descending and this time because
- 00:17:35 we have two columns when you specify
- 00:17:37 true or false for both it might you
- 00:17:40 might be able to do this yeah we can do
- 00:17:42 this but if you want to separate if one
- 00:17:47 is extending one's descending we can do
- 00:17:50 see now I got it descending but we got
- 00:17:53 the farthest down type one so I can do
- 00:17:56 something like this so we want the first
- 00:18:00 one to be ascending and the second one
- 00:18:03 to be descending so now type one will be
- 00:18:06 going a through Z and each P will go
- 00:18:09 from high to low as you can see so
- 00:18:15 sorting the values is very useful as
- 00:18:17 well okay now that we know how to read
- 00:18:19 our data that's like we start making
- 00:18:21 some changes to it so let's look at our
- 00:18:25 data again okay so let you get this data
- 00:18:32 one change that I think would be cool to
- 00:18:35 make is we have all these stat
- 00:18:37 categories I think would be pretty cool
- 00:18:40 to add all these stats together and
- 00:18:43 create like a total category which kind
- 00:18:45 of potentially could help us rank which
- 00:18:48 Pokemon are the best so let's go ahead
- 00:18:51 and do that and one thing that's cool
- 00:18:53 and I guess true about most things
- 00:18:56 programming is there's multiple ways to
- 00:18:58 do this so we're adding a column that is
- 00:19:00 the total of those stats so one way we
- 00:19:04 could do it is we just go ahead and
- 00:19:06 access our data frame and then just call
- 00:19:09 this new column total and we can just
- 00:19:11 reference it it like this right now
- 00:19:13 and we will say that that equals this is
- 00:19:16 probably the easiest way to read but not
- 00:19:20 the I guess fastest way to do it but you
- 00:19:22 could do DF of HP plus DF of attack I'll
- 00:19:32 just probably speed this up when I'm
- 00:19:33 actually editing this
- 00:19:43 okay so now we had to find this one was
- 00:19:47 you have the dataframe total is gonna
- 00:19:51 equal all the other columns run that I
- 00:19:57 guess we don't see anything but if I go
- 00:19:59 ahead and do data frame dot head five as
- 00:20:06 we can see over here on the right side
- 00:20:10 we have this new column name total and I
- 00:20:15 would say I recommend always when you do
- 00:20:18 something like this just making sure
- 00:20:20 that you did it like actually is the
- 00:20:22 total that you're trying to get 49 plus
- 00:20:25 65 plus 65 plus 45 because you could
- 00:20:30 easily see that this total is a valid
- 00:20:32 number but if you don't actually
- 00:20:34 double-check that it's the right number
- 00:20:36 you kind of run into a dangerous
- 00:20:37 territory and as we can see perfect 318
- 00:20:42 is exactly what we are looking for so
- 00:20:44 that's one way to do it another way we
- 00:20:47 could go about doing this and actually
- 00:20:49 because I have I'm using a Jupiter
- 00:20:51 notebook actually what we might want to
- 00:20:53 do first is drop some columns so one
- 00:20:58 thing about Jupiter notebooks is like if
- 00:21:00 I run this again even though I've
- 00:21:01 commented it out lost myself it still
- 00:21:06 has that data frame in memory so it even
- 00:21:10 though this is commented out it still
- 00:21:13 has that data frame in memory so it
- 00:21:16 doesn't remove the total after I even
- 00:21:18 comp this out it just stays in memory
- 00:21:21 but so one thing we might want to do is
- 00:21:23 drop a specific column so if I wanted to
- 00:21:26 go ahead and drop the total column and
- 00:21:29 just show how to do it in another way I
- 00:21:32 could do data frame drop and then I can
- 00:21:35 specify the columns and I'm gonna
- 00:21:41 specify total I'm gonna run that yeah
- 00:21:44 why did you not disappear and so the
- 00:21:47 reason this did not disappear is because
- 00:21:49 it actually
- 00:21:50 directly modified or doesn't directly
- 00:21:53 remove that column I believe you have to
- 00:21:55 just reset it to dataframe so I can go
- 00:21:58 ahead and do this and we should see this
- 00:22:00 total column here the right side which
- 00:22:03 my face is blocking will see this
- 00:22:07 disappear run that yay
- 00:22:10 so that was dropping a column so now if
- 00:22:14 I wanted to go ahead and do the add a
- 00:22:16 column in a different way maybe a little
- 00:22:18 bit more succinct of my way I can go
- 00:22:20 ahead and do DF total that stays the
- 00:22:23 same and then what I'm going to do this
- 00:22:26 time is I'm going to use that I Lok
- 00:22:29 function that we learned so integer
- 00:22:34 location I want all the rows so the
- 00:22:37 first input is going to be the colon
- 00:22:39 which just means all rows everything and
- 00:22:42 then the columns I actually want to add
- 00:22:45 together will be HP through speed so
- 00:22:49 that will be this is 0 1 2 3 4 so this
- 00:22:53 will be the fourth column to the 5th 6th
- 00:23:02 7th 8th 9th to the ninth ninth column
- 00:23:08 and I'll run and then I can there's a
- 00:23:10 dot sum function you can use and you
- 00:23:13 want to specify if you're adding
- 00:23:15 horizontally you want to specify axis
- 00:23:17 equals 1 if you said actually su equals
- 00:23:20 0 that would be adding vertically ok
- 00:23:24 and we have our totals again and one
- 00:23:28 thing you might have noticed I don't
- 00:23:29 know if you caught this but because I
- 00:23:33 have this 318 down here I realized that
- 00:23:35 this 273 is actually wrong so that's why
- 00:23:38 it's good to make that check the error I
- 00:23:40 made was that it shouldn't end at 9 if
- 00:23:44 we want to include this speed it
- 00:23:45 actually has to go to the next one
- 00:23:47 because the end parameter and like lists
- 00:23:50 and don't be everything we the end
- 00:23:53 parameter enlists is exclusive so 10th
- 00:23:56 means the tenth column is the first one
- 00:23:58 we don't include so every run that now
- 00:24:00 you see that the totals are actually
- 00:24:02 correct as we did the math down here
- 00:24:05 the last change we'll make before we
- 00:24:07 resave this as a another CSV the last
- 00:24:14 change we'll make before we receive this
- 00:24:16 as an updated CSV will be let's say we
- 00:24:19 didn't want this total column all the
- 00:24:21 way over here on the right side it makes
- 00:24:23 a little bit more sense I would say to
- 00:24:26 be either to the left of HP sorry I
- 00:24:30 don't know what tells us or the right of
- 00:24:33 speed so we can do this in a few
- 00:24:36 different ways and the way I'm gonna
- 00:24:37 choose is probably not the most
- 00:24:38 efficient but it it makes sense given
- 00:24:42 what we've done already so remember that
- 00:24:44 we could grab specific columns like this
- 00:24:47 so if I wanted total I wanted HP and
- 00:24:51 like defense let's say and note I can
- 00:24:56 order these however I want so if I
- 00:24:59 wanted to reorder my columns and then
- 00:25:01 save it afterwards I could just do DF
- 00:25:03 equals whatever order I choose and
- 00:25:07 because it's a little bit annoying to
- 00:25:09 type out all these things I'm going to
- 00:25:11 get the columns as a list to do that I
- 00:25:14 will do calls equal DF and you don't
- 00:25:17 have to know why why I'm typing what I'm
- 00:25:19 typing works exactly I'm looking at the
- 00:25:22 documentation as I do this and I
- 00:25:28 recommend you guys do the same always
- 00:25:29 look at the documentation there's great
- 00:25:31 stuff here I can't get everything out
- 00:25:33 here in this single video doing the best
- 00:25:35 I can but definitely check the
- 00:25:37 documentation out I'll link to that in
- 00:25:39 the description ok so I'm getting the
- 00:25:41 columns and instead of ordering it like
- 00:25:44 this I'm going to do ranges so if I want
- 00:25:47 these first four columns and then total
- 00:25:51 and then the last the remaining columns
- 00:25:53 I could do something like this calls of
- 00:25:57 0 to 4
- 00:25:59 they'll get me the first four in the
- 00:26:02 same order I want it plus calls of
- 00:26:04 negative one that's just reverse
- 00:26:06 indexing getting the total here I might
- 00:26:09 be blocking that again the
- 00:26:11 here and then finally the remaining
- 00:26:17 stuff we would need to add to that would
- 00:26:19 be four to five six seven eight nine ten
- 00:26:24 eleven twelve and we include twelve
- 00:26:26 because that would be in our the first
- 00:26:28 one we actually don't include in the
- 00:26:30 final data frame so let's see what
- 00:26:32 happens when we do that we want to see
- 00:26:33 this total go over here know what
- 00:26:39 happened okay so why do we get this
- 00:26:41 error can only concatenate lists not
- 00:26:45 string to list so that's telling me
- 00:26:48 something probably in here is messed up
- 00:26:50 and what I'm seeing is that because this
- 00:26:54 is a single column it's not gonna be
- 00:26:57 it's just gonna be a string so I have to
- 00:27:00 actually share out of that in / in
- 00:27:02 brackets to make it a list and then I
- 00:27:04 can go ahead and run this again and we
- 00:27:06 wanted to see the total switched over to
- 00:27:09 the left side and there we go it is
- 00:27:12 there cool and one comment I want to
- 00:27:16 make as I said before this type of
- 00:27:19 change doesn't actually really modify
- 00:27:21 our data at all it's just kind of a
- 00:27:24 visual thing so I didn't really care too
- 00:27:26 too much about how I went about and did
- 00:27:28 it but one thing I really want to note
- 00:27:31 here is be careful when you're hard
- 00:27:32 coding numbers in like this if your data
- 00:27:36 is changing and you have these hard to
- 00:27:40 criticism uh kind of like just using
- 00:27:43 actual names
- 00:27:45 so even calculating the total like this
- 00:27:47 is a bit dangerous so maybe instead of
- 00:27:49 using four to ten one thing you could
- 00:27:51 potentially do is get the index of your
- 00:27:54 start so that would when we were doing
- 00:27:56 this it was the index of HTP and then go
- 00:27:59 to the index of speed that would be one
- 00:28:02 way to do it's a little bit safer I
- 00:28:04 would say all right now that we're done
- 00:28:06 with all of this let's move on to saving
- 00:28:09 our new CSV so just a reminder of what
- 00:28:12 we have in our data frame we follow this
- 00:28:14 information and in this previous of
- 00:28:17 cells is where I actually defined the
- 00:28:19 data frame just as a reminder this date
- 00:28:23 frames not coming out of
- 00:28:24 midair so I have this data frame and now
- 00:28:28 I want to save this updated and let's
- 00:28:30 start by saving it to a CSV so just like
- 00:28:33 we had the dot read CSV we also have a
- 00:28:37 built-in function in panda is called to
- 00:28:39 CSV so I could just call this something
- 00:28:42 like modified TA or modified CSV and now
- 00:28:48 it will take whatever is in this data
- 00:28:49 frame and output it to nice comma
- 00:28:51 separated values format so because I got
- 00:28:55 to this next cell we know it did that I
- 00:28:56 can check my directory and as you can
- 00:29:00 see there's this modified CSV and I'll
- 00:29:03 just open that up real quick just so you
- 00:29:04 can see it all the information is there
- 00:29:07 load okay so we see we have all the
- 00:29:12 stuff we wanted and this total column
- 00:29:16 there which is cool the one thing that
- 00:29:18 is annoying about the current state of
- 00:29:20 this stop texting me I'm making a video
- 00:29:24 who has the nerve okay sorry
- 00:29:31 so the one thing that might be annoying
- 00:29:33 is that you have all these indexes over
- 00:29:35 here and I don't really care to have
- 00:29:38 those so the quick fix to not save all
- 00:29:42 these indexes with your data you can if
- 00:29:44 you want to but you can go ahead and
- 00:29:46 pass in the variable index equals false
- 00:29:52 run that again and then I reopen my
- 00:29:55 modified CSV you will see that that
- 00:30:01 stuff is all gone so yeah now we just
- 00:30:04 have the Pokedex sr for this column to
- 00:30:06 the left which is perfect you can also
- 00:30:09 go ahead and there's also a built-in to
- 00:30:14 excel function so I could if I wanted to
- 00:30:17 save this as a excel even though right
- 00:30:19 now we're just working with the data
- 00:30:20 frame it's easy to output it to that
- 00:30:22 format so to excel we'll call this
- 00:30:25 modified dot X there xlsx and we can
- 00:30:31 also make the index false here
- 00:30:34 run and so that well now we have these
- 00:30:38 two modified this is the actual excel
- 00:30:40 file I could load that but for the sake
- 00:30:42 of time I'm not going to and then
- 00:30:44 finally the last way we load it in three
- 00:30:46 formats I might as well save three
- 00:30:48 formats so the last one is what if we
- 00:30:50 wanted to save that tab-separated file
- 00:30:53 so we can do to CSV again modified I'm
- 00:30:57 going to call this modified txt and
- 00:31:00 index equals false and then the one
- 00:31:04 thing on this is there's no delimiter
- 00:31:07 parameter for when we're doing to CSV
- 00:31:10 which is kind of annoying but there is a
- 00:31:14 separator parameter you can pass in and
- 00:31:17 look at the documentation if you need to
- 00:31:20 remember this I'm looking at the
- 00:31:21 documentation as I speak and so I can
- 00:31:23 specify that I want to separate it with
- 00:31:25 tabs instead of commas is that that's
- 00:31:28 gonna happen by default so run that and
- 00:31:32 I will actually open this one up just so
- 00:31:34 you can see modified here and if I drag
- 00:31:38 that and you can see that all the data
- 00:31:39 is there no indexes on the left and it's
- 00:31:42 all separated by tabs so that looks
- 00:31:45 pretty good alright now that we've done
- 00:31:48 all of that let's move into some more
- 00:31:49 advanced Panda stuff and we'll start out
- 00:31:52 with some more advanced filtering of our
- 00:31:54 data so just a reminder this is our data
- 00:31:57 frame so as a first example I showed
- 00:32:02 before was that we could specify a
- 00:32:04 specific type for example that we want
- 00:32:06 it's okay the DF cloak and then we said
- 00:32:09 DF of type 1 equals or equals equals
- 00:32:17 let's say graphs we're only going to get
- 00:32:20 the rows that actually have grass as
- 00:32:23 their type on so as you can see all
- 00:32:26 these type ones are grass in addition
- 00:32:30 and we can do just more so than just one
- 00:32:34 location condition we can pass in
- 00:32:36 multiple so I can do something like DF
- 00:32:39 type 1 equals grass and let's say we
- 00:32:41 wanted DF
- 00:32:44 of type two two equal poison so I can
- 00:32:50 type it in like this run it oh no we got
- 00:32:53 an error so the thing you got to do here
- 00:32:57 is we have to separate our conditions
- 00:32:59 with parentheses for whatever reason not
- 00:33:01 quite sure why that is so here I have
- 00:33:07 two conditions separate them by
- 00:33:09 parentheses now as you can see we only
- 00:33:11 have grass and poison now and one thing
- 00:33:14 to note is usually we're typing out and
- 00:33:17 like this but inside of our pandas
- 00:33:20 dataframe when we're filtering we just
- 00:33:22 do the actual and sign let's say if we
- 00:33:26 wanted type 1 equals grass or type 2
- 00:33:30 equals poison then we could do the or
- 00:33:33 sign like this it's a little bit
- 00:33:35 different than you're normally used to
- 00:33:37 just the convention of the Python pandas
- 00:33:40 library and just look this up if you
- 00:33:42 forget so I run that now we should have
- 00:33:46 one is poison either type one is grass
- 00:33:50 or tuck two is poison and as you can see
- 00:33:52 this is a bug type this is poison so I
- 00:33:55 was able to separate those two
- 00:33:56 conditions by N or instead of an and and
- 00:34:03 we don't have to just use text
- 00:34:04 conditions I could also add in let's say
- 00:34:06 we wanted type 1 is equal to grass type
- 00:34:11 2 is equal to poison and let's say we
- 00:34:15 wanted the HP to be a fairly high value
- 00:34:19 so just looking at these feel like 70 is
- 00:34:23 a good cutoff value so HP has to be
- 00:34:26 greater than 70 I can also specify
- 00:34:28 conditions like this and around that now
- 00:34:35 you see we only have five rows that it
- 00:34:38 actually filters out and you could go
- 00:34:40 ahead if you wanted to there's a couple
- 00:34:43 different things you can do with this so
- 00:34:46 first if you let's say you wanted to
- 00:34:48 make a new data frame that was just the
- 00:34:49 filter data I could just do something
- 00:34:51 like new DF equals this and now if I
- 00:34:56 print out
- 00:34:57 nudee f we get just those five rows but
- 00:35:02 I could go ahead and just print out D F
- 00:35:06 and we still have everything also worth
- 00:35:09 mentioning real quick I can easily save
- 00:35:12 this new data frame as a new CSV kind of
- 00:35:15 to checkpoint my work and maybe if I
- 00:35:17 wanted to do this on many different
- 00:35:19 filters kind of have this more specific
- 00:35:21 CSV files that I could dive in and look
- 00:35:24 at in more depth it's like I call this
- 00:35:26 something like filtered dot CSV if I ran
- 00:35:30 this you'd see in here that I have this
- 00:35:35 filtered and it contains the data that I
- 00:35:38 just grabbed out one thing to note when
- 00:35:41 you are filtering your data and you
- 00:35:43 shrink down the data size is when you
- 00:35:45 print out that data frame so I'll
- 00:35:47 comment this out okay I can just print
- 00:35:50 out new D F as you can see one thing
- 00:35:54 that's weird is this is the index here
- 00:35:56 so it goes to 350 77 652 even though
- 00:36:00 we've filtered out our data the old
- 00:36:02 index stayed there and that get annoying
- 00:36:05 if you're trying to do some additional
- 00:36:06 processing with this new data frame so
- 00:36:09 if you want to reset your index you can
- 00:36:11 go new D F dot reset index and you can
- 00:36:19 start off by just setting new D F equal
- 00:36:24 to new DF is don't reset index now if I
- 00:36:26 print out new DF you see that we have 0
- 00:36:31 1 2 3 4 and by default it saves that old
- 00:36:35 index there as a new column if you don't
- 00:36:38 want that to happen we can modify it
- 00:36:40 further we can do we can do drop equals
- 00:36:47 true so this will get rid of the old in
- 00:36:49 these indices as you can see now we
- 00:36:52 don't have that then the last thing is
- 00:36:54 if you don't want to have to reset it to
- 00:36:55 a new data frame you can actually do
- 00:36:57 this in place as well which just
- 00:37:00 probably conserves a little bit of
- 00:37:01 memory and if I run this I don't even
- 00:37:06 set it to a new variable it just will
- 00:37:09 change the value of within
- 00:37:11 the given new DF and as you can see we
- 00:37:13 got the new indexes for our filtered out
- 00:37:17 data so that's something useful too to
- 00:37:20 be aware of because if you're running
- 00:37:22 through your new data frame like row by
- 00:37:25 row and you're trying to get a specific
- 00:37:26 spot even though it's like the fourth
- 00:37:29 row that you see it might be you might
- 00:37:31 need to index like you know the semi
- 00:37:33 first position and that would get really
- 00:37:35 annoying so resetting indexes is helpful
- 00:37:38 in this case in addition to I guess
- 00:37:43 equals conditions greater than less than
- 00:37:45 etc not equals we also have other types
- 00:37:48 of conditions we can use basically
- 00:37:50 anything you can think of so one thing
- 00:37:52 that I see that is kind of annoying me
- 00:37:54 with this data is if you look in here
- 00:37:57 maybe this is because I'm like a little
- 00:37:59 bit outdated on my Pokemon knowledge but
- 00:38:01 I've seen these like in mega versions of
- 00:38:03 Pokemon and I'm not quite sure what that
- 00:38:05 really means so let's say I wanted to
- 00:38:08 filter out all the names that contained
- 00:38:11 mega and it's tough to do with equal
- 00:38:15 science you know because contain is not
- 00:38:17 quite equal to because we want to allow
- 00:38:19 a lot of different things there so I
- 00:38:21 could not allow the name to include mega
- 00:38:24 by doing the following so I'm going to
- 00:38:27 delete the stuff that's inside of here
- 00:38:29 maybe I'll just comment it out so you
- 00:38:31 can still see it but I'm going to do DF
- 00:38:34 cloak and then I'm gonna pass in da name
- 00:38:40 then I need to get the string parameter
- 00:38:43 of the name this is something you should
- 00:38:45 just kind of I guess remember about the
- 00:38:49 contains function string and then dot
- 00:38:52 contains mega so if I run this you'll
- 00:39:00 see that all of these ones are just the
- 00:39:03 columns that include the word mega and
- 00:39:07 then if we want to get the reverse of
- 00:39:09 this this is another good symbol to
- 00:39:11 remember because it's not quite what you
- 00:39:13 would think it would be but within the
- 00:39:15 alok function if we want to do not
- 00:39:17 instead of maybe think you know to be
- 00:39:19 the explanation point it's actually this
- 00:39:22 squiggly line
- 00:39:23 so if I run this now we drop all those
- 00:39:29 ones that had the mega so as you can see
- 00:39:31 there's no Megas anymore in our data so
- 00:39:35 that's pretty useful and taking this
- 00:39:38 even a step farther this contains
- 00:39:41 function I find to be very very powerful
- 00:39:44 because in addition to just doing exact
- 00:39:46 words we can also pass in reg X
- 00:39:51 expressions and do all sorts of like
- 00:39:52 complicated filtering with this so let's
- 00:39:56 say that's the first example let's say
- 00:40:00 we wanted to see if the string wanted a
- 00:40:04 simple way to get if the type one was
- 00:40:06 either grass or fire so to do that first
- 00:40:10 have to just import regular expressions
- 00:40:13 and I would recommend looking into
- 00:40:16 regular expressions if you don't know
- 00:40:18 what they are super super powerful and
- 00:40:19 filtering data based on certain textual
- 00:40:22 patterns so I can do reg x equals true
- 00:40:31 and right now I'm trying to find if type
- 00:40:33 1 is equal to fire or let's say grass so
- 00:40:42 in the writer reg X expression this
- 00:40:44 means or so I want it to either match
- 00:40:46 fire or grass run that shoot it did not
- 00:40:51 give me anything and the reason it
- 00:40:53 didn't give me anything is because the
- 00:40:55 capitalisation was off so this is gonna
- 00:40:57 be another good point so see that did
- 00:41:00 work type 1 grass type 1 fire etc but a
- 00:41:04 probably nicer way to do this because
- 00:41:06 you might have all sorts of funky
- 00:41:07 capitalization is I could go ahead and
- 00:41:10 change it back to this way but there's a
- 00:41:12 flag that you can use so I can say Flags
- 00:41:16 equals re dot I and that's going to be
- 00:41:21 ignore case so I run that again as you
- 00:41:24 can see grass and fire is grabbed even
- 00:41:27 though I specified it without the
- 00:41:30 capital letters one more example let's
- 00:41:34 say I wanted to get all Pokemon names
- 00:41:37 that contains started with API so
- 00:41:40 probably the first example you might
- 00:41:42 think of as Pikachu but he also would
- 00:41:44 have like Pidgeotto and probably a bunch
- 00:41:48 of new ones that I don't know so if I
- 00:41:50 wanted to just get data in the name
- 00:41:53 category that started with P I I could
- 00:41:56 use red x's to do the following I could
- 00:41:58 do P I and then specify that I need it
- 00:42:04 to start with P I but the next set of
- 00:42:06 letters can be a through Z and let's say
- 00:42:12 like this star means one or more and
- 00:42:16 yeah this is all just Rex information if
- 00:42:19 it seems super super foreign to you look
- 00:42:22 into Ray guesses and if I do this we
- 00:42:27 didn't get anything what happened that's
- 00:42:29 because I said type 1 so if I actually
- 00:42:31 change this to name run it as you can
- 00:42:35 see oh we got Caterpie so I did
- 00:42:37 something messed up with my reg ex but
- 00:42:39 as you can see there's all these PA
- 00:42:41 names in it and if I wanted to eliminate
- 00:42:45 this from happening the PA letter to be
- 00:42:50 in the middle I can specify a start of
- 00:42:52 line with this carrot run that now we've
- 00:42:55 got only our names that begin with a P I
- 00:42:58 and you might find this you know there's
- 00:43:01 many different use cases where you might
- 00:43:02 find something like this useful to do to
- 00:43:05 filter out your data in a kind of
- 00:43:07 complex manner building off the
- 00:43:08 filtering we did in the last examples we
- 00:43:11 can actually change our data frame based
- 00:43:14 on the conditions that we filter out by
- 00:43:16 so let's imagine I wanted to I didn't
- 00:43:20 like the name fire for type 1 I thought
- 00:43:24 that you know bitter and if it was name
- 00:43:26 like flame flamie flamer if you have our
- 00:43:31 fire type you're actually a flamer so
- 00:43:33 let's make that change and I know this
- 00:43:35 is going against Pokemon tradition but
- 00:43:37 just to show you DFL oak and we want to
- 00:43:43 have DF of type 1
- 00:43:50 equal equal fire and if that is the case
- 00:43:54 well I can do if I specify with a comma
- 00:43:58 I can specify a parameter so I'm going
- 00:44:02 to say type 1 so this is the column I
- 00:44:04 want and I can do equals like flamer it
- 00:44:13 looks like something is off why does it
- 00:44:14 look like something is off that's
- 00:44:16 because I have an extra bracket there
- 00:44:19 now it should be good run that don't see
- 00:44:22 anything but if I do DF oh shoot you can
- 00:44:27 see that now type 1 is flamer as opposed
- 00:44:30 to fire if I wanted to change it back
- 00:44:32 I could go fire and this is flavor now I
- 00:44:39 have fire again we also can do like
- 00:44:43 specify this to be some different
- 00:44:48 different calm it doesn't have to be the
- 00:44:49 same column we're editing so maybe you
- 00:44:52 decided that legendary pokémon are all
- 00:44:54 Pokemon that are of type fire and you
- 00:44:59 can make this true in that case and as
- 00:45:03 you can see now all the fire pokemon are
- 00:45:05 legendary which obviously isn't true but
- 00:45:09 it's kind of cool that we can use one
- 00:45:10 condition to set the parameter of
- 00:45:13 another column and I'm kind of screwed
- 00:45:18 up this data frame in general now
- 00:45:21 because I did that so what I could do is
- 00:45:23 use kind of my check point that was the
- 00:45:25 modified CSC so I'm gonna just say CSV
- 00:45:27 equals DAR PD dot read CSV modified dot
- 00:45:36 CSV now I'm just kind of loading my
- 00:45:38 check point that I had a while back yeah
- 00:45:42 so now I fixed up the false the
- 00:45:44 legendary by just reloading my data
- 00:45:47 frame you can also change multiple
- 00:45:50 parameters at the time so I'm gonna just
- 00:45:52 do this as a demonstration but imagine
- 00:45:54 we wanted to say like something like if
- 00:45:57 the total
- 00:46:00 is greater than 500 it's pretty damn
- 00:46:08 good Pokemon I was gonna say these
- 00:46:12 changes that I'm gonna make her here
- 00:46:13 don't really matter but just to show you
- 00:46:17 that certain conditions can be modified
- 00:46:22 multiple conditions can be modified at a
- 00:46:25 single time so if I want to modify
- 00:46:28 multiple columns at a single time I can
- 00:46:30 pass in a list and if I set like this to
- 00:46:35 the test value don't worry about this
- 00:46:38 that's what I'm saying test value so if
- 00:46:40 the total is greater than 500 these two
- 00:46:43 columns should instead of having their
- 00:46:44 normal values should have test value
- 00:46:47 let's see if that's true
- 00:46:48 oh and I just need to print out data
- 00:46:51 frame as you can see this total is
- 00:46:56 greater than 500 this is greater than
- 00:47:01 500 this is greater than 500 all of them
- 00:47:04 modified as we wanted and another neat
- 00:47:07 thing to know is that you can modify
- 00:47:09 them individually as well so this could
- 00:47:11 be tests that are like will say that
- 00:47:15 this is test one and this is test two so
- 00:47:22 now we are specifying what generation
- 00:47:26 becomes and what legendary becomes if
- 00:47:28 this specific condition is met as you
- 00:47:32 can see that updated appropriately
- 00:47:34 comment all this out real quick and just
- 00:47:36 reload the data frame as it was
- 00:47:42 initially
- 00:47:49 okay so that's just I'm just resetting
- 00:47:51 the changes cause I don't want to stay
- 00:47:53 but showing you that you can do these
- 00:47:55 things then they become super super
- 00:47:57 useful all right we're gonna end this
- 00:47:58 video by doing something that I find
- 00:48:00 very useful and that's using the group
- 00:48:02 by function to help you do some
- 00:48:05 aggregate statistics so let's start by
- 00:48:08 loading in that check pointed CSV we
- 00:48:11 kind of created so I modified dot CSV is
- 00:48:14 what I'm gonna load it in and just
- 00:48:16 reminder this is what it's looking like
- 00:48:18 all right so with this group by function
- 00:48:23 we can start doing some really kind of
- 00:48:26 cool analysis on things so for example
- 00:48:30 one thing I could do is if I wanted to
- 00:48:32 see like the average HP and attack of
- 00:48:36 all the Pokemon grouped by which type
- 00:48:40 they are so like maybe trying to figure
- 00:48:41 out like Oh our specific types like have
- 00:48:45 better skills so maybe a rock pokémon
- 00:48:47 would have like hi defense I think
- 00:48:50 there's a rock pokémon steel Pokemon
- 00:48:52 would have hi defense you know maybe a
- 00:48:55 poison Pokemon would have high attack so
- 00:48:58 we can start seeing that as a kind of
- 00:49:00 holistic measure by using this group by
- 00:49:02 function so I can go D F dot group I and
- 00:49:06 let's say I wanted to group by type one
- 00:49:08 and we're gonna look for the averages of
- 00:49:13 all the type one Pokemon so I can run
- 00:49:16 that and here we get all the stats
- 00:49:20 broken down by their mean sorted by what
- 00:49:23 type one is so if I look at a bug it has
- 00:49:29 like an average attack of 70 and make
- 00:49:33 this even more useful I can do that use
- 00:49:35 the sort function we learned
- 00:49:36 so sort values and we'll sort on let's
- 00:49:40 say defense and I'm gonna make this
- 00:49:47 ascending
- 00:49:50 the equals false so I can see the
- 00:49:51 highest defense type one and as I
- 00:49:54 mentioned before this is we did kind of
- 00:49:57 see what I was expecting to see that if
- 00:50:00 we took the average of all steel
- 00:50:01 pokemons they have the highest defense
- 00:50:05 which is kind of cool it's cool to see
- 00:50:09 that we could also instead of doing
- 00:50:11 defense we could look at what is of all
- 00:50:14 Pokemon that are in the have a type 1
- 00:50:17 that's the same who has the highest
- 00:50:19 attack and I always thinking maybe would
- 00:50:21 be poison but I'm not positive okay so
- 00:50:24 we got this does make sense
- 00:50:25 we got dragon with the highest average
- 00:50:29 attack and a lot of the dragon pokémon
- 00:50:31 are legendary so there's no surprise
- 00:50:34 there that they're super powerful and
- 00:50:36 like holy crap like a dragon would be
- 00:50:38 very scary
- 00:50:39 fighting obviously that makes sense that
- 00:50:42 they should have good attack let's see
- 00:50:44 who has the best HP this might the
- 00:50:49 dragon also is the best HP but it's kind
- 00:50:52 of cool that we can group by type 1 and
- 00:50:54 see all these like useful stats about
- 00:50:57 them just on that group it's a useful
- 00:51:00 little analysis tool and I could
- 00:51:01 additionally do something like dot some
- 00:51:05 so I can use the three that come off to
- 00:51:07 like come to me in my mind immediately
- 00:51:09 there might be others that you can
- 00:51:11 aggregate statistics you can do with
- 00:51:13 group I but some or mean some and count
- 00:51:18 so if I did this and some do everything
- 00:51:22 here I have like all of the HP's added
- 00:51:25 up and you know you got to be thinking
- 00:51:27 about why you're doing something when
- 00:51:29 you're doing anything data science
- 00:51:30 related in this case it doesn't really
- 00:51:31 make sense for me to sum up these
- 00:51:33 properties because like I could sum up
- 00:51:36 and see that one's like way higher than
- 00:51:38 another but because you don't know how
- 00:51:40 many type ones are bug or type ones are
- 00:51:44 are dark
- 00:51:45 etc this aggregate sum doesn't make
- 00:51:49 sense in this context but you can do it
- 00:51:52 then you also then you also have
- 00:51:57 count so if I run this we have all the
- 00:52:06 counts of the different out of pokémon
- 00:52:10 that are type 1 so 69 are bog 31 are
- 00:52:14 dark 32 or dragon type 1 and if you want
- 00:52:19 to clean this up a little bit like you
- 00:52:20 have a lot of the same values everywhere
- 00:52:23 basically it's any time it's a non zero
- 00:52:27 non false number or I think just yeah
- 00:52:30 any time the row is filled in so like
- 00:52:34 the reason this is 52 right here is
- 00:52:36 because the type 2 is just blank so it
- 00:52:38 didn't count those counts so if you
- 00:52:43 wanted to just have like a clean this is
- 00:52:45 the count you could add DF equals or DF
- 00:52:49 of count equals 1 so basically what I'm
- 00:52:54 doing is I'm filling in a column added
- 00:52:56 to the data frame and I'll show you this
- 00:52:59 that's just a one for every route as you
- 00:53:03 can see on the right side where my face
- 00:53:07 is normally blocking there's this one
- 00:53:10 here so now what I could do is I could
- 00:53:16 do that same group by count right oh
- 00:53:21 shoot it's calming it up and I get all
- 00:53:26 of this back but if I want to just make
- 00:53:28 my life easier I can do just get the
- 00:53:31 count column and now I have this useful
- 00:53:38 little set where there's 69 bug 31 dark
- 00:53:44 32 dragon etc this is a little bit
- 00:53:47 easier to read now
- 00:53:48 and now that I have been it easier to
- 00:53:50 read format
- 00:53:51 I could also group by multiple
- 00:53:53 parameters at the same time so I could
- 00:53:55 do type 1 and type 2 so looking at all
- 00:53:59 the subsets so of type 1 bug to have a
- 00:54:04 type 2 of electric to have a type 2 of
- 00:54:07 fighting etc I can do all sorts of count
- 00:54:10 and this gets really useful if you're
- 00:54:12 working with a really really massive
- 00:54:15 data set and I'll get into that quickly
- 00:54:18 I don't know if I'll go through the full
- 00:54:19 example but imagine you had a really
- 00:54:21 really big data set and you couldn't
- 00:54:25 even load it all into one data frame
- 00:54:26 this group I and like count and sum is
- 00:54:29 super useful because you can take your
- 00:54:33 data frame and kind of squish it so if
- 00:54:35 you're like tracking some number of like
- 00:54:37 how many times an event occurs in your
- 00:54:39 data frame you can use this group I and
- 00:54:42 like count things and then kind of
- 00:54:45 squeeze your data frame make it smaller
- 00:54:47 based on this group by function if that
- 00:54:51 made any sense so I'm not gonna show you
- 00:54:54 guys exactly just because we've been
- 00:54:55 working with the data set that's not
- 00:54:56 that small I don't really feel like
- 00:54:57 bringing in a new data set right at the
- 00:54:59 end but imagine you're working with a
- 00:55:01 data set a file that's on the order of
- 00:55:04 like 20 gigabytes it's pretty dang big
- 00:55:06 and you don't really know how to best
- 00:55:09 process it one thing that's really
- 00:55:11 useful about the Python pandas library
- 00:55:13 is it allows you to read in like a file
- 00:55:16 like that you can read it in chunks at a
- 00:55:19 time so instead of reading it into all
- 00:55:21 20 gigabytes because now unless you have
- 00:55:23 a very very powerful machine you're not
- 00:55:25 going to be able to load all of that
- 00:55:26 into memory you can load it in let's say
- 00:55:29 100 megabytes at a time and so normally
- 00:55:33 when we are reading the CSV we would do
- 00:55:37 like PD read CSV modified F CSV is file
- 00:55:43 I'm using right now and that would load
- 00:55:46 everything in so instead of doing that
- 00:55:49 what we can do is we can pass in this
- 00:55:52 chunk size parameter so I'm going to
- 00:55:56 just say for now chunk size equals 5
- 00:55:58 just for the example and that means 5
- 00:56:01 rows are being passed in at a time so if
- 00:56:04 I did for DF in PD read CSV that means
- 00:56:09 that my DF right here would be 5 rows of
- 00:56:13 my total data set modified CSV and
- 00:56:17 because this is rows and you might would
- 00:56:19 rather think a bit in like terms of
- 00:56:21 memory size you can do a little bit of
- 00:56:23 math with the Rose to figure out how
- 00:56:25 much memory that will actually be taking
- 00:56:27 if you think every row is probably like
- 00:56:30 maybe 10 or 20 bytes shouldn't be much
- 00:56:34 more than that you can do some math on
- 00:56:36 how big this is and if you run into an
- 00:56:38 area you can always like that you don't
- 00:56:40 have enough memory you can always shrink
- 00:56:41 this chunk size so we're working with a
- 00:56:44 really big data set we made a set you
- 00:56:48 know our chunk size to a hundred
- 00:56:49 thousand rows at a time which is a lot
- 00:56:51 of rows but nowhere near how much that
- 00:56:53 full 20 gigabytes would be but for our
- 00:56:55 example we're just loading again five
- 00:56:58 rows at a time and I can show you that
- 00:57:01 that is happening so 10 or like chunk
- 00:57:06 data frame and then the data frame just
- 00:57:09 to see how it's working so we have the
- 00:57:14 first data frame and as you can see it's
- 00:57:16 five rows second data frame another next
- 00:57:20 five rows third data frame the third set
- 00:57:23 of five rows etc so this loaded in the
- 00:57:27 data frame but in chunks of five so
- 00:57:31 what's useful with like the aggregate
- 00:57:33 stuff we were just going through is you
- 00:57:35 could also like to find some new data
- 00:57:37 frame equals that say PD data frame and
- 00:57:43 you can give it like the same columns as
- 00:57:46 you had in your original data frame this
- 00:57:50 would just create a new data frame
- 00:57:51 that's empty
- 00:57:52 with the same column names basically
- 00:57:55 what you can do is you could let's say
- 00:57:58 like define DF group by type 1 let's say
- 00:58:08 and get like the count of that stored in
- 00:58:13 results and what you can do here is with
- 00:58:17 that new data frame you defined you
- 00:58:19 could do something like you can use the
- 00:58:21 concat function of pandas which just
- 00:58:24 appends two data frames together and you
- 00:58:26 could do something like p or new data
- 00:58:29 frame equals PD concat
- 00:58:32 of the new data frame and results so
- 00:58:37 basically what what this would do is
- 00:58:38 always take your new data frame as you
- 00:58:40 go through chunks append on results and
- 00:58:43 store it back to new data frame so as
- 00:58:46 you did this as you did more iterations
- 00:58:48 you be building this new data frame of
- 00:58:51 all the information in your original
- 00:58:53 really really large data set but because
- 00:58:57 your each chunk you're like aggregating
- 00:58:59 doing some sort of group buy and count
- 00:59:01 you're shrinking that data size down so
- 00:59:03 that this final new data frame has the
- 00:59:06 meaning that comes out of that big
- 00:59:08 original data frame but it's a lot
- 00:59:11 smaller you can actually do more
- 00:59:12 analysis now on this shrunken down size
- 00:59:15 new data frankly if that hopefully
- 00:59:18 hopefully that makes sense if you need
- 00:59:19 me to clarify this just leave me a
- 00:59:21 comment down below and I'll try to clear
- 00:59:24 things up regarding that all right
- 00:59:26 that's all I'm gonna do in this video
- 00:59:28 hopefully you have control you kind of
- 00:59:30 feel like you have control the pandas
- 00:59:32 library now if you felt like you learned
- 00:59:34 something make sure to hit that
- 00:59:35 subscribe button it would mean a lot to
- 00:59:36 me
- 00:59:37 I'm gonna build off of this video in
- 00:59:39 future videos such as like plotting
- 00:59:41 stuff in our data frames and you know
- 00:59:44 kind of doing some advanced stuff using
- 00:59:46 like regular expressions I don't know if
- 00:59:48 it will be tell it specifically a pandas
- 00:59:50 but a lot of useful information that you
- 00:59:52 can take your panda skills and build off
- 00:59:54 of so subscribe for all of that if you
- 00:59:57 have any questions about anything I
- 00:59:58 covered in the video you leave a comment
- 00:59:59 down below and I'll try to help you out
- 01:00:01 and clarify and also if there's any like
- 01:00:04 additional features you would love to
- 01:00:06 see in pandas that I've missed leave a
- 01:00:08 comment down below what that is and
- 01:00:10 maybe I'll make a follow-up part two to
- 01:00:12 this video all right that's all I got
- 01:00:14 thank you guys again for watching and
- 01:00:16 peace out
- 01:00:20 [Music]
- 01:00:23 you