Coding

Generating Mock Data with Python! (NumPy, Pandas, & Datetime Libraries)

  • 00:00:00 [Music]
  • 00:00:01 hey what is up everyone and welcome back
  • 00:00:03 to another video I want to start this
  • 00:00:05 video off by saying thank you for all
  • 00:00:07 the love on the last video I had a lot
  • 00:00:08 of fun making that video so it's really
  • 00:00:10 exciting to see that you guys enjoyed it
  • 00:00:12 as well but that leads me to the topic
  • 00:00:13 of this video so in the comment section
  • 00:00:15 of the last video I was left with the
  • 00:00:17 question where did you find the data as
  • 00:00:19 background information I looked all over
  • 00:00:21 the place for a good data set for the
  • 00:00:22 tutorial I did Google searches I looked
  • 00:00:24 through Kaggle pretty much anywhere you
  • 00:00:27 could find a dataset I looked at to try
  • 00:00:28 to find data I was really happy with I
  • 00:00:30 was looking for something that was
  • 00:00:31 simple enough to kind of quickly grasp
  • 00:00:33 and follow along with but complex enough
  • 00:00:35 that we could perform really substantial
  • 00:00:38 analysis on it and I just wasn't happy
  • 00:00:41 with what options I was finding so that
  • 00:00:44 ultimately led me to the spot where I
  • 00:00:46 was like okay I'm not finding anything
  • 00:00:48 I'm happy with how about we just create
  • 00:00:50 our own data and so for the last video I
  • 00:00:52 generated mock data that was very
  • 00:00:55 realistic I was using statistics and
  • 00:00:58 math principles to kind of generate that
  • 00:00:59 data and that was ultimately the data
  • 00:01:01 that was used in the analysis and so in
  • 00:01:04 this video we're gonna walk through how
  • 00:01:06 we can generate mock data with Python
  • 00:01:08 the Python libraries that you'll learn
  • 00:01:09 more about in this video will be numpy
  • 00:01:12 pandas and the date/time library so as a
  • 00:01:15 reminder of what we're working towards
  • 00:01:17 in this video I'm going to load in the
  • 00:01:18 data that will ultimately be creating
  • 00:01:20 using a Python script so we have a 1 2 3
  • 00:01:25 4 5 6 column CSV that contains the
  • 00:01:31 columns order ID product quantity
  • 00:01:35 ordered price each order date and
  • 00:01:37 purchase address so we're gonna be
  • 00:01:39 talking about how we can generate each
  • 00:01:41 one of these fields and ultimately make
  • 00:01:43 this pretty big data frame and I think
  • 00:01:47 learned this from a comment I probably
  • 00:01:49 get all sorts of information on this
  • 00:01:51 data here so we have we're gonna be
  • 00:01:53 generating roughly it will be fluctuate
  • 00:01:56 but roughly like 200,000 rows of
  • 00:01:58 products prices addresses with the
  • 00:02:02 Python script in this video I will start
  • 00:02:04 very simple so let's start out by
  • 00:02:06 importing the libraries that we'll need
  • 00:02:07 we might import a couple more as we go
  • 00:02:10 but to start off we'll want to
  • 00:02:11 import pandas as PD and we'll want to
  • 00:02:15 import numpy and then the third library
  • 00:02:19 this will need a lot of like very
  • 00:02:20 helpful is going to be the date/time
  • 00:02:22 library for both numpy and the date/time
  • 00:02:26 you might have to do a pip install
  • 00:02:27 before you can get going with these so
  • 00:02:29 the data I showed you had six columns so
  • 00:02:31 let's write those out real quick
  • 00:02:33 it had the order ID the product name the
  • 00:02:39 quantity ordered the price of each the
  • 00:02:48 order date and the purchase address so
  • 00:02:55 these are the six columns that were
  • 00:02:57 ultimately going to be generating data
  • 00:02:59 for but it's good to start out with them
  • 00:03:00 as our base okay now we have our columns
  • 00:03:03 let's start writing out the products
  • 00:03:05 that we want to fill in as our data and
  • 00:03:09 so when I was doing this I gave the kind
  • 00:03:11 of theme of all of the data as if it was
  • 00:03:14 coming from an electronic store so let's
  • 00:03:16 start out by typing some electronic
  • 00:03:19 products so I started with like the
  • 00:03:21 iPhone and basically what I'm doing here
  • 00:03:23 is I'm doing a product – price
  • 00:03:27 dictionary mapping so we got our product
  • 00:03:30 and we're gonna map that to a price so
  • 00:03:33 we have our iPhone that's roughly seven
  • 00:03:35 hundred dollars and we just can do
  • 00:03:38 something like Google phone this is
  • 00:03:41 let's say $600 and I didn't have like a
  • 00:03:44 set list I was just kind of throwing
  • 00:03:46 random electronics into my products that
  • 00:03:49 I ultimately wanted to be in my data and
  • 00:03:51 I'm not going to type all this out but
  • 00:03:53 basically I just brainstormed several
  • 00:03:56 products that you would have in a
  • 00:03:59 electronic store and I had a little fun
  • 00:04:03 like this next item that I put in there
  • 00:04:06 I called it the very bad phone I didn't
  • 00:04:08 want to you know give any hard time to
  • 00:04:11 Google or iPhone so for the worst phone
  • 00:04:13 I just named one very bad phone and I
  • 00:04:15 said that was 400 and then let's just
  • 00:04:19 paste it in the rest of this data and if
  • 00:04:22 you want access to the completed file
  • 00:04:23 again all these products easily
  • 00:04:25 I'll have a link in the description of
  • 00:04:28 this video that will go to my github
  • 00:04:30 page and you can get this very easily
  • 00:04:32 but here's all the products that we have
  • 00:04:35 with what we have here let's generate a
  • 00:04:37 simple CSV of data so we can start out
  • 00:04:41 by saying our data frame is equal to PD
  • 00:04:44 data frame and this is going to be empty
  • 00:04:47 data frame and but we will set the
  • 00:04:49 columns equal to the column values we
  • 00:04:53 define so it'll be an empty data frame
  • 00:04:55 with these columns next we're going to
  • 00:05:01 kind of just randomly select products
  • 00:05:04 here in fill in rows with these randomly
  • 00:05:08 selected products will be smarter about
  • 00:05:10 this a little bit but let's just get
  • 00:05:11 something working first so for I and
  • 00:05:14 range let's say let's do do a thousand
  • 00:05:17 entries to start we're going to we'll
  • 00:05:22 get a random product and to do this
  • 00:05:24 we'll get a to get the random product
  • 00:05:26 usually whenever you hear random you
  • 00:05:27 also have to import the random library
  • 00:05:29 so I'll do that real quick but we want a
  • 00:05:31 random product so I'm just going to do
  • 00:05:33 products dot keys to get the just the
  • 00:05:39 names of the products and then I'll do
  • 00:05:41 surround that with a random dot choice
  • 00:05:47 okay so that will give us a product
  • 00:05:49 let's say we want the price of that
  • 00:05:51 product well we can just do equals
  • 00:05:55 products and then we'll key on whatever
  • 00:05:58 product we just selected that we'll get
  • 00:06:01 so if we selected like pose songs for
  • 00:06:03 headphones if we did products with that
  • 00:06:06 key then we would get 99 so we have the
  • 00:06:09 price we'll just say our order ID we'll
  • 00:06:12 fill in our order ID later but we'll
  • 00:06:14 just say it's I for now order date we'll
  • 00:06:17 just leave blank and purchase address
  • 00:06:19 we'll leave blank for now I'll fill them
  • 00:06:21 like na so what would this data frame
  • 00:06:23 would look like so we could do data
  • 00:06:25 frame loke we're going to put it in
  • 00:06:27 position I so that will start with 0 go
  • 00:06:30 up to 999 and we're going to say the row
  • 00:06:34 there is well we can start building out
  • 00:06:37 a CSV or
  • 00:06:38 dataframe by just filling in a list that
  • 00:06:41 has the right dimension here so our
  • 00:06:43 order ID we said was I then we're going
  • 00:06:46 to fill out the product we randomly
  • 00:06:48 selected the price or I guess the
  • 00:06:50 quantity ordered we'll say is 1 start
  • 00:06:52 the price of that randomly selected item
  • 00:06:55 was price order a date I'm going to say
  • 00:06:58 is n/a and purchases dress is n/a to
  • 00:07:02 start and so that will add a thousand
  • 00:07:06 rows and then to see if it works we can
  • 00:07:09 do a DF to CSV test data dot CSV will
  • 00:07:18 say so let's see what happens
  • 00:07:27 dict Keys is not subscript about so I
  • 00:07:33 probably surround this with a list
  • 00:07:34 because it's fine on a list type object
  • 00:07:37 first then I can do random choice cool
  • 00:07:39 finished in 5.1 seconds so I can open up
  • 00:07:42 the folder that I have this file and as
  • 00:07:44 you can see this file is called generate
  • 00:07:47 data and as you see we have test data
  • 00:07:50 CSV let's see what's in that test data
  • 00:07:56 cool looks like we have our products
  • 00:07:58 that seem kinda random right and do they
  • 00:08:02 match what we said in our dictionary
  • 00:08:03 should be the price of those huh where'd
  • 00:08:05 you go let's say let's look at some of
  • 00:08:11 them like MacBook Pro laptop 1700 and as
  • 00:08:15 you can see it's 1700 in this document
  • 00:08:18 as well so that looks good so that's a
  • 00:08:20 very simple starter data but we want to
  • 00:08:23 make our data more realistic so how do
  • 00:08:24 we do that the first thing to make it
  • 00:08:26 more realistic will be allowing some of
  • 00:08:29 these products to show up more on our
  • 00:08:32 data than others so like it shouldn't
  • 00:08:34 just be a random process you should
  • 00:08:36 expect like the iPhone and the Google
  • 00:08:38 phone should be purchased more than this
  • 00:08:41 very bad phone and similarly like
  • 00:08:45 double-a and triple-a batters they have
  • 00:08:47 such a low cost
  • 00:08:48 they should be purchased more probably
  • 00:08:50 than like I think
  • 00:08:51 pad laptop so the question I have for
  • 00:08:54 you guys is how can we make the certain
  • 00:08:58 products show up more than others
  • 00:09:00 feel free to pause the video and I'll
  • 00:09:01 walk through my solution in one second
  • 00:09:03 okay my solution for this was to in
  • 00:09:06 addition to having the price here in
  • 00:09:10 this dictionary I also mapped a weight
  • 00:09:15 value the exact weights that we will
  • 00:09:19 ultimately give these so I could say
  • 00:09:20 let's say so the iPhone had a weight of
  • 00:09:23 10 this exact value of 10 doesn't really
  • 00:09:26 matter but you know it should always be
  • 00:09:30 kind of relative to the other product so
  • 00:09:31 maybe if I said the iPhone was a 10 you
  • 00:09:35 know maybe I expect the iPhone to be
  • 00:09:36 sold more than the Google phone so maybe
  • 00:09:39 this is an 8 and then if we were giving
  • 00:09:42 a weight to the very bad phone that's
  • 00:09:45 gonna be a lot less than both the iPhone
  • 00:09:48 or the Google phone so maybe we give
  • 00:09:49 that a weighting of 3 so we can keep
  • 00:09:52 doing this for each of the products in
  • 00:09:55 here and as I did the last time I don't
  • 00:09:56 want to do all this but you should be
  • 00:09:59 able to kind of see what I ultimately
  • 00:10:01 use as my relative weighting scheme and
  • 00:10:04 kind of see why maybe certain values are
  • 00:10:07 the what are what they are so I'll paste
  • 00:10:09 that in real quick ok so they all have
  • 00:10:15 weights now given that we have the
  • 00:10:17 weights now let's actually use those
  • 00:10:19 values to generate our data again so
  • 00:10:23 before we did random dot choice and we
  • 00:10:24 selected a product from the list now
  • 00:10:28 we're going to do the same exact thing
  • 00:10:29 but we will factor in also this relative
  • 00:10:32 weight and to do this I'm going to first
  • 00:10:35 bring in the documentation for the
  • 00:10:37 random library and we're going to go to
  • 00:10:39 random dot choice and so I can't type ok
  • 00:10:46 random dot choice so it doesn't tell us
  • 00:10:49 much right here but right below random
  • 00:10:51 dot choice notice that there's this
  • 00:10:53 random dot choices
  • 00:10:54 function and the first or the second
  • 00:10:58 parameter of that is weights and it's
  • 00:11:00 currently set to none by default and so
  • 00:11:02 let's see what awaits does
  • 00:11:04 so so if we'd sequence of specified
  • 00:11:09 selections are made according to the
  • 00:11:10 relative weights cool this is exactly
  • 00:11:13 what we want so now we need to use
  • 00:11:16 random about choices and we want to pass
  • 00:11:18 in the weight so the weights equal and
  • 00:11:24 it's going to be this parameter of the
  • 00:11:27 the dictionary so how can we get that I
  • 00:11:30 think the easiest solution is to use all
  • 00:11:33 this comprehension here and I'm actually
  • 00:11:35 going to rewrite how we grab this too so
  • 00:11:37 we're gonna do two things at once it's
  • 00:11:39 going to be the same exact value here
  • 00:11:41 but I'm going to just make it a little
  • 00:11:43 bit neater so we have first our product
  • 00:11:46 list which is just going to be product
  • 00:11:50 for products in products so we did a
  • 00:11:56 list comprehension on these products
  • 00:11:58 here to just get the key that's the
  • 00:12:01 first thing I just defined and now we
  • 00:12:03 wanted to find our weights well we can
  • 00:12:05 do a similar list comprehension we can
  • 00:12:06 do product products product and then
  • 00:12:12 that will give us these this list of
  • 00:12:15 values we want to grab the second value
  • 00:12:17 or the first index so then we do one
  • 00:12:21 here and that will be for product in the
  • 00:12:24 products so to list comprehensions I'll
  • 00:12:28 pass in the product list and I'll pass
  • 00:12:31 in the weights as weights and just to
  • 00:12:35 make this neat will regenerate our data
  • 00:12:37 again
  • 00:12:38 but just because it might be helpful
  • 00:12:41 I'll also print out before I save the
  • 00:12:43 data the product list and the weights
  • 00:12:47 just to show that we still have the same
  • 00:12:49 thing or that they they're right and
  • 00:12:53 we're passing the right thing into this
  • 00:12:54 random choices method okay so cool let's
  • 00:12:59 run this we got an error what is there
  • 00:13:03 oh no okay Google phone your valid
  • 00:13:07 syntax okay why do we have invalid
  • 00:13:09 syntax by Google phone
  • 00:13:11 and okay I see it I actually don't I
  • 00:13:13 forgot to add my commas back in let's
  • 00:13:18 say you should work now no another error
  • 00:13:25 on hash we'll type list okay and I think
  • 00:13:30 the reason for that is let's look at the
  • 00:13:33 return type of random choices does it
  • 00:13:38 give us the return type attorney case
  • 00:13:41 size list so K is 1 but it's still a
  • 00:13:44 list so what we actually have to do now
  • 00:13:47 to get the product is passing a 0 now
  • 00:13:49 that we've used random choices instead
  • 00:13:51 of random choice let's run that again
  • 00:13:53 cool we have all of our products that
  • 00:13:56 looks good here and do the weights match
  • 00:13:58 up they seem to match up you could
  • 00:14:00 double check if you wanted to but yeah
  • 00:14:04 that looks good and we can also if you
  • 00:14:06 want to load up the test data but I
  • 00:14:09 don't have pass on that for now it will
  • 00:14:10 you know reflect these weights you can
  • 00:14:14 trust that random choices works so the
  • 00:14:16 next question I have for you guys is
  • 00:14:18 that in the pandas tutorial we didn't
  • 00:14:20 just have a test data CSV we had 12
  • 00:14:23 different CSVs each representing a
  • 00:14:26 different month so you know you would
  • 00:14:27 have like our January data CSV or
  • 00:14:30 February data CSV so how can we go about
  • 00:14:34 making 12 CSV is instead of 1 so that's
  • 00:14:37 the first problem
  • 00:14:38 and then the second problem is how can
  • 00:14:41 we wait those months and the number of
  • 00:14:43 items that are generated each month
  • 00:14:46 differently and one of the key things I
  • 00:14:48 want to include here is I want a
  • 00:14:50 December to be have the most items
  • 00:14:52 generated maybe November to be the
  • 00:14:54 second most but then the other months
  • 00:14:56 maybe just kind of fluctuate randomly
  • 00:14:58 around a certain value so we want to
  • 00:15:01 generate data for each month and also
  • 00:15:04 have that data kind of fluctuate based
  • 00:15:05 on the month so let's try to do that
  • 00:15:07 feel free to pause and try that on your
  • 00:15:10 own okay to start I think the first
  • 00:15:13 thing we'll do is we'll indent all of
  • 00:15:15 this and we'll basically surround it
  • 00:15:18 with another for loop so we'll do for I
  • 00:15:20 and reign
  • 00:15:21 and I guess technically maybe this
  • 00:15:24 should be we'll just say four month four
  • 00:15:28 month value and range 1 to 13 because we
  • 00:15:32 want to start at 1 and we want to go to
  • 00:15:35 12 so 13 is the first thing that we
  • 00:15:38 exclude all right so 4 that will
  • 00:15:41 generate we'll do the same process
  • 00:15:44 before will generate the thousand rows I
  • 00:15:47 don't need to print this anymore
  • 00:15:51 yeah well generate the thousand rows in
  • 00:15:53 our CSV and also I'm gonna have to now
  • 00:15:55 pull this empty data frame into
  • 00:16:00 happening each month okay we're getting
  • 00:16:04 there okay so here we are now we have 13
  • 00:16:08 times or I guess 12 times that we should
  • 00:16:10 be generating a CSV but now we want to
  • 00:16:13 populate this name to actually say the
  • 00:16:15 month name and to do that we're gonna
  • 00:16:17 import another library you probably
  • 00:16:18 could do it with this date time library
  • 00:16:20 but I did it wasn't immediately clear to
  • 00:16:22 me so I used this calendar library here
  • 00:16:26 and you might have to pip install this
  • 00:16:27 I'm not positive it might be built-in
  • 00:16:29 but what we can do is we can grab the
  • 00:16:33 month name by doing calendar dot month
  • 00:16:40 name and then we pass in the month value
  • 00:16:44 so that will give us the month name and
  • 00:16:47 if I want to be creative and use Python
  • 00:16:51 3 to its full capabilities when I'm
  • 00:16:54 saving this data I could save something
  • 00:16:56 like we'll use an F string here so if f
  • 00:16:59 strings are pretty cool if you're
  • 00:17:00 unfamiliar but I'm gonna pass in month
  • 00:17:02 name directly into the string but
  • 00:17:05 because I used this F right here I can
  • 00:17:07 do this so now although save as month
  • 00:17:09 name data CSV okay I think that's all we
  • 00:17:14 need to do to generate the different CSV
  • 00:17:17 s and then the second step of this
  • 00:17:19 problem was generating different values
  • 00:17:22 for different months but we'll do that
  • 00:17:23 in a sec run this might take a little
  • 00:17:26 bit now because we are generating you
  • 00:17:29 know CSV is for each
  • 00:17:32 month one thing to notice as this is
  • 00:17:35 going and kind of check that it's
  • 00:17:38 working January February March April May
  • 00:17:41 June so it is generating these CSVs
  • 00:17:45 and using this calendar method we were
  • 00:17:48 able to get the month name for each of
  • 00:17:51 those and F string worked well where we
  • 00:17:53 could just use these brackets and pass
  • 00:17:57 in the month name like that so any
  • 00:18:00 variable you can use the whatever you
  • 00:18:04 call those brackets and pass in
  • 00:18:07 variables and you can even pass in code
  • 00:18:09 into this which is cool all right so how
  • 00:18:12 do we take what we just did and now you
  • 00:18:14 know have more products be included in
  • 00:18:17 the CSV for certain months so this
  • 00:18:20 thousand will no longer be static and
  • 00:18:22 okay let's let's start by breaking it
  • 00:18:25 off based on the month I mentioned that
  • 00:18:28 I want December to be the highest value
  • 00:18:30 so I'll make a if statement for if month
  • 00:18:33 value equals equals 12 so the month of
  • 00:18:36 December make I value just that's a
  • 00:18:42 reminder to myself we also I think one
  • 00:18:46 to make the value slightly higher at
  • 00:18:49 November because you know things are
  • 00:18:51 starting to ramp up for the holiday
  • 00:18:53 season and for the other ones we'll just
  • 00:18:58 kind of randomly choose so if month
  • 00:19:01 value equals equals let's say less than
  • 00:19:05 equal to ten
  • 00:19:09 just randomly select and the way we will
  • 00:19:12 randomly select we want some sort of
  • 00:19:15 like average value maybe that we're
  • 00:19:17 selecting around and we want the values
  • 00:19:20 to kind of appear around that average
  • 00:19:21 value so the way I thought about doing
  • 00:19:25 this was using a normal distribution so
  • 00:19:27 a normal distribution is something you
  • 00:19:30 often find in data sets and you know in
  • 00:19:32 the wild kind of with all sorts of
  • 00:19:34 different types of data and especially
  • 00:19:35 where you find in your data set there's
  • 00:19:38 a mean value that things are kind of
  • 00:19:40 circling around that are kind of being
  • 00:19:43 around
  • 00:19:44 then that's like the peak so the most
  • 00:19:46 value is around this peak value and then
  • 00:19:49 there's other the distribution kind of
  • 00:19:50 trails off and curves off around that
  • 00:19:55 peak so we'll select the values of
  • 00:19:58 products that will generate for each CSV
  • 00:20:00 using a normal distribution and we can
  • 00:20:03 do that in code by doing this I'm going
  • 00:20:04 to say orders amount equals numpy dot
  • 00:20:09 random dot normal and you know numpy
  • 00:20:15 offers a lot of good statistics
  • 00:20:18 functions and capabilities so we want
  • 00:20:21 the look to be twelve thousand I'm going
  • 00:20:25 to say so I want on average twelve
  • 00:20:27 thousand items and the scale so this is
  • 00:20:32 like our mean the scale will be how much
  • 00:20:36 is the variance so the scale will say is
  • 00:20:40 four thousand or the standard if the
  • 00:20:45 standard deviation I believe the scale
  • 00:20:48 is representing in this library is four
  • 00:20:51 thousand so we want to generate values
  • 00:20:54 that are centered around twelve thousand
  • 00:20:56 that have a standard deviation and trail
  • 00:20:58 off roughly like four thousand so and I
  • 00:21:03 think that this can give us a float
  • 00:21:06 value in addition to integers and I want
  • 00:21:09 to just keep it an integer so I'm going
  • 00:21:10 to just surround this with int so for
  • 00:21:15 month value you know you could set this
  • 00:21:17 order's amount to a concrete value but
  • 00:21:19 I'm going to just set it also to a
  • 00:21:21 normal distribution
  • 00:21:28 we'll give the mean this time have about
  • 00:21:32 20,000 with a standard Devo 3000 and
  • 00:21:39 finally for the highest value and the
  • 00:21:42 month of December we will say that this
  • 00:21:44 is to numpy random dot normal and we'll
  • 00:21:50 give this one a mean of 26,000 in a
  • 00:21:55 scale of 3000 as well and this is just
  • 00:21:59 giving us a really good probability that
  • 00:22:02 December will have the most sales a very
  • 00:22:04 good probability that the November will
  • 00:22:07 of the second most sales and then for
  • 00:22:09 the remaining months the remaining 10
  • 00:22:11 months kind of just randomly select a
  • 00:22:13 value that fluctuates based on this
  • 00:22:16 normal distribution so now what we will
  • 00:22:19 do is replace this 1000 with our orders
  • 00:22:22 amount that we select and going into
  • 00:22:27 this we might want to actually change us
  • 00:22:30 to not bi anymore but actually be like
  • 00:22:32 some sort of order ID because I
  • 00:22:35 ultimately don't want to have collisions
  • 00:22:37 between order IDs between months and
  • 00:22:39 right now this I will reset each time we
  • 00:22:43 get to a new month so let's make this
  • 00:22:46 sum order ID value and we'll have to
  • 00:22:49 define that up here so I'm going to say
  • 00:22:52 order ID equals I'm just gonna sign it
  • 00:22:56 some random value to start so 1 4 3 2 5
  • 00:23:02 3 will be our starting value we'll add
  • 00:23:06 that value the first time we are here
  • 00:23:09 and then each time we want to add a new
  • 00:23:14 row each time we basically get through
  • 00:23:17 this and add a row we'll update this
  • 00:23:19 order ID to be one more and that will
  • 00:23:22 happen for all of the orders in our
  • 00:23:24 order amount and I guess this is order
  • 00:23:26 is amount cool so I think that should do
  • 00:23:33 what we want to do
  • 00:23:37 and you can run the code again to check
  • 00:23:40 I've been running this forever and it
  • 00:23:42 still hasn't finished so if you ever run
  • 00:23:45 to this case try to ask yourself like
  • 00:23:47 why is this happening and I realized
  • 00:23:50 that I left this code in here which PI
  • 00:23:52 slows things down because I'm running
  • 00:23:54 this iteration a lot and this does not
  • 00:23:57 need to be done each time I can move
  • 00:24:00 this completely out of the the four
  • 00:24:04 loops so I can put this stuff up here
  • 00:24:08 because it's static it only needs to be
  • 00:24:10 done once so that is one quick change I
  • 00:24:14 could make to make the code run quicker
  • 00:24:17 if it's still running too slow I can try
  • 00:24:22 to do more but the nature of it is it's
  • 00:24:25 gonna probably take a little bit of time
  • 00:24:27 because we are generating a lot of rows
  • 00:24:29 but you know if it's unreasonable amount
  • 00:24:32 of time you know look at what you're
  • 00:24:34 doing and see if there's anything you
  • 00:24:35 can improve and another thing you can do
  • 00:24:41 you know maybe I cancel this build again
  • 00:24:43 you can kind of just keep track of
  • 00:24:47 things and maybe each time you save a
  • 00:24:50 CSV like print month name I'm gonna do
  • 00:25:00 this a different four string for anyway
  • 00:25:02 month name finished and this will kind
  • 00:25:06 of just help you know how far you're
  • 00:25:08 progressing along
  • 00:25:11 and in January finished that wasn't too
  • 00:25:14 too bad I think it definitely was
  • 00:25:16 quicker than before
  • 00:25:16 and this is gonna keep happening all
  • 00:25:22 right
  • 00:25:22 it finally finished and it only took
  • 00:25:24 eight hundred and twenty eight seconds
  • 00:25:27 so obviously this is not something you
  • 00:25:29 want to do every time you run this mock
  • 00:25:32 data generation so there's a couple
  • 00:25:35 things we can do to like kind of keep
  • 00:25:38 our test period small like we don't need
  • 00:25:40 to look at every individual CSV file to
  • 00:25:42 like make sure that sure things are
  • 00:25:44 working properly so what I recommend
  • 00:25:46 like do as you're testing is to break
  • 00:25:48 out of the month loop while testing so
  • 00:25:54 that will make you only have to do
  • 00:25:55 January each time you run it you also
  • 00:25:58 could decrease like this value or these
  • 00:26:00 values as you are testing it to make
  • 00:26:03 sure things work but yeah you're not
  • 00:26:05 gonna want to spend 800 seconds each
  • 00:26:07 time you make a change but anyway we
  • 00:26:10 have something working pretty well where
  • 00:26:12 we're generating a CSV for all the
  • 00:26:14 months and I can show all the months
  • 00:26:17 over here that are all like this files
  • 00:26:21 you know bigger than a megabyte that we
  • 00:26:23 just generated but yeah we're doing and
  • 00:26:27 that was December 2 which is kind of
  • 00:26:28 nice you can see the relative sizes just
  • 00:26:32 by looking at the kilobyte values but
  • 00:26:35 looks like it's working well all right
  • 00:26:37 the next task I have for you guys is
  • 00:26:39 let's try to enter a random address for
  • 00:26:42 each row in the CSV that we're
  • 00:26:44 generating so I want a random address
  • 00:26:46 for each row in this CSV so how would we
  • 00:26:49 generate random addresses feel free to
  • 00:26:52 pause and then start up the video okay
  • 00:26:56 the way I thought about doing this was I
  • 00:26:57 added a function to my code called
  • 00:27:00 generate random address and so what do
  • 00:27:06 we need to have in this function well I
  • 00:27:08 was trying to think about okay I want a
  • 00:27:10 random address I want you know like 123
  • 00:27:12 Main Street Boston Massachusetts so
  • 00:27:16 we're gonna be able to generate stuff
  • 00:27:17 like that so to start out I just did a
  • 00:27:19 Google search of the most common street
  • 00:27:22 names
  • 00:27:22 and I ended up actually getting a good
  • 00:27:25 amount back so you see right here like
  • 00:27:27 these would be good streets to include
  • 00:27:29 if you're randomly generating addresses
  • 00:27:31 but I ultimately found
  • 00:27:33 I think this 538 post that had all of
  • 00:27:37 these street names listed and I just
  • 00:27:41 ultimately wrote down these names and
  • 00:27:44 knew that when I was generating random
  • 00:27:45 addresses I would select one of these
  • 00:27:47 completely randomly and that would be
  • 00:27:49 like the base of my street it's all
  • 00:27:52 pasting the code to do that so the first
  • 00:27:54 thing we have in our generate random
  • 00:27:57 address will be some street names so
  • 00:28:01 everything you just saw there was
  • 00:28:03 included and you can as I said before
  • 00:28:05 you can find this file on my github page
  • 00:28:08 so if you want to just copy in this too
  • 00:28:10 then I decided I wanted to also have a
  • 00:28:13 few different choices for cities so you
  • 00:28:18 know the cities I thought to include
  • 00:28:20 were San Francisco these are all US
  • 00:28:24 cities Boston my hometown New York City
  • 00:28:34 and then I'll paste in some more one
  • 00:28:40 thing to note is I purposely included
  • 00:28:42 for a challenge two cities of Portland
  • 00:28:46 now one being in Maine and one being in
  • 00:28:48 Oregon and that was just kind of like
  • 00:28:51 throw people off when we were doing the
  • 00:28:52 analysis so we have our cities with the
  • 00:28:55 cities we needed the corresponding US
  • 00:28:58 states so I'll paste in a list of states
  • 00:29:05 so we got our state's corresponding to
  • 00:29:07 these cities and then I did get slightly
  • 00:29:11 lazy and I don't know if you notice this
  • 00:29:13 in the data of the last video but I also
  • 00:29:16 found some zip codes associated with
  • 00:29:19 each of these cities but I used the same
  • 00:29:21 zip code each time I use a specific city
  • 00:29:23 so I kind of wasn't super super precise
  • 00:29:27 with how random these were but it was a
  • 00:29:30 good enough it seemed very realistic if
  • 00:29:33 I really wanted to include multiple zip
  • 00:29:35 code
  • 00:29:36 I would have probably mapped City as the
  • 00:29:39 key in a list of ZIP codes as a value of
  • 00:29:44 a dictionary but I was just going to use
  • 00:29:47 one value and note that the index of
  • 00:29:51 these different lists are all matching
  • 00:29:55 so San Francisco goes with California
  • 00:29:57 and nine four zero one six Boston with
  • 00:29:59 Massachusetts zero two two one five etc
  • 00:30:02 okay so we have all the stuff we need to
  • 00:30:04 generate our address and now basically
  • 00:30:08 we just need to put it all together so
  • 00:30:09 what I did there was I also used some
  • 00:30:13 weights to pick my city so we'll add in
  • 00:30:17 those weights this was kind of just
  • 00:30:21 thinking about generally you know what
  • 00:30:23 cities are most likely to purchase a lot
  • 00:30:25 of electronics so i weighted it
  • 00:30:27 according to that so like San Francisco
  • 00:30:29 I thought would ply purchase a lot I
  • 00:30:31 thought New York City would purchase a
  • 00:30:32 lot so these weights are relative to you
  • 00:30:37 know how big the city is you know how
  • 00:30:38 much money maybe is in the city so we
  • 00:30:42 get these weights and now it now it's
  • 00:30:45 packaging it all together so we want to
  • 00:30:48 get a random street first off so all we
  • 00:30:51 have to do for that is use our random
  • 00:30:53 dot choice function you've seen before
  • 00:30:54 street names and then if we want to get
  • 00:30:57 a random city we can do random choices
  • 00:31:03 just like we did for the products
  • 00:31:09 earlier in the video and we can pass in
  • 00:31:12 and basically we're going to do is
  • 00:31:15 actually not pass in the city's list but
  • 00:31:18 we're gonna pass in the range of the
  • 00:31:22 city's list just so we can get an index
  • 00:31:24 value and then use that in the
  • 00:31:25 corresponding cities States and zips
  • 00:31:31 lists
  • 00:31:35 okay random choices so we're gonna pick
  • 00:31:38 that index with weights equal to our
  • 00:31:42 weights that we put here so as you can
  • 00:31:45 see 0.5 is the least likely to be
  • 00:31:48 selected and that's associated with
  • 00:31:50 Portland Maine 9 is the most likely and
  • 00:31:54 that's associated with San Francisco
  • 00:31:56 California and as we did before we need
  • 00:32:00 to get the 0th index of that now finally
  • 00:32:03 we can return an F string that has a
  • 00:32:06 address formatted so I want to return
  • 00:32:09 the F string of random Rand int I'm
  • 00:32:15 going to give it a the street a random
  • 00:32:17 number between 1 and 999 and then we
  • 00:32:23 will add in the street name which we
  • 00:32:26 also named a variable and then in actual
  • 00:32:29 string text I'm going to write Street
  • 00:32:32 comma then we wanted to pass in because
  • 00:32:38 it goes address City comma state comma
  • 00:32:42 like zip code so we want to do to get
  • 00:32:46 our city now we want to do I guess this
  • 00:32:51 is not really a city this is our index
  • 00:32:52 do you get our city we would do passing
  • 00:32:56 a variable cities of that index comma
  • 00:33:05 now we want to pass in the state
  • 00:33:08 abbreviation so that'd be States index
  • 00:33:13 and let's see what else do we need to do
  • 00:33:17 and then we just need to pass in I'm
  • 00:33:19 just going to do a space and then pass
  • 00:33:20 it in the zip code so zips index and
  • 00:33:24 this right here all of this logic right
  • 00:33:27 here now we're randomly generating
  • 00:33:29 addresses and this is what I did for the
  • 00:33:31 data so we can go ahead if we want to
  • 00:33:35 and you know each time we're generating
  • 00:33:38 one of these values we can add an
  • 00:33:43 address equals generate random address
  • 00:33:47 and then fill in this spot with our
  • 00:33:50 newly added in address and just to make
  • 00:33:56 things simple I'm gonna make this number
  • 00:33:58 very small to just start out just so it
  • 00:34:03 doesn't take long to run oh here we go
  • 00:34:12 oopsies that's annoying that it did that
  • 00:34:15 there we go
  • 00:34:20 cool so we finished the month of January
  • 00:34:24 and if I load up that data you can see
  • 00:34:25 what the output is and look at that look
  • 00:34:30 at these purchase addresses they seem
  • 00:34:31 very realistic but we just randomly
  • 00:34:34 generated them and I think you would not
  • 00:34:36 really notice and that's the only real
  • 00:34:38 sign that you'd notice is from all these
  • 00:34:41 matching ZIP codes but they look very
  • 00:34:44 realistic so that's pretty cool look
  • 00:34:48 looks like we did something wrong with
  • 00:34:49 the price each year so we need to fix
  • 00:34:51 that let's fix the price that should
  • 00:34:56 actually be the 0th index now that we've
  • 00:34:58 switched it to using this list so now
  • 00:35:01 we're gonna actually grab the price okay
  • 00:35:03 as the next task let's fill in this last
  • 00:35:05 n/a and generate random order dates for
  • 00:35:10 each of our rows of data and to make
  • 00:35:12 this a little bit more challenging of a
  • 00:35:14 problem I don't want just any date here
  • 00:35:17 I want the times for these purchases to
  • 00:35:20 peak around noon and around 8:00 p.m.
  • 00:35:23 and then all other times we'll kind of
  • 00:35:25 circle around those average times so
  • 00:35:28 they'll kind of trail off like this on
  • 00:35:30 both sides so noon and 8:00 p.m. those
  • 00:35:34 are the peak times I want the times to
  • 00:35:36 be generated around but then to kind of
  • 00:35:38 like be spread around that try to do it
  • 00:35:41 on your own if you want if not I'll just
  • 00:35:42 dive into how I would go about doing it
  • 00:35:46 whenever I'm going to work with dates
  • 00:35:48 and times and Python I'm always thinking
  • 00:35:50 about using this date time library so
  • 00:35:54 we're going to look at the documentation
  • 00:35:55 for that in a second first let's start
  • 00:35:58 writing out the base of a
  • 00:36:00 a function that we'll use to generate
  • 00:36:03 these times so just like we had generate
  • 00:36:05 random address
  • 00:36:06 let's do generate random time and our
  • 00:36:11 ultimate goal is going to be to generate
  • 00:36:13 a date in the format let's say our month
  • 00:36:19 slash day so we can even say month month
  • 00:36:23 day day and then year and then followed
  • 00:36:27 by an hour in a minute so this is what
  • 00:36:29 we're trying to generate and we want
  • 00:36:31 this to average around either 8:00 p.m.
  • 00:36:34 or noon so that's also what we're trying
  • 00:36:37 to do so let's load in the date/time
  • 00:36:40 library documentation and see how that
  • 00:36:42 can help us in this task
  • 00:36:44 okay so date/time library Python you can
  • 00:36:48 type into Google and find this I don't
  • 00:36:50 make it slightly bigger okay so when you
  • 00:36:55 are kind of like start on this
  • 00:36:57 documentation page one thing to notice
  • 00:37:00 is all this stuff on the left it's super
  • 00:37:02 helpful to navigate so what I see here
  • 00:37:05 is I see you know some of the
  • 00:37:07 preliminary stuff then I see time delta
  • 00:37:08 objects we'll get that to that in a sec
  • 00:37:11 I see date objects and date time objects
  • 00:37:14 well given that we want to generate both
  • 00:37:16 a date and a time what we're going to
  • 00:37:19 want to click on is this date time
  • 00:37:20 object so we can construct a date time
  • 00:37:25 by passing in a year month day and then
  • 00:37:28 optionally hour minute second etc so
  • 00:37:32 that's a good starting point so let's
  • 00:37:33 start by date which is going to be date
  • 00:37:42 time date time I'm going to pass in
  • 00:37:45 let's look at that one more time
  • 00:37:50 year-month-day so I mentioned that this
  • 00:37:53 data is all gonna be for 2019 now we
  • 00:37:57 need to get the month
  • 00:37:58 well when we're iterating through this
  • 00:38:01 loop we have the month already assigned
  • 00:38:03 so what we'll have to do is pass in the
  • 00:38:05 month to this function so we got month
  • 00:38:09 here we'll say we can pass in the month
  • 00:38:12 now
  • 00:38:12 and now we need to figure out the day
  • 00:38:14 well the day should be kind of random
  • 00:38:16 within the month so now we need to
  • 00:38:18 figure out how many days are in the
  • 00:38:21 given month we have and we actually use
  • 00:38:23 that calendar library again to do this
  • 00:38:27 so to get the number of days in a month
  • 00:38:30 we can do this function and if you try
  • 00:38:35 to Google search how many days are in a
  • 00:38:37 month programmatically Python you PI
  • 00:38:39 find the same solution so given a month
  • 00:38:43 value this is the number of days in that
  • 00:38:47 month so if we want to select a random
  • 00:38:50 day we can just do equals random dot R
  • 00:38:54 and int from 1 because every day starts
  • 00:38:59 on the first to the day range and pass
  • 00:39:03 in random day here now we need the hour
  • 00:39:09 and the minute so what I said is I
  • 00:39:14 wanted to peak at noon and 8:00 p.m. so
  • 00:39:18 we can start off and say that this is
  • 00:39:20 going to start at noon so 12 here and
  • 00:39:27 the minutes is going to be 0 0 hours 0
  • 00:39:30 it will parse both but we want 0 minutes
  • 00:39:34 here so 12 exactly and now starting with
  • 00:39:41 just this 12 o'clock peak how do we add
  • 00:39:44 random values around that so like
  • 00:39:47 ideally we could have some sort of
  • 00:39:49 distribution of values around this noon
  • 00:39:52 time but it gets tricky because we're
  • 00:39:55 working with dates not just integers
  • 00:39:58 luckily going back to the documentation
  • 00:40:00 that's exactly what the let's go to the
  • 00:40:04 top the time delta object is 4 so we can
  • 00:40:07 utilize this time delta and add it to
  • 00:40:10 our date objects and that will
  • 00:40:12 ultimately allow us to fluctuate up and
  • 00:40:14 down around that noon on a specific day
  • 00:40:17 we specify so we have all these
  • 00:40:21 parameters so
  • 00:40:23 I'm going to say I'm going to go back to
  • 00:40:25 using our numpy library I want to say
  • 00:40:28 that our time offset is going to be a
  • 00:40:31 normal distribution again it's going to
  • 00:40:33 be a normal distribution so numpy random
  • 00:40:36 normal you should remember this from
  • 00:40:39 earlier in the video with a mean of zero
  • 00:40:41 so on average we don't add any time to
  • 00:40:45 this date time object that we just
  • 00:40:46 created but we want to have a standard
  • 00:40:49 deviation of let's say three hours so if
  • 00:40:53 we pass in the scale equals two you can
  • 00:40:59 know you could say three here three
  • 00:41:01 hours that would work or I think what I
  • 00:41:04 did in the when I was originally doing
  • 00:41:06 this as I said 180 and I was just
  • 00:41:08 specifying three hours and minutes so
  • 00:41:10 this is our time offset so now when we
  • 00:41:12 create a time delta to add to this date
  • 00:41:15 we can do final date equals date plus
  • 00:41:25 date time dot time Delta and we'll pass
  • 00:41:30 in minutes equals the time offset so
  • 00:41:35 because I'm using minutes I'm going to
  • 00:41:36 specify minutes here if I made this zero
  • 00:41:39 to three I could specify hours the
  • 00:41:43 reason I ultimately said minutes was
  • 00:41:45 given that this might return a integer
  • 00:41:48 value I forget if it does or not I just
  • 00:41:51 wanted to have a little bit more options
  • 00:41:53 of how much it could fluctuate so I used
  • 00:41:57 minutes just to be safe okay so that now
  • 00:42:02 gives us a random date around whatever
  • 00:42:05 our initial time was I said I wanted to
  • 00:42:07 peak it around noon and 8:00 p.m. no so
  • 00:42:10 we're gonna do one additional little
  • 00:42:12 thing I'm going to say if random dot R
  • 00:42:15 and random is less than 0.5 we will
  • 00:42:21 select the noon o'clock time else we
  • 00:42:26 will select 8:00 p.m.
  • 00:42:30 so we'll select our date to be 8 p.m.
  • 00:42:33 which in terms of a 24-hour clock is 20
  • 00:42:38 and then the stuff let's say this stuff
  • 00:42:41 stays the same so we either pick the
  • 00:42:43 peak at noon or the peak at 8:00 p.m.
  • 00:42:46 and then we add that time Delta to
  • 00:42:49 whichever peak we selected and that's
  • 00:42:50 ultimately what we're generating four
  • 00:42:52 times and this type of stuff especially
  • 00:42:56 with date times is super useful if you
  • 00:42:58 ever have to generate mock data for a
  • 00:43:01 job or you know for an interview or
  • 00:43:03 something it's this is like a very
  • 00:43:06 realistic task that you would have to do
  • 00:43:08 to like you know generate data according
  • 00:43:11 to times and try to make it pretty
  • 00:43:13 realistic cool so this is generating a
  • 00:43:18 random time and now we just need a
  • 00:43:20 format it so to do that we can there's
  • 00:43:23 also some nice methods within the
  • 00:43:27 date/time library so if I pull the
  • 00:43:30 documentation I'll just look up format
  • 00:43:37 pasted before that's not what we want
  • 00:43:39 string format this is what we're looking
  • 00:43:41 for so basically we can pass in specific
  • 00:43:46 ways we want this to be outputted in
  • 00:43:48 this and going here we get some
  • 00:43:54 information on it so we can pass in
  • 00:43:56 these variables to get how we want to
  • 00:43:58 format the date so there's so many
  • 00:44:00 options to how we can do this but as I
  • 00:44:05 mentioned we wanted to get it in month
  • 00:44:07 month day day year hour a minute so to
  • 00:44:11 do that we can pass in this value so we
  • 00:44:19 can do a final date
  • 00:44:22 dot string.format time and then we'll
  • 00:44:29 pass it in I'd copied the parentheses
  • 00:44:34 twice just like that so these all
  • 00:44:38 represent something in the day time
  • 00:44:40 library and this will now if we pass in
  • 00:44:45 if we create a date each time we do this
  • 00:44:50 so generate random time pass in the
  • 00:44:54 month now if we generate our data and
  • 00:44:57 pass in this date we should get what we
  • 00:44:59 want
  • 00:45:02 no month is not defined okay month in
  • 00:45:08 our case is equal to I for I guess sorry
  • 00:45:12 month value cool we can load the data
  • 00:45:23 again and as you see it looks pretty
  • 00:45:27 good and I guess the format I used was
  • 00:45:29 just two letters here instead of four
  • 00:45:32 but in the documentation if you wanted
  • 00:45:34 to make it a four-digit year you could
  • 00:45:37 have very easily done that but this
  • 00:45:41 looks pretty cool so for the month of
  • 00:45:44 January it's you know generating all
  • 00:45:46 those times and you could probably do a
  • 00:45:48 little bit of manual inspection to see
  • 00:45:50 if there around noon and 12 p.m. but
  • 00:45:53 also when we do our full analysis and
  • 00:45:55 the previous video you also can see that
  • 00:45:58 the next up we're going to do is
  • 00:46:00 generate a more realistic quantity
  • 00:46:03 ordered so when I think about how much
  • 00:46:05 you're gonna order a product if we go
  • 00:46:06 back to our products you know if an
  • 00:46:09 iPhone cost $700 you're not very likely
  • 00:46:11 to purchase two of them or even less
  • 00:46:14 likely you know to purchase three
  • 00:46:16 however though if batteries are less
  • 00:46:19 than four dollars you have a much higher
  • 00:46:21 probability of purchasing maybe a few
  • 00:46:23 packs of those same thing with triple-a
  • 00:46:25 batteries and know a little bit less
  • 00:46:28 likely but you might buy two chargers
  • 00:46:31 what I'm trying to get at is it's really
  • 00:46:34 dependent on
  • 00:46:35 the quantity ordered of an item is
  • 00:46:37 really dependent on the price so when
  • 00:46:40 we're filling out a good quantity
  • 00:46:42 ordered for our mock data we want to be
  • 00:46:46 very like mathematical about it and make
  • 00:46:48 sure that the number ordered varies
  • 00:46:51 inversely with the price so lower price
  • 00:46:54 means a higher quantity ordered and a
  • 00:46:58 higher price means a lot less quantity
  • 00:47:03 ordered usually just one so to do that I
  • 00:47:06 used a geometric distribution and
  • 00:47:12 basically the way you can think about
  • 00:47:13 this is imagine you're flipping a coin
  • 00:47:16 and what this geometric distribution
  • 00:47:19 does is counts the number of times you
  • 00:47:24 flip that coin until you get your first
  • 00:47:26 heads so you know if I was looking at
  • 00:47:28 coin maybe it lands on tails and then
  • 00:47:30 the second time it lands on heads so
  • 00:47:32 that be two so what we do to make this
  • 00:47:34 vary with the price is that instead of
  • 00:47:38 having a 50/50 heads or tails odds we
  • 00:47:41 think of the chances of heads being 1
  • 00:47:45 minus 1 over the price
  • 00:47:48 so if our item was $2 you could think of
  • 00:47:52 the probability of heads being 0.5 so
  • 00:47:58 you may be the first flip is tails and
  • 00:48:01 then second chances heads it's pretty
  • 00:48:02 likely that that might be a 2 however if
  • 00:48:05 we increase that price so imagine the
  • 00:48:07 price is now 500
  • 00:48:09 so if this was 1 minus 1 over 500 then
  • 00:48:12 we have 499 over 500 and this is the
  • 00:48:16 probability of heads so the chance that
  • 00:48:19 you're going to not flip heads on the
  • 00:48:22 first go is very small but it's not like
  • 00:48:26 negligible like it does happen every 500
  • 00:48:28 times you expect to see you know 1 tails
  • 00:48:31 and then a heads so this is exactly what
  • 00:48:33 we're doing with our prices is the
  • 00:48:35 higher prices have a higher chance of
  • 00:48:38 hitting heads on the first go and the
  • 00:48:40 lower prices are more likely to flip a
  • 00:48:42 few times before you get a head so
  • 00:48:45 that's how we're going to
  • 00:48:47 this quantity order so it looks like
  • 00:48:49 what that looks like in code is numpy
  • 00:48:53 random dot geometric distribution so
  • 00:48:57 geometric and then we pass in the
  • 00:49:00 probability P and that's going to be
  • 00:49:02 equal to one – I'm going to just do it
  • 00:49:07 with floats just in case some weird
  • 00:49:09 rounding errors happen 1.0 divided by
  • 00:49:12 the price and then what we want to
  • 00:49:15 specify just so this only does one
  • 00:49:17 iteration is size of one and now we get
  • 00:49:23 the 0th index of that and you can look
  • 00:49:26 into the documentation if you want to
  • 00:49:28 see exactly what's going on there but
  • 00:49:29 that's a good way to do our quantity
  • 00:49:32 ordered okay to finish this video
  • 00:49:36 oftentimes when you're shopping you know
  • 00:49:38 you're not just buying one item you're
  • 00:49:40 buying you know multiple items at a time
  • 00:49:42 so let's add that to our data set and on
  • 00:49:44 a realistic fashion so for example if
  • 00:49:47 you order an iPhone you'd likely maybe
  • 00:49:50 also pick up a extra lightning charging
  • 00:49:53 cable or maybe some air pods
  • 00:49:55 how can we add this to our code like
  • 00:49:58 that the way I went about it was I just
  • 00:50:00 kind of selected a couple items that I
  • 00:50:02 knew would probably have some items that
  • 00:50:04 would be commonly paired with so let's
  • 00:50:06 take the iPhone as our first item we're
  • 00:50:08 gonna add some code after our initial
  • 00:50:10 product purchase but before we increase
  • 00:50:12 our order ID that adds a couple more
  • 00:50:15 that could potentially add a couple more
  • 00:50:16 products to our purchase so first case
  • 00:50:21 is if product equals equals iPhone then
  • 00:50:27 we want to basically have some chance so
  • 00:50:31 let's say a 15% chance of getting a
  • 00:50:33 lightning charging Campbell so we could
  • 00:50:35 do it if random random this gives us a
  • 00:50:38 random value between zero and one so if
  • 00:50:40 we want a 15% chance of something
  • 00:50:42 happening we can do less than 0.15 we'll
  • 00:50:46 say we generate a new product that is a
  • 00:50:51 lightning charging cable
  • 00:50:55 and we want to make sure that this is
  • 00:50:58 copied in the same way it was written up
  • 00:51:00 here so okay and the order ID stays the
  • 00:51:10 same because this is part of the same
  • 00:51:11 purchase we're saying the quantity
  • 00:51:13 ordered might change and the price
  • 00:51:15 definitely changes but the date and
  • 00:51:17 address should also be the same so what
  • 00:51:20 I'm gonna do is abstract some of this
  • 00:51:22 logic out into a function so we're going
  • 00:51:24 to just have a function called up here
  • 00:51:32 pulled right row and that's going to
  • 00:51:35 take in a order ID a product an order
  • 00:51:40 date and a address and it's going to
  • 00:51:44 generate the list that will ultimately
  • 00:51:46 add to our data frame so basically we
  • 00:51:49 just need a copy some of our logic down
  • 00:51:50 here up there so we want to take the
  • 00:51:53 price and now move that in here because
  • 00:51:57 the price depends on the product so move
  • 00:52:00 that here we can delete it here I'm
  • 00:52:02 going to move the quantity ordered into
  • 00:52:03 that function so quantity ordered into
  • 00:52:10 the function and now we just want to
  • 00:52:17 return our output and that will just be
  • 00:52:19 the list that we normally had so we can
  • 00:52:22 pass in this
  • 00:52:29 cool so quantity ordered price
  • 00:52:34 everything is there
  • 00:52:36 so now instead of doing this we do right
  • 00:52:42 row order ID product quantity ordered
  • 00:52:46 and price are figured out during in this
  • 00:52:49 function and we should be good and we
  • 00:52:53 can also now add this same code down
  • 00:52:57 here but we want to make our product now
  • 00:53:01 the Lightning charging cable cool and
  • 00:53:11 because we're adding another row here we
  • 00:53:14 also need to increase our I here so I
  • 00:53:17 plus equals 1 mmm this now gets us into
  • 00:53:22 a weird scenario where we're increasing
  • 00:53:25 I additionally in the loop so honestly
  • 00:53:29 what we'll want to do is not make this a
  • 00:53:32 for loop but make it a while loop so
  • 00:53:35 we're going to change this to while
  • 00:53:39 orders amount is greater than 0 we'll
  • 00:53:47 set an I've area beside of that every
  • 00:53:51 time we make a purchase will increase I
  • 00:53:54 by 1 so in here we increased I by 1 and
  • 00:53:58 every time we finish with an order we
  • 00:54:03 want to decrease orders amount by 1 okay
  • 00:54:11 so I just handled the case of adding
  • 00:54:14 multiple rows within a single iteration
  • 00:54:17 ok so what other products might be
  • 00:54:19 ordered with a iPhone I would say air
  • 00:54:23 pods are kind of likely so let's say we
  • 00:54:27 had a 5% of getting AI air pods with our
  • 00:54:32 iPhone so I'd be right there maybe we
  • 00:54:35 had a you
  • 00:54:38 the other thing I saw there was just
  • 00:54:41 normal wired headphones so maybe there
  • 00:54:43 was a 7% chance of getting wired
  • 00:54:48 headphones just simple common man
  • 00:54:50 headphone – no expensive air pods wired
  • 00:54:54 headphones with our order so we can keep
  • 00:54:56 adding things like in trying to make it
  • 00:54:58 somewhat realistic but this is how you
  • 00:55:01 would add a couple different products to
  • 00:55:05 a single order so we could do the same
  • 00:55:08 exact thing with the Google phone are
  • 00:55:10 the very bad phone and add some
  • 00:55:12 probabilities so that would look
  • 00:55:13 something like this I'm not going to go
  • 00:55:15 through all of it to save some time and
  • 00:55:17 then maybe we you know it's not always
  • 00:55:19 specific things you're buying with
  • 00:55:21 specific other things and actually
  • 00:55:23 indented this one too many times you
  • 00:55:27 know any product might have a
  • 00:55:28 possibility of being ordered with any
  • 00:55:30 other product so let's add one more spot
  • 00:55:33 where with just a 2% chance
  • 00:55:37 approximately we just get any old item
  • 00:55:40 so we basically are just putting in you
  • 00:55:47 know any old item with our initial
  • 00:55:49 product that we got up here and that
  • 00:55:52 would look like this you could say
  • 00:55:56 product 2 equals random choices of the
  • 00:56:01 product list and with our weights
  • 00:56:04 equaling our weights they would have to
  • 00:56:08 take the 0th index of that so exactly
  • 00:56:11 what we did up here and then we were
  • 00:56:14 just right the row just like we did
  • 00:56:16 before
  • 00:56:21 cool so this gives us the possibility of
  • 00:56:23 having any items purchased together that
  • 00:56:26 looks good to me
  • 00:56:28 okay let's test everything we've done
  • 00:56:30 now and so I'm just going to run this
  • 00:56:31 again with the hundred yeah we got in
  • 00:56:33 there product choice is not defined okay
  • 00:56:38 this was just supposed to be product
  • 00:56:40 this was just supposed to be product
  • 00:56:47 okay
  • 00:56:48 so I just want to double-check that it
  • 00:56:50 seems like everything's working cool so
  • 00:56:56 do we have any order IDs that stay the
  • 00:56:58 same
  • 00:56:58 I'm curious you might not find them here
  • 00:57:06 yep quick look at that you see iPhone
  • 00:57:09 with the Lightning charging cable here
  • 00:57:11 two orders in the same one that's nice
  • 00:57:14 and you also see like these lower priced
  • 00:57:17 items like triple-a batteries or
  • 00:57:19 double-a batteries they do have the
  • 00:57:20 quantity order that's up a bit so
  • 00:57:22 everything looks good this is awesome so
  • 00:57:24 the final thing we want to do is just
  • 00:57:26 clean up the code a little bit we've
  • 00:57:28 kind of confirmed things seem to be
  • 00:57:30 working so I'm going to remove this
  • 00:57:32 break because we're done testing I'm
  • 00:57:34 going to up this VAT or restore what
  • 00:57:38 this value is normally and delete this
  • 00:57:40 other value one thing that I recommend
  • 00:57:44 doing is putting all of this code here
  • 00:57:49 indenting it once and putting it inside
  • 00:57:53 of a main function so that if maybe you
  • 00:57:56 ever wanted to use any of these
  • 00:57:57 functions you could use these methods
  • 00:58:00 without running all the other codes so
  • 00:58:03 we're going to say if name equals equals
  • 00:58:10 main so this means that this code is
  • 00:58:15 only executed if you run generate data
  • 00:58:17 py as the main file that looks good and
  • 00:58:21 then the last touch would be to actually
  • 00:58:24 go ahead and run the code and as it no
  • 00:58:28 know what happened this should not be
  • 00:58:33 order number it should be order
  • 00:58:34 screwed up there so oops
  • 00:58:48 and this should just be date not order
  • 00:58:51 date everything else looks good though
  • 00:58:57 that's the curse of copy and pasting all
  • 00:59:01 right Ron do we get in the air so see if
  • 00:59:04 you get an error but I think this is
  • 00:59:06 good this is generating the Mach data
  • 00:59:07 just like we did for the video the last
  • 00:59:10 video I posted so nice alright with that
  • 00:59:15 we're gonna end the video here hope you
  • 00:59:17 guys enjoyed this one I do want to say
  • 00:59:19 real quick that generating mock data is
  • 00:59:20 a very real thing that I've done
  • 00:59:22 multiple times in my you know career as
  • 00:59:24 a professional software developer often
  • 00:59:27 times you might be working on building a
  • 00:59:29 dashboard you know showing off pretty
  • 00:59:31 graphs and whatnot and not actually have
  • 00:59:33 the connection to the data that you're
  • 00:59:35 working with yet so ultimately to make
  • 00:59:38 that proof of concept dashboard you need
  • 00:59:40 to make your own data you need to
  • 00:59:42 generate mock data so this is a very
  • 00:59:44 realistic thing that you do I know you
  • 00:59:46 back with some new videos soon I'm
  • 00:59:48 trying to make two videos a month so try
  • 00:59:49 to hold me accountable to that I think
  • 00:59:51 the next two are gonna be everything you
  • 00:59:53 need to know about classes in Python and
  • 00:59:55 play a neural networks tutorial in
  • 00:59:58 Python and maybe tons of flow or PI
  • 01:00:00 torch if you did enjoy this video make
  • 01:00:01 sure to throw a big thumbs up and also
  • 01:00:03 subscribe if you haven't already also if
  • 01:00:06 you're curious to kind of see what my
  • 01:00:07 day-to-day life looks like make sure to
  • 01:00:09 check out my Instagram in Twitter
  • 01:00:12 alright that's all I have guys thanks
  • 01:00:14 again for watching Pisa
  • 01:00:17 you
  • 01:00:22 [Music]