Coding

Python Data Science Project Ideas! (for all skill levels)

  • 00:00:00 hey what's up everyone and welcome back
  • 00:00:02 to another video today I thought we'd
  • 00:00:04 switch things up a bit for my normal
  • 00:00:06 kind of long tutorial and I thought we'd
  • 00:00:07 sit down and just walk through some data
  • 00:00:09 science project ideas as I'm making this
  • 00:00:12 video we're in the middle of a pretty
  • 00:00:13 crazy time with all that's going on with
  • 00:00:16 Kovan 19 around the world first and
  • 00:00:18 foremost I hope that everyone's staying
  • 00:00:19 safe and healthy definitely follow
  • 00:00:22 whatever your local governments telling
  • 00:00:24 you to do I know I've been at home a lot
  • 00:00:26 recently and it's been tough at times
  • 00:00:28 but I've also trying to make the most of
  • 00:00:31 being here and now for me it gives me an
  • 00:00:34 opportunity to make more videos and
  • 00:00:35 experiment with different types of
  • 00:00:37 videos a lot of us have time to kind of
  • 00:00:39 take on maybe an extra project that
  • 00:00:41 we've been meaning to but I've kind of
  • 00:00:42 put off you know we're kind of forced to
  • 00:00:44 be at home right now I think it's a good
  • 00:00:46 time to walk through some cool data
  • 00:00:48 science projects so how this is gonna
  • 00:00:51 work is that we're gonna walk through
  • 00:00:52 about eight different project ideas for
  • 00:00:54 each idea I'm gonna introduce the
  • 00:00:56 problem pretty quickly and then give you
  • 00:00:58 all sorts of resources videos blogs code
  • 00:01:02 snippets that are useful in solving that
  • 00:01:05 project and building that project all
  • 00:01:07 these resources will be found in the
  • 00:01:09 description so for any task that you're
  • 00:01:11 specifically interested in check the
  • 00:01:12 description and I'll have all sorts of
  • 00:01:14 resources listed there that will help
  • 00:01:17 you out with the project before we get
  • 00:01:19 started I want to just list out two
  • 00:01:20 rules that I really think you should
  • 00:01:22 follow if you wanna get the most out of
  • 00:01:23 a data science project and the first
  • 00:01:26 rule is don't do a project just because
  • 00:01:28 you think it will look good on your
  • 00:01:30 resume if you want to get a lot out of a
  • 00:01:32 data science project you're gonna need
  • 00:01:33 to be genuinely interested in the topic
  • 00:01:35 and you know each person has a different
  • 00:01:37 their own preferences you know something
  • 00:01:39 that I'm interested in might not be very
  • 00:01:42 interesting to you if you're not
  • 00:01:43 interested in the project you're
  • 00:01:45 probably not going to work too much on
  • 00:01:46 it you're probably not going to build up
  • 00:01:47 your skills to too much and the second
  • 00:01:49 rule is that a good data science project
  • 00:01:51 should make you work on various skills
  • 00:01:53 in data science a relatively small
  • 00:01:56 percentage of your time is actually
  • 00:01:58 spent building models you're gonna spend
  • 00:02:01 much higher percentage of the time
  • 00:02:02 scraping data from the web
  • 00:02:05 processing it building an architecture
  • 00:02:07 around you know deploying it the
  • 00:02:10 honestly the more marketable skills
  • 00:02:12 often are those skills surrounding the
  • 00:02:14 actual model all right let's get started
  • 00:02:16 okay first idea is very appropriate for
  • 00:02:19 the time and it was also requested by a
  • 00:02:21 couple of my Twitter followers and that
  • 00:02:23 is to play around with Kovan 19 data I
  • 00:02:26 think the best place to start for this
  • 00:02:27 is to go to the Johns Hopkins github
  • 00:02:30 page and play around with the data that
  • 00:02:31 they have made available there I did
  • 00:02:33 this and I made a little script that
  • 00:02:35 allows you to graph the number of
  • 00:02:37 confirmed cases and different places so
  • 00:02:40 I thought this was a good starting point
  • 00:02:41 next if you want to get a little bit
  • 00:02:43 more advanced I recommend going to
  • 00:02:44 kaggle and they have several different
  • 00:02:46 coronavirus related challenges available
  • 00:02:49 right now the first challenge is a NLP
  • 00:02:52 and text mining challenge this can get
  • 00:02:55 pretty advanced but what I found really
  • 00:02:56 useful when trying to attack this is
  • 00:02:59 that syntax has already made a video
  • 00:03:01 walking through his first hour doing
  • 00:03:04 this challenge and then if you want some
  • 00:03:06 inspiration for what you can kind of do
  • 00:03:09 and some cool things to aspire to I
  • 00:03:11 recommend checking out three blue one
  • 00:03:13 Browns simulations that he did ok the
  • 00:03:16 next idea is also probably pretty
  • 00:03:18 relevant for a lot of people if you're
  • 00:03:20 like me you've been maybe killing a
  • 00:03:21 decent amount of your time playing
  • 00:03:23 different types of board games and
  • 00:03:25 sometimes maybe you don't have another
  • 00:03:26 person to play against so what I suggest
  • 00:03:28 as another fun data science project idea
  • 00:03:31 is to build your own board game AI and
  • 00:03:33 we can even extend this further to be
  • 00:03:35 you know any sort of game AI and this
  • 00:03:37 kind of hits on more of the machine
  • 00:03:39 learning part of data science but I
  • 00:03:40 think it's a very fun project to try out
  • 00:03:43 because you can real time play against
  • 00:03:45 your AI and see if it's working or not I
  • 00:03:48 have several different resources for
  • 00:03:49 getting started with this I did a video
  • 00:03:51 over viewing different types of board
  • 00:03:53 game AI is a while back which I'll link
  • 00:03:55 to I actually also made a video where we
  • 00:03:58 implemented minimax
  • 00:03:59 the minimax algorithm for Kinect 4 and
  • 00:04:03 then if you want to kind of veer more
  • 00:04:04 into the state-of-the-art neural network
  • 00:04:07 based AI is I recommend a couple
  • 00:04:10 different tutorials the first I saw was
  • 00:04:13 on reinforcement learning using the game
  • 00:04:15 of snake as
  • 00:04:16 example I thought this was a cool little
  • 00:04:18 blog post and I think you can follow it
  • 00:04:20 to get some ideas on how you could maybe
  • 00:04:22 incorporate reinforcement learning into
  • 00:04:24 a project and then the other article
  • 00:04:27 that I found cool was on alpha zero for
  • 00:04:30 chess and that was a you know more
  • 00:04:31 sophisticated AI and kind of more modern
  • 00:04:34 approach also for fun and a good way to
  • 00:04:37 kill an hour and a half recently alphago
  • 00:04:40 by deepmind they posted a full
  • 00:04:41 documentary on AI that beat the top with
  • 00:04:44 the world's top go players that's on
  • 00:04:46 YouTube now and completely free to watch
  • 00:04:48 and definitely check out that another
  • 00:04:50 project idea is to draw some inspiration
  • 00:04:53 from reddit so if you go to reddit.com
  • 00:04:56 and then you go to the data is beautiful
  • 00:05:00 thread this is one of my favorite
  • 00:05:01 threads on the site you can kind of
  • 00:05:03 scroll through this and see all sorts of
  • 00:05:04 cool visualizations that people have
  • 00:05:06 created so if you you know see one you
  • 00:05:09 particularly like so I think this one's
  • 00:05:10 pretty cool you can click on it and one
  • 00:05:13 thing that's super useful is that all
  • 00:05:15 these posts you get this sticky post by
  • 00:05:18 the data data is beautiful bot that
  • 00:05:21 lists the author citations so as you can
  • 00:05:24 see the source of their data I guess was
  • 00:05:26 right here so it's from the US Census
  • 00:05:30 Bureau so all sorts of cool data you
  • 00:05:31 could use for your own projects here and
  • 00:05:34 then they also tell you that they use
  • 00:05:37 map chart net to create that
  • 00:05:38 visualization so if you wanted to create
  • 00:05:40 something similar
  • 00:05:40 you could go to map shot arnett and play
  • 00:05:44 around with the interface here and I
  • 00:05:47 recommend that you kind of just scroll
  • 00:05:49 through the data is beautiful and find
  • 00:05:50 stuff that you're particularly
  • 00:05:51 interested in so like oh this graph
  • 00:05:54 right here on the sp500 Recovery's looks
  • 00:05:57 kind of cool you know maybe you look
  • 00:06:00 into how that was created and once again
  • 00:06:03 we can click the author citations and as
  • 00:06:06 we can see here this is actually kind of
  • 00:06:08 cool
  • 00:06:08 this person actually used matplotlib and
  • 00:06:10 python to create this graph and they
  • 00:06:13 actually listed their repository right
  • 00:06:15 here so the takeaway here is that
  • 00:06:18 there's a lot of cool kind of
  • 00:06:20 visualizations you can draw inspiration
  • 00:06:22 from what I recommend is scroll through
  • 00:06:25 this find something you like try to
  • 00:06:26 build a project that's similar
  • 00:06:28 maybe a similar dataset using similar
  • 00:06:30 skills and one real benefit of something
  • 00:06:33 like this is these types of
  • 00:06:34 visualizations make really cool
  • 00:06:36 portfolio type projects this is very
  • 00:06:39 easy to show on a you know personal web
  • 00:06:42 page and it looks pretty impressive and
  • 00:06:44 it's clear that you know how to work
  • 00:06:46 with data and visualize it the next
  • 00:06:49 project idea is to build a text
  • 00:06:50 sentiment analysis tool so basically
  • 00:06:52 what this means is to build a model that
  • 00:06:54 can identify whether or not text has a
  • 00:06:57 positive or negative connotation so for
  • 00:07:00 example if I looked at all the comments
  • 00:07:02 on one of my YouTube videos if someone
  • 00:07:04 said this video sucks that would return
  • 00:07:06 negative from the model and if someone
  • 00:07:08 said oh wow thanks Keith this video was
  • 00:07:10 super helpful that I would return
  • 00:07:11 positive from the model I actually did a
  • 00:07:14 full tutorial on this in my real world
  • 00:07:16 machine learning tutorial with
  • 00:07:17 scikit-learn so if you want to see the
  • 00:07:19 entire process to build this tool you
  • 00:07:21 can check out that video and then I
  • 00:07:24 recommend building that out and making
  • 00:07:25 it more powerful some ways to do that
  • 00:07:27 are to maybe use more data I also didn't
  • 00:07:30 do a lot of the kind of basic text
  • 00:07:32 processing that you could do so no
  • 00:07:34 stripping out punctuation you'd also use
  • 00:07:36 by grams instead of you know grams terms
  • 00:07:39 so basically do your modeling on pairs
  • 00:07:41 of words instead of just a single word
  • 00:07:43 you can also utilize kind of more
  • 00:07:45 state-of-the-art NLP techniques so I
  • 00:07:48 recommend learning about transformers if
  • 00:07:50 you're interested in kind of being at
  • 00:07:52 the top of the NLP world specifically
  • 00:07:54 the Bert paper this is a super super
  • 00:07:57 powerful model and kind of the probably
  • 00:07:59 most cited NLP model right now and so I
  • 00:08:03 recommend reading into that and then if
  • 00:08:04 you want to try to apply that to this
  • 00:08:06 text sentiment analysis tool you can use
  • 00:08:08 the space elibrary
  • 00:08:10 which makes extracting those features
  • 00:08:12 very accessible once you're confident
  • 00:08:14 with your model one thing you could do
  • 00:08:16 is create some sort of dashboard with
  • 00:08:18 your tool so maybe you use the YouTube
  • 00:08:20 API to extract all of the comments from
  • 00:08:23 a youtubers channel and you feed them
  • 00:08:26 into your model to see whether or not on
  • 00:08:28 average the comments are positive or
  • 00:08:30 negative and this might be fun to
  • 00:08:32 compare different youtubers to one
  • 00:08:33 another so the next project idea is
  • 00:08:35 doing something related to sports
  • 00:08:37 analysis and there's a lot of different
  • 00:08:39 ways you can go with
  • 00:08:40 project like this one concrete task I
  • 00:08:43 think is fun is trying to
  • 00:08:44 programmatically build the optimal
  • 00:08:47 fantasy sports team so I think this is a
  • 00:08:50 good task because it kind of makes you
  • 00:08:51 go full pipeline of data science you
  • 00:08:54 start with nothing and you have to
  • 00:08:56 scrape some data off the web and then
  • 00:08:58 once you have the data you're trying to
  • 00:09:00 find meaning you're trying to find
  • 00:09:02 different ways you can analyze players
  • 00:09:04 to figure out who's the best for their
  • 00:09:06 team and then ultimately once you've
  • 00:09:08 done that analysis you extract the
  • 00:09:10 meaning and ultimately make your
  • 00:09:11 decision of who you're gonna draft one
  • 00:09:13 resource that that would be helpful for
  • 00:09:15 me to include in this is that a lot of
  • 00:09:17 the sports analysis projects require you
  • 00:09:19 to scrape data so I figured I would
  • 00:09:22 include some code to do that so I went
  • 00:09:24 to Basketball Reference comm and I
  • 00:09:26 scraped a table on the 2019 2020 stats
  • 00:09:30 and whenever you're scraping a table the
  • 00:09:33 way you're going to usually going to go
  • 00:09:34 about it is to right-click on an element
  • 00:09:37 in the table and click inspect this will
  • 00:09:40 open up the HTML source code and once
  • 00:09:42 you're in the HTML source code view
  • 00:09:44 you're going to want to find the entire
  • 00:09:46 table so here we have the table and you
  • 00:09:49 see it's all highlighted on the screen
  • 00:09:50 once you have that all highlighted
  • 00:09:52 you're going to want to identify the
  • 00:09:53 properties here so we see if this is a
  • 00:09:55 table with class here and also an ID
  • 00:09:58 that is found here and we're gonna use
  • 00:10:00 that in our Python script that
  • 00:10:02 ultimately web scrapes us and I think
  • 00:10:05 the library that's easiest to do this
  • 00:10:06 with is beautiful soup so I use
  • 00:10:09 beautiful soup to scrape all this
  • 00:10:11 information the next project idea I have
  • 00:10:13 is to build a stock trading bot so to
  • 00:10:16 build a bot that automatically buys and
  • 00:10:18 sells stocks I think this is a fun
  • 00:10:20 project because it kind of combines two
  • 00:10:22 different skill sets you know you have
  • 00:10:24 to build up knowledge about financial
  • 00:10:26 markets and how all that works and you
  • 00:10:28 also have to have skills Python data
  • 00:10:31 science skills and you can kind of
  • 00:10:33 leverage those two skills together and
  • 00:10:35 develop strategies to hopefully buy low
  • 00:10:38 and sell high one caveat I will say
  • 00:10:40 regarding this project is whenever
  • 00:10:42 you're putting money into a project it's
  • 00:10:44 kind of a dangerous territory and
  • 00:10:46 the thing that I think is important to
  • 00:10:48 say is do not put more money than you're
  • 00:10:50 willing to lose entirely into a stock
  • 00:10:53 trading bot I also say that even though
  • 00:10:55 we're putting money into something as
  • 00:10:56 risky I think the positives outweigh the
  • 00:10:59 negatives in this case because when
  • 00:11:01 you're putting your money into it and
  • 00:11:02 maybe you're just putting 1015 dollars
  • 00:11:04 in it you're invested now in the project
  • 00:11:07 and you're going to kind of force
  • 00:11:08 yourself to want to learn more about
  • 00:11:09 finance more about Python to build the
  • 00:11:12 best trading strategy as possible and I
  • 00:11:14 think that elves what outweighs the risk
  • 00:11:16 of losing some money to get you started
  • 00:11:19 on this project you should probably
  • 00:11:20 check out this alpaca dot market site
  • 00:11:23 they provide a Web API to buy and sell
  • 00:11:26 stocks as well as get all sorts of
  • 00:11:28 market information I've tried it out a
  • 00:11:30 bit and I was very happy with what I've
  • 00:11:33 seen so far the signup process is pretty
  • 00:11:35 easy they do make you sign a couple
  • 00:11:37 forms and you also have to put a small
  • 00:11:39 deposit in before you can start trading
  • 00:11:40 but the trades are commissioned free
  • 00:11:42 that API Doc's are well written and in
  • 00:11:45 the worst case scenario that you've
  • 00:11:46 maybe bought some stocks using the API
  • 00:11:48 but you have no idea how to sell them
  • 00:11:50 and you're maybe they're going down the
  • 00:11:52 dashboard of alpaca gives you a option
  • 00:11:55 to manually buy or sell stocks so you
  • 00:11:58 can always worst-case scenario sell
  • 00:12:01 stocks in the dashboard for the seventh
  • 00:12:03 data science project idea we're going to
  • 00:12:04 go to Kaggle and here you can find a
  • 00:12:08 cool challenge on house pricing
  • 00:12:11 predictions so basically the idea behind
  • 00:12:13 this cago competition is you have all
  • 00:12:16 sorts of data on houses you know you
  • 00:12:19 have their square footage you have if
  • 00:12:21 they have like a porch or not the square
  • 00:12:23 footage of the porch now whether they
  • 00:12:24 have a driveway what type of
  • 00:12:25 neighborhood they're in you have all
  • 00:12:27 this information and with it you're
  • 00:12:29 supposed to predict the price of the
  • 00:12:31 house and this is a great challenge to
  • 00:12:33 learn more about regression and to
  • 00:12:35 develop your regression techniques and
  • 00:12:37 also getting more comfortable working
  • 00:12:38 with Python pandas and CSV data one
  • 00:12:41 thing that's nice with this competition
  • 00:12:42 is that there's all sorts of people that
  • 00:12:44 have already done it so you can look at
  • 00:12:46 some notebooks that have already been
  • 00:12:47 submitted or the other great option is
  • 00:12:50 they actually have tutorials for this
  • 00:12:51 and you can click on like this
  • 00:12:54 comprehensive data exploration with
  • 00:12:56 Python link and they'll walk you through
  • 00:12:59 getting started with this challenge and
  • 00:13:01 all sorts of things that you need to
  • 00:13:03 know to start predicting these prices on
  • 00:13:05 your own for the eighth and final data
  • 00:13:07 science project idea I'm gonna make kind
  • 00:13:09 of a bit of a stretch and I'm gonna just
  • 00:13:11 call it miscellaneous Kaggle challenges
  • 00:13:13 and I just thought for this final
  • 00:13:15 project idea I would walk through how I
  • 00:13:17 would go about finding cool things to
  • 00:13:19 work on so I would get to the Kaggle
  • 00:13:21 homepage and then I would go to data
  • 00:13:24 over here in the left and then with this
  • 00:13:27 data I might look at most votes over all
  • 00:13:30 time and look through all of these
  • 00:13:33 different data sets to find something
  • 00:13:35 that I think is particularly interesting
  • 00:13:38 you know one thing that I think is
  • 00:13:39 pretty cool is you know New York City
  • 00:13:41 Airbnb open data so like this data set
  • 00:13:44 gives you all sorts of information on
  • 00:13:45 Airbnb listings one cool project you
  • 00:13:48 might start with is gives you the
  • 00:13:50 latitude and longitude of these Airbnb
  • 00:13:52 listings maybe you try to visualize all
  • 00:13:55 of the listings in New York
  • 00:13:57 using this data set that would be one
  • 00:13:59 fun test to play around with if you ever
  • 00:14:01 need inspiration one thing that's great
  • 00:14:03 about Cal 2 is that you can click on
  • 00:14:05 kernels and you can find all sorts of
  • 00:14:08 people that have already done analysis
  • 00:14:10 on this data and so that can help you
  • 00:14:13 kind of learn how to work on the data
  • 00:14:15 yourself another suggestion if you're
  • 00:14:17 having trouble getting started with
  • 00:14:19 capital challenges is to check out my
  • 00:14:21 real world data science video which I
  • 00:14:24 will link to in the description
  • 00:14:26 I don't directly do a cackle challenge
  • 00:14:28 in this video but I make it a very
  • 00:14:29 realistic approach to taking a data set
  • 00:14:32 and asking yourself all sorts of
  • 00:14:34 businesslike questions about the data
  • 00:14:36 set and actually using the data to
  • 00:14:38 answer those questions so I think it's a
  • 00:14:41 good tutorial to get you to the point
  • 00:14:43 where you probably can go to Kaggle and
  • 00:14:45 play around with this data on your own
  • 00:14:47 and answer your own questions alright
  • 00:14:49 with that I think we're gonna end this
  • 00:14:50 video here hope you guys enjoyed this
  • 00:14:52 one hopefully you found it informative
  • 00:14:54 and also feel a little bit inspired to
  • 00:14:57 work on a data science project now if
  • 00:14:59 you did enjoy this video and mean a lot
  • 00:15:01 if you throw it a thumbs up and also let
  • 00:15:03 me know in the comments if any of these
  • 00:15:06 ideas were particularly interesting to
  • 00:15:08 you I want to make a full walkthrough
  • 00:15:09 video on one of them and I definitely
  • 00:15:11 we'll give preference to the most
  • 00:15:13 requested most comedy ideas my current
  • 00:15:16 plan is to make a neural network
  • 00:15:17 tutorial next and then shortly after
  • 00:15:19 that I'll probably make the full data
  • 00:15:21 science walkthrough video make sure to
  • 00:15:22 subscribe to not miss either of those
  • 00:15:24 also if you want to stay up to date on
  • 00:15:26 everything that I'm doing make sure to
  • 00:15:27 check out my Instagram and Twitter I
  • 00:15:29 think that's all I have until next time
  • 00:15:32 everyone
  • 00:15:33 [Music]