Coding

Comprehensive Python Beautiful Soup Web Scraping Tutorial! (find/find_all, css select, scrape table)

  • 00:00:00 hey how's it going everyone and welcome
  • 00:00:01 back to another video we've got a fun
  • 00:00:03 one in store today we're gonna go
  • 00:00:04 through all sorts of things related to
  • 00:00:06 web scraping and Python and specifically
  • 00:00:08 be looking at the beautiful soup library
  • 00:00:11 couple logistical things before we begin
  • 00:00:13 if you haven't already and you've
  • 00:00:15 enjoyed any of my videos it'd mean a lot
  • 00:00:16 to me if you subscribe then the second
  • 00:00:18 thing is I want to just thank everyone
  • 00:00:20 who responded to my post when I asked
  • 00:00:23 what should be included in this video I
  • 00:00:24 wish I thought of this before I actually
  • 00:00:27 made this post but thank you for the
  • 00:00:29 person that responded everything we'll
  • 00:00:32 just do everything in this video no I
  • 00:00:34 mean I thought about it a bit more with
  • 00:00:35 all your feedback and I think what we're
  • 00:00:38 gonna do is probably make a couple
  • 00:00:39 videos on the topic of web scraping but
  • 00:00:41 this first one will really be about the
  • 00:00:43 basics HTML and CSS you know what is web
  • 00:00:47 scraping then we're gonna dive into the
  • 00:00:49 beautiful soup library and kind of learn
  • 00:00:51 the building blocks that we need from
  • 00:00:53 that library and then the final part of
  • 00:00:54 this video will be a section of
  • 00:00:56 exercises where you can kind of test out
  • 00:00:58 your skills I think that should be a
  • 00:00:59 pretty fun section a timeline in the
  • 00:01:01 description or attached to the video
  • 00:01:03 there's all sorts of places you can find
  • 00:01:05 the timeline these days but let's just
  • 00:01:08 jump into it okay before we get into web
  • 00:01:10 scraping it's important to know how web
  • 00:01:13 pages on the Internet actually work so
  • 00:01:15 any site that we go to whether that be
  • 00:01:17 YouTube Amazon Wikipedia they're all
  • 00:01:21 composed of some combination of HTML and
  • 00:01:24 CSS so hTML is a language really to
  • 00:01:27 style web pages so here we have you know
  • 00:01:31 my YouTube page and the first thing to
  • 00:01:33 understand is that we can actually see
  • 00:01:35 the HTML source code and this is
  • 00:01:37 possible on pretty much any browser you
  • 00:01:39 are using by right clicking and clicking
  • 00:01:42 view page source so this YouTube page
  • 00:01:45 we're looking at right now it's styled
  • 00:01:47 by all this code that you see here and
  • 00:01:53 don't worry if this looks a little bit
  • 00:01:55 intimidating to start we'll start with a
  • 00:01:58 lot simpler examples of HTML but another
  • 00:02:01 important thing to know is that you know
  • 00:02:02 that was a lot to look at but if we're
  • 00:02:04 looking for a specific HTML that
  • 00:02:06 represents specific elements on the page
  • 00:02:08 so let's say my subscriber count' here
  • 00:02:11 we can use not view page source but
  • 00:02:14 inspect and so this is kind of a
  • 00:02:18 separate view of the HTML if I expand
  • 00:02:22 this a little bit you can see a bit but
  • 00:02:24 as I kind of navigate over these items
  • 00:02:26 we see we got my name and the subscriber
  • 00:02:30 count'
  • 00:02:31 I could just if I wanted to grab the
  • 00:02:34 subscriber count' and I can actually
  • 00:02:36 within the web browser I can edit this
  • 00:02:38 so let's say I wanted to have know 100
  • 00:02:44 million subscribers I'm coming for
  • 00:02:46 Beauty pie now as you can see I've
  • 00:02:50 edited that edited the code and we can
  • 00:02:52 see it says a hundred million
  • 00:02:53 subscribers here so that's another way
  • 00:02:56 you can look for specific HTML on a page
  • 00:02:58 and note that I didn't actually change
  • 00:03:00 the code on YouTube servers if we
  • 00:03:02 refresh the page unfortunately it will
  • 00:03:06 go back to this number but these are two
  • 00:03:08 good things to know about as we get into
  • 00:03:11 our web scraping so the core of what web
  • 00:03:13 scraping is is using Python or another
  • 00:03:16 language to programmatically look
  • 00:03:19 through HTML source code and pull out
  • 00:03:21 only the things that we want to scrape
  • 00:03:23 the web pages for elements that we want
  • 00:03:26 and information that we want to collect
  • 00:03:27 so an example here would be maybe I
  • 00:03:30 wanted to scrape my YouTube home page
  • 00:03:33 and just grab all the titles of all of
  • 00:03:37 my YouTube videos that is one example
  • 00:03:39 tasks with why we use web scraping where
  • 00:03:42 we I didn't want to manually go through
  • 00:03:44 and write down all these video names I
  • 00:03:46 wanted Python to do that for me
  • 00:03:48 alright let's start moving towards
  • 00:03:50 getting into the code okay I want
  • 00:03:52 everyone to navigate to the page Keith
  • 00:03:54 galley github I o slash web scraping
  • 00:03:58 slash example dot HTML so here once you
  • 00:04:03 load that up it's a very very simple
  • 00:04:07 example of an HTML web page we really
  • 00:04:11 just have like six or so elements on
  • 00:04:14 this page so once again we can right
  • 00:04:16 click click view page source actually
  • 00:04:20 what will probably be easier is instead
  • 00:04:21 we'll do the inspect and this is small
  • 00:04:24 enough where we Pike
  • 00:04:25 see pretty much all of the code so we
  • 00:04:28 have a header so this head is what we
  • 00:04:30 see in the top left the HTML example
  • 00:04:32 that's the title and then we have the
  • 00:04:34 body and that's ultimately all this
  • 00:04:36 stuff so I'm going to just kind of
  • 00:04:38 unfold all this so we can see all of it
  • 00:04:42 in its entirety but here we go so we
  • 00:04:46 start with the body that's what is
  • 00:04:47 actually on our web page we have a h1
  • 00:04:50 tag that means header we have a
  • 00:04:52 paragraph which is going to be linked to
  • 00:04:55 more interesting example and then the
  • 00:04:56 next thing is that a tag is a link which
  • 00:04:59 is pointing me to another web page then
  • 00:05:03 down here we have a smaller header
  • 00:05:05 that's denoted by h2 some more paragraph
  • 00:05:08 which this time is using italics which
  • 00:05:10 is an eye tag we have another header
  • 00:05:12 here same size as the other header and
  • 00:05:15 some more text and one thing to note
  • 00:05:16 here is as you can see this has an
  • 00:05:19 additional property and this is
  • 00:05:20 something we're going to look for as we
  • 00:05:21 start to scrape it has an ID equals
  • 00:05:23 paragraph ID but this is a basic page so
  • 00:05:27 we're gonna load this into Python open
  • 00:05:30 up your preferred editor and let's start
  • 00:05:32 writing some code I'm going to be using
  • 00:05:33 a Jupiter notebook through Google collab
  • 00:05:35 so the first thing we'll want to do is
  • 00:05:37 load in the necessary libraries and so
  • 00:05:41 with Google Cloud I already have these
  • 00:05:43 provided for me but you might need a pip
  • 00:05:44 install these so the first one we're
  • 00:05:46 gonna import is the requests library and
  • 00:05:50 this is going to be so we can load those
  • 00:05:52 web pages that we were just looking at
  • 00:05:54 so if you need to you might have to do a
  • 00:05:56 pip install requests but I already have
  • 00:05:59 it so I can just import requests and
  • 00:06:00 then the second thing we'll want to
  • 00:06:01 import is actually the beautifulsoup
  • 00:06:03 library so this is a little bit more
  • 00:06:06 complex of a line but I like to import
  • 00:06:08 it like this
  • 00:06:08 so from beautifulsoup for import
  • 00:06:11 beautiful soup as BS and so if you don't
  • 00:06:17 have this one you're probably not to do
  • 00:06:18 a pip install of beautiful soup 4 or
  • 00:06:24 maybe a pip 3 install beautiful
  • 00:06:26 beautiful soup 4 but we can run that
  • 00:06:29 and then let's load our first page so
  • 00:06:31 we're going to load that page we were
  • 00:06:33 just looking at and so we can do this by
  • 00:06:36 a the following so we're going to first
  • 00:06:39 load the webpage content and we'll do
  • 00:06:42 this with a quest library so we can do
  • 00:06:45 our equals requests dot get now we're
  • 00:06:50 gonna pass in the URL so the URL was
  • 00:06:53 HTTP slash Keith Galilee github IL slash
  • 00:07:00 web scraping slash example dot HTML so
  • 00:07:05 that's the web page we're looking for
  • 00:07:06 and so once we have that though we want
  • 00:07:10 to convert it to a beautiful soup object
  • 00:07:12 convert to a beautiful soup object and
  • 00:07:17 so how you're going to do this is the
  • 00:07:18 following you're gonna do soup equals BS
  • 00:07:22 and then surround it with our dot
  • 00:07:25 content that's going to be actually the
  • 00:07:28 HTML on the web page that we get back
  • 00:07:30 from this request and finally we will
  • 00:07:35 want to print it out so print out our
  • 00:07:37 HTML and we can just do print soup let's
  • 00:07:42 see what happens
  • 00:07:44 okay cool so this is our page and one
  • 00:07:48 thing to note in addition to just
  • 00:07:49 printing out all this we can actually
  • 00:07:51 add an additional little snippet where
  • 00:07:54 we do soup dot fur defy and this will
  • 00:07:59 just format it in a little bit more of a
  • 00:08:01 readable way so you can see exactly
  • 00:08:03 indentations and what level certain
  • 00:08:05 elements are compared to other elements
  • 00:08:07 and what elements are nested inside
  • 00:08:09 other elements okay so that's our page
  • 00:08:12 and as we can see everything that we
  • 00:08:13 were looking at before is all here all
  • 00:08:16 right let's now start scraping with the
  • 00:08:18 beautiful soup library and so I think
  • 00:08:21 what will be useful to start here is to
  • 00:08:24 look at some of the documentation for
  • 00:08:27 the beautiful soup library and as I go
  • 00:08:30 through some of the elements in this I
  • 00:08:31 think will be easier and easier to look
  • 00:08:34 through this documentation yourself
  • 00:08:35 there's a lot of useful stuff here and
  • 00:08:37 it's actually not that crazy large of a
  • 00:08:40 live
  • 00:08:41 but you know you have some initial stuff
  • 00:08:44 kind of initial navigation here at the
  • 00:08:46 start of this page has the installation
  • 00:08:49 here and I all link this in the
  • 00:08:51 description this documentation and you
  • 00:08:56 know kind of how we import it but the
  • 00:08:59 first thing we're going to look at is
  • 00:09:01 going to be find and find all so we see
  • 00:09:06 it just kind of here with navigating
  • 00:09:09 using tag names so let's look at find
  • 00:09:11 and find all I think this is what I use
  • 00:09:13 most frequently what I'm using
  • 00:09:14 beautifulsoup
  • 00:09:15 so it's useful as kind of a starting
  • 00:09:18 point all right so we have soup if you
  • 00:09:21 remember that's all of this HTML here so
  • 00:09:24 what we're going to want to do is let's
  • 00:09:27 say we wanted to grab just these h2
  • 00:09:31 elements so very easily with beautiful
  • 00:09:33 soup we can go ahead and do I'm going to
  • 00:09:36 say first header equals soup dot find h2
  • 00:09:43 so we pass into this find command the
  • 00:09:46 tag that we're looking for in this case
  • 00:09:47 it is the h2 tag and so if I run this
  • 00:09:51 and then actually print out the first
  • 00:09:53 header you see we have a header and note
  • 00:10:00 that when we just use this find command
  • 00:10:03 it's going to find the first element
  • 00:10:05 that matches the description that we
  • 00:10:07 passed in the other command that's
  • 00:10:09 useful to use and I honestly use this
  • 00:10:11 way more frequently and then I use just
  • 00:10:13 find is to use find also headers we'll
  • 00:10:19 say is equal to soup dot find all h2 so
  • 00:10:24 the syntax here is exactly the same but
  • 00:10:27 now we're not going to stop on the first
  • 00:10:29 element we're going to create a list of
  • 00:10:30 all the h2s and so even if there's just
  • 00:10:32 one element it's still going to be a
  • 00:10:35 list of that one element so now let's
  • 00:10:39 print out the headers and because I'm
  • 00:10:42 using a Jupiter notebook I can either do
  • 00:10:43 print headers or just type headers as
  • 00:10:45 the last line and it will show this but
  • 00:10:47 now you see we have a list of both these
  • 00:10:50 headers so that's a very simple example
  • 00:10:53 of gravity
  • 00:10:53 something from our page but we get a lot
  • 00:10:55 more complicated as we go so the next
  • 00:10:59 thing is that we can actually pass in a
  • 00:11:02 list of elements to look for so let's
  • 00:11:09 say in addition to the h2 tags we
  • 00:11:12 actually wanted bolt also the h1 text so
  • 00:11:15 any kind of header element so what I can
  • 00:11:17 do here is I can do same as last time
  • 00:11:21 first header equals souped up find and
  • 00:11:26 then pass in a list here instead of just
  • 00:11:28 a single object and I'm going to do h1
  • 00:11:30 and h2 now let's print out first header
  • 00:11:37 and see we have HTML web page which
  • 00:11:40 matches what the first header is because
  • 00:11:43 these h2s come after okay and note that
  • 00:11:46 the order here doesn't matter whatever
  • 00:11:48 you put in this list is going to find
  • 00:11:50 the first occurrence of one of those
  • 00:11:51 items so I did the order the other way
  • 00:11:54 around and ran it we still get that same
  • 00:11:56 result okay moving on yet again as we
  • 00:12:04 can see with find all we do find all and
  • 00:12:08 pass in that list we will get both h1
  • 00:12:14 tags and the h2 tag so headers we see we
  • 00:12:20 have now three tags because we're
  • 00:12:22 including h1 and h2 s so that's another
  • 00:12:24 useful thing to know and we're going to
  • 00:12:26 just keep building up our intuition for
  • 00:12:27 fine and final this is I would say the
  • 00:12:30 most important function within the
  • 00:12:32 beautiful soup library and we can get
  • 00:12:34 more and more complicated with how we
  • 00:12:35 use it and I'll show that so the next
  • 00:12:38 thing that we should look at is that we
  • 00:12:39 can actually pass in attributes to the
  • 00:12:44 find slash final function so an example
  • 00:12:49 would be a paragraph if I did equals
  • 00:12:53 soup dot find all let's say P and then
  • 00:12:58 print it out paragraph
  • 00:13:02 we see we have all these different
  • 00:13:05 elements there's three listed well let's
  • 00:13:08 say I just wanted the paragraph with the
  • 00:13:11 ID paragraph ID well we can pass in a
  • 00:13:15 second argument to this and you see
  • 00:13:18 right here attributes so this would be
  • 00:13:20 what you'd find in the documentation if
  • 00:13:22 you looked up find all so if I pass in
  • 00:13:24 attributes I can use a dictionary
  • 00:13:27 mapping of the property that I'm looking
  • 00:13:30 for in this case it's the ID and we want
  • 00:13:33 that to be equal to paragraph ID and I
  • 00:13:36 run that and now we just have a list
  • 00:13:38 containing the single paragraph item and
  • 00:13:41 note that if this wasn't a valid ID we'd
  • 00:13:45 get nothing here so that's another
  • 00:13:47 useful thing to know and we're going to
  • 00:13:49 keep just building up these building
  • 00:13:50 blocks so what's another useful thing we
  • 00:13:53 can do well I think something that's
  • 00:13:55 really useful as you're trying to get
  • 00:13:57 specific elements on a page is to note
  • 00:14:01 that you can nest find and find all
  • 00:14:05 calls so what I mean by this is that we
  • 00:14:08 could say something like body equals
  • 00:14:11 soup dot find of the body so if we look
  • 00:14:16 at our HTML up here let's say we only
  • 00:14:20 want to you know we have the head here
  • 00:14:22 but we all need wanted the stuff in the
  • 00:14:24 body so we can start with body equals
  • 00:14:28 souped up fine body print out the body
  • 00:14:30 as you see we have that here and now
  • 00:14:34 let's say we wanted just this div div is
  • 00:14:39 basically a container within HTML so now
  • 00:14:41 I'm going to say div equals body dot
  • 00:14:45 find div so now just within this body we
  • 00:14:49 are specifically looking for a div and
  • 00:14:52 so this is very helpful as you have a
  • 00:14:54 really really big page like you solve
  • 00:14:57 that YouTube page on narrowing down
  • 00:14:59 where you're scraping from so if I go
  • 00:15:03 ahead and print out the div now we have
  • 00:15:05 just this stuff within the body and
  • 00:15:09 finally let's say we wanted to just get
  • 00:15:11 the header from that
  • 00:15:13 well I can say header equals div dot
  • 00:15:16 find of each one and print out the
  • 00:15:20 header and there we go
  • 00:15:23 I guess one additional thing that I
  • 00:15:25 think is useful with this before we move
  • 00:15:27 on to the next function would be that we
  • 00:15:30 can search for specific strings in our
  • 00:15:35 find slash find all calls so let's say
  • 00:15:40 we wanted to find all I'm gonna real
  • 00:15:47 quick just print out our our soup oh no
  • 00:15:55 what happened okay so let's say we
  • 00:16:04 wanted to just find any paragraph with
  • 00:16:08 the text sum so really some italicized
  • 00:16:12 text in some bold text well we can do
  • 00:16:14 that by doing soup dot find all
  • 00:16:19 paragraph tag and then one of these
  • 00:16:22 arguments is text actually this
  • 00:16:25 documentation that Google Club is using
  • 00:16:27 is a little bit outdated it's actually
  • 00:16:29 now known as string in the beautiful
  • 00:16:32 soup 4
  • 00:16:33 so string equals let's say sum and we're
  • 00:16:37 going to find all so we'll say paragraph
  • 00:16:41 equal this print out paragraphs and it
  • 00:16:48 is blank so we have an issue here why is
  • 00:16:52 it blank well if we think about how this
  • 00:16:56 paragraph text actually is it doesn't
  • 00:16:58 include just son it's either some bold
  • 00:17:01 text or some italicized text so if I did
  • 00:17:03 some bold text and did the full string
  • 00:17:06 now we see it that's not ideal in my
  • 00:17:08 opinion you usually don't want to look
  • 00:17:10 for an exact string you might want to
  • 00:17:11 find a specific word or two so this
  • 00:17:14 becomes really useful if we leverage it
  • 00:17:17 with the Reg X library so if I import
  • 00:17:20 Ari which is reg X and then I do
  • 00:17:23 our e.com pile and then I do some now it
  • 00:17:29 will just look for if I if I do it right
  • 00:17:32 what would I do wrong oh I had an extra
  • 00:17:36 thing man accident now it will just look
  • 00:17:39 for some somewhere in the string and
  • 00:17:41 this becomes particularly useful to
  • 00:17:44 another example we could do with the Reg
  • 00:17:46 acts is find all headers that have the
  • 00:17:50 word header in it and note that these
  • 00:17:53 headers have different capitalization so
  • 00:17:56 if I wanted to find those headers I
  • 00:17:57 could do headers equals soup find all
  • 00:18:01 those are both h2 elements and I could
  • 00:18:04 be looking for a string equals re dot
  • 00:18:07 compile and then pass in header if we
  • 00:18:13 run this we just get one because that
  • 00:18:19 just gets the lowercase one but because
  • 00:18:21 this is a reg X we could do something
  • 00:18:23 like H or H so that's now looking for a
  • 00:18:27 capital H or a lowercase H using reg X
  • 00:18:30 syntax and there we go we get both so
  • 00:18:34 that's useful too I think that's all we
  • 00:18:36 need to know about find and find all
  • 00:18:37 okay the next functionality we're gonna
  • 00:18:40 get into it's pretty similar to find and
  • 00:18:43 find all but it's going to be the Select
  • 00:18:45 method within beautiful stuff soup and
  • 00:18:48 this is really kind of like selecting
  • 00:18:51 elements based on kind of how you would
  • 00:18:53 select elements and CSS so I haven't
  • 00:18:55 talked too much about CSS yet but kind
  • 00:18:59 of has a quick introduction if we go to
  • 00:19:03 this page that I have linked to this is
  • 00:19:06 the more advanced example we'll do a lot
  • 00:19:07 of the exercises on you know this is a
  • 00:19:10 little about me page if we view the page
  • 00:19:13 source here this top stuff that we see
  • 00:19:17 up here is the CSS and it's basically
  • 00:19:23 how we can style specific elements in
  • 00:19:26 the HTML so Jeb that's just a little bit
  • 00:19:28 about CSS we'll see that a little bit
  • 00:19:30 more in a bit but let's cool
  • 00:19:33 we're gonna use that the way that you've
  • 00:19:35 style specific elements with CSS
  • 00:19:37 beautifulsoup kind of also mimics the
  • 00:19:40 ability to select elements like that so
  • 00:19:42 I think the best place to go to start
  • 00:19:44 seeing what you can do is this CSS
  • 00:19:48 selectors reference page and I'll link
  • 00:19:51 this in the description but basically it
  • 00:19:54 shows us kind of different ways we can
  • 00:19:56 select elements in our HTML just like if
  • 00:19:59 we do dot class that selects dot intro
  • 00:20:02 selects all the elements with class
  • 00:20:04 intro if we do a hash pound signed ID a
  • 00:20:09 pound sign a first name that selects all
  • 00:20:11 elements with ID equals first name we
  • 00:20:14 can also just pass in an element so like
  • 00:20:17 select all keys you can nest things so
  • 00:20:20 like all paragraphs within a specific
  • 00:20:23 div that's how you can select it you can
  • 00:20:26 do div plus piece likes all p.m. is
  • 00:20:29 there place immediately after div
  • 00:20:31 elements so there's a lot of useful
  • 00:20:33 stuff here you can also you know grab
  • 00:20:35 specific attributes so like if there's a
  • 00:20:37 certain URL you were looking for you
  • 00:20:39 could use it to do that and you'll see
  • 00:20:42 once this page will be very useful as
  • 00:20:44 you kind of see me using select in
  • 00:20:47 action but let's go ahead and start by
  • 00:20:51 just selecting all the paragraph tags in
  • 00:20:57 our page so soup dot select key and then
  • 00:21:02 I can print out content and as you can
  • 00:21:06 see this is very similar to doing find
  • 00:21:08 all of P one thing that's really useful
  • 00:21:11 with this type of method is yeah is
  • 00:21:13 using those paths so if we look back at
  • 00:21:16 our HTML and maybe it'll be useful for
  • 00:21:18 me to just kind of like print out some
  • 00:21:20 HTML again so I'm going to add another
  • 00:21:21 code cell I'll just print out soup dot
  • 00:21:26 body and this is kind of a nice little
  • 00:21:27 shorthand to get just the body but we
  • 00:21:31 have the body here and maybe I'll
  • 00:21:33 prettify that and print it
  • 00:21:41 cool so that's the body let's say we
  • 00:21:43 wanted to just grab paragraphs that were
  • 00:21:47 inside of divs so I could do soup dot
  • 00:21:50 select div and then paragraph and so we
  • 00:21:57 see now we have that only this one right
  • 00:22:00 here
  • 00:22:01 other stuff we can do with this we could
  • 00:22:03 let's say we wanted to grab all the
  • 00:22:05 paragraphs that were preceded by a
  • 00:22:07 header too so I could do paragraphs
  • 00:22:11 equals soup dot select we will want h2
  • 00:22:16 and then I believe this squiggly and
  • 00:22:19 then P so that's going to be getting the
  • 00:22:22 the paragraphs directly after each two
  • 00:22:24 and when it says directly after that
  • 00:22:26 means on the same level so you see that
  • 00:22:28 the nested is right there and let's see
  • 00:22:30 if that gets it as we hope some
  • 00:22:36 italicized text and then we get some
  • 00:22:38 bold text so that did it exactly how we
  • 00:22:40 are hoping that's awesome let's do some
  • 00:22:43 more of this
  • 00:22:44 what else is useful well it's also
  • 00:22:47 useful to grab specific elements with
  • 00:22:49 IDs so let's say we wanted to grab the
  • 00:22:51 bold text the bold element after a
  • 00:22:55 paragraph with ID paragraph ID so I
  • 00:22:58 could do it this way you could say bold
  • 00:23:03 text equals soup dot select well I want
  • 00:23:07 to grab the paragraph with hash tag ID
  • 00:23:11 paragraph ID and then I want inside of
  • 00:23:15 that a be element the the bold text
  • 00:23:19 element now if we print that we get bold
  • 00:23:25 text so you know you have a lot of
  • 00:23:28 options with this this kind I would say
  • 00:23:31 if you're trying to navigate through a
  • 00:23:32 specific path using select is very
  • 00:23:34 helpful and you're going to get as you
  • 00:23:37 get more and more practice with
  • 00:23:39 beautiful soup you kind of get a feeling
  • 00:23:40 of when you want to use select when you
  • 00:23:42 want to use find and find all and always
  • 00:23:45 you can go to that reference page that I
  • 00:23:47 showed to see how you can use
  • 00:23:50 and one thing that's a little bit of a
  • 00:23:51 bummer is some of the things down here I
  • 00:23:55 don't actually have support in
  • 00:23:57 beautifulsoup but a lot of this top
  • 00:23:59 stuff I think is all supported I guess
  • 00:24:03 real quick one final thing that's worth
  • 00:24:04 mentioning is you can kind of run nested
  • 00:24:07 calls so if I said you know paragraphs
  • 00:24:10 equals soup dot font or dot select you
  • 00:24:17 know any the body tag followed by some
  • 00:24:21 paragraph element and then I wanted to
  • 00:24:26 maybe I wanted it directly the direct
  • 00:24:29 child to be a paragraph element so that
  • 00:24:31 would take out this as an option because
  • 00:24:36 it's this paragraphs inside the div it's
  • 00:24:39 not a direct descendant so direct
  • 00:24:42 descendants of the body paragraphs and
  • 00:24:45 that would give me these two things and
  • 00:24:50 one thing we can also do is I could say
  • 00:24:52 like for paragraph in paragraphs I could
  • 00:24:59 do I could do nested kind of select
  • 00:25:01 calls on these so I could do paragraph
  • 00:25:03 dot selects let's say I want it in I tag
  • 00:25:09 and I think that should print things I
  • 00:25:13 guess I'll do French paragraphs
  • 00:25:21 and then I'll print paragraphs select ok
  • 00:25:32 so as we can see we get the paragraphs
  • 00:25:34 and then we iterate through these two
  • 00:25:36 items in the list and first time we do
  • 00:25:39 have an eye element so we can select
  • 00:25:41 that and print something out the second
  • 00:25:43 time
  • 00:25:43 there's no italics in this element so
  • 00:25:46 it's just an empty list I'm gonna
  • 00:25:47 quickly paste in one more things to show
  • 00:25:49 I could get grabbed and I done with an
  • 00:25:52 align equal middle by doing the
  • 00:25:55 following but let's move on to getting
  • 00:25:59 different properties of the HTML so
  • 00:26:07 let's say like as a first example one
  • 00:26:10 thing that we might want to get is a
  • 00:26:14 string within an element so I don't want
  • 00:26:17 just the header I want the text a header
  • 00:26:21 not the full tagged element so if I did
  • 00:26:23 soup dot find all four maybe I'll just
  • 00:26:26 do find h2 and that's equal to header
  • 00:26:34 note if I print out header it's going to
  • 00:26:39 give me this but if I did header dot
  • 00:26:42 string it will give me just that text
  • 00:26:45 string as a nice thing to know however
  • 00:26:47 if we do it with the div so if I do div
  • 00:26:50 equals soup dot find each I guess it's
  • 00:26:57 div if you print out the div notice we
  • 00:27:01 have all of this and if I I think it'll
  • 00:27:04 be a little bit more clear if I do print
  • 00:27:05 div dot prettify but we have this div if
  • 00:27:10 I try to call print div dot string on
  • 00:27:14 this see what happens so it says none
  • 00:27:18 and the issue here with the div why it
  • 00:27:20 can't print out all the text in this tag
  • 00:27:23 is because it's not clear if it should
  • 00:27:26 print out HTML web page or if it should
  • 00:27:27 print out everything in the paragraph
  • 00:27:29 here so because this has two kind of
  • 00:27:33 elements at the same level as children
  • 00:27:35 it
  • 00:27:35 know what to look at with that div so if
  • 00:27:39 you ever run into this type of problem
  • 00:27:40 where string is none there's another
  • 00:27:42 built-in method of beautiful soup called
  • 00:27:45 get text which is very useful and we can
  • 00:27:48 use that for bigger objects and getting
  • 00:27:51 all the text inside kind of in a
  • 00:27:53 recursive manner so now you see we have
  • 00:27:56 HTML web page linked more exam
  • 00:27:58 interesting example it gives that link
  • 00:28:01 so if multiple like child elements use
  • 00:28:06 text otherwise we can go ahead and do
  • 00:28:10 use dust string so that's getting the
  • 00:28:14 string what else can we do here I think
  • 00:28:17 it is useful to actually get like this
  • 00:28:20 link and know how to get that href here
  • 00:28:23 so let's now go ahead and get a specific
  • 00:28:28 property from an element so to do this
  • 00:28:33 we could do soup got find we could find
  • 00:28:35 the links and note you see that this
  • 00:28:38 link tag if we print that out link we
  • 00:28:46 get this link tag we just want this href
  • 00:28:49 because that's the actual link that we
  • 00:28:51 would use so if I do link href it
  • 00:28:56 doesn't work but what we can do is I can
  • 00:28:59 go ahead and with that link tag I can
  • 00:29:01 pass in in brackets href like this and
  • 00:29:07 as you can see we just get that link
  • 00:29:09 here and you can use this in other ways
  • 00:29:11 where if I grab paragraph equals soup
  • 00:29:14 dot find or do soup dot select paragraph
  • 00:29:18 with the ID paragraph ID and we printed
  • 00:29:23 that out
  • 00:29:26 if we just wanted to get this ID from
  • 00:29:29 the element we could do paragraph zero
  • 00:29:32 because this is in a list and then we
  • 00:29:33 could do ID so you can kind of pass in
  • 00:29:36 anything in the bracket syntax that's
  • 00:29:38 one of those properties that's useful
  • 00:29:40 too all right the final thing we'll do
  • 00:29:42 before we get into the examples is some
  • 00:29:45 code navigation okay I'm going to try to
  • 00:29:48 go through this section pretty quickly
  • 00:29:50 first thing I think it's good to know
  • 00:29:51 about is path syntax so basically we
  • 00:29:55 have you know our soup object as we've
  • 00:29:57 seen before what you can do is there's
  • 00:30:00 shorthands I could do like soup body to
  • 00:30:03 just get the body and I could just keep
  • 00:30:04 doing this I could do dot div to get
  • 00:30:08 just the div inside of the body then
  • 00:30:11 maybe like dot h1 to get just that
  • 00:30:14 header and then I really wanted to dot
  • 00:30:17 string to just get the string off that
  • 00:30:20 header so path syntax is good to know
  • 00:30:22 another thing that's good to know about
  • 00:30:25 and it really comes down to three terms
  • 00:30:28 that you really need to know so I'm
  • 00:30:31 going to say know the terms and the
  • 00:30:34 terms are parent sibling and child
  • 00:30:39 parent sibling and child okay and this
  • 00:30:46 will be more clear if I do a pretty
  • 00:30:48 print of the body okay so what does
  • 00:30:56 parent sibling and child mean well when
  • 00:31:01 we look at our body here we have this
  • 00:31:03 nested structure and so these terms all
  • 00:31:06 kind of relate to that nested structure
  • 00:31:09 so the did this div its parent is the
  • 00:31:13 body because it is nested within the
  • 00:31:15 body likewise this body child is the div
  • 00:31:20 so the body is the parent of the div and
  • 00:31:24 the div is the child of the the body if
  • 00:31:26 we look now at this div we look at the
  • 00:31:28 elements that are on the same level of
  • 00:31:30 it and the next element that would be on
  • 00:31:32 the same level is this H two so if
  • 00:31:35 elements are on the same level we
  • 00:31:37 consider them siblings
  • 00:31:39 and just to see what we can do with
  • 00:31:41 those terms in beautiful soup there's
  • 00:31:43 several things that beautiful soup
  • 00:31:45 offers so if you look at the left side
  • 00:31:49 of the documentation here you see all
  • 00:31:52 sorts of things navigating the tree and
  • 00:31:54 it talks about contents and children
  • 00:31:56 descendants it talks about parents the
  • 00:32:00 really useful things that I think are
  • 00:32:02 kind of right around here with these
  • 00:32:04 function calls
  • 00:32:05 so find find parents find next siblings
  • 00:32:08 find previous siblings find all next and
  • 00:32:11 find all previous these commands can be
  • 00:32:14 pretty useful if you want to just get a
  • 00:32:16 subset of elements so just to do one
  • 00:32:20 quick example let's say I did define
  • 00:32:22 next siblings so if I next sibling would
  • 00:32:26 get is kind of like find and find net
  • 00:32:28 next siblings with an S is kind of like
  • 00:32:31 find all so let's just say we grab the
  • 00:32:35 soup duck body dot find each one urn now
  • 00:32:42 let's just find the div so we have this
  • 00:32:46 div and as we saw before this the
  • 00:32:50 siblings of that div are the h2
  • 00:32:52 I guess the paragraph this other h2 this
  • 00:32:55 paragraph etc so there's I think for
  • 00:32:58 total elements that have that are
  • 00:33:00 siblings to the div so I did find next
  • 00:33:03 siblings we should get a list of four
  • 00:33:07 elements so we get a header paragraph
  • 00:33:10 first paragraph tag second paragraph tag
  • 00:33:13 our second header and the second
  • 00:33:15 paragraph so we could accordingly you
  • 00:33:19 know do some additional processing on
  • 00:33:21 those siblings and with all those other
  • 00:33:23 terms I just mentioned that
  • 00:33:25 documentation also has ways to access
  • 00:33:29 those so you know you could find all
  • 00:33:34 parents you could find all next etc so
  • 00:33:39 useful things to know about and you can
  • 00:33:41 kind of look into the documentation if
  • 00:33:43 you need a specific thing among there I
  • 00:33:46 honestly don't find myself using these
  • 00:33:49 types of functions nearly as much as
  • 00:33:50 just using
  • 00:33:51 final and select but good to know about
  • 00:33:54 all right let's get into exercises okay
  • 00:33:57 for the exercises we're going to be
  • 00:33:59 using in this page Keith galley github
  • 00:34:02 is slash web stripping slash web page
  • 00:34:04 HTML and I'll link this in the
  • 00:34:06 description but just a reminder what
  • 00:34:08 that looks like
  • 00:34:09 it looks like this it's kind of an about
  • 00:34:11 me page kind of a little fun thing I put
  • 00:34:13 together but we're gonna be getting
  • 00:34:15 specific elements from this page and to
  • 00:34:18 help you do that I recommend right
  • 00:34:20 clicking and clicking inspect and note
  • 00:34:23 that if you open up the body you can
  • 00:34:25 kind of see exactly how every one of
  • 00:34:28 these elements is styled in the HTML and
  • 00:34:32 that's gonna help us scrape specific
  • 00:34:34 things out alright so how this is gonna
  • 00:34:36 go is I'm going to present a task on
  • 00:34:39 that web page and the way that I think
  • 00:34:42 you're going to get the most out of this
  • 00:34:43 section is that every time I present a
  • 00:34:45 task if you pause the video try to solve
  • 00:34:49 the task on your own and then resume it
  • 00:34:51 when you're ready to see the answer or
  • 00:34:53 see how I solve it there's gonna be
  • 00:34:55 multiple answers to all these tasks it
  • 00:34:58 will really allow you to you know
  • 00:34:59 practice your skills and and drill down
  • 00:35:01 the library all right so let's start
  • 00:35:04 with the first task and I guess before
  • 00:35:07 we actually I present a first task let's
  • 00:35:09 just load load the web page as a
  • 00:35:14 reminder how to do that we can just run
  • 00:35:15 it load it the same way basically that
  • 00:35:17 we loaded that other example page and
  • 00:35:20 make sure if you haven't already make
  • 00:35:22 sure you import it requests and imported
  • 00:35:24 the beautiful soup library let's load
  • 00:35:27 the web page and the one thing I want to
  • 00:35:31 do here this is now web page dot HTML
  • 00:35:35 and the one thing I'm gonna just give
  • 00:35:37 this a different day I'm gonna give it
  • 00:35:38 the name web page and we can print out
  • 00:35:41 web page dot prettify now so this is
  • 00:35:44 loading the page and as you can see it's
  • 00:35:50 a lot more text than it was before
  • 00:35:53 so first task is going to be to grab all
  • 00:35:59 all of the social links from the web
  • 00:36:06 and to make this a little bit more
  • 00:36:09 interesting I'm gonna say that you have
  • 00:36:12 to do this in at least three different
  • 00:36:14 ways because ultimately we can select
  • 00:36:20 items in many different ways using and
  • 00:36:22 and not only does it have to be done in
  • 00:36:24 three different ways but one has to use
  • 00:36:26 find / find all and at least one has to
  • 00:36:28 use that is the select method so let's
  • 00:36:31 go back to our web page so what we're
  • 00:36:34 trying to grab is all these links right
  • 00:36:37 here feel free to pause the video try
  • 00:36:40 this on your own and then resume it when
  • 00:36:42 you're ready to see how I would go about
  • 00:36:44 solving this alright so what I would do
  • 00:36:48 would to start out and probably be kind
  • 00:36:53 of pretty simple right from the get-go
  • 00:36:55 is I would just try to see what happens
  • 00:36:57 when I select all of the a elements
  • 00:36:59 because a elements are the links on the
  • 00:37:02 page so what happens when I do that and
  • 00:37:06 as we can see we get some stuff there
  • 00:37:12 but it's you know our socials are here
  • 00:37:15 we get all this other stuff in addition
  • 00:37:16 to our socials so this is not the best
  • 00:37:19 way to go about this so what else can we
  • 00:37:22 do well let's go back to the web page
  • 00:37:24 and remember we can do inspect so if we
  • 00:37:30 go ahead and start inspecting these
  • 00:37:32 elements we see that once we get to
  • 00:37:34 these socials
  • 00:37:37 they are all in this class this
  • 00:37:40 unordered list class UL with a class
  • 00:37:45 name of socials so if we grab this then
  • 00:37:48 it's pretty easy to get the links from
  • 00:37:50 there alright so how can we do that well
  • 00:37:52 we wanted to get that unordered list and
  • 00:37:56 we wanted it to have the class name of
  • 00:37:59 socials so remember if we did pound that
  • 00:38:03 would be the ID but dot is for class
  • 00:38:05 names so you l dot socials and what does
  • 00:38:09 that just that give us ok cool that
  • 00:38:12 gives us what we're kind of looking for
  • 00:38:14 but now we just want these a elements
  • 00:38:17 within
  • 00:38:18 that so we can do you all dot socials a
  • 00:38:20 and run that line and cool we have a
  • 00:38:27 list of the a tags and now if we wanted
  • 00:38:30 to just get the actual links I'm gonna
  • 00:38:32 say actual links that's going to be done
  • 00:38:35 if we just do a list comprehension we
  • 00:38:38 could say link href cuz that's where the
  • 00:38:42 actual link is stored in the href for
  • 00:38:47 link in links and then let's go ahead
  • 00:38:51 and print out the actual links and look
  • 00:38:55 at that we got it cool so that's one way
  • 00:38:57 let's move on to the second this time
  • 00:38:59 let's try to use find so once again a
  • 00:39:03 nice starting point might be to do find
  • 00:39:06 let's see what happens if we find just
  • 00:39:08 first a tag print that out okay we get
  • 00:39:17 this tag all right that's my YouTube
  • 00:39:20 channel that's not what we're looking
  • 00:39:21 for
  • 00:39:22 well we can do the same kind of thing
  • 00:39:24 that we did before we ultimately if we
  • 00:39:26 just find this element so if I said ul
  • 00:39:30 hope nope that gives us the fun facts
  • 00:39:33 but if we passed in also the attributes
  • 00:39:36 dictionary with class equaling socials
  • 00:39:42 and then printed out links we see that
  • 00:39:45 we get the tag that we're kind of
  • 00:39:47 looking for here so we can go ahead and
  • 00:39:51 copy this code in and ultimately
  • 00:39:55 navigate over all those links again on
  • 00:39:58 that and we get an error string indices
  • 00:40:06 must be integers okay
  • 00:40:09 those print out links again okay I see
  • 00:40:14 the issue here we did find so this now
  • 00:40:17 is not in a list it's just a single tag
  • 00:40:20 element so what we could do is basically
  • 00:40:23 do links dot find all of the a tag with
  • 00:40:30 get a list just like we had before so
  • 00:40:33 I'm gonna say that this is called you
  • 00:40:34 list unordered list and ultimately the
  • 00:40:38 links is going to be equal to you list
  • 00:40:40 dot to find all of the a tag so this is
  • 00:40:45 our u list now we're finding all the
  • 00:40:47 links within that print out links cool
  • 00:40:55 we get this and then finally now I think
  • 00:40:58 we can copy this in and ultimately get
  • 00:41:01 our new number two way of grabbing these
  • 00:41:05 links cool we got it so now we just have
  • 00:41:07 to do one more let's go back to the web
  • 00:41:09 page to try to figure out a nice way to
  • 00:41:10 do this just looking at the inspect tool
  • 00:41:13 again and what I see here is that these
  • 00:41:19 have a class tag the individual this
  • 00:41:21 telemon's have a clash tag of social so
  • 00:41:25 if I did something like let's say links
  • 00:41:32 equals web page dot select of list dot
  • 00:41:40 social within the a tag within that we
  • 00:41:46 should I think at the same links as
  • 00:41:48 before and look at that we do so we just
  • 00:41:52 really grabbed the individual list
  • 00:41:54 elements instead of the entire unordered
  • 00:41:58 list of links and as a result our third
  • 00:42:03 and final way we copy that and we get
  • 00:42:09 everything alright the next exercise is
  • 00:42:12 going to be to scrape the table that is
  • 00:42:14 included on that web page so we go back
  • 00:42:17 to the page and you scroll down I
  • 00:42:19 actually included a table of my MIT
  • 00:42:22 hockey stats very fun stuff figure this
  • 00:42:25 was a simple fairly straightforward
  • 00:42:27 table that'd be fun to scrape I
  • 00:42:29 initially took this table from a site
  • 00:42:31 called elite prospects but I simplified
  • 00:42:34 it a little bit so if we want to scrape
  • 00:42:36 this I think the first thing we should
  • 00:42:38 do and actually feel free to pause the
  • 00:42:40 video
  • 00:42:41 into it yet and then resume when you're
  • 00:42:43 ready for the solution alright so I
  • 00:42:45 think the first thing that we want to do
  • 00:42:47 is to inspect this table and just see
  • 00:42:51 what we're working with and to get the
  • 00:42:54 entire table we see that we can grab
  • 00:42:57 this table with class hockey stats so
  • 00:43:00 let's do that in code so a web page or
  • 00:43:04 let's say table equals web page dot
  • 00:43:09 select and we're grabbing the table with
  • 00:43:14 class hockey stats and let's see what we
  • 00:43:19 have for table and as it looks like we
  • 00:43:24 have everything there so I'm seeing and
  • 00:43:28 just because we don't want this in the
  • 00:43:29 list we can just do select and just grab
  • 00:43:32 the first element which is the only
  • 00:43:33 element and then we get that just as the
  • 00:43:36 tag form not the entire table all right
  • 00:43:40 and next it's really a matter of I think
  • 00:43:44 the best way to shape a table is to load
  • 00:43:47 it into a data frame in pandas so let's
  • 00:43:49 import pandas so import pandas as PD and
  • 00:43:55 now how do we shape this table to
  • 00:43:58 actually go into that pandas dataframe
  • 00:44:00 so for something like this you might be
  • 00:44:03 able to do it off the top of your head
  • 00:44:05 but this is something that I would
  • 00:44:06 usually you know do at Google let's just
  • 00:44:08 tackle overflow for so let's do that so
  • 00:44:11 how to scrape a table using
  • 00:44:16 beautifulsoup
  • 00:44:17 we could look up something like that
  • 00:44:19 take the first stack overflow post and
  • 00:44:22 kind of look through it see if it's what
  • 00:44:24 you're looking for so this person's
  • 00:44:26 transcript table and this person
  • 00:44:30 responded with kind of how you can do
  • 00:44:32 that and the one thing that I kind of
  • 00:44:34 see with this response is that it's
  • 00:44:37 scraping the table but it's it's not
  • 00:44:39 it's you know it's printing it out as a
  • 00:44:40 string it's not putting it into a like
  • 00:44:42 pandas dataframe so what I think we
  • 00:44:44 should actually look up is scrape a
  • 00:44:47 table into pandas dataframe and I'll
  • 00:44:52 also look up butyl
  • 00:44:54 soup as another keyword in this search
  • 00:44:57 script tables into data frame with
  • 00:44:59 beautiful soup that looks good it's got
  • 00:45:02 a decent number of up votes and let's
  • 00:45:04 see what the answer is here try this
  • 00:45:07 that looks pretty straightforward so I'm
  • 00:45:09 gonna copy this code and just utilize it
  • 00:45:12 within our code so if I go ahead and
  • 00:45:16 insert a new code so I'm just going to
  • 00:45:18 paste this kind of as reference up here
  • 00:45:20 and try to mimic the behavior with our
  • 00:45:24 table so first thing is that our columns
  • 00:45:28 will be ultimately this table header
  • 00:45:32 stuff so I think the first thing we
  • 00:45:34 should do is try to grab all of these
  • 00:45:35 table heads so we could do that with
  • 00:45:38 I'll say column columns equal table dot
  • 00:45:47 find all th and maybe just to be careful
  • 00:45:54 let's because we can kind of see the
  • 00:45:56 scope of this let's do first table dot
  • 00:45:59 find table head and then do dot find all
  • 00:46:07 table heads here and let's see what we
  • 00:46:11 have for columns
  • 00:46:14 cool we get a list of what we're
  • 00:46:16 expecting and now if we wanted just the
  • 00:46:18 column names that would be we could do a
  • 00:46:23 list comprehension which would be a C
  • 00:46:26 dot string we'll say for C n columns
  • 00:46:32 let's print out call names look at that
  • 00:46:38 so we get all the column names this one
  • 00:46:41 looks a little bit weird so we might
  • 00:46:42 obviously ultimately get rid of that and
  • 00:46:44 also because we have these duplicate so
  • 00:46:48 over here that might cause problems in
  • 00:46:49 panas but we'll cross that bridge when
  • 00:46:51 we get there we have the column names
  • 00:46:53 next really we need to copy this code
  • 00:46:57 here so table rows how do we get the
  • 00:47:00 table rows well we're going to go into
  • 00:47:03 the table body so row
  • 00:47:05 are going to be equal to table dot find
  • 00:47:08 we want table body and then we want to
  • 00:47:14 find all table rows and just to look at
  • 00:47:19 the table again just to see how its laid
  • 00:47:21 out I think is helpful
  • 00:47:23 we have table heads here and that's all
  • 00:47:28 within the table head in the tail body
  • 00:47:31 you see we have these table rows and
  • 00:47:33 inside of that you have all these table
  • 00:47:36 datas so we're gonna need to find all of
  • 00:47:39 the table rows and then we're going to
  • 00:47:41 basically do a bunch of processing
  • 00:47:43 within the table data of those table
  • 00:47:45 rows so we're going to find all table
  • 00:47:48 rows and then now we can basically copy
  • 00:47:52 this code so let's paste that in here
  • 00:47:56 for TR and table rows so I'll just call
  • 00:48:00 this table row is just a mirror of the
  • 00:48:01 syntax fot find all table data
  • 00:48:07 Teairra text first TR and TD i do dot
  • 00:48:10 string just because that's the most
  • 00:48:12 up-to-date syntax may be actually text
  • 00:48:14 is totally fine too but you can use
  • 00:48:15 either dot text or not string a row
  • 00:48:18 equals to your outlet string l equal and
  • 00:48:22 row I'll also include that L that I
  • 00:48:26 didn't include up here so that's empty
  • 00:48:28 list that are basically adding all the
  • 00:48:30 row details to so after it on that let's
  • 00:48:32 see what happens if we print out L and
  • 00:48:37 look at that it looks pretty good
  • 00:48:41 like that's just do L the first row it
  • 00:48:46 looks pretty good except for the fact
  • 00:48:49 that a bunch of them are just like
  • 00:48:52 newline characters so how do we strip
  • 00:48:55 out newline characters I'd be kind of
  • 00:48:57 another Google search strip out white
  • 00:49:03 spay or white space and newline
  • 00:49:06 characters Python
  • 00:49:16 so it looks like a nice trip
  • 00:49:21 okay cool it looks like it'll strip any
  • 00:49:25 white space with the dot strip method so
  • 00:49:28 I want to try doing tear duct string and
  • 00:49:31 then dot strip as well and then see what
  • 00:49:34 now our tail array looks like hmm what
  • 00:49:43 happened there sir on that string top
  • 00:49:56 strip what happened this oh look at that
  • 00:50:00 we did it I guess you couldn't call the
  • 00:50:04 strip on just the string but once we
  • 00:50:05 converted it into an actual Python
  • 00:50:07 string object it was a lot more friendly
  • 00:50:10 and that looks like a pretty clean row
  • 00:50:12 so now what we'll do is we'll just merge
  • 00:50:16 this into the data frame so DF equals p2
  • 00:50:23 up data frame L and the columns now are
  • 00:50:26 not that they're going to be the column
  • 00:50:28 names and now let's print out our data
  • 00:50:34 frame D F dot head come on
  • 00:50:37 oh that looks good I love it oh no it
  • 00:50:42 looks like some things are missing what
  • 00:50:43 is missing here I guess because some of
  • 00:50:45 this had nested elements the dot string
  • 00:50:47 didn't work we might have to do get text
  • 00:50:51 let's see if that fixes things
  • 00:50:54 oh look at that it looks pretty good
  • 00:50:57 yeah some of those taggers have nested
  • 00:50:59 elements so this string did none but
  • 00:51:02 that looks pretty good
  • 00:51:04 and we could go ahead and do pandas type
  • 00:51:08 stuff on this so I could do like DF team
  • 00:51:14 and we see we get which of those I could
  • 00:51:18 do TF dot lok of DF we're team is nadia
  • 00:51:27 cool stuff string
  • 00:51:33 team is I guess not equal to did not
  • 00:51:39 play what happens if I do that and look
  • 00:51:46 at that we filtered by that and then
  • 00:51:48 maybe we would want to do like dot sum
  • 00:51:52 or something like that just get the the
  • 00:51:54 totals Oh doesn't look right but we
  • 00:51:59 scraped the table I'm not going to go
  • 00:52:00 into the details you might have to like
  • 00:52:01 convert some of these columns to
  • 00:52:03 different things it also might be worth
  • 00:52:06 not including these last ones or just
  • 00:52:09 changing them to have slightly different
  • 00:52:13 slightly different names because if I
  • 00:52:16 did GP I don't know what would happen
  • 00:52:19 here yeah I guess it gives me both those
  • 00:52:22 columns but if you wanted it kind of
  • 00:52:24 makes things weird because we have
  • 00:52:25 duplicates so you might want to rename
  • 00:52:27 them so you can get each column having
  • 00:52:29 only a single each column name only
  • 00:52:33 having corresponding to a single column
  • 00:52:37 in the date frame but that's more of a
  • 00:52:40 panda's question than a beautiful soup
  • 00:52:41 question and just so you guys can give
  • 00:52:45 me a hard time in the comments if we
  • 00:52:48 look at the table again you see how all
  • 00:52:50 of this postseason stuff is empty yeah
  • 00:52:54 unfortunately in my four years of
  • 00:52:56 playing I guess I played five years I
  • 00:52:57 don't know why 2013-2014 is missing but
  • 00:53:01 in my five years of playing I never made
  • 00:53:02 the postseason so yeah a bunch of blank
  • 00:53:05 spots in that that side of the table but
  • 00:53:08 yeah that's faking the table alright
  • 00:53:11 next exercise we're gonna grab all the
  • 00:53:13 fun facts that use the word is in it so
  • 00:53:18 going back to page we got up near the
  • 00:53:22 top we have these fun facts let's read
  • 00:53:25 through them you know I I own my dream
  • 00:53:27 car in high school everyone kind of a
  • 00:53:30 baller if you click on this footer
  • 00:53:32 though you get some details about that
  • 00:53:33 and it's not my this might not be
  • 00:53:36 everyone's idea of a dream car because
  • 00:53:40 it was actually a minivan but it was an
  • 00:53:43 awesome minivan middle name is Ronald
  • 00:53:46 very fun never had been on a plane until
  • 00:53:49 college the first time I was ever on a
  • 00:53:50 plane was for my freshman year at MIT
  • 00:53:54 'he's the cross trip and I was given a
  • 00:53:58 fun haircut before that trip next fun
  • 00:54:02 fact Dunkin Donuts better than Starbucks
  • 00:54:05 very very important you need I need
  • 00:54:08 everyone to know that you got a support
  • 00:54:10 Duncans some other things so we're
  • 00:54:13 grabbing all of the fun facts here that
  • 00:54:15 have the word is in it let's do that
  • 00:54:19 so we have webpage I think we're gonna
  • 00:54:22 have to do is find fun facts we see it's
  • 00:54:26 a class this is gonna be very similar to
  • 00:54:28 the social media links so let's grab the
  • 00:54:30 unordered list with class fun facts I'm
  • 00:54:37 gonna say facts equals webpage dot
  • 00:54:39 select ul dot fun facts and then I'm
  • 00:54:46 gonna grab all the list elements from
  • 00:54:49 that and that should give me something
  • 00:54:51 pretty good look at that we got all the
  • 00:54:54 list elements here now we just need to
  • 00:54:57 figure it out of those list elements so
  • 00:55:00 I'm going to do find all that contain
  • 00:55:04 the string equaling now we're gonna have
  • 00:55:10 to import reg X again should be already
  • 00:55:13 imported from before but case it isn't
  • 00:55:16 you can import reg X again and do re
  • 00:55:21 compile is and let's see what happens
  • 00:55:27 when I do tax is equals that this topic
  • 00:55:37 has no attribute find all so we're gonna
  • 00:55:40 make this list comprehension so we did a
  • 00:55:42 fact dot find all four
  • 00:55:47 yes fact dot find we don't need to find
  • 00:55:50 dog because there's only us
  • 00:55:51 Ingle string of these for our fact in
  • 00:55:54 facts it's at work now let's see what
  • 00:55:58 happens if we print out facts with is
  • 00:56:03 non nine cool this looks pretty good I
  • 00:56:08 think it looks like only the first and
  • 00:56:10 third didn't have is it and that's just
  • 00:56:14 confirmed that that is right
  • 00:56:16 so first doesn't have his third doesn't
  • 00:56:19 have his all the other ones have is as
  • 00:56:22 we can see here so the last step of this
  • 00:56:25 would be you know maybe just getting rid
  • 00:56:27 of the nuns so you can just do like
  • 00:56:34 another list comprehension if you wanted
  • 00:56:36 to do I'll just say facts with is equals
  • 00:56:41 fact for fact and facts with is if not
  • 00:56:48 fact or if fact because none is a false
  • 00:56:55 condition so if they are none this
  • 00:56:58 wouldn't be true so let's see what now
  • 00:57:00 happens look at that I think we got it
  • 00:57:06 so note that we're really close except
  • 00:57:09 for the fact that these had some like
  • 00:57:12 italicized elements in it and right now
  • 00:57:15 the way we're doing this it's stripping
  • 00:57:18 out the rest of this text so what we're
  • 00:57:23 going to need to do is in the string
  • 00:57:25 element here that we're grabbing we can
  • 00:57:27 do fact I find parent and just get the
  • 00:57:32 element that's directly above it so
  • 00:57:35 ultimately if we run this we see now we
  • 00:57:37 get everything in it and we could even
  • 00:57:41 then go ahead and do dot get text on the
  • 00:57:44 find parent and that should just give us
  • 00:57:47 what we're looking for for the fun facts
  • 00:57:50 look at that so that was actually fairly
  • 00:57:53 tricky with this nuance here at the end
  • 00:57:56 so this was kind of a fun little
  • 00:57:58 exercise alright next exercise is how
  • 00:58:01 can we go to this web page
  • 00:58:03 and download one of these images so we
  • 00:58:06 have you know the image of me and then
  • 00:58:08 we have some Hittle images of a Italy
  • 00:58:11 that I took last year when I made a trip
  • 00:58:13 there so this is like Como this is
  • 00:58:16 Florence and this is a sunset over rio
  • 00:58:20 Maggio what I can't say it I'm gonna
  • 00:58:24 botch anything that I say here but it's
  • 00:58:26 in the Cinque Terre a Cinque Cinque
  • 00:58:30 Terre tarry man
  • 00:58:32 Italians watching this video are gonna
  • 00:58:34 be are we pissed but I had a great time
  • 00:58:38 at all these places but let's try to
  • 00:58:41 download one of these images using web
  • 00:58:43 scraping and some other library so
  • 00:58:45 that's the task I try to do that pause
  • 00:58:48 the video and then resume it when you're
  • 00:58:51 ready alright because I am using Google
  • 00:58:56 collab right now instead of running this
  • 00:58:59 code here in my Google collab notebook
  • 00:59:01 I'm gonna actually use a local sublime
  • 00:59:03 text file to do this downloading so the
  • 00:59:10 start code here is just really getting
  • 00:59:11 that same webpage as before just now I
  • 00:59:16 wanted to do it again because this is
  • 00:59:19 sublime text but as you can see I ran
  • 00:59:23 this code all the stuff that was there
  • 00:59:25 before is still available but now let's
  • 00:59:27 go ahead and and grab an image and
  • 00:59:30 ultimately get the source for an image
  • 00:59:32 so that we can download the image so if
  • 00:59:38 I inspect these pictures we see we have
  • 00:59:42 images slash Italy slash Lake Como so
  • 00:59:45 this is a local path so really we need
  • 00:59:49 to get our current path which is the
  • 00:59:53 webpage that we're looking at and then
  • 00:59:55 add this on to download the image so
  • 00:59:57 that's good to know and this is inside
  • 01:00:00 of a div called row and a column called
  • 01:00:04 class so I could probably do something
  • 01:00:07 like we're going to do webpage dot
  • 01:00:12 select div dot
  • 01:00:16 row div dot column and then we want
  • 01:00:20 images within that so let's see what
  • 01:00:22 happens if we print that out and look we
  • 01:00:26 get just the images that we're looking
  • 01:00:27 for now we need to basically append on
  • 01:00:31 this to our URL so I'm going to pull out
  • 01:00:35 our URL and say your URL is equal to
  • 01:00:39 just this directory this is kind of our
  • 01:00:41 base path so now if we change things up
  • 01:00:48 a bit this would be equal to a URL plus
  • 01:00:52 web page dot HTML and now what I want to
  • 01:00:58 do is for any of these images so I think
  • 01:01:01 for simplicity sake we'll just grab the
  • 01:01:04 image of Lake Como so we have I'm going
  • 01:01:07 to say our images are equal to web page
  • 01:01:12 let's select we're going to just grab
  • 01:01:15 the first image and we will want to
  • 01:01:20 download that so we need to get the URL
  • 01:01:22 well we'll say image URL equals image 0
  • 01:01:28 then we'll get the source for that and
  • 01:01:31 that's just print out the image URL the
  • 01:01:38 image is not defined image is 0 so we
  • 01:01:44 have this that's what we just printed
  • 01:01:46 out okay so we need to append this on to
  • 01:01:53 our so our full URL is now going to be
  • 01:01:56 equal to full URL equals URL plus image
  • 01:02:02 URL and now we just need to download
  • 01:02:04 that well I think this is something that
  • 01:02:08 will be helpful to Google so I'm gonna
  • 01:02:12 just say Python download image using
  • 01:02:20 your
  • 01:02:22 and then we get a Stack Overflow post
  • 01:02:25 right here save image from URL let's see
  • 01:02:29 what we got a sample code that works for
  • 01:02:31 me on Windows this looks pretty
  • 01:02:33 straightforward I want to see what other
  • 01:02:34 replies there are and this one's even
  • 01:02:37 shorter and I like shorter so we're
  • 01:02:39 gonna try doing this one in our code he
  • 01:02:42 uses requests we already have imported
  • 01:02:50 requests so that's basically just making
  • 01:02:56 another request now we're going to need
  • 01:02:58 to use the full URL because like the
  • 01:03:01 pace of this image this code in we
  • 01:03:03 didn't have it as we wanted it's a full
  • 01:03:05 URL we're going to get the content so
  • 01:03:07 this is just like getting the webpage
  • 01:03:08 content but now we're getting the image
  • 01:03:10 content on that page with open and then
  • 01:03:14 we can name this whatever we want so I
  • 01:03:16 know that this is going to be Lake Como
  • 01:03:18 it will open that as writing buffer
  • 01:03:22 let's just confirm yeah this is a JPEG
  • 01:03:25 image so we can use this extension
  • 01:03:27 handler dot rate image data that should
  • 01:03:32 be good and now that's also we're going
  • 01:03:34 to save it wherever we have this code
  • 01:03:36 locally so I'm gonna run this I think it
  • 01:03:42 ran I'm gonna confirm so over here I
  • 01:03:46 opened up the folder that I had this
  • 01:03:48 download image file in and as weensy I
  • 01:03:51 can open up the image of Lake Como
  • 01:03:53 locally that's pretty cool so we just
  • 01:03:55 downloaded that so we get the full
  • 01:03:56 quality here and Wow looking at this
  • 01:03:59 again this was just such a beautiful
  • 01:04:00 spot I definitely recommend traveling
  • 01:04:03 not only to Italy but checking out Lake
  • 01:04:05 Como's the it was so relaxing so pretty
  • 01:04:09 and then I'm missing this right now
  • 01:04:12 being in the middle of the pandemic but
  • 01:04:16 okay that was downloading an image so
  • 01:04:17 we're done with that exercise all right
  • 01:04:20 final exercise before we conclude this
  • 01:04:22 video it's going to be solving the
  • 01:04:24 mystery challenge so if we go one more
  • 01:04:27 time to this web page and we look at the
  • 01:04:31 bottom there's a bunch of
  • 01:04:33 links here and if you scrape just the
  • 01:04:38 paragraph tags with the ID secret word
  • 01:04:41 from all of these links and you got to
  • 01:04:44 do this in order each each one of these
  • 01:04:46 files is going to have exactly one of
  • 01:04:48 these secret word secret word IDs and
  • 01:04:52 just to show what the file looks like it
  • 01:04:55 looks like this but if you scrape those
  • 01:04:57 all and just grab the the correct ID
  • 01:05:00 you'll ultimately get a fun secret
  • 01:05:05 message alright so how are we going to
  • 01:05:07 do that well let's look at what these
  • 01:05:09 links look like inspect all right again
  • 01:05:14 they're similar to the image just from
  • 01:05:16 the last exercise where it's a relative
  • 01:05:19 path so we can probably utilize some of
  • 01:05:22 that previous code we did to get to open
  • 01:05:25 those files and we'll have to use of
  • 01:05:26 requests again to dive into them and
  • 01:05:30 then if we look at the actual file and
  • 01:05:33 we look at it we see that all of these
  • 01:05:37 have IDs secret word with to sees not
  • 01:05:41 the secret word we're looking for or one
  • 01:05:43 of them has the specific ID we're
  • 01:05:45 looking for so we're in a scrape for
  • 01:05:47 that alright let's do this so I think
  • 01:05:54 first off let's grab our elements that
  • 01:05:58 we need and see what they are let's just
  • 01:06:01 see what's in this respect this so we
  • 01:06:07 have a paragraph we have a div so these
  • 01:06:11 class block divs this looks like what we
  • 01:06:15 need to find and we need to get out the
  • 01:06:18 links from there so what I'm going to do
  • 01:06:21 is a select here I'm going to do web
  • 01:06:24 page or let's say files equals web page
  • 01:06:31 dot select it had it was a div with
  • 01:06:36 class block so we can just block it you
  • 01:06:39 select those and hopefully nothing else
  • 01:06:40 on the page has those and then we want
  • 01:06:42 to grab the link from
  • 01:06:46 those block divs let's see what now
  • 01:06:51 files gives us look at that looks like
  • 01:06:54 it's all the files we want one through
  • 01:06:56 ten in order
  • 01:06:57 so now let's we want to just get the
  • 01:07:00 relative paths
  • 01:07:03 I guess relative files will say equals
  • 01:07:08 file and we're gonna get the href
  • 01:07:12 element I think your file might be a
  • 01:07:16 special word and pythons I want to just
  • 01:07:20 say F F href for file F in files
  • 01:07:28 printout relative files and remember you
  • 01:07:38 can pause the video if you want at any
  • 01:07:39 point if you don't want to watch me
  • 01:07:41 solve this but look that just gets us
  • 01:07:43 the relative paths and then we need to
  • 01:07:46 kind of from our previous example we
  • 01:07:49 should go ahead and you know kind of
  • 01:07:54 copy some of this code so when I say URL
  • 01:07:57 equals this so our URL equals this then
  • 01:08:06 I say for file and for F in relative
  • 01:08:12 files we want to construct the full URL
  • 01:08:16 so full URL equals URL plus the relative
  • 01:08:22 path so the F here so that would be like
  • 01:08:26 this path this URL plus this that's
  • 01:08:29 going to be our full URL then we're
  • 01:08:31 going to want to load that page so we're
  • 01:08:33 going to do requests dot get the full
  • 01:08:36 URL and then ultimately page equals that
  • 01:08:44 and then we will want to load that into
  • 01:08:47 beautiful soup so be
  • 01:08:50 bs page equals bua beautifulsoup passing
  • 01:08:57 the page and then within the
  • 01:08:59 beautifulsoup page i guess let's just
  • 01:09:01 look at one of these pages so we'll just
  • 01:09:04 do a beautiful save page body print that
  • 01:09:08 out maybe prettify it and then break out
  • 01:09:16 of this so this is only gonna run once
  • 01:09:20 type response has no length or is that
  • 01:09:24 an issue oh okay we're have to do page
  • 01:09:28 dot content here okay and now we at the
  • 01:09:33 page cool so now in that page it's just
  • 01:09:38 a bunch of paragraph tags so if we just
  • 01:09:41 go ahead and do print vs page dot find
  • 01:09:53 paragraph or I guess we can do let's do
  • 01:09:57 find we'll do find the paragraph tag we
  • 01:10:01 want to pass in attributes equals ID
  • 01:10:07 secret word and that should get us for
  • 01:10:13 that file the secret word so let's run
  • 01:10:16 this again because it looked like the
  • 01:10:19 page was loaded in properly look at that
  • 01:10:23 make okay so that's the secret word for
  • 01:10:27 the first file with in the tag so if we
  • 01:10:37 wanted just the string so just two
  • 01:10:41 secret word element equals BS page to
  • 01:10:47 find that and then if we wanted just the
  • 01:10:53 secret word that's going to be secret
  • 01:10:57 word
  • 01:10:58 element dot string then mr. sprint
  • 01:11:03 secret word just make sure it works for
  • 01:11:05 one file mate
  • 01:11:09 cool so now we're gonna remove this
  • 01:11:11 break and we'll see what it prints out
  • 01:11:15 so now it's gonna iterate over all the
  • 01:11:18 relative file paths add it to the URL to
  • 01:11:22 get our full URL and ultimately
  • 01:11:24 hopefully this will give us our secret
  • 01:11:26 message and we can be done with the
  • 01:11:28 video let's run it what is it gonna say
  • 01:11:30 oh wow look at that look at this secret
  • 01:11:36 message make sure to smash that like
  • 01:11:38 button and subscribe that's all we're
  • 01:11:41 gonna do in this video everyone
  • 01:11:42 hopefully enjoyed this hopefully likes
  • 01:11:44 learning kind of a little bit about what
  • 01:11:47 web scraping is then you learned about
  • 01:11:50 the building blocks and then we did a
  • 01:11:51 bunch of exercises to really drill down
  • 01:11:53 those skills if you did enjoy this video
  • 01:11:55 yeah it may mean a lot of you smash
  • 01:11:57 those like buttons and subscribe also
  • 01:12:00 feel free to check me out on the other
  • 01:12:01 socials Instagram and Twitter
  • 01:12:04 I do appreciate when people follow me
  • 01:12:07 there and I think it's a good way for me
  • 01:12:09 to kind of show my personality a bit on
  • 01:12:11 those other platforms so I post some
  • 01:12:13 cool stuff in those places do I have
  • 01:12:16 anything else yeah I guess the only
  • 01:12:18 other thing I want to mention is that
  • 01:12:19 I'm going to try to do some follow-up
  • 01:12:20 more kind of complex examples of web
  • 01:12:23 scraping in the future maybe like a real
  • 01:12:26 world web scraping video and I also want
  • 01:12:31 to dive into not just beautiful soup but
  • 01:12:32 I would like to look at selenium and
  • 01:12:36 scrapey so I might do that in future
  • 01:12:38 videos too but feel free to let me know
  • 01:12:40 in the comments if there's anything else
  • 01:12:42 that you'd like to see alright once
  • 01:12:44 again that's all we're doing in this
  • 01:12:46 video thank you everyone for watching
  • 01:12:48 this has been a fun one for me peace out
  • 01:12:55 [Music]
  • 01:12:57 you