if you want to learn scraping there is 1 thing very important.
you need to know how pages are build, and you need to be able to read them.
when you open your webpage in chrome, and then use [ctrl] [shift] I your page splits in 2 parts.
left where you have your page and right the way it is build
the first thing to do is to learn to read that right part.
the webpage is build up in levels.
the first 1 is HTML , you see that at the top, that is like the root dir.
the second one is head, which is like a subdir from html
a subdir is called a child and html is the parent from head
then you see below head the subdir body.
body is a sibbling from head and also a child from html.
now you go down like its an ancestry tree.
when you start scraping you need to try and look at your logging, untill you only see the value you want to save in a sensor.
lets look at the code:
###########################################################################################
# an app to that creates a sensor out of data collected from #
# https://www.nfl.com/standings #
# #
###########################################################################################
import appdaemon.plugins.hass.hassapi as hass
import datetime
import time
import requests
from socket import timeout
from bs4 import BeautifulSoup
class standings(hass.Hass):
def initialize(self):
#################################################################
# when initialising the sensor needs to be imported #
# but we need to run the same code again to get the next values #
# thats why i only start the call_back from here #
#################################################################
self.get_values(self)
def get_values(self, kwargs):
#################################################################
# first we set some values, this could be done in the yaml #
# but this app is specialized and will only work for this #
# webpage, so why bother #
#################################################################
self.url = "https://www.nfl.com/standings"
self.sensorname = "sensor.standings"
self.friendly_name = "AFC North Standings"
afc_standings = None
#################################################################
# now we read the webpage #
#################################################################
try:
response = requests.get(self.url, timeout=10)
except:
self.log("i couldnt read the nfl standings page")
return
page = response.content
this part you never need to change. what is done here is that you get the webpage and save all its content (what you see in chrome on the right side) in the variable call “page”
then we use this command:
soup = BeautifulSoup(page, "html.parser")
to get all that from the page into the variable called “soup”, but in a way we can work with it.
now you get the hard part.
you can use a line like
self.log(soup)
to show the content from the variable in your logs.
in this case it shws everything from the webpage
in chrome you find the next level you want to find
off course every information you want is in body, so you can use
self.log(soup.body)
now you already get a bit less information in your log.
lets take another look at our chrome page
we already did know we needed soup.body
as you can see in the page, everything from the page is inside a sublevel (child) called div with id=content.
so we change our log to
self.log(soup.body.div)
inside that div is another div (data_radium=“true”)
and inside that another div, which has no settings.
so we are already on
self.log(soup.body.div.div.div)
now we get to a difficult level. when we log that or look in chrome we see a few divs as children from the div we have chosen in our log.
when we want the first one, we could just use div and go on.
but we want the next one that has class=“application-shell”
so instead of just div, we use div.nextSibling
div means we chose all children with the name div, but we dont want number 1, but the nextSibling
so our log now looks like:
self.log(soup.body.div.div.div.div.nextSibling)
now you go deeper and deeper, every time you save the app, look at the log and see if the first line from the log is the same as you see inside chrome, so you know where you are.
all the way untill you see the value you look for.
if we are on the level where our info is and we dont need any more divs (or other html elements)
we can use .string to view the text inside that child.
now we can put everything we got between the ( ) from the self.log into a variable
my_value = soup.soup.body.div.div.div.div.nextSibling # and everything that you need to put behind here
and then all you need is set_state.
i hope this did make clear how you need to work to get your data.
of course there are a lot of other ways.
if it isnt clear enough the way i tell, or i did chose a way that is to difficult, then you can look at 1 or more from the many online tutorials.
to find those google for “scraping with beautifulsoup tutorial” or something like that
because the program you use is BeautifulSoup
i wish you success, and hope to hear soon that you did understand and that it worked.