How to Scrape Multiple Values

ben-digitalhive · January 20, 2020, 4:02am

Hi there - I am trying to use the SCRAPE component to pull a tag from my local council website - garbage collection and have this successfully pulling the first date. I have think I have enough understanding to pull the second tage however my question is:

Q: How do you run the SCRAPE platform and pull multiple tags to different variables? And ideally do this in one go (i.e. hit the website once and pull multiple tags as opposed to running a separate scrape for each variable)

# Add a web scraper to get garbage collection days from Auckland Council website
  - platform: scrape
    resource: 'https://www.aucklandcouncil.govt.nz/rubbish-recycling/rubbish-recycling-collections/Pages/collection-day-detail.aspx?an=12341658037'
    select: ".m-r-1"
    name: "Garbage Collection Next"
    value_template: > 
        {% set strtext = value + " 2020" %}
        {{ strptime(strtext.split(' ', 1)[1], '%d %B %Y') }}
    unit_of_measurement: date
    scan_interval: 86400

vpsmarthome · January 28, 2021, 2:03am

Any solution for this??

icaballero · March 4, 2021, 11:12am

Hello,
I think I am trying the same thing. Did you find a solution?

I have a select like this: “.exampleclass p”
but this only scraps the first paragraph tag, and I would like to have all of them listed or concatenated.

<div class="exampleclass">
<p>1</p>
<p>2</p>
....etc

Only the first one is appearing even if the website mentioned in the documentation talks about all the tags that fulfill the select condition:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

Troon · March 4, 2021, 11:46am

I resorted to AppDaemon to get my bin sensors in:

import hassapi as hass
from bs4 import BeautifulSoup
import datetime
from dateutil import parser
import requests


class BinCollection(hass.Hass):
    """
    Looks up the LDC website for the specified uprn (address) and
    returns a dictionary of {'$bin_colour': '$date'} scraped from
    the webpage. Likely to break if the site design changes at all.
    """

    def initialize(self):
        """Sets up the uprn and the first run"""

        self.log("Bin collection app running")
        self.uprn = '[MY_UPRN]'
        self.run_in(self.refresh, 5)

    def refresh(self, kwargs):
        """Looks up the data, sets next run at 2am tomorrow"""

        url = "https://www.lichfielddc.gov.uk/homepage/6/bin-collection-dates" \
              f"?uprn={self.uprn}"
        page = requests.get(url)

        if page.status_code != 200:
            return None

        soup = BeautifulSoup(page.text, 'html.parser')

        bh3s = soup.find_all('h3', class_="bin-collection-tasks__heading")
        bpds = soup.find_all('p', class_="bin-collection-tasks__date")

        for i in range(len(bpds)):
            bin_colour = bh3s[i].contents[1].split(' ')[0].lower()
            bin_date = parser.parse(bpds[i].contents[0]).strftime('%Y-%m-%d')
            self.log(f"{bin_date}: {bin_colour} bin")
            self.set_state(f"sensor.{bin_colour}_bin",
                           state=bin_date,
                           attributes={'device_class': 'timestamp'})

        tomorrow = datetime.datetime.today() + datetime.timedelta(days=1)
        next_run = tomorrow.replace(hour=2, minute=0, second=0, microsecond=0)

        self.log(f"Scheduling next run for {next_run.isoformat()}")

        self.run_at(self.refresh, next_run)

Have you seen this, though:

icaballero · March 4, 2021, 1:32pm

Thank you for the information! I think I will try your solution too.
I have tried the custom component but it behaves equal to the original one. The advantage is that puts all the values in a just single http request instead of multiple requests.
I was looking into an improvement of the select command to have a list of values in one sensor.
One good solution is using the chrome’s tool for copying the selector:

At least, with this method, I can select a parent div that contains every tag. Then, due to the 255 limit of the single sensor state, truncate it with the following template or a similar one :

value_template: “{{ value | truncate(255) }}”

popboxgun · March 4, 2021, 2:04pm

I’ve moved to using python scripts because of the 255 limit (with the added bonus of not having to reboot HA to test a change). Below is what I followed to get it up and running, but I then output it as json. I run the python script in a command_line sensor and use the json_attributes so push all the data to attributes to get over the 255 plus I don’t have to have multiple template sensors.

Here’s a scrape I setup yesterday to monitor when Giant food’s COVID signup page changes.

from bs4 import BeautifulSoup
import requests
import json

# Change these 2 things
URL="https://covidinfo.reportsonline.com/covidinfo/GiantFood.html"
# This is the select line you will use in the config
SELECT="#divApptTypeInfo0 > h2 > span > span"
# You may need to use a template after the fact...
r = requests.get(URL)
data=r.text
# Print the output of the request command to see what we even get. 
soup = BeautifulSoup(data, 'lxml')
# See what the select returns
val = soup.select(SELECT)
value = str(val)
#convert to json
data = {
  'output':
    {
      'giant_covid': value
    }
}
jstr = json.dumps(data)
print(jstr)

sensor:
  - platform: command_line
    name: Covid Giant Scrape
    command: "python /config/python_scripts/covid_giant.py"
    value_template: "Covid Giant Scrape"
    json_attributes: 
      - output
    scan_interval: 300
    command_timeout: 30

icaballero · March 5, 2021, 7:19am

Works really good!
Could you please explain how to clean the final data? I have it full of html tags.

Troon · March 5, 2021, 8:01am

In this example, it doesn’t need cleaning, as @popboxgun is just looking for changes not content, and probably has an automation triggered by that sensor updating. You could adapt the script to use Beautiful Soup filtering (as in my example) to pull out the bits you need and return them in the JSON as attributes.

icaballero · March 5, 2021, 9:07am

Well, I have seen that adding the following code to the python script cleans it well, so you can have it also that way:

jstr = BeautifulSoup(jstr, “lxml”).text

Regards,

popboxgun · March 5, 2021, 1:33pm

Thanks for the formatting, I’m sure I’ll need it in the future