Scrape sensor improved - scraping multiple values

20210608 - IMPORTANT UPDATE: release 4.0.0 upgraded this component to an integration. You can now also create binary sensors, still based on a single HTTP request.
It does come with a backward-incompatible change in the configuration though. Please check the upgrade notes! I also updated the example config below.

20210520 - IMPORTANT UPDATE: I have created a new repository named ha-multiscrape for this sensor as the orginal development setup wasn’t ideal. The new sensor (which is backward compatible) is now available in the default HACS store!
The new Github project can be found here: GitHub - danieldotnl/ha-multiscrape: Home Assistant custom component for scraping multiple values (from a single HTTP request) with a separate sensor for each value.
The old project will not be maintained anymore, so please switch!


I think this didn’t yet exist (or at least I couldn’t find it), so I created an improved scrape component which is able to scrape multiple values with a single HTTP request. The values become available as attributess on the sensor.
You can find the project here GitHub - danieldotnl/hass-multiscrape: Home Assistant custom component for scraping multiple values (from a single HTTP request) with a separate sensor for each value. or install from HACS (add it as a custom repository).

multiscrape:
  - resource: https://www.home-assistant.io
    scan_interval: 3600
    sensor:
      - name: Latest version
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) }}'
      - name: Release date
        select: ".release-date"
    binary_sensor:
      - name: Latest version == 2021.6.0
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) | trim == "2021.6.0" }}'
6 Likes

Nice one! Have you seen anyone to try to make scrape sensor that could handle easy login prompts?

1 Like

Not sure what you consider ‘easy login prompts’, but it does support HTTP authentication like in the RESTful sensor or you can send bearer tokens in an authorization header.
See: https://www.home-assistant.io/integrations/rest/

Have no need for it, and by the time I get that need I believe I have already forgotten about this and will just set up two separate scrapers.
But nice work!
Would be nice if this could be merged in with the core.

Nice one, have converted all my multi-scrape sensors.

Would be nice to have multiple sensors instead of attributes though
EDIT: notice this is on your roadmap

What is the scrape interval ?

Thanks

You can configure that with scan_interval (https://www.home-assistant.io/docs/configuration/platform_options/) I believe it defaults to 30 sec.
No sure though how that works once I upgrade it to an async component.

OK , I’ve added that thanks, how often would it scrape once you move to async?
Won’t some websites block after a lot of scrapes ?

With ‘easy login prompts’ I meant login to site that requires username and password for login. So it’s x amount harder but still doable. I have working code but it’s only for this specific website and I haven’t got to translate it to HA custom component. I was thinking that maybe this component could be extended (maybe a bit wishful thinking). Maybe I have to take your code and extend it :smiley:

Looking forward to your pull request! :slight_smile:

Update: I created a new release which now uses the lxml parser (which is better). For backward-compatibility, the old ‘html.parser’-parser could still be configured.

Thanks, no more messages in log for me

Hah, went and did something about it. It’s very rude and doesn’t have good error handling, but it does login and scrapes values!

How does config for login looks like?

This is only my current development setup and may change. But I would value any feedback. Never done these before.

Current setup for configuration.yaml in addition to normal multiscrape settings:

    prelogin:
      preloginpage: https://url.of.that.site.com/login.html
      preloginform: 'loginForm'
      username_field: 'username'
      password_field: 'password'
      username: 'yourusername'
      password: 'yourpassword'

There are things that this script cannot find automatically and some html reading skills are needed from you. You need to tell it what is URL of login page. This can be same or different page you are scraping. Preloginform is the name parameter of <form> tag in html. So in my example it’s <form name="loginForm">.
Username_field and password_field are input’s from html. So in my example those are like <input type="text" name="username"> and <input type="password" name="password">.
Username and password are what you expect. They are the real credentials to be filled out to the form and submitted to the site.

Should this be moved to an another discussion as this is not about the current version of hass-multiscrape and it’s not even sure if this ever will be part of it. @danieldotnl what do you think?

1 Like

Are you attempting to be able to login to HTML forms? I was just wanting to look into that. I’d like to pull the overnight usage data from my CPAP machine. But it is behind an HTML form.

Yes, that is exactly what my code is doing. I have it now working in my dev system. But I have only tested it with one site now. Maybe I should just share it for others to test it also.

Yes please. Share your code and I can help test.

Now first test version available at https://github.com/drogfild/hass-multiscrape/blob/dd32b0eaa6bd1a53181b48c2a4b66db033f25ce8/sensor.py

Just notices that @danieldotnl has been busy also and updated his code. So this doesn’t include any of his new improvements (each item as separate sensor).

Let me know if this worked with your site.

1 Like

OK, so my login form does not have a name nor id property. But it does have a unique class and a unique action. I just submitted a Pull Request to your fork which checks all 4 form attributes.

Thanks for your input @BrianHanifin! It’s now merged to my repo.
Next I should update my version to match original. We would get multiple sensors instead of multiple attributes. Then maybe do pull request to the original.

1 Like