Scrape sensor improved - scraping multiple values

danieldotnl · August 28, 2020, 1:35pm

Good news!! I have now published a beta 2.0 release that features a sensor for each selector in your configuration. Finally you can get rid of all those template sensors in your configuration.
It is now also fully async which improves the overall performance of home assistant.

I’m looking for some testers (select ‘show beta’ in hacs) and will publish a final release after my upcoming holiday when no important issues are reported.

danieldotnl · August 28, 2020, 1:41pm

I’m very interested @drogfild but won’t have time to look into it in the coming 2 weeks.
I’m afraid it will be a major job for you to merge this with my latest changes, as they are almost a full rewrite. Hope you will enjoy those new features though.

Is it maybe possible for you to prepare a sample configuration for your change, that logs in into this forum?
Just would like to understand a bit more about the (im)possibilities of your new feature.

drogfild · August 28, 2020, 7:24pm

Thanks for your kind words. Now that I have code working that rewriting it won’t be such big task. Same time I have to learn how use git properly.

Idea about having an example about login to this forum is a good one!

Galazar · September 3, 2020, 2:47pm

Firstly, awesome work!

Secondly, quick query - after rebooting I see this event in the HA logs;

Logger: custom_components.multiscrape.sensor
Source: custom_components/multiscrape/sensor.py:245
Integration: multiscrape
First occurred: 15:39:23 (3 occurrences)
Last logged: 15:39:36

Sensor State was unable to extract data from HTML

I’m guessing this is probably the website I’m scraping applying some sort of rate control? - Is there any way to know from the logs if this is the case?

I’ve got 6 scraping sensors setup to one site where each sensor has 5 selectors so I wouldn’t be surprised if they weren’t liking that.

Appreciate this is probably more of a question for the original scrape component too

Thanks
Lawrence

Holdestmade · September 3, 2020, 2:55pm

You can add scan_interval to reduce the scrapes (default is 30s)

Also, I think it only does 1 HTTP request per sensor and does the multiple fields separately

This Home Assistant custom component can scrape multiple fields (using CSS selectors) from a single HTTP request (the existing scrape sensor can scrape a single field only).

Galazar · September 3, 2020, 3:14pm

~~Thanks, I already had the scan_interval set to hourly (3600). I think it just craps out after a reboot because it does all of them simultaneously.~~

I might try adjusting the intervals to be sequential instead of asynchronous per scraping sensor - they’ll still conflict and crap out at each reboot, but if they sort themselves out as they refresh at individual intervals (5-10-15 minutes, etc) then that might be fine.

Ignore all that, turns out the stock website sub catagorises 3 of the 6 pages I’m scraping with a different tag for the same field - no idea why, seems to be fund related but that’s why.

danieldotnl · October 7, 2020, 7:54pm

That’s correct!

swifty · October 8, 2020, 8:13am

Thanks @danieldotnl for this component, I’ve been attempting to use the modified version by @drogfild to scrape tank telemetry data from the LPG supplier, but I’m having real issues getting it past the login page.

The site I’m trying to scrape from is https://my.flogas.co.uk/ - I have a valid login, but using

    preloginform: 'login_form'
    username_field: 'email_address'
    password_field: 'password'

doesn’t appear to work… I enabled the commented out debug lines in sensor.py and it seems that it just returns the same login page after ‘prelogin’ finishes…
Am I missing something ?

Roemer · October 8, 2020, 8:26am

Great component, thanks for that!
I am having trouble using a dynamic select. I want to use something like:

select: >
            {% set wn = now().isocalendar()[1] %}
            {% set wd = now().weekday() %}
            tr[data-weeknumber="{{ wn }}"][data-dayofweek="{{ wd }}"] td:nth-child(2)

But I guess because the select is only a string, it does not support templates. Do you know if something like that is possible or could be made possible?

danieldotnl · October 8, 2020, 3:00pm

Support for templates would be a nice feature! I’ll try to find time to look into it.
I also still want to publish the latest pre-release as an official release.

drogfild · October 9, 2020, 5:59pm

Hi all! Sorry for not being able to update my code lately. I try to find time for this project also.

@swifty For me your config seems valid. Can you test is it possible to login to that page without javascript with your browser? Am I able to create account to that site?

danieldotnl · October 10, 2020, 7:18am

I just released the latest pre-release as an official release and created a new pre-release which supports templates in the select! @Roemer: Please give it a try!

swifty · October 12, 2020, 7:20am

Sorry for the late reply, it was a busy weekend!
I think the site probably needs javascript but got it going in the end.
I use node-red for all my automations so I used a selenium docker container controlled by node-red to scrape the information I needed from the site.

Roemer · October 15, 2020, 9:13am

Wow the template seems to work great! Now I only have the issue that
tags are removed and just replaced by spaces, so I cannot format the text nicely.

erik7 · November 19, 2020, 3:33pm

Latest Multi Scape component is working fine for me. It’s logging-in (username/password) and scrape multiple values which are available as attributes on the sensor.

My question is: Is it possible to get the value of ‘totaal:’ as Entity State value? (current Entity State is none)

    selectors:
      levering_dag:
        name: Levering dag
        select: "div.col-lg-4:nth-child(4) > div:nth-child(1) > div:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(2) > a:nth-child(1)"
        unit_of_measurement: "kWh"
      levering_nacht:
        name: Levering nacht
        select: "div.col-lg-4:nth-child(4) > div:nth-child(1) > div:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(2) > a:nth-child(1)"
        unit_of_measurement: "kWh"
      totaal:
        name: Totaal (levering – teruglevering)
        select: "div.col-lg-4:nth-child(4) > div:nth-child(1) > div:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(5) > td:nth-child(2) > a:nth-child(1)"
        unit_of_measurement: "kWh"

majkers · December 17, 2020, 7:43am

@drogfild how can I point to login form since it does not have any name? My page looks like this:
https://ebok.mpwik.lublin.pl/login

drogfild · December 20, 2020, 7:44pm

Hi @majkers! Good question and thanks for giving an example. After @BrianHanifin commit that “prelogin” script also looks for attribute name, id, class or action. So in your case you should be able to use it’s class

      preloginform: 'form-horizontal'

or even it’s action

      preloginform: '/login'

Unfortunately my script isn’t updated with latest multiscrape improvements and is based on quite old version of it. I haven’t yet got my head around async requests yet

Adrian_Stanciu · December 29, 2020, 1:03am

Hi,
I’m trying to get exchange rates as-of-today from national bank link, but seems i don’t write proper syntax.
After someone can illuminate me, then i will want to get with a single request exchage rates for 4 currencies. (tried openexchange, but for free api the result is only for usd and i want base currency to be “ron”
Thank you

 - platform: scrape
   resource: https://bnr.ro/Cursul-de-schimb-524.aspx
   select: "chf"
   name: leutu
   value_template: '{{ (value | int) / 10 }}'

tony124 · February 13, 2021, 9:59am

I am trying to use the fork by @drogfild https://github.com/drogfild/hass-multiscrape to scrape data from my heat pump (thread https://community.home-assistant.io/t/is-there-any-interest-in-a-stiebel-eltron-climate-platform), but could not get login working. A GET request by curl, browser, or the requests module from python returns the expected content of the login page, but this module gets always a 400 - bad request page. Also tried to add a user-agent to headers, no luck.

Any idea what am I missing?

drogfild · February 14, 2021, 7:25pm

That’s weird. Haven’t experienced that error myself. Have you been able to verify if you get that error from the first page load or is it after login attempt?

Most probably doesn’t affect this problem, but I have beta version of my fork new version. It’s quite up to date with original. You can find it from dev branch. Config should be identical.