Scrape sensor improved - scraping multiple values

Nuuki · July 20, 2022, 9:43am

I’m trying to scrape a number of weather values from http://www.prestwoodweather.co.uk. I’m able to scrape the temperature, but can’t get other values. Its just standard HTML but despite checking it numerous times I just can’t see what I might be doing wrong. Here’s my config:

multiscrape:
  - resource: http://www.prestwoodweather.co.uk
    scan_interval: 300
    sensor:
      - unique_id: outside_temperature
        name: Outside Temperature
        select: "table > tr:nth-child(3) > td:nth-child(2) > font > strong > small > font"
        value_template: '{{ value | regex_findall_index(find="\d+\.\d+", index=0, ignorecase=true) | float}}'
        device_class: temperature
        unit_of_measurement: "°C"
      - unique_id: outside_humidity
        name: Outside Humidity
        select: "table > tr:nth-child(6) > td:nth-child(2) > font > strong > small > font"

Any ideas?

danieldotnl · July 20, 2022, 6:31pm

The HTML is broken. Several <td> are not part of a <tr>. That must be the problem. And I can’t exactly explain it, but this works:

select: "table > td:nth-of-type(4) > font > strong > small > font"

swifty · July 20, 2022, 6:50pm

So I’m having an interesting issue… I am trying to scrape my gas tank level data from my supplier, I currently use selenium and node red which works, but just seems overkill… the issue is the supplier website is a bit “glitchy” and seems to need the login form to be filled in and submitted twice before it logs in (I see this in a normal browser too, and handle it in the selenium setup currently)

I’d like to move to this component but I don’t think I’d be able to currently get a sequence of actions ? (IE two form submits, then scrape)

Looking at the docs, I think it just supports a single login form then scrape, is that right?

danieldotnl · July 20, 2022, 7:06pm

Not sure if it helps, but if a request fails for some reason, the form is always going to be submitted again in the next run.
Alternatively, you could set submit_once to false, which will submit the form on every run.

Nuuki · July 20, 2022, 10:19pm

@danieldotnl I don’t know what witchcraft you used to work that out but it worked great!

By creating 10 of each select formats I’ve find most of the values, but I’m not able to find the following:

Wind Chill
Wind - Anemometer Status - OK
Today’s Rain (Since 00H)
Rain Rate

Any chance of seeing how the first of those looks and I’m hoping that from that I’ll find the rest?

danieldotnl · July 21, 2022, 7:50am

Here’s how I do it:

enable log_response in multiscrape config
copy the page_soup.txt from the logging (/config/multiscrape/yourconfigname) to local computer
rename extension into .html
load in firefox (we now have exactly what the scrape library beautifulsoup parsed)
start playing around with the selectors in the firefox console (F12) (this step is just trial and error)
copy the end result in the config and test

In the console you can test selectors like this:
$$('table tr:nth-of-type(7) > td:nth-child(2) > font > strong > font > small')
and it will immediately show you the result.

So this works for wind chill:

select: "tr:nth-of-type(6) > td:nth-child(2) > font > strong > font > small"

Good luck with the other fields

Nuuki · July 21, 2022, 9:38am

That sounds good. I had been using Chrome but I’ll give Firefox a go.

One bit I’m confused about - when you say to copy the page_soup.txt, what is that? I assume you don’t mean to simply save the web page. Is this a Firefox feature?

danieldotnl · July 21, 2022, 11:02am

I updated my previous answer, hope that makes it more clear.

PS: I’m sure Chrome has similar functionality, I just happen to use Firefox.

Nuuki · July 21, 2022, 3:26pm

Perfect - thanks again!

swifty · July 25, 2022, 11:37am

Thanks, unfortunately it didn’t seem to help - the login still seemed to fail on the second run.
I tried Chrome in incognito mode so and attempted to login manually to the site, and sure enough the first attempt fails, then a second login works… if I log out of the site and back in again, it only takes one attempt… but close the browser and start a new incognito session and I have to login twice again.

Does the multiscrape sensor retain the same ‘session’ between runs ? - My only thought being it’s only ever submitting that first attempt if not?

danieldotnl · July 25, 2022, 2:08pm

Yes, the sessions are managed and retained by home assistant. They are even shared between e.g. the multiscrape and rest integration.
I’m not sure how I can help you further, unless you are willing to share your credentials via PM.

swifty · July 26, 2022, 11:39am

Thanks Daniel, unfortunately I can’t really share the account details as it has all of my payment information etc saved within it.

I have however been digging a little into it this morning and turned on debug logging and log response, and I noticed that some fields were not named as I expected - I have set these but still no luck logging in… I also tried using the Chrome dev tools, by filling in my username and password and then using;

form = document.getElementById("login_form")
form.submit()

And I get the same behaviour - no matter how many times I fill in the login details and submit it does not login, yet if I click the sign in button instead of using the form submit, it logs in OK…

For clarity I’m using the following;

multiscrape:
  - resource: 'https://my.flogas.co.uk/account/telemetry'
    scan_interval: 60
    method: 'GET'
    log_response: true
    form_submit:
      submit_once: false
      resource: 'https://my.flogas.co.uk'
      select: "#login_form"
      input:
        email_address: <my_email>
        password: <my_password>
    sensor:
      - select: '#flogas-content > div:nth-child(3) > div > div > section:nth-child(2) > div > div.activity-blocks__secondary > p:nth-child(2)'
        name: LPG_Level_Percent
      - select: '#flogas-content > div:nth-child(3) > div > div > section:nth-child(2) > div > div.activity-blocks__secondary > p:nth-child(3)'
        name: LPG_Level_Litres

Are you able to double check the login page to confirm I and trying to set the right fields ?

danieldotnl · July 26, 2022, 2:43pm

If you check the traffic in the browser when you submit the form, you see that a lot of extra data is sent. Some of it seems to be dynamically generated. I don’t know what is and what is not relevant.

{
	"ajax_hash": "1ed3133e89ddf3cca1414d81cec88be967c0921fc1da54dcd96f1d36200ae4579db43545c52610e7125856ed9d67920ae41b6b920919225448326e8e2e88f416",
	"hc_timing": "94409",
	"email_address": "fdsdf",
	"password": "sdfdf",
	"remember_me": "0",
	"redirect": "/account/telemetry",
	"ajax_origin": "login_form",
	"module": "flogas\\portal\\login_form",
	"ajax_act": "do_ajax_submit",
	"rp": "",
	"window_location": "/"
}

You can try to add those values in form_submit/input

swifty · July 26, 2022, 3:49pm

Amazing thank you ! I’ve added all but ajax_hash and hc_timing (as it looks like they are values which may change?) and it’s logging in now

Can I ask where you see that extra info in the dev tools of the browser? - I’ve looked but couldn’t find it.

My values still aren’t scraping into the sensors, but I have opened the page_soup.txt and can see they are present - I even double checked the elements selector matches by altering the extension to .html and opening it in Chrome, and when I right click the items I’m interested in the element selector matches.
I also tried $$('#flogas-content > div:nth-child(2) > div > div > section:nth-child(2) > div > div.activity-blocks__secondary > p:nth-child(2)') in the dev console (with the page_soup open) and it finds it OK

danieldotnl · July 27, 2022, 8:41am

Could you post the page_soup file? (replacing sensitive data)

swifty · July 27, 2022, 9:06am

Sure here you go; <!DOCTYPE html><html class="desktop_device" dir="ltr" lang="en-gb" tabindex="- - Pastebin.com

I just found (after restarting HA) that the ajax-hash and hc_timing options are actually needed - I previously added everything, and then removed them to see if it still worked, which misled me a bit as it did until HA was restarted.
Can you point me to where you saw those values ? - I get the feeling they might change at some point so would be good to know so I can check

danieldotnl · July 27, 2022, 7:24pm

I’m using firefox. You can find it there in the developer tools on the network tab. Open it first, clear everything and then submit the form. Select the post request and check the ‘request’ tab on the right.

swifty · July 27, 2022, 9:46pm

Amazing, thanks so much for the help - that was the tip I needed. I have installed firefox and can see the request data.
I also have the scrape working now - out of interest I used the Firefox dev tools to get the css selector and it gave a totally different one to Chrome… and the firefox one works perfectly with the multiscrape component!

Firefox Gives:
section.activity-blocks__block:nth-child(2) > div:nth-child(1) > div:nth-child(2) > p:nth-child(2)

Whereas Chrome gave:
#flogas-content > div:nth-child(2) > div > div > section:nth-child(2) > div > div.activity-blocks__secondary > p:nth-child(2)

canaru · July 29, 2022, 1:56pm

Anyone know how I can select a sub product with a “select element click” like in this site before scrappe data?
Tks.

sler · August 3, 2022, 1:31pm

I’m trying to scrape time table from Marprom Interaktivni vozni redi

But CSS selector from chrome or firefox doesn’t work, is scrape supporting tables and nth-childs etc?
CSS selector from chrome: table.table:nth-child(9) > tbody:nth-child(2) > tr:nth-child(2) > td:nth-child(2)