I’m trying to scrape a number of weather values from http://www.prestwoodweather.co.uk. I’m able to scrape the temperature, but can’t get other values. Its just standard HTML but despite checking it numerous times I just can’t see what I might be doing wrong. Here’s my config:
multiscrape:
- resource: http://www.prestwoodweather.co.uk
scan_interval: 300
sensor:
- unique_id: outside_temperature
name: Outside Temperature
select: "table > tr:nth-child(3) > td:nth-child(2) > font > strong > small > font"
value_template: '{{ value | regex_findall_index(find="\d+\.\d+", index=0, ignorecase=true) | float}}'
device_class: temperature
unit_of_measurement: "°C"
- unique_id: outside_humidity
name: Outside Humidity
select: "table > tr:nth-child(6) > td:nth-child(2) > font > strong > small > font"
So I’m having an interesting issue… I am trying to scrape my gas tank level data from my supplier, I currently use selenium and node red which works, but just seems overkill… the issue is the supplier website is a bit “glitchy” and seems to need the login form to be filled in and submitted twice before it logs in (I see this in a normal browser too, and handle it in the selenium setup currently)
I’d like to move to this component but I don’t think I’d be able to currently get a sequence of actions ? (IE two form submits, then scrape)
Looking at the docs, I think it just supports a single login form then scrape, is that right?
Not sure if it helps, but if a request fails for some reason, the form is always going to be submitted again in the next run.
Alternatively, you could set submit_once to false, which will submit the form on every run.
copy the page_soup.txt from the logging (/config/multiscrape/yourconfigname) to local computer
rename extension into .html
load in firefox (we now have exactly what the scrape library beautifulsoup parsed)
start playing around with the selectors in the firefox console (F12) (this step is just trial and error)
copy the end result in the config and test
In the console you can test selectors like this: $$('table tr:nth-of-type(7) > td:nth-child(2) > font > strong > font > small')
and it will immediately show you the result.
So this works for wind chill:
select: "tr:nth-of-type(6) > td:nth-child(2) > font > strong > font > small"
That sounds good. I had been using Chrome but I’ll give Firefox a go.
One bit I’m confused about - when you say to copy the page_soup.txt, what is that? I assume you don’t mean to simply save the web page. Is this a Firefox feature?
Thanks, unfortunately it didn’t seem to help - the login still seemed to fail on the second run.
I tried Chrome in incognito mode so and attempted to login manually to the site, and sure enough the first attempt fails, then a second login works… if I log out of the site and back in again, it only takes one attempt… but close the browser and start a new incognito session and I have to login twice again.
Does the multiscrape sensor retain the same ‘session’ between runs ? - My only thought being it’s only ever submitting that first attempt if not?
Yes, the sessions are managed and retained by home assistant. They are even shared between e.g. the multiscrape and rest integration.
I’m not sure how I can help you further, unless you are willing to share your credentials via PM.
Thanks Daniel, unfortunately I can’t really share the account details as it has all of my payment information etc saved within it.
I have however been digging a little into it this morning and turned on debug logging and log response, and I noticed that some fields were not named as I expected - I have set these but still no luck logging in… I also tried using the Chrome dev tools, by filling in my username and password and then using;
form = document.getElementById("login_form")
form.submit()
And I get the same behaviour - no matter how many times I fill in the login details and submit it does not login, yet if I click the sign in button instead of using the form submit, it logs in OK…
If you check the traffic in the browser when you submit the form, you see that a lot of extra data is sent. Some of it seems to be dynamically generated. I don’t know what is and what is not relevant.
Amazing thank you ! I’ve added all but ajax_hash and hc_timing (as it looks like they are values which may change?) and it’s logging in now
Can I ask where you see that extra info in the dev tools of the browser? - I’ve looked but couldn’t find it.
My values still aren’t scraping into the sensors, but I have opened the page_soup.txt and can see they are present - I even double checked the elements selector matches by altering the extension to .html and opening it in Chrome, and when I right click the items I’m interested in the element selector matches.
I also tried $$('#flogas-content > div:nth-child(2) > div > div > section:nth-child(2) > div > div.activity-blocks__secondary > p:nth-child(2)') in the dev console (with the page_soup open) and it finds it OK
I just found (after restarting HA) that the ajax-hash and hc_timing options are actually needed - I previously added everything, and then removed them to see if it still worked, which misled me a bit as it did until HA was restarted.
Can you point me to where you saw those values ? - I get the feeling they might change at some point so would be good to know so I can check
I’m using firefox. You can find it there in the developer tools on the network tab. Open it first, clear everything and then submit the form. Select the post request and check the ‘request’ tab on the right.
Amazing, thanks so much for the help - that was the tip I needed. I have installed firefox and can see the request data.
I also have the scrape working now - out of interest I used the Firefox dev tools to get the css selector and it gave a totally different one to Chrome… and the firefox one works perfectly with the multiscrape component!
But CSS selector from chrome or firefox doesn’t work, is scrape supporting tables and nth-childs etc?
CSS selector from chrome: table.table:nth-child(9) > tbody:nth-child(2) > tr:nth-child(2) > td:nth-child(2)