Scrape sensor improved - scraping multiple values

xbmcnut · July 6, 2022, 8:22am

Super helpful thanks. That’s produced the output below so now trying around the forums for a split methodology to try in the template editor.

{{ states("sensor.next").split()[4] }}

produces:

July,Rubbish,Recycle

but, I don’t really know what I’m doing as I’ve never used split before.

Niximus · July 6, 2022, 8:26am

Oh, that did it, thank you!

Have included my working code below if anyone tries the same.

 - resource: http://10.1.1.241/status.html
   authentication: basic
   username: admin
   password: admin
   scan_interval: 30
   sensor:
     - name: Current Solar Generation
       select: "body > script:nth-child(4)"
       value_template: "{{ (value.split(';')[5])|replace('var webdata_now_p = ','')|replace('\"', '')|float }}"
       unit_of_measurement: "W"
       device_class: "power"
       state_class: "measurement"
     - name: Today Solar Generation
       select: "body > script:nth-child(4)"
       value_template: "{{ (value.split(';')[6])|replace('var webdata_today_e = ','')|replace('\"', '')|float }}"
       unit_of_measurement: "kWh"
       device_class: "Energy"
       state_class: "measurement"
     - name: Total Solar Generation
       select: "body > script:nth-child(4)"
       value_template: "{{ (value.split(';')[7])|replace('var webdata_total_e = ','')|replace('\"', '')|float }}"
       unit_of_measurement: "kWh"
       device_class: "energy"
       state_class: "total_increasing"

xbmcnut · July 6, 2022, 11:29am

Cracked it.

{{ states("sensor.next").split(",")[0] }}, {{ states("sensor.next").split(",")[1] }}, {{ states("sensor.next").split(",")[2] }}

decibel_nv · July 20, 2022, 4:16am

Your example of using attributes helped me a lot. I messed with your scenario and learned a ton more. The Jinja documentation linked in HA Developer Tools page tells why you’re getting back single character with each index reference: List filter–Convert the value into a list. If it was a string the returned list will be a list of characters.

For some reason your attributes’ value_template is being overridden and returning a string rather than a list created by the split method. Possible bug?

I figured out a workaround. Remove your attributes’ value_template and let it be returned as a csv string. Then you can use a template with the split function to convert it to a list. By the way, Jinja to_json didn’t work.

Here’s the template I used to verify that this worked.

{{ '========= RAW ==============' }}
{{
state_attr('sensor.inpo_praha_temperature', 'forecast_time')
}}
{{ '========= SPLIT TO LIST ==============' }}
{% set items = 
state_attr('sensor.inpo_praha_temperature', 'forecast_time').split(",")
%}
{{ items[0] }}
{{ items[1] }}
{{ items[2] }}
{{ items[3] }}
{{ items | length }}
{{ '==========================' }}

Output:

========= RAW ==============
06:00,07:00,08:00,09:00,10:00,11:00,12:00,13:00,14:00,15:00,16:00,17:00,18:00,19:00,20:00,21:00,22:00,23:00,00:00,01:00,02:00,03:00,04:00,05:00,06:00
========= SPLIT TO LIST ==============

06:00
07:00
08:00
09:00
25
==========================

Something else interesting I learned is that HA Templates can use Python string methods in their dotted notation with your template variables in addition to the piped Jinja filters. That will come in handy to me.
Example Templates:

{{ 'raw: ' + states('sun.sun') }}
{{ 'Jinja filter: ' + states('sun.sun') | upper}}
{{ 'Python method: ' + states('sun.sun').upper() }}

Output:

raw: below_horizon
Jinja filter: BELOW_HORIZON
Python method: BELOW_HORIZON

Nuuki · July 20, 2022, 9:43am

I’m trying to scrape a number of weather values from http://www.prestwoodweather.co.uk. I’m able to scrape the temperature, but can’t get other values. Its just standard HTML but despite checking it numerous times I just can’t see what I might be doing wrong. Here’s my config:

multiscrape:
  - resource: http://www.prestwoodweather.co.uk
    scan_interval: 300
    sensor:
      - unique_id: outside_temperature
        name: Outside Temperature
        select: "table > tr:nth-child(3) > td:nth-child(2) > font > strong > small > font"
        value_template: '{{ value | regex_findall_index(find="\d+\.\d+", index=0, ignorecase=true) | float}}'
        device_class: temperature
        unit_of_measurement: "°C"
      - unique_id: outside_humidity
        name: Outside Humidity
        select: "table > tr:nth-child(6) > td:nth-child(2) > font > strong > small > font"

Any ideas?

danieldotnl · July 20, 2022, 6:31pm

The HTML is broken. Several <td> are not part of a <tr>. That must be the problem. And I can’t exactly explain it, but this works:

select: "table > td:nth-of-type(4) > font > strong > small > font"

swifty · July 20, 2022, 6:50pm

So I’m having an interesting issue… I am trying to scrape my gas tank level data from my supplier, I currently use selenium and node red which works, but just seems overkill… the issue is the supplier website is a bit “glitchy” and seems to need the login form to be filled in and submitted twice before it logs in (I see this in a normal browser too, and handle it in the selenium setup currently)

I’d like to move to this component but I don’t think I’d be able to currently get a sequence of actions ? (IE two form submits, then scrape)

Looking at the docs, I think it just supports a single login form then scrape, is that right?

danieldotnl · July 20, 2022, 7:06pm

Not sure if it helps, but if a request fails for some reason, the form is always going to be submitted again in the next run.
Alternatively, you could set submit_once to false, which will submit the form on every run.

Nuuki · July 20, 2022, 10:19pm

@danieldotnl I don’t know what witchcraft you used to work that out but it worked great!

By creating 10 of each select formats I’ve find most of the values, but I’m not able to find the following:

Wind Chill
Wind - Anemometer Status - OK
Today’s Rain (Since 00H)
Rain Rate

Any chance of seeing how the first of those looks and I’m hoping that from that I’ll find the rest?

danieldotnl · July 21, 2022, 7:50am

Here’s how I do it:

enable log_response in multiscrape config
copy the page_soup.txt from the logging (/config/multiscrape/yourconfigname) to local computer
rename extension into .html
load in firefox (we now have exactly what the scrape library beautifulsoup parsed)
start playing around with the selectors in the firefox console (F12) (this step is just trial and error)
copy the end result in the config and test

In the console you can test selectors like this:
$$('table tr:nth-of-type(7) > td:nth-child(2) > font > strong > font > small')
and it will immediately show you the result.

So this works for wind chill:

select: "tr:nth-of-type(6) > td:nth-child(2) > font > strong > font > small"

Good luck with the other fields

Nuuki · July 21, 2022, 9:38am

That sounds good. I had been using Chrome but I’ll give Firefox a go.

One bit I’m confused about - when you say to copy the page_soup.txt, what is that? I assume you don’t mean to simply save the web page. Is this a Firefox feature?

danieldotnl · July 21, 2022, 11:02am

I updated my previous answer, hope that makes it more clear.

PS: I’m sure Chrome has similar functionality, I just happen to use Firefox.

Nuuki · July 21, 2022, 3:26pm

Perfect - thanks again!

swifty · July 25, 2022, 11:37am

Thanks, unfortunately it didn’t seem to help - the login still seemed to fail on the second run.
I tried Chrome in incognito mode so and attempted to login manually to the site, and sure enough the first attempt fails, then a second login works… if I log out of the site and back in again, it only takes one attempt… but close the browser and start a new incognito session and I have to login twice again.

Does the multiscrape sensor retain the same ‘session’ between runs ? - My only thought being it’s only ever submitting that first attempt if not?

danieldotnl · July 25, 2022, 2:08pm

Yes, the sessions are managed and retained by home assistant. They are even shared between e.g. the multiscrape and rest integration.
I’m not sure how I can help you further, unless you are willing to share your credentials via PM.

swifty · July 26, 2022, 11:39am

Thanks Daniel, unfortunately I can’t really share the account details as it has all of my payment information etc saved within it.

I have however been digging a little into it this morning and turned on debug logging and log response, and I noticed that some fields were not named as I expected - I have set these but still no luck logging in… I also tried using the Chrome dev tools, by filling in my username and password and then using;

form = document.getElementById("login_form")
form.submit()

And I get the same behaviour - no matter how many times I fill in the login details and submit it does not login, yet if I click the sign in button instead of using the form submit, it logs in OK…

For clarity I’m using the following;

multiscrape:
  - resource: 'https://my.flogas.co.uk/account/telemetry'
    scan_interval: 60
    method: 'GET'
    log_response: true
    form_submit:
      submit_once: false
      resource: 'https://my.flogas.co.uk'
      select: "#login_form"
      input:
        email_address: <my_email>
        password: <my_password>
    sensor:
      - select: '#flogas-content > div:nth-child(3) > div > div > section:nth-child(2) > div > div.activity-blocks__secondary > p:nth-child(2)'
        name: LPG_Level_Percent
      - select: '#flogas-content > div:nth-child(3) > div > div > section:nth-child(2) > div > div.activity-blocks__secondary > p:nth-child(3)'
        name: LPG_Level_Litres

Are you able to double check the login page to confirm I and trying to set the right fields ?

danieldotnl · July 26, 2022, 2:43pm

If you check the traffic in the browser when you submit the form, you see that a lot of extra data is sent. Some of it seems to be dynamically generated. I don’t know what is and what is not relevant.

{
	"ajax_hash": "1ed3133e89ddf3cca1414d81cec88be967c0921fc1da54dcd96f1d36200ae4579db43545c52610e7125856ed9d67920ae41b6b920919225448326e8e2e88f416",
	"hc_timing": "94409",
	"email_address": "fdsdf",
	"password": "sdfdf",
	"remember_me": "0",
	"redirect": "/account/telemetry",
	"ajax_origin": "login_form",
	"module": "flogas\\portal\\login_form",
	"ajax_act": "do_ajax_submit",
	"rp": "",
	"window_location": "/"
}

You can try to add those values in form_submit/input

swifty · July 26, 2022, 3:49pm

Amazing thank you ! I’ve added all but ajax_hash and hc_timing (as it looks like they are values which may change?) and it’s logging in now

Can I ask where you see that extra info in the dev tools of the browser? - I’ve looked but couldn’t find it.

My values still aren’t scraping into the sensors, but I have opened the page_soup.txt and can see they are present - I even double checked the elements selector matches by altering the extension to .html and opening it in Chrome, and when I right click the items I’m interested in the element selector matches.
I also tried $$('#flogas-content > div:nth-child(2) > div > div > section:nth-child(2) > div > div.activity-blocks__secondary > p:nth-child(2)') in the dev console (with the page_soup open) and it finds it OK

danieldotnl · July 27, 2022, 8:41am

Could you post the page_soup file? (replacing sensitive data)

swifty · July 27, 2022, 9:06am

Sure here you go; <!DOCTYPE html><html class="desktop_device" dir="ltr" lang="en-gb" tabindex="- - Pastebin.com

I just found (after restarting HA) that the ajax-hash and hc_timing options are actually needed - I previously added everything, and then removed them to see if it still worked, which misled me a bit as it did until HA was restarted.
Can you point me to where you saw those values ? - I get the feeling they might change at some point so would be good to know so I can check