Scrape sensor improved - scraping multiple values

I can get a scrape working with the regular scrape, but want to be able to get multiple values.

The below works in scrape, but no matter what I try nothing works when trying to convert this to multiscrape.

sensor:
  - platform: scrape
    resource: http://10.1.1.241/status.html
    name: webdata_today
    authentication: basic
    username: admin
    password: admin
    select: "script"
    index: 1
    value_template: "{{ (value.split(';')[6])|replace('var webdata_today_e = ','')|replace('\"', '')|float }}"

This gives me rubbish (pun intended :yum:), hope it helps:

multiscrape:
  - resource: "https://www.aucklandcouncil.govt.nz/rubbish-recycling/rubbish-recycling-collections/Pages/collection-day-detail.aspx?an=12340870253"
    name: Auckland Council
    scan_interval: 30
    log_response: true
    sensor:
      - select: ".card-block > div:nth-child(2) > span:nth-child(2) > span:nth-child(1)"
        name: NZrubbish
        unique_id: nzrubbish

Could you enable log_response and post the content of page_soup.txt?

Could you enable log_response and post the content of page_soup.txt ?
I ended up getting 3 separate scrapes to work, though multiscrape would still be better.

I can’t remember what attempt at the multiscrape this was I was trying, but I have this log.

<html><body><p><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

</p>
<style type="text/css">
.in_body
{
	margin-top:0px;
	margin-left:0px;
	margin-right:0px;
	margin-bottom:0px;
	background-color:transparent;
}
.div_c
{
	margin-left:50px;
	margin-right:50px;
	margin-top:50px;
	margin-bottom:50px;
}
.cu
{
	cursor:pointer;
}
.b
{
	font-weight:bold;
}
.lab_5
{
	font-size:16px;
	color:#666666;
	margin-left:-20px;
}
.lab_l2
{
	float:left;
	width:32%;
	color:#666666;
	margin-bottom:-2px;
	font-size:14px;
}
.lab_r2
{
	float:left;
	width:68%;
	color:#666666;
	text-align:right;
	font-size:14px;
}
.cl
{
	clear:left;
}
.line
{
	height:1px;
	background-color:#666666;
	width:100%;
	margin-top:5px;
	margin-bottom:5px;
}
.sp_5
{
	height:5px;
	width:500px;
}
.sp_20
{
	height:20px;
	width:500px;
}
.label
{
	float:left;
	width:50%;
	color:#666666;
	margin-bottom:-2px;
	font-size:14px;
}
.lab_r
{
	float:left;
	width:50%;
	color:#666666;
	text-align:right;
	font-size:14px;
}
.lab_l
{
	float:left;
	width:40%;
	color:#666666;
	margin-bottom:-2px;
	margin-left:10%;
	font-size:14px;
}
.line_l
{
	height:1px;
	background-color:#666666;
	width:450px;
	margin-top:5px;
	margin-bottom:5px;
	margin-left:50px;
}
.sub
{
    display:inline-block;
    width:16px;
    text-align:center;
}
</style>
<script type="text/javascript">
var height=0;function fileText(id,value){if(document.getElementById(id)){document.getElementById(id).innerHTML=value}}function changeFont(){reCon("main_div").style.fontFamily=window.parent.reFont()}function child_getH(){var nh=document.body.offsetHeight+100;if(nh<500||nh==null){nh=500}if(height!=nh){height=nh;window.parent.child_height(height)}}function reCon(id){return document.getElementById(id)}function ready(){try{window.parent.show_ifr()}catch(e){}child_getH()}function show(v){var c=document.getElementById(v);if(c!=null){c.style.display=""}}function hide(v){var c=document.getElementById(v);if(c!=null){c.style.display="none"}};
</script>
<script type="text/javascript">
var webdata_sn = "133AF121A1900430";
var webdata_msvn = "0050";
var webdata_ssvn = "002D";
var webdata_pv_type = "00AF";
var webdata_rate_p = "";
var webdata_now_p = "5040";
var webdata_today_e = "9.80";
var webdata_total_e = "156.0";
var webdata_alarm = "";
var webdata_utime = "0";
var cover_mid = "4077258446";
var cover_ver = "MW3_15_0501_1.18";
var cover_wmode = "STA";
var cover_ap_ssid = "AP_4077258446";
var cover_ap_ip = "10.10.100.254";
var cover_ap_mac = "2C:9C:6E:5F:9F:BA";
var cover_sta_ssid = "NickAndie";
var cover_sta_rssi = "80%";
var cover_sta_ip = "10.1.1.241";
var cover_sta_mac = "28:9C:6E:5F:9F:BA";
var status_a = "1";
var status_b = "0";
var status_c = "0";

function initPageText(){var list=window.parent.reList("status");fileText("st1",list["t1"]);fileText("st2",list["t2"]);fileText("st3",list["t3"]);for(var i=1;i<=27;i++){if(i!=14){fileText("tx"+i,list[i])}}changeFont();child_getH()}function upfold(v){if(document.getElementById("up_"+v+"_div").style.display=="none"){show("up_"+v+"_div");reCon("p_"+v).innerHTML="-"}else{hide("up_"+v+"_div");reCon("p_"+v).innerHTML="+"}}function init_main_page(){var on=window.parent.reTip("1");var off=window.parent.reTip("2");document.getElementById("cover_mid").innerHTML=cover_mid;document.getElementById("cover_ver").innerHTML=cover_ver;document.getElementById("cover_ap_status").innerHTML=off;document.getElementById("cover_sta_status").innerHTML=off;if(cover_wmode!="STA"){document.getElementById("cover_ap_status").innerHTML=on;document.getElementById("cover_ap_ssid").innerHTML=cover_ap_ssid;document.getElementById("cover_ap_ip").innerHTML=cover_ap_ip;document.getElementById("cover_ap_mac").innerHTML=cover_ap_mac}if(cover_wmode!="AP"){document.getElementById("cover_sta_status").innerHTML=on;document.getElementById("cover_sta_ssid").innerHTML=cover_sta_ssid;document.getElementById("cover_sta_rssi").innerHTML=cover_sta_rssi;document.getElementById("cover_sta_ip").innerHTML=cover_sta_ip;document.getElementById("cover_sta_mac").innerHTML=cover_sta_mac}if(webdata_sn==""){webdata_sn="---"}fileText("webdata_sn",webdata_sn);if(webdata_msvn==""){webdata_msvn="---"}fileText("webdata_msvn",webdata_msvn);if(webdata_ssvn==""){webdata_ssvn="---"}fileText("webdata_ssvn",webdata_ssvn);if(webdata_pv_type==""){webdata_pv_type="---"}fileText("webdata_pv_type",webdata_pv_type);if(webdata_rate_p==""){webdata_rate_p="---"}fileText("webdata_rate_p",webdata_rate_p+" W");if(webdata_now_p==""||webdata_now_p==0){webdata_now_p="---"}fileText("webdata_now_p",webdata_now_p+" W");if(webdata_today_e==""){webdata_today_e="---"}fileText("webdata_today_e",webdata_today_e+" kWh");if(webdata_total_e==""){webdata_total_e="---"}fileText("webdata_total_e",webdata_total_e+" kWh");if(webdata_alarm==""){webdata_alarm="---"}fileText("webdata_alarm",webdata_alarm);if(webdata_utime==""){if(document.getElementById("webdata_sn").innerHTML=="---"){webdata_utime="---"}else{webdata_utime=value+window.parent.reTip("5")}}fileText("webdata_utime",webdata_utime);var st_en=window.parent.reTip("3");var st_dis=window.parent.reTip("4");var st_un=window.parent.reTip("41");if(status_a=="1"){document.getElementById("cover_remote_status_a").innerHTML=st_en}else{if(status_a=="0"){document.getElementById("cover_remote_status_a").innerHTML=st_dis}else{document.getElementById("cover_remote_status_a").innerHTML=st_un}}if(status_b=="1"){document.getElementById("cover_remote_status_b").innerHTML=st_en}else{if(status_b=="0"){document.getElementById("cover_remote_status_b").innerHTML=st_dis}else{document.getElementById("cover_remote_status_b").innerHTML=st_un}}};

</script>
<div class="div_c" id="main_div">
<div class="lab_5 cu b" onclick="upfold(1);child_getH();"><span class="sub" id="p_1">-</span><span id="st1" style="margin-left:3px"></span></div>
<div class="sp_5"></div>
<div id="up_1_div">
<div class="lab_l2" id="tx1"></div>
<div class="lab_r2" id="webdata_sn"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx2"></div>
<div class="lab_r2" id="webdata_msvn"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx3"></div>
<div class="lab_r2" id="webdata_ssvn"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx4"></div>
<div class="lab_r2" id="webdata_pv_type"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx5"></div>
<div class="lab_r2" id="webdata_rate_p"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx6" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_now_p" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx7" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_today_e" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx8" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_total_e" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx9" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_alarm" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx10" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_utime" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
</div>
<div class="sp_20"></div>
<div class="lab_5 cu b" onclick="upfold(2);child_getH();"><span class="sub" id="p_2">+</span><span id="st2" style="margin-left:3px"></span></div>
<div class="sp_5"></div>
<div id="up_2_div" style="display:none">
<div class="label" id="tx11"></div>
<div class="lab_r" id="cover_mid"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="label" id="tx12"></div>
<div class="lab_r" id="cover_ver"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="label" id="tx13"></div>
<div class="lab_r" id="cover_ap_status" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l" id="ap_ssid">SSID</div>
<div class="lab_r" id="cover_ap_ssid"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx15"></div>
<div class="lab_r" id="cover_ap_ip"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx16"></div>
<div class="lab_r" id="cover_ap_mac"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="label" id="tx17"></div>
<div class="lab_r" id="cover_sta_status" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l" id="tx18"></div>
<div class="lab_r" id="cover_sta_ssid"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx19"></div>
<div class="lab_r" id="cover_sta_rssi"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx20"></div>
<div class="lab_r" id="cover_sta_ip"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx21"></div>
<div class="lab_r" id="cover_sta_mac"></div>
<div class="cl"></div>
<div class="line_l"></div>
</div>
<div class="sp_20"></div>
<div class="lab_5 cu b" onclick="upfold(3);child_getH();"><span class="sub" id="p_3">+</span><span id="st3" style="margin-left:3px"></span></div>
<div class="sp_5"></div>
<div id="up_3_div" style="display:none">
<div class="label" id="tx25"></div>
<div class="lab_r" id="cover_remote_status_a"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="label" id="tx26"></div>
<div class="lab_r" id="cover_remote_status_b"></div>
<div class="cl"></div>
<div class="line"></div>
</div>
</div>
<script type="text/javascript">
	    initPageText();
	    ready();
	</script>
</body></html>

Thanks for the reply @danieldotnl, very much appreciated. I’ve added your code but I’m a bit confused with the - select section as I just end up with a sensor called Rubbish that seemingly does nothing. I was ideally looking to extract the date of next collection for both Rubbish and Recycle. Is that possible?

Ah I see. It’s a bit tricky as CSS doesn’t support parent/sibling selectors. You could use select_list to get all the <span> elements and then try to split and make sense of it in a value_template?

- resource: "https://www.aucklandcouncil.govt.nz/rubbish-recycling/rubbish-recycling-collections/Pages/collection-day-detail.aspx?an=12340870253"
    name: Auckland Council
    scan_interval: 30
    log_response: true
    sensor:
      - select_list: ".card-block > div:nth-child(2) > span"
        name: Next collection
        unique_id: next

I can’t properly test it from a file, but loading this in firefox and copying the css selector gives me this: body > script:nth-child(4). Did you try this?

Super helpful thanks. That’s produced the output below so now trying around the forums for a split methodology to try in the template editor.

{{ states("sensor.next").split()[4] }}

produces:

July,Rubbish,Recycle

but, I don’t really know what I’m doing as I’ve never used split before.

Oh, that did it, thank you!

Have included my working code below if anyone tries the same.

 - resource: http://10.1.1.241/status.html
   authentication: basic
   username: admin
   password: admin
   scan_interval: 30
   sensor:
     - name: Current Solar Generation
       select: "body > script:nth-child(4)"
       value_template: "{{ (value.split(';')[5])|replace('var webdata_now_p = ','')|replace('\"', '')|float }}"
       unit_of_measurement: "W"
       device_class: "power"
       state_class: "measurement"
     - name: Today Solar Generation
       select: "body > script:nth-child(4)"
       value_template: "{{ (value.split(';')[6])|replace('var webdata_today_e = ','')|replace('\"', '')|float }}"
       unit_of_measurement: "kWh"
       device_class: "Energy"
       state_class: "measurement"
     - name: Total Solar Generation
       select: "body > script:nth-child(4)"
       value_template: "{{ (value.split(';')[7])|replace('var webdata_total_e = ','')|replace('\"', '')|float }}"
       unit_of_measurement: "kWh"
       device_class: "energy"
       state_class: "total_increasing"

Cracked it.

{{ states("sensor.next").split(",")[0] }}, {{ states("sensor.next").split(",")[1] }}, {{ states("sensor.next").split(",")[2] }}

Your example of using attributes helped me a lot. I messed with your scenario and learned a ton more. The Jinja documentation linked in HA Developer Tools page tells why you’re getting back single character with each index reference: List filter–Convert the value into a list. If it was a string the returned list will be a list of characters.

For some reason your attributes’ value_template is being overridden and returning a string rather than a list created by the split method. Possible bug?

I figured out a workaround. Remove your attributes’ value_template and let it be returned as a csv string. Then you can use a template with the split function to convert it to a list. By the way, Jinja to_json didn’t work.

Here’s the template I used to verify that this worked.

{{ '========= RAW ==============' }}
{{
state_attr('sensor.inpo_praha_temperature', 'forecast_time')
}}
{{ '========= SPLIT TO LIST ==============' }}
{% set items = 
state_attr('sensor.inpo_praha_temperature', 'forecast_time').split(",")
%}
{{ items[0] }}
{{ items[1] }}
{{ items[2] }}
{{ items[3] }}
{{ items | length }}
{{ '==========================' }}

Output:

========= RAW ==============
06:00,07:00,08:00,09:00,10:00,11:00,12:00,13:00,14:00,15:00,16:00,17:00,18:00,19:00,20:00,21:00,22:00,23:00,00:00,01:00,02:00,03:00,04:00,05:00,06:00
========= SPLIT TO LIST ==============

06:00
07:00
08:00
09:00
25
==========================

Something else interesting I learned is that HA Templates can use Python string methods in their dotted notation with your template variables in addition to the piped Jinja filters. That will come in handy to me.
Example Templates:

{{ 'raw: ' + states('sun.sun') }}
{{ 'Jinja filter: ' + states('sun.sun') | upper}}
{{ 'Python method: ' + states('sun.sun').upper() }}

Output:

raw: below_horizon
Jinja filter: BELOW_HORIZON
Python method: BELOW_HORIZON
1 Like

I’m trying to scrape a number of weather values from http://www.prestwoodweather.co.uk. I’m able to scrape the temperature, but can’t get other values. Its just standard HTML but despite checking it numerous times I just can’t see what I might be doing wrong. Here’s my config:

multiscrape:
  - resource: http://www.prestwoodweather.co.uk
    scan_interval: 300
    sensor:
      - unique_id: outside_temperature
        name: Outside Temperature
        select: "table > tr:nth-child(3) > td:nth-child(2) > font > strong > small > font"
        value_template: '{{ value | regex_findall_index(find="\d+\.\d+", index=0, ignorecase=true) | float}}'
        device_class: temperature
        unit_of_measurement: "°C"
      - unique_id: outside_humidity
        name: Outside Humidity
        select: "table > tr:nth-child(6) > td:nth-child(2) > font > strong > small > font"

Any ideas?

The HTML is broken. Several <td> are not part of a <tr>. That must be the problem. And I can’t exactly explain it, but this works:

select: "table > td:nth-of-type(4) > font > strong > small > font"
1 Like

So I’m having an interesting issue… I am trying to scrape my gas tank level data from my supplier, I currently use selenium and node red which works, but just seems overkill… the issue is the supplier website is a bit “glitchy” and seems to need the login form to be filled in and submitted twice before it logs in (I see this in a normal browser too, and handle it in the selenium setup currently)

I’d like to move to this component but I don’t think I’d be able to currently get a sequence of actions ? (IE two form submits, then scrape)

Looking at the docs, I think it just supports a single login form then scrape, is that right?

Not sure if it helps, but if a request fails for some reason, the form is always going to be submitted again in the next run.
Alternatively, you could set submit_once to false, which will submit the form on every run.

@danieldotnl I don’t know what witchcraft you used to work that out but it worked great!

By creating 10 of each select formats I’ve find most of the values, but I’m not able to find the following:

  • Wind Chill
  • Wind - Anemometer Status - OK
  • Today’s Rain (Since 00H)
  • Rain Rate

Any chance of seeing how the first of those looks and I’m hoping that from that I’ll find the rest?

Here’s how I do it:

  • enable log_response in multiscrape config
  • copy the page_soup.txt from the logging (/config/multiscrape/yourconfigname) to local computer
  • rename extension into .html
  • load in firefox (we now have exactly what the scrape library beautifulsoup parsed)
  • start playing around with the selectors in the firefox console (F12) (this step is just trial and error)
  • copy the end result in the config and test

In the console you can test selectors like this:
$$('table tr:nth-of-type(7) > td:nth-child(2) > font > strong > font > small')
and it will immediately show you the result.

So this works for wind chill:

select: "tr:nth-of-type(6) > td:nth-child(2) > font > strong > font > small"

Good luck with the other fields :wink:

1 Like

That sounds good. I had been using Chrome but I’ll give Firefox a go.

One bit I’m confused about - when you say to copy the page_soup.txt, what is that? I assume you don’t mean to simply save the web page. Is this a Firefox feature?

I updated my previous answer, hope that makes it more clear.

PS: I’m sure Chrome has similar functionality, I just happen to use Firefox.

Perfect - thanks again!