Scrape sensor improved - scraping multiple values

If someone could help that would be great. Thank you!!

Trying to get the current outdoor temp on this site. I am unable to get this to work.

Website is ambientweather.net/dashboard/kbck

Here is the select:

#root > div > div.page-container > div > div > div > div.device-device-realtime-dashboard > div > div > div.device-widget.square.temp > div.device-temp-widget.center-aligned > div > div.top > span > span.fdp-val

Not sure how to fix. Any ideas?

[lists]

Itā€™s still working as it ever did, but Iā€™ve realised that it behaves differently to those attributes that are lists, e.g. if you take the openweather weather entity, the forecast attribute is a list:

{{state_attr('weather.openweathermap','forecast')}}

gives you a list like

[
  {
    "datetime": "2022-04-14T11:00:00+00:00",
    "precipitation": 0.14,
    "precipitation_probability": 34,
    "pressure": 1017,
    "wind_speed": 3.94,
    "wind_bearing": 349,
    "condition": "rainy",
    "clouds": 61,
    "temperature": 20.5,
    "templow": 8.1
  },
...

Now if I try to add an index like [1] into the template, I get the whole element in {}.

I am trying to achieve something similar with multiscrape ā€“ select a list which I could reference by using an index in a template, but this doesnā€™t seem to work that way. When I use the following template in multiscrape:

     attributes:
        - name: "Forecast Time"
          select_list: ".day-hour-ext.flex-column .day-hour-time"
          value_template: "{{(value.split(','))|list}}"

The attribute I get is a string, with carriage returns but not a list

{{state_attr('sensor.inpo_praha_temperature','forecast_time')}}

Looks as if it were a list

[
  "18:00",
  "19:00",
  "20:00",
  "21:00",
...
]

If I try to reference an element by an index, e.g. [1], I get a single character but not the whole element ā€œ19:00ā€.

Hope that anyone could help me out with this one.

Multiscraper works and all, I get data from the requested website and the sensor is created. The only thing is that the scraper value includes a lot of spaces and adds these spaces in the entity data.

What results in when I want to use the entity state in, for example, an if template. Then I have to copy the exact result from the scraper with all the spaces and not only the text.

Any idea?

Did you try trimming it in the value_template?

Not until now. Implemented this code and it works! Thank you very much.

Is there a more simple version?

value_template: "{{ value.split(' :')[0].split(': ')[0] }}"

Anyone tried working with the resouce_template option?
What I am trying to use is a dynamic URL with an entity-state in it (dynamic number generated by another part of my HA)

For example:

multiscrape:
  - resource_template: https://domain.com/{{sensor.dynamicnumber}}/subfolder/xyz
    scan_interval: 3600
    sensor:
      - unique_id: testscraper_value
        name: Test Scraper
        select: ".section img"
        attribute: data-src

But I canā€™t get it to work. How do I generate the dynamic URL for the resource?

Please help me out :slight_smile:

Try: https://domain.com/{{ states('sensor.dynamicnumber') }}/subfolder/xyz

Fixed it.
Had to use:

  - resource_template: https://domain.com/{{states.sensor.dynamicnumber.state}}/subfolder/xyz

AND important: lower the scan_interval

Does anyone know if ha-multiscrape supports click actions before a scrape? The resource that Iā€™m trying to pull from unfortunately requires a ā€œCheck for new dataā€ button to be clicked prior to refreshing the information that Iā€™d like to scrape.

can anyone help me?

on page ā€œhttps://www.bgv-nachwuchshelden.de/projekte/609af01508681a0e70544c73ā€. I would like to have the value ā€œ34ā€ evaluated.

preferably with multiscrape or nodered

Iā€™ve been sitting in front of it for hours and have gone through every conceivable possibilityā€¦ the only thing Iā€™m shown is: empty

Sure, though ā€œ34ā€ isnā€™t helpful when the number is changing :roll_eyes: It is at 72 for me ā€¦

Long story short, there is CAPTCHA preventing scraping which is why you canā€™t get your result.

(run this)

curl "https://bgvnachwuchshelden2022.projektbigfoot.de/api/v1/projects/609af01508681a0e70544c73"

Response

{..., "voteCount":72, ...}
How I would do it without HA-MultiScrape (Sorry but I haven't used it enough)

You only need a REST Sensor, something like:

sensor:
  - platform: rest
    resource: https://bgvnachwuchshelden2022.projektbigfoot.de/api/v1/projects/609af01508681a0e70544c73
    name: Stimmen Count
    value_template: "{{ value_json.voteCount }}"

Note I donā€™t know if that URL will change. I donā€™t think so but it may.

1 Like

Wow! Thanks!
Yesterday evening I found the corresponding API and was then able to at least filter the value via a GET request.

Anyone know how Iā€™d create two entities from this site? N.B The address used is some random one.

<h4 class="h6 m-t-1" style="margin-left: 10px;"><strong>Your next collection dates:</strong></h4>

        <div class="links"><span class="m-r-1">Thursday 7 July</span><span class="icon-rubbish"><span class="sr-only">Rubbish</span></span> <span class="icon-recycle"><span class="sr-only">Recycle</span></span> </div>
    
        <div class="links"><span class="m-r-1">Thursday 14 July</span><span class="icon-rubbish"><span class="sr-only">Rubbish</span></span> </div>

Basically whether the rubbish (weekly) or the recycling (fortnightly) is scheduled is determined by an icon shown on the webpage defined by Rubbish and Recycle in the code above.

Iā€™m using Google Calendar now which works pretty well but it does not account for public holidays so Iā€™d like to pull the info straight from the councils website if possible.

I can get a scrape working with the regular scrape, but want to be able to get multiple values.

The below works in scrape, but no matter what I try nothing works when trying to convert this to multiscrape.

sensor:
  - platform: scrape
    resource: http://10.1.1.241/status.html
    name: webdata_today
    authentication: basic
    username: admin
    password: admin
    select: "script"
    index: 1
    value_template: "{{ (value.split(';')[6])|replace('var webdata_today_e = ','')|replace('\"', '')|float }}"

This gives me rubbish (pun intended :yum:), hope it helps:

multiscrape:
  - resource: "https://www.aucklandcouncil.govt.nz/rubbish-recycling/rubbish-recycling-collections/Pages/collection-day-detail.aspx?an=12340870253"
    name: Auckland Council
    scan_interval: 30
    log_response: true
    sensor:
      - select: ".card-block > div:nth-child(2) > span:nth-child(2) > span:nth-child(1)"
        name: NZrubbish
        unique_id: nzrubbish

Could you enable log_response and post the content of page_soup.txt?

Could you enable log_response and post the content of page_soup.txt ?
I ended up getting 3 separate scrapes to work, though multiscrape would still be better.

I canā€™t remember what attempt at the multiscrape this was I was trying, but I have this log.

<html><body><p>ļ»æ<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

</p>
<style type="text/css">
.in_body
{
	margin-top:0px;
	margin-left:0px;
	margin-right:0px;
	margin-bottom:0px;
	background-color:transparent;
}
.div_c
{
	margin-left:50px;
	margin-right:50px;
	margin-top:50px;
	margin-bottom:50px;
}
.cu
{
	cursor:pointer;
}
.b
{
	font-weight:bold;
}
.lab_5
{
	font-size:16px;
	color:#666666;
	margin-left:-20px;
}
.lab_l2
{
	float:left;
	width:32%;
	color:#666666;
	margin-bottom:-2px;
	font-size:14px;
}
.lab_r2
{
	float:left;
	width:68%;
	color:#666666;
	text-align:right;
	font-size:14px;
}
.cl
{
	clear:left;
}
.line
{
	height:1px;
	background-color:#666666;
	width:100%;
	margin-top:5px;
	margin-bottom:5px;
}
.sp_5
{
	height:5px;
	width:500px;
}
.sp_20
{
	height:20px;
	width:500px;
}
.label
{
	float:left;
	width:50%;
	color:#666666;
	margin-bottom:-2px;
	font-size:14px;
}
.lab_r
{
	float:left;
	width:50%;
	color:#666666;
	text-align:right;
	font-size:14px;
}
.lab_l
{
	float:left;
	width:40%;
	color:#666666;
	margin-bottom:-2px;
	margin-left:10%;
	font-size:14px;
}
.line_l
{
	height:1px;
	background-color:#666666;
	width:450px;
	margin-top:5px;
	margin-bottom:5px;
	margin-left:50px;
}
.sub
{
    display:inline-block;
    width:16px;
    text-align:center;
}
</style>
<script type="text/javascript">
var height=0;function fileText(id,value){if(document.getElementById(id)){document.getElementById(id).innerHTML=value}}function changeFont(){reCon("main_div").style.fontFamily=window.parent.reFont()}function child_getH(){var nh=document.body.offsetHeight+100;if(nh<500||nh==null){nh=500}if(height!=nh){height=nh;window.parent.child_height(height)}}function reCon(id){return document.getElementById(id)}function ready(){try{window.parent.show_ifr()}catch(e){}child_getH()}function show(v){var c=document.getElementById(v);if(c!=null){c.style.display=""}}function hide(v){var c=document.getElementById(v);if(c!=null){c.style.display="none"}};
</script>
<script type="text/javascript">
var webdata_sn = "133AF121A1900430";
var webdata_msvn = "0050";
var webdata_ssvn = "002D";
var webdata_pv_type = "00AF";
var webdata_rate_p = "";
var webdata_now_p = "5040";
var webdata_today_e = "9.80";
var webdata_total_e = "156.0";
var webdata_alarm = "";
var webdata_utime = "0";
var cover_mid = "4077258446";
var cover_ver = "MW3_15_0501_1.18";
var cover_wmode = "STA";
var cover_ap_ssid = "AP_4077258446";
var cover_ap_ip = "10.10.100.254";
var cover_ap_mac = "2C:9C:6E:5F:9F:BA";
var cover_sta_ssid = "NickAndie";
var cover_sta_rssi = "80%";
var cover_sta_ip = "10.1.1.241";
var cover_sta_mac = "28:9C:6E:5F:9F:BA";
var status_a = "1";
var status_b = "0";
var status_c = "0";

function initPageText(){var list=window.parent.reList("status");fileText("st1",list["t1"]);fileText("st2",list["t2"]);fileText("st3",list["t3"]);for(var i=1;i<=27;i++){if(i!=14){fileText("tx"+i,list[i])}}changeFont();child_getH()}function upfold(v){if(document.getElementById("up_"+v+"_div").style.display=="none"){show("up_"+v+"_div");reCon("p_"+v).innerHTML="-"}else{hide("up_"+v+"_div");reCon("p_"+v).innerHTML="+"}}function init_main_page(){var on=window.parent.reTip("1");var off=window.parent.reTip("2");document.getElementById("cover_mid").innerHTML=cover_mid;document.getElementById("cover_ver").innerHTML=cover_ver;document.getElementById("cover_ap_status").innerHTML=off;document.getElementById("cover_sta_status").innerHTML=off;if(cover_wmode!="STA"){document.getElementById("cover_ap_status").innerHTML=on;document.getElementById("cover_ap_ssid").innerHTML=cover_ap_ssid;document.getElementById("cover_ap_ip").innerHTML=cover_ap_ip;document.getElementById("cover_ap_mac").innerHTML=cover_ap_mac}if(cover_wmode!="AP"){document.getElementById("cover_sta_status").innerHTML=on;document.getElementById("cover_sta_ssid").innerHTML=cover_sta_ssid;document.getElementById("cover_sta_rssi").innerHTML=cover_sta_rssi;document.getElementById("cover_sta_ip").innerHTML=cover_sta_ip;document.getElementById("cover_sta_mac").innerHTML=cover_sta_mac}if(webdata_sn==""){webdata_sn="---"}fileText("webdata_sn",webdata_sn);if(webdata_msvn==""){webdata_msvn="---"}fileText("webdata_msvn",webdata_msvn);if(webdata_ssvn==""){webdata_ssvn="---"}fileText("webdata_ssvn",webdata_ssvn);if(webdata_pv_type==""){webdata_pv_type="---"}fileText("webdata_pv_type",webdata_pv_type);if(webdata_rate_p==""){webdata_rate_p="---"}fileText("webdata_rate_p",webdata_rate_p+" W");if(webdata_now_p==""||webdata_now_p==0){webdata_now_p="---"}fileText("webdata_now_p",webdata_now_p+" W");if(webdata_today_e==""){webdata_today_e="---"}fileText("webdata_today_e",webdata_today_e+" kWh");if(webdata_total_e==""){webdata_total_e="---"}fileText("webdata_total_e",webdata_total_e+" kWh");if(webdata_alarm==""){webdata_alarm="---"}fileText("webdata_alarm",webdata_alarm);if(webdata_utime==""){if(document.getElementById("webdata_sn").innerHTML=="---"){webdata_utime="---"}else{webdata_utime=value+window.parent.reTip("5")}}fileText("webdata_utime",webdata_utime);var st_en=window.parent.reTip("3");var st_dis=window.parent.reTip("4");var st_un=window.parent.reTip("41");if(status_a=="1"){document.getElementById("cover_remote_status_a").innerHTML=st_en}else{if(status_a=="0"){document.getElementById("cover_remote_status_a").innerHTML=st_dis}else{document.getElementById("cover_remote_status_a").innerHTML=st_un}}if(status_b=="1"){document.getElementById("cover_remote_status_b").innerHTML=st_en}else{if(status_b=="0"){document.getElementById("cover_remote_status_b").innerHTML=st_dis}else{document.getElementById("cover_remote_status_b").innerHTML=st_un}}};

</script>
<div class="div_c" id="main_div">
<div class="lab_5 cu b" onclick="upfold(1);child_getH();"><span class="sub" id="p_1">-</span><span id="st1" style="margin-left:3px"></span></div>
<div class="sp_5"></div>
<div id="up_1_div">
<div class="lab_l2" id="tx1"></div>
<div class="lab_r2" id="webdata_sn"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx2"></div>
<div class="lab_r2" id="webdata_msvn"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx3"></div>
<div class="lab_r2" id="webdata_ssvn"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx4"></div>
<div class="lab_r2" id="webdata_pv_type"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx5"></div>
<div class="lab_r2" id="webdata_rate_p"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx6" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_now_p" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx7" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_today_e" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx8" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_total_e" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx9" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_alarm" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx10" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_utime" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
</div>
<div class="sp_20"></div>
<div class="lab_5 cu b" onclick="upfold(2);child_getH();"><span class="sub" id="p_2">+</span><span id="st2" style="margin-left:3px"></span></div>
<div class="sp_5"></div>
<div id="up_2_div" style="display:none">
<div class="label" id="tx11"></div>
<div class="lab_r" id="cover_mid"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="label" id="tx12"></div>
<div class="lab_r" id="cover_ver"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="label" id="tx13"></div>
<div class="lab_r" id="cover_ap_status" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l" id="ap_ssid">SSID</div>
<div class="lab_r" id="cover_ap_ssid"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx15"></div>
<div class="lab_r" id="cover_ap_ip"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx16"></div>
<div class="lab_r" id="cover_ap_mac"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="label" id="tx17"></div>
<div class="lab_r" id="cover_sta_status" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l" id="tx18"></div>
<div class="lab_r" id="cover_sta_ssid"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx19"></div>
<div class="lab_r" id="cover_sta_rssi"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx20"></div>
<div class="lab_r" id="cover_sta_ip"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx21"></div>
<div class="lab_r" id="cover_sta_mac"></div>
<div class="cl"></div>
<div class="line_l"></div>
</div>
<div class="sp_20"></div>
<div class="lab_5 cu b" onclick="upfold(3);child_getH();"><span class="sub" id="p_3">+</span><span id="st3" style="margin-left:3px"></span></div>
<div class="sp_5"></div>
<div id="up_3_div" style="display:none">
<div class="label" id="tx25"></div>
<div class="lab_r" id="cover_remote_status_a"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="label" id="tx26"></div>
<div class="lab_r" id="cover_remote_status_b"></div>
<div class="cl"></div>
<div class="line"></div>
</div>
</div>
<script type="text/javascript">
	    initPageText();
	    ready();
	</script>
</body></html>

Thanks for the reply @danieldotnl, very much appreciated. Iā€™ve added your code but Iā€™m a bit confused with the - select section as I just end up with a sensor called Rubbish that seemingly does nothing. I was ideally looking to extract the date of next collection for both Rubbish and Recycle. Is that possible?

Ah I see. Itā€™s a bit tricky as CSS doesnā€™t support parent/sibling selectors. You could use select_list to get all the <span> elements and then try to split and make sense of it in a value_template?

- resource: "https://www.aucklandcouncil.govt.nz/rubbish-recycling/rubbish-recycling-collections/Pages/collection-day-detail.aspx?an=12340870253"
    name: Auckland Council
    scan_interval: 30
    log_response: true
    sensor:
      - select_list: ".card-block > div:nth-child(2) > span"
        name: Next collection
        unique_id: next

I canā€™t properly test it from a file, but loading this in firefox and copying the css selector gives me this: body > script:nth-child(4). Did you try this?