Web scraper Sensor question

then you probably did not copy the value correctly as in your example there is a space between the A and .5…
If the A is directly close to the number, replace
value_template: '{{value.split(" ")[0] | float}}'
with
value_template: '{{value.split("A")[0] | float}}'

If there is indeed what looks like a space but the value template doesn’t work, try to toy with this:
value_template: '{{value[:-4] | float}}'
( you may need to adjust the 4)

It works perfectly with the this line:
value_template: ‘{{value[:-4] | float}}’

Thank you so much, I will have to investigat more in this to understand what I have to do if I have to configure another one in the future. Looks a bit complex to me at the morment.

for scrape sensors, best is to look for classes (e.g. <div class="classname">).
Hopefully the value you’re after is after a class and it’s the only class with that name
For the actual value, value[:-4] means remove the last 4 chars from value

ok thanks for clarifying, this is at least a good start as I understood your explanation.

:+1:

Now I have a new challegenge, which seems a bit more complicated: I have tried to use select command, but in this case I have no clue which word I could look for as the word I am trying to use is shown several times. Could you please look at it and come with suggestion. I would need the three yellow marked figures in the screenshot.
The weblink for the site is http://77.119.243.51:86

Here’s the instructions I use to scrape data.

	• Open up developer mode in Chrome (F12)
	• Click on Elements
	• Expand until you find location you want to get information
	• Right click and choose Copy then Copy selector
	• Paste into HA sensor under select:
1 Like

I am trying to create a sensor with my next bus time to integrate into home assistant and this is my first attempt at using the scrape component. I’ve tried a number of ‘selects’ but continue to get Unable to extract data from HTML. What method of testing and debugging could be used to find out where it is failing? The link is below, and the data appears in
div.nextride-pattern__container > div.nextride-pattern__times.data-test-pattern-times-block

https://www.rtd-denver.com/app/nextride/stop/11844

sensor:
  - platform: scrape
    resource: https://www.rtd-denver.com/app/nextride/stop/11844
    name: RTD Downtown Bus Arrival 1
    select: 'div.nextride-pattern-stoptime__prediction'

Can anyone assist how I might achieve or where my configuration is failing?

try without the “div”

sensor:
  - platform: scrape
    resource: https://www.rtd-denver.com/app/nextride/stop/11844
    name: RTD Downtown Bus Arrival 1
    select: '.nextride-pattern-stoptime__prediction'

Still unable to extract data from HTML. Because this appears multiple times, would it help if I added an index: 0?

sensor:
  - platform: scrape
    resource: https://www.rtd-denver.com/app/nextride/stop/11844
    name: RTD Downtown Bus Arrival 1
    select: '.nextride-pattern-stoptime__prediction'
    index: 0

Or would it be index: 1? This index appears 3 times.

Also, would the headers element need to be added? Would ssl affect how it scrapes?

You would be better off parsing the API https://www.rtd-denver.com/api/nextride/stops/11844 then using a web scraper

2 Likes

I need some help too using the scrape function.
I want to scrape a power value from an internal webinterface my Solis datalogger has. If I get it to work for one value it should be easier for me to figure out the rest.

I copies the CCS select on the reuqired value (1480 W) and received from Chrome
#webdata_now_p

This in my config right now:

sensor:
  - platform: scrape
    resource: http://192.168.*.*/
    authentication : basic
    username : <username>
    password : <password>
    select: "#webdata_now_p"  
    value_template: '{{value.split(" ")[0] | float}}'
    unit_of_measurement: 'W'
    scan_interval: 300
    name : SolisPower

But all I got is an empty value in HA.

In the Core log form HA I do receive:
2020-05-09 08:29:47 ERROR (SyncWorker_10) [homeassistant.components.scrape.sensor] Unable to extract data from HTML

The internal HTML interface does use authentication but it’s (I assume) of the basic type. In domotizc I had this scraping working using a php script that used the username:[email protected] and this worked.

For now I dont know if the issue is the authentication part or something wrong in my YML scrape configuration . I’m trying for hours, even days and can’t figure it out? Can someone point me in the right direction?

EDIT:
After I figured out to turn on debug logging:

2020-05-09 09:42:54 DEBUG (SyncWorker_19) [homeassistant.components.scrape.sensor] 401 Unauthorized

401 Unauthorized

Authorization required.

It’s just a basic authentication at the website, according the scape options page this should be supported.

OK. a reply to my self :slight_smile:

After adjsuting the confiugration to:

sensor:
  - platform: scrape
    resource: http://username:[email protected]/status.html
    select: "webdata_now_p" 
    unit_of_measurement: 'W'
    scan_interval: 10
    name : SolisPower4

It seems it reads the page correctly. In the home-assistant.log I do find:

</style>
<script type="text/javascript">
var height=0;function fileText(id,value){if(document.getElementById(id)){document.getElementById(id).innerHTML=value}}function changeFont(){reCon("main_div").style.fontFamily=window.parent.reFont()}function child_getH(){var nh=document.body.offsetHeight+100;if(nh<500||nh==null){nh=500}if(height!=nh){height=nh;window.parent.child_height(height)}}function reCon(id){return document.getElementById(id)}function ready(){try{window.parent.show_ifr()}catch(e){}child_getH()}function show(v){var c=document.getElementById(v);if(c!=null){c.style.display=""}}function hide(v){var c=document.getElementById(v);if(c!=null){c.style.display="none"}};
</script>
<script type="text/javascript">
var webdata_sn = "1234567890 ";
var webdata_msvn = "000C";
var webdata_ssvn = "001E";
var webdata_pv_type = "0078";
var webdata_rate_p = "";
var webdata_now_p = "4880";
var webdata_today_e = "11.0";
var webdata_total_e = "2233.0";
var webdata_alarm = "";
var webdata_utime = "0";
var cover_mid = "4013752125";
var cover_ver = "MW_08_0501_1.57";
var cover_wmode = "STA";
var cover_ap_ssid = "AP_111111111";
var cover_ap_ip = "10.1.1.1";
var cover_ap_mac = "11111111111";
var cover_sta_ssid = "my ssid";
var cover_sta_rssi = "100%";
var cover_sta_ip = "192.168.x.y";
var cover_sta_mac = "123456789";
var status_a = "1";
var status_b = "0";
var status_c = "0";

And somewhat further in the logs:

<div class="lab_l2" id="tx6" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_now_p" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>

I expect the select “webdata_now_p” will retrieve 4880 back.

However:
2020-05-09 10:47:45 ERROR (SyncWorker_19) [homeassistant.components.scrape.sensor] Unable to extract data from HTML

1 Like

One step further…

select: “.lab_l2”
This does not throw an error anymore but the value is now empty. There are multiple values with different ID= but how to select the right one? And is this the right approach? Or is it possible to directly scrape the value behind var webdata_now_p =

Anybody?

Hi, Frank
Thank you for the replies. I will test it today. Sorry I didn’t reply yesterday. I hope I get this working. It is realy important for me. Thanks a lot for working this out

Hi Frank, I think you have the same problem as I’m having with my solarpanel. As lolouk44 replied on this post in april 2019, I think that your values also are loaded after the page itself is loaded and that makes it not possible to read any of the values on the page. However since they appear in the log of your homeassistant, there has to be a way. You are already one step ahead of me.

I’v read all related topics I could find and believe that I need to do with a value template.
Is there a way to select “all html output” and then use a value tempate to filter the exact value? Or use something like a regex ?

The enitre HTML, including the desried values is in the Home Assitant Log.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- saved from url=(0031)http://192.168.2.15/status.html -->
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style type="text/css">
.in_body
{
	margin-top:0px;
	margin-left:0px;
	margin-right:0px;
	margin-bottom:0px;
	background-color:transparent;
}
.div_c
{
	margin-left:50px;
	margin-right:50px;
	margin-top:50px;
	margin-bottom:50px;
}
.cu
{
	cursor:pointer;
}
.b
{
	font-weight:bold;
}
.lab_5
{
	font-size:16px;
	color:#666666;
	margin-left:-20px;
}
.lab_l2
{
	float:left;
	width:32%;
	color:#666666;
	margin-bottom:-2px;
	font-size:14px;
}
.lab_r2
{
	float:left;
	width:68%;
	color:#666666;
	text-align:right;
	font-size:14px;
}
.cl
{
	clear:left;
}
.line
{
	height:1px;
	background-color:#666666;
	width:100%;
	margin-top:5px;
	margin-bottom:5px;
}
.sp_5
{
	height:5px;
	width:500px;
}
.sp_20
{
	height:20px;
	width:500px;
}
.label
{
	float:left;
	width:50%;
	color:#666666;
	margin-bottom:-2px;
	font-size:14px;
}
.lab_r
{
	float:left;
	width:50%;
	color:#666666;
	text-align:right;
	font-size:14px;
}
.lab_l
{
	float:left;
	width:40%;
	color:#666666;
	margin-bottom:-2px;
	margin-left:10%;
	font-size:14px;
}
.line_l
{
	height:1px;
	background-color:#666666;
	width:450px;
	margin-top:5px;
	margin-bottom:5px;
	margin-left:50px;
}
.sub
{
    display:inline-block;
    width:16px;
    text-align:center;
}
</style>
<script type="text/javascript">
var height=0;function fileText(id,value){if(document.getElementById(id)){document.getElementById(id).innerHTML=value}}function changeFont(){reCon("main_div").style.fontFamily=window.parent.reFont()}function child_getH(){var nh=document.body.offsetHeight+100;if(nh<500||nh==null){nh=500}if(height!=nh){height=nh;window.parent.child_height(height)}}function reCon(id){return document.getElementById(id)}function ready(){try{window.parent.show_ifr()}catch(e){}child_getH()}function show(v){var c=document.getElementById(v);if(c!=null){c.style.display=""}}function hide(v){var c=document.getElementById(v);if(c!=null){c.style.display="none"}};
</script>
<script type="text/javascript">
var webdata_sn = "123456789 ";
var webdata_msvn = "000C";
var webdata_ssvn = "001E";
var webdata_pv_type = "0078";
var webdata_rate_p = "";
var webdata_now_p = "1420";
var webdata_today_e = "24.60";
var webdata_total_e = "2290.0";
var webdata_alarm = "";
var webdata_utime = "0";
var cover_mid = "123456789";
var cover_ver = "MW_08_0501_1.57";
var cover_wmode = "STA";
var cover_ap_ssid = "AP_123456789";
var cover_ap_ip = "10.10.10.10";
var cover_ap_mac = "123456789";
var cover_sta_ssid = "123456789";
var cover_sta_rssi = "100%";
var cover_sta_ip = "10.10.10.10";
var cover_sta_mac = "123456789";
var status_a = "1";
var status_b = "0";
var status_c = "0";

function initPageText(){var list=window.parent.reList("status");fileText("st1",list["t1"]);fileText("st2",list["t2"]);fileText("st3",list["t3"]);for(var i=1;i<=27;i++){if(i!=14){fileText("tx"+i,list[i])}}changeFont();child_getH()}function upfold(v){if(document.getElementById("up_"+v+"_div").style.display=="none"){show("up_"+v+"_div");reCon("p_"+v).innerHTML="-"}else{hide("up_"+v+"_div");reCon("p_"+v).innerHTML="+"}}function init_main_page(){var on=window.parent.reTip("1");var off=window.parent.reTip("2");document.getElementById("cover_mid").innerHTML=cover_mid;document.getElementById("cover_ver").innerHTML=cover_ver;document.getElementById("cover_ap_status").innerHTML=off;document.getElementById("cover_sta_status").innerHTML=off;if(cover_wmode!="STA"){document.getElementById("cover_ap_status").innerHTML=on;document.getElementById("cover_ap_ssid").innerHTML=cover_ap_ssid;document.getElementById("cover_ap_ip").innerHTML=cover_ap_ip;document.getElementById("cover_ap_mac").innerHTML=cover_ap_mac}if(cover_wmode!="AP"){document.getElementById("cover_sta_status").innerHTML=on;document.getElementById("cover_sta_ssid").innerHTML=cover_sta_ssid;document.getElementById("cover_sta_rssi").innerHTML=cover_sta_rssi;document.getElementById("cover_sta_ip").innerHTML=cover_sta_ip;document.getElementById("cover_sta_mac").innerHTML=cover_sta_mac}if(webdata_sn==""){webdata_sn="---"}fileText("webdata_sn",webdata_sn);if(webdata_msvn==""){webdata_msvn="---"}fileText("webdata_msvn",webdata_msvn);if(webdata_ssvn==""){webdata_ssvn="---"}fileText("webdata_ssvn",webdata_ssvn);if(webdata_pv_type==""){webdata_pv_type="---"}fileText("webdata_pv_type",webdata_pv_type);if(webdata_rate_p==""){webdata_rate_p="---"}fileText("webdata_rate_p",webdata_rate_p+" W");if(webdata_now_p==""||webdata_now_p==0){webdata_now_p="---"}fileText("webdata_now_p",webdata_now_p+" W");if(webdata_today_e==""){webdata_today_e="---"}fileText("webdata_today_e",webdata_today_e+" kWh");if(webdata_total_e==""){webdata_total_e="---"}fileText("webdata_total_e",webdata_total_e+" kWh");if(webdata_alarm==""){webdata_alarm="---"}fileText("webdata_alarm",webdata_alarm);if(webdata_utime==""){if(document.getElementById("webdata_sn").innerHTML=="---"){webdata_utime="---"}else{webdata_utime=value+window.parent.reTip("5")}}fileText("webdata_utime",webdata_utime);var st_en=window.parent.reTip("3");var st_dis=window.parent.reTip("4");var st_un=window.parent.reTip("41");if(status_a=="1"){document.getElementById("cover_remote_status_a").innerHTML=st_en}else{if(status_a=="0"){document.getElementById("cover_remote_status_a").innerHTML=st_dis}else{document.getElementById("cover_remote_status_a").innerHTML=st_un}}if(status_b=="1"){document.getElementById("cover_remote_status_b").innerHTML=st_en}else{if(status_b=="0"){document.getElementById("cover_remote_status_b").innerHTML=st_dis}else{document.getElementById("cover_remote_status_b").innerHTML=st_un}}};

</script>
</head>
<body class="in_body" onload="init_main_page();">
	<div class="div_c" id="main_div">
        <div class="lab_5 cu b" onclick="upfold(1);child_getH();"><span class="sub" id="p_1">-</span><span id="st1" style="margin-left:3px"></span></div>
        <div class="sp_5"></div>
        <div id="up_1_div">
        <div class="lab_l2" id="tx1"></div>
                <div class="lab_r2" id="webdata_sn"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" id="tx2"></div>
                <div class="lab_r2" id="webdata_msvn"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" id="tx3"></div>
                <div class="lab_r2" id="webdata_ssvn"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" id="tx4"></div>
                <div class="lab_r2" id="webdata_pv_type"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" id="tx5"></div>
                <div class="lab_r2" id="webdata_rate_p"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" style="color:#666666;font-weight:bold;" id="tx6"></div>
                <div class="lab_r2" id="webdata_now_p" style="color:#666666;font-weight:bold;"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" style="color:#666666;font-weight:bold;" id="tx7"></div>
                <div class="lab_r2" id="webdata_today_e" style="color:#666666;font-weight:bold;"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" style="color:#666666;font-weight:bold;" id="tx8"></div>
                <div class="lab_r2" id="webdata_total_e" style="color:#666666;font-weight:bold;"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" style="color:#666666;font-weight:bold;" id="tx9"></div>
                <div class="lab_r2" id="webdata_alarm" style="color:#666666;font-weight:bold;"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" style="color:#666666;font-weight:bold;" id="tx10"></div>
                <div class="lab_r2" id="webdata_utime" style="color:#666666;font-weight:bold;"></div>
        <div class="cl"></div>
        <div class="line"></div>
        </div>
        <div class="sp_20"></div>
        <div class="lab_5 cu b" onclick="upfold(2);child_getH();"><span class="sub" id="p_2">+</span><span id="st2" style="margin-left:3px"></span></div>
                <div class="sp_5"></div>
                <div id="up_2_div" style="display:none">
                <div class="label" id="tx11"></div>
                <div class="lab_r" id="cover_mid"></div>
                <div class="cl"></div>
                <div class="line"></div>
                <div class="label" id="tx12"></div>
                <div class="lab_r" id="cover_ver"></div>
                <div class="cl"></div>
                <div class="line"></div>
                <div class="label" id="tx13"></div>
                <div class="lab_r" id="cover_ap_status" style="color:#666666;font-weight:bold;"></div>
                <div class="cl"></div>
                <div class="line"></div>
                <div class="lab_l" id="ap_ssid">SSID</div>
                <div class="lab_r" id="cover_ap_ssid"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="lab_l" id="tx15"></div>
                <div class="lab_r" id="cover_ap_ip"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="lab_l" id="tx16"></div>
                <div class="lab_r" id="cover_ap_mac"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="label" id="tx17"></div>
                <div class="lab_r" id="cover_sta_status" style="color:#666666;font-weight:bold;"></div>
                <div class="cl"></div>
                <div class="line"></div>
                <div class="lab_l" id="tx18"></div>
                <div class="lab_r" id="cover_sta_ssid"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="lab_l" id="tx19"></div>
                <div class="lab_r" id="cover_sta_rssi"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="lab_l" id="tx20"></div>
                <div class="lab_r" id="cover_sta_ip"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="lab_l" id="tx21"></div>
                <div class="lab_r" id="cover_sta_mac"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                </div>
                
                <div class="sp_20"></div>
                <div class="lab_5 cu b" onclick="upfold(3);child_getH();"><span class="sub" id="p_3">+</span><span id="st3" style="margin-left:3px"></span></div>
                <div class="sp_5"></div>
                <div id="up_3_div" style="display:none">
                <div class="label" id="tx25"></div>
                <div class="lab_r" id="cover_remote_status_a"></div>
                <div class="cl"></div>
                <div class="line"></div>
                <div class="label" id="tx26"></div>
                <div class="lab_r" id="cover_remote_status_b"></div>
                <div class="cl"></div>
                <div class="line"></div>
                </div>
    </div>
	<script type="text/javascript">
	    initPageText();
	    ready();
	</script>


</body></html>

Can anyone point me in the right direction with a select / template syntax?

Is it even possible to scrape just string values from this source?
I tried everything but can’t grab these values. Is is possible to select all and then use a regex?

Or is it possible to run a PHP script (as I have a working script)?

If you have the option to run php scripts then that is your solution.
Just host the PHP file somewhere and make it scrape and output the value you want, then in Home Assistant it’s a simple scrape.

I did that, I uploaded a scrape php page to my webpage that my Home Assistant contacts to get the values from a third party page.
Much easier to scrape with full programming tools.

Hi,

I am trying to scrape few data from Accuweather webpage (they have nice working, in my opinion, Minutecast - minute by minute for next two hours forecast).
https://www.accuweather.com/en/cu/havana/122438/minute-weather-forecast/122438

Unfortunatelly I cannot find selector. I did try:
title, .title, .minutecast-dial .title, .minutecast-dial.title and all this options with p at the end.
No luck. All the time error:
ERROR (SyncWorker_18) [homeassistant.components.scrape.sensor] Unable to extract data from HTML
Any hint?

I managed to get this working from the Ginlong monitoring website after setting it to public but it doesn’t seem to be “recording” the information only displaying it on a card. Does anyone happen to know how I would get this to work please? I’m using InfluxDB atm for thermostats so would like to add it here if possible.

Here is my sensor in configuration.yaml:-

this returns :- image