Web scraper Sensor question

Now I have a new challegenge, which seems a bit more complicated: I have tried to use select command, but in this case I have no clue which word I could look for as the word I am trying to use is shown several times. Could you please look at it and come with suggestion. I would need the three yellow marked figures in the screenshot.
The weblink for the site is http://77.119.243.51:86

Here’s the instructions I use to scrape data.

	• Open up developer mode in Chrome (F12)
	• Click on Elements
	• Expand until you find location you want to get information
	• Right click and choose Copy then Copy selector
	• Paste into HA sensor under select:
1 Like

I am trying to create a sensor with my next bus time to integrate into home assistant and this is my first attempt at using the scrape component. I’ve tried a number of ‘selects’ but continue to get Unable to extract data from HTML. What method of testing and debugging could be used to find out where it is failing? The link is below, and the data appears in
div.nextride-pattern__container > div.nextride-pattern__times.data-test-pattern-times-block

https://www.rtd-denver.com/app/nextride/stop/11844

sensor:
  - platform: scrape
    resource: https://www.rtd-denver.com/app/nextride/stop/11844
    name: RTD Downtown Bus Arrival 1
    select: 'div.nextride-pattern-stoptime__prediction'

Can anyone assist how I might achieve or where my configuration is failing?

try without the “div”

sensor:
  - platform: scrape
    resource: https://www.rtd-denver.com/app/nextride/stop/11844
    name: RTD Downtown Bus Arrival 1
    select: '.nextride-pattern-stoptime__prediction'

Still unable to extract data from HTML. Because this appears multiple times, would it help if I added an index: 0?

sensor:
  - platform: scrape
    resource: https://www.rtd-denver.com/app/nextride/stop/11844
    name: RTD Downtown Bus Arrival 1
    select: '.nextride-pattern-stoptime__prediction'
    index: 0

Or would it be index: 1? This index appears 3 times.

Also, would the headers element need to be added? Would ssl affect how it scrapes?

You would be better off parsing the API https://www.rtd-denver.com/api/nextride/stops/11844 then using a web scraper

2 Likes

I need some help too using the scrape function.
I want to scrape a power value from an internal webinterface my Solis datalogger has. If I get it to work for one value it should be easier for me to figure out the rest.

I copies the CCS select on the reuqired value (1480 W) and received from Chrome
#webdata_now_p

This in my config right now:

sensor:
  - platform: scrape
    resource: http://192.168.*.*/
    authentication : basic
    username : <username>
    password : <password>
    select: "#webdata_now_p"  
    value_template: '{{value.split(" ")[0] | float}}'
    unit_of_measurement: 'W'
    scan_interval: 300
    name : SolisPower

But all I got is an empty value in HA.

In the Core log form HA I do receive:
2020-05-09 08:29:47 ERROR (SyncWorker_10) [homeassistant.components.scrape.sensor] Unable to extract data from HTML

The internal HTML interface does use authentication but it’s (I assume) of the basic type. In domotizc I had this scraping working using a php script that used the username:[email protected] and this worked.

For now I dont know if the issue is the authentication part or something wrong in my YML scrape configuration . I’m trying for hours, even days and can’t figure it out? Can someone point me in the right direction?

EDIT:
After I figured out to turn on debug logging:

2020-05-09 09:42:54 DEBUG (SyncWorker_19) [homeassistant.components.scrape.sensor] 401 Unauthorized

401 Unauthorized

Authorization required.

It’s just a basic authentication at the website, according the scape options page this should be supported.

OK. a reply to my self :slight_smile:

After adjsuting the confiugration to:

sensor:
  - platform: scrape
    resource: http://username:[email protected]/status.html
    select: "webdata_now_p" 
    unit_of_measurement: 'W'
    scan_interval: 10
    name : SolisPower4

It seems it reads the page correctly. In the home-assistant.log I do find:

</style>
<script type="text/javascript">
var height=0;function fileText(id,value){if(document.getElementById(id)){document.getElementById(id).innerHTML=value}}function changeFont(){reCon("main_div").style.fontFamily=window.parent.reFont()}function child_getH(){var nh=document.body.offsetHeight+100;if(nh<500||nh==null){nh=500}if(height!=nh){height=nh;window.parent.child_height(height)}}function reCon(id){return document.getElementById(id)}function ready(){try{window.parent.show_ifr()}catch(e){}child_getH()}function show(v){var c=document.getElementById(v);if(c!=null){c.style.display=""}}function hide(v){var c=document.getElementById(v);if(c!=null){c.style.display="none"}};
</script>
<script type="text/javascript">
var webdata_sn = "1234567890 ";
var webdata_msvn = "000C";
var webdata_ssvn = "001E";
var webdata_pv_type = "0078";
var webdata_rate_p = "";
var webdata_now_p = "4880";
var webdata_today_e = "11.0";
var webdata_total_e = "2233.0";
var webdata_alarm = "";
var webdata_utime = "0";
var cover_mid = "4013752125";
var cover_ver = "MW_08_0501_1.57";
var cover_wmode = "STA";
var cover_ap_ssid = "AP_111111111";
var cover_ap_ip = "10.1.1.1";
var cover_ap_mac = "11111111111";
var cover_sta_ssid = "my ssid";
var cover_sta_rssi = "100%";
var cover_sta_ip = "192.168.x.y";
var cover_sta_mac = "123456789";
var status_a = "1";
var status_b = "0";
var status_c = "0";

And somewhat further in the logs:

<div class="lab_l2" id="tx6" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_now_p" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>

I expect the select “webdata_now_p” will retrieve 4880 back.

However:
2020-05-09 10:47:45 ERROR (SyncWorker_19) [homeassistant.components.scrape.sensor] Unable to extract data from HTML

1 Like

One step further…

select: “.lab_l2”
This does not throw an error anymore but the value is now empty. There are multiple values with different ID= but how to select the right one? And is this the right approach? Or is it possible to directly scrape the value behind var webdata_now_p =

Anybody?

Hi, Frank
Thank you for the replies. I will test it today. Sorry I didn’t reply yesterday. I hope I get this working. It is realy important for me. Thanks a lot for working this out

Hi Frank, I think you have the same problem as I’m having with my solarpanel. As lolouk44 replied on this post in april 2019, I think that your values also are loaded after the page itself is loaded and that makes it not possible to read any of the values on the page. However since they appear in the log of your homeassistant, there has to be a way. You are already one step ahead of me.

I’v read all related topics I could find and believe that I need to do with a value template.
Is there a way to select “all html output” and then use a value tempate to filter the exact value? Or use something like a regex ?

The enitre HTML, including the desried values is in the Home Assitant Log.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- saved from url=(0031)http://192.168.2.15/status.html -->
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style type="text/css">
.in_body
{
	margin-top:0px;
	margin-left:0px;
	margin-right:0px;
	margin-bottom:0px;
	background-color:transparent;
}
.div_c
{
	margin-left:50px;
	margin-right:50px;
	margin-top:50px;
	margin-bottom:50px;
}
.cu
{
	cursor:pointer;
}
.b
{
	font-weight:bold;
}
.lab_5
{
	font-size:16px;
	color:#666666;
	margin-left:-20px;
}
.lab_l2
{
	float:left;
	width:32%;
	color:#666666;
	margin-bottom:-2px;
	font-size:14px;
}
.lab_r2
{
	float:left;
	width:68%;
	color:#666666;
	text-align:right;
	font-size:14px;
}
.cl
{
	clear:left;
}
.line
{
	height:1px;
	background-color:#666666;
	width:100%;
	margin-top:5px;
	margin-bottom:5px;
}
.sp_5
{
	height:5px;
	width:500px;
}
.sp_20
{
	height:20px;
	width:500px;
}
.label
{
	float:left;
	width:50%;
	color:#666666;
	margin-bottom:-2px;
	font-size:14px;
}
.lab_r
{
	float:left;
	width:50%;
	color:#666666;
	text-align:right;
	font-size:14px;
}
.lab_l
{
	float:left;
	width:40%;
	color:#666666;
	margin-bottom:-2px;
	margin-left:10%;
	font-size:14px;
}
.line_l
{
	height:1px;
	background-color:#666666;
	width:450px;
	margin-top:5px;
	margin-bottom:5px;
	margin-left:50px;
}
.sub
{
    display:inline-block;
    width:16px;
    text-align:center;
}
</style>
<script type="text/javascript">
var height=0;function fileText(id,value){if(document.getElementById(id)){document.getElementById(id).innerHTML=value}}function changeFont(){reCon("main_div").style.fontFamily=window.parent.reFont()}function child_getH(){var nh=document.body.offsetHeight+100;if(nh<500||nh==null){nh=500}if(height!=nh){height=nh;window.parent.child_height(height)}}function reCon(id){return document.getElementById(id)}function ready(){try{window.parent.show_ifr()}catch(e){}child_getH()}function show(v){var c=document.getElementById(v);if(c!=null){c.style.display=""}}function hide(v){var c=document.getElementById(v);if(c!=null){c.style.display="none"}};
</script>
<script type="text/javascript">
var webdata_sn = "123456789 ";
var webdata_msvn = "000C";
var webdata_ssvn = "001E";
var webdata_pv_type = "0078";
var webdata_rate_p = "";
var webdata_now_p = "1420";
var webdata_today_e = "24.60";
var webdata_total_e = "2290.0";
var webdata_alarm = "";
var webdata_utime = "0";
var cover_mid = "123456789";
var cover_ver = "MW_08_0501_1.57";
var cover_wmode = "STA";
var cover_ap_ssid = "AP_123456789";
var cover_ap_ip = "10.10.10.10";
var cover_ap_mac = "123456789";
var cover_sta_ssid = "123456789";
var cover_sta_rssi = "100%";
var cover_sta_ip = "10.10.10.10";
var cover_sta_mac = "123456789";
var status_a = "1";
var status_b = "0";
var status_c = "0";

function initPageText(){var list=window.parent.reList("status");fileText("st1",list["t1"]);fileText("st2",list["t2"]);fileText("st3",list["t3"]);for(var i=1;i<=27;i++){if(i!=14){fileText("tx"+i,list[i])}}changeFont();child_getH()}function upfold(v){if(document.getElementById("up_"+v+"_div").style.display=="none"){show("up_"+v+"_div");reCon("p_"+v).innerHTML="-"}else{hide("up_"+v+"_div");reCon("p_"+v).innerHTML="+"}}function init_main_page(){var on=window.parent.reTip("1");var off=window.parent.reTip("2");document.getElementById("cover_mid").innerHTML=cover_mid;document.getElementById("cover_ver").innerHTML=cover_ver;document.getElementById("cover_ap_status").innerHTML=off;document.getElementById("cover_sta_status").innerHTML=off;if(cover_wmode!="STA"){document.getElementById("cover_ap_status").innerHTML=on;document.getElementById("cover_ap_ssid").innerHTML=cover_ap_ssid;document.getElementById("cover_ap_ip").innerHTML=cover_ap_ip;document.getElementById("cover_ap_mac").innerHTML=cover_ap_mac}if(cover_wmode!="AP"){document.getElementById("cover_sta_status").innerHTML=on;document.getElementById("cover_sta_ssid").innerHTML=cover_sta_ssid;document.getElementById("cover_sta_rssi").innerHTML=cover_sta_rssi;document.getElementById("cover_sta_ip").innerHTML=cover_sta_ip;document.getElementById("cover_sta_mac").innerHTML=cover_sta_mac}if(webdata_sn==""){webdata_sn="---"}fileText("webdata_sn",webdata_sn);if(webdata_msvn==""){webdata_msvn="---"}fileText("webdata_msvn",webdata_msvn);if(webdata_ssvn==""){webdata_ssvn="---"}fileText("webdata_ssvn",webdata_ssvn);if(webdata_pv_type==""){webdata_pv_type="---"}fileText("webdata_pv_type",webdata_pv_type);if(webdata_rate_p==""){webdata_rate_p="---"}fileText("webdata_rate_p",webdata_rate_p+" W");if(webdata_now_p==""||webdata_now_p==0){webdata_now_p="---"}fileText("webdata_now_p",webdata_now_p+" W");if(webdata_today_e==""){webdata_today_e="---"}fileText("webdata_today_e",webdata_today_e+" kWh");if(webdata_total_e==""){webdata_total_e="---"}fileText("webdata_total_e",webdata_total_e+" kWh");if(webdata_alarm==""){webdata_alarm="---"}fileText("webdata_alarm",webdata_alarm);if(webdata_utime==""){if(document.getElementById("webdata_sn").innerHTML=="---"){webdata_utime="---"}else{webdata_utime=value+window.parent.reTip("5")}}fileText("webdata_utime",webdata_utime);var st_en=window.parent.reTip("3");var st_dis=window.parent.reTip("4");var st_un=window.parent.reTip("41");if(status_a=="1"){document.getElementById("cover_remote_status_a").innerHTML=st_en}else{if(status_a=="0"){document.getElementById("cover_remote_status_a").innerHTML=st_dis}else{document.getElementById("cover_remote_status_a").innerHTML=st_un}}if(status_b=="1"){document.getElementById("cover_remote_status_b").innerHTML=st_en}else{if(status_b=="0"){document.getElementById("cover_remote_status_b").innerHTML=st_dis}else{document.getElementById("cover_remote_status_b").innerHTML=st_un}}};

</script>
</head>
<body class="in_body" onload="init_main_page();">
	<div class="div_c" id="main_div">
        <div class="lab_5 cu b" onclick="upfold(1);child_getH();"><span class="sub" id="p_1">-</span><span id="st1" style="margin-left:3px"></span></div>
        <div class="sp_5"></div>
        <div id="up_1_div">
        <div class="lab_l2" id="tx1"></div>
                <div class="lab_r2" id="webdata_sn"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" id="tx2"></div>
                <div class="lab_r2" id="webdata_msvn"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" id="tx3"></div>
                <div class="lab_r2" id="webdata_ssvn"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" id="tx4"></div>
                <div class="lab_r2" id="webdata_pv_type"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" id="tx5"></div>
                <div class="lab_r2" id="webdata_rate_p"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" style="color:#666666;font-weight:bold;" id="tx6"></div>
                <div class="lab_r2" id="webdata_now_p" style="color:#666666;font-weight:bold;"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" style="color:#666666;font-weight:bold;" id="tx7"></div>
                <div class="lab_r2" id="webdata_today_e" style="color:#666666;font-weight:bold;"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" style="color:#666666;font-weight:bold;" id="tx8"></div>
                <div class="lab_r2" id="webdata_total_e" style="color:#666666;font-weight:bold;"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" style="color:#666666;font-weight:bold;" id="tx9"></div>
                <div class="lab_r2" id="webdata_alarm" style="color:#666666;font-weight:bold;"></div>
        <div class="cl"></div>
        <div class="line"></div>
        <div class="lab_l2" style="color:#666666;font-weight:bold;" id="tx10"></div>
                <div class="lab_r2" id="webdata_utime" style="color:#666666;font-weight:bold;"></div>
        <div class="cl"></div>
        <div class="line"></div>
        </div>
        <div class="sp_20"></div>
        <div class="lab_5 cu b" onclick="upfold(2);child_getH();"><span class="sub" id="p_2">+</span><span id="st2" style="margin-left:3px"></span></div>
                <div class="sp_5"></div>
                <div id="up_2_div" style="display:none">
                <div class="label" id="tx11"></div>
                <div class="lab_r" id="cover_mid"></div>
                <div class="cl"></div>
                <div class="line"></div>
                <div class="label" id="tx12"></div>
                <div class="lab_r" id="cover_ver"></div>
                <div class="cl"></div>
                <div class="line"></div>
                <div class="label" id="tx13"></div>
                <div class="lab_r" id="cover_ap_status" style="color:#666666;font-weight:bold;"></div>
                <div class="cl"></div>
                <div class="line"></div>
                <div class="lab_l" id="ap_ssid">SSID</div>
                <div class="lab_r" id="cover_ap_ssid"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="lab_l" id="tx15"></div>
                <div class="lab_r" id="cover_ap_ip"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="lab_l" id="tx16"></div>
                <div class="lab_r" id="cover_ap_mac"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="label" id="tx17"></div>
                <div class="lab_r" id="cover_sta_status" style="color:#666666;font-weight:bold;"></div>
                <div class="cl"></div>
                <div class="line"></div>
                <div class="lab_l" id="tx18"></div>
                <div class="lab_r" id="cover_sta_ssid"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="lab_l" id="tx19"></div>
                <div class="lab_r" id="cover_sta_rssi"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="lab_l" id="tx20"></div>
                <div class="lab_r" id="cover_sta_ip"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                <div class="lab_l" id="tx21"></div>
                <div class="lab_r" id="cover_sta_mac"></div>
                <div class="cl"></div>
                <div class="line_l"></div>
                </div>
                
                <div class="sp_20"></div>
                <div class="lab_5 cu b" onclick="upfold(3);child_getH();"><span class="sub" id="p_3">+</span><span id="st3" style="margin-left:3px"></span></div>
                <div class="sp_5"></div>
                <div id="up_3_div" style="display:none">
                <div class="label" id="tx25"></div>
                <div class="lab_r" id="cover_remote_status_a"></div>
                <div class="cl"></div>
                <div class="line"></div>
                <div class="label" id="tx26"></div>
                <div class="lab_r" id="cover_remote_status_b"></div>
                <div class="cl"></div>
                <div class="line"></div>
                </div>
    </div>
	<script type="text/javascript">
	    initPageText();
	    ready();
	</script>


</body></html>

Can anyone point me in the right direction with a select / template syntax?

Is it even possible to scrape just string values from this source?
I tried everything but can’t grab these values. Is is possible to select all and then use a regex?

Or is it possible to run a PHP script (as I have a working script)?

If you have the option to run php scripts then that is your solution.
Just host the PHP file somewhere and make it scrape and output the value you want, then in Home Assistant it’s a simple scrape.

I did that, I uploaded a scrape php page to my webpage that my Home Assistant contacts to get the values from a third party page.
Much easier to scrape with full programming tools.

Hi,

I am trying to scrape few data from Accuweather webpage (they have nice working, in my opinion, Minutecast - minute by minute for next two hours forecast).
https://www.accuweather.com/en/cu/havana/122438/minute-weather-forecast/122438

Unfortunatelly I cannot find selector. I did try:
title, .title, .minutecast-dial .title, .minutecast-dial.title and all this options with p at the end.
No luck. All the time error:
ERROR (SyncWorker_18) [homeassistant.components.scrape.sensor] Unable to extract data from HTML
Any hint?

I managed to get this working from the Ginlong monitoring website after setting it to public but it doesn’t seem to be “recording” the information only displaying it on a card. Does anyone happen to know how I would get this to work please? I’m using InfluxDB atm for thermostats so would like to add it here if possible.

Here is my sensor in configuration.yaml:-

this returns :- image

Sorted it, The system wouldn’t record it as it was seeing it as a text value not a number as per below.

Hi Dietlman,
I’m trying to grab the temperature from the same HWg-STE box to my HA with no luck.
My code looks like this:
sensor:

  • platform: scrape
    resource: MY_HWG-STE_IP_XML_WEB_SITE
    name: Tamir_Boiler
    select: “#s215

And it shows me the same A sign in the temperature result.
Can you please share your code that works?
Where should I add the value_template ?

Thanks a lot,
Tamir

Hi Tamir,

this is how it works for me:

  • platform: scrape
    resource: http://my_sensor_http
    name: WW-Boiler
    select: ‘.value’
    value_template: ‘{{value[:-4] | float}}’
    unit_of_measurement: “°C”
    scan_interval: 360

Hi,
I’m trying to use this component but I get the following error during HA initialization:
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Any idea how to solve the problem?
I’m using Home Assistant 0.113.1

arch | armv7l
dev | false
docker | false
hassio | false
installation_type | Home Assistant Core
os_name | Linux
os_version | 4.19.50-v7+
python_version | 3.7.3
timezone | UTC
version | 0.113.1
virtualenv | true