Scrape sensor improved - scraping multiple values

the file link is https://aquarea-service.panasonic.com/installer/api/function/status
I added this code to the yaml configuration but not work

rest:
  - authentication: basic
    username: 'dimare.gabriele'
    password: 'xxxx'
    scan_interval: 30
    resource: 'https://aquarea-service.panasonic.com/installer/api/function/status'
    sensor:
      - name: Panasonic Compressor
        value_template: "{{ value_json['statusDataInfo']['function-status-text-060']['value'] }}"

OK: what’s shown in the logs? If there’s nothing obvious there, replace the value_template line with:

value_template: "{{ value[:100] }}"

which will give you the first 100 characters of the response, which might be helpful in debugging what to do.

Did you reload? Developer Tools / YAML / REST ENTITIES AND NOTIFY SERVICES.

the sensor returns this error:

{"errorCode":4194816,"message":[{"errorMessage":"System error occurred","errorCode":"9999-9999"}]}

which is the same error you get if you try to click on the link directly. https://aquarea-service.panasonic.com/installer/api/function/status

I think the login is not working

Hi web scrapers

Does anyone know if I can scrape the data from here:

https://oxontime.com/departure-single-page/1636

I can see the selectors in my browser’s developer tools but multi-scrape records errors in the log saying the selectors are missing.

Is there another way to get this data into HA?

Thanks in advance

The page is making a request to this URL which is returning JSON. Use a rest sensor to pull in and manipulate that:

https://oxontime.com/pwi/departureBoard/340000093BP3

Thanks @Troon , I’ll give it a go.

Hi Scrapers

I’m pretty new to HA as well as .yaml and anything to do with the smart home community.

I’ve now tried for multiple hours to get this to work, but i just can’t.

I have the problem with the authentication in regards to the scraper.
After many hours of trial and error i’ve now found out that the website doesn’t give me a session ID cookie, but instead incorporates it into the url with a query. Like the following

http://192.168.1.74/cgi-bin/overview.tcl?sid=xxxxxx7129387

Is there any way to bypass this or do anything about it?
If i just do it without the form_submit and just enter the full URL with sessionID it has no problem scraping

- resource: http://192.168.1.74/cgi-bin/overview.tcl
  scan_interval: 60
  log_response: true
  form_submit:
    submit_once: True
    resource: http://192.168.1.74/cgi-bin/login_page.tcl
    select: "#loginForm"
    input:
        user: !secret solar_user
        pw: !secret solar_password
  sensor:
  - unique_id: solceller_prod_live
    name: Solcelle produktion Live
    select: "#curr_power"
  - unique_id: solceller_prod_idag
    name: Solcelle produktion i dag
    select: "#prod_today"
    on_error:
        log: warning

Thanks in advance

You’d be partly out of luck, because:

The site will probably set a cookie in the session. The sessions are reused between all scraping requests.

From Form submit functionality · danieldotnl/ha-multiscrape Wiki · GitHub.

But, what you can do, is to use RESTful Sensor - Home Assistant or RESTful - Home Assistant, with the former slightly the better option.

First, you’ll need to figure out the details of the form submit. If it’s a POST to an URL, then that’s what you’ll use to configure it, but you also need to check what response you receive. Hopefully, your session token would be easily accessible. It will then be your sensor’s value.

If it’s more complicated, then write a Bash script using curl and use it in a Command Line - Home Assistant.

At this point you should have the token in a sensor, which means you’ll drop the auth stuff from the scraper’s config and use the token sensor in a template in the request URL.

resource_template: "http://192.168.1.74/cgi-bin/overview.tcl?sid={{ states('sensor.my_token_sensor') }}"

Note that this will allow anybody that has access to your HA will have access to that token.

Lastly, you need to figure out for how long a session token is valid and set the with sensor’s scan interval appropriately or write an automation using homeassistant.update_entity to update it on a specific schedule.

EDIT: Unless you’re using it for other cases where you want to retrieve multiple values in one go, you can actually just stick with the built-in Scrape - Home Assistant.

1 Like

Thank you so much for the detailed answer!

I will try it as soon as possible!

EDIT:
I have now tried with a post request in postman and got it working and i get the data i want. However, i get the sid in a HTML body. And i can’t get that to work in the Restful Sensor.

Is there any way to get this into restful, multiscraper or any other system to get that information out, into an URL and scrape the data afterwards?

Here is the return value from postman:

<html>

<head></head>

<body onLoad="top.window.location='/cgi-bin/frameset.tcl?sid=8351692669043402850'">Logger ind...</body>

</html>
1 Like

Hello, I am looking to recover an energy value on an EDF ENR panel

Link to the login page : https://espaceclient.edfenr.com/
Source code of the text I want to retrieve :

<div class="truncate text-sm font-bold leading-4 mb-px">--</div>
Photo du panel

(It’s marked “–” because it’s 00h00 and therefore no sun hihi)

For information when I enter my login I am directly redigirated on the right page to see the consumption :smiley:

What does

http://192.168.1.74/cgi-bin/frameset.tcl?sid=8351692669043402850

return? Does that sid change each time or is it constant?

1 Like

Hi Troon

Thanks for your response!

The frameset.tcl returns the default entry page. Inside the entry page is a iframe view of the overview.tcl page that i want to scrape.

I can only get a sid once. If i try again it gets denied, unless it hasn’t been used for 10 minutes or 24 hours has passed since it’s been delivered.

It’s a bit hard to simulate this since I don’t have access to that endpoint of yours, but I’ll try to describe a few ideas.

Option 1:
I’ve saved your HTML snippet to a file called test.html. If you want to go the command line sensor route, you can create a curl command and process the output, e.g. like this:

cat test.html | sed -n "s/.*sid=\([0-9]*\).*/\1/p"

The part before the pipe | will be the curl command.

Option 2:
This should in principle also work with a REST sensor: Instead of using value_json in your template, just use value, which is the raw output. You can then use HA’s regex functions on it.

Something like this:

{% set value = "<html>

<head></head>

<body onLoad=\"top.window.location='/cgi-bin/frameset.tcl?sid=8351692669043402850'\">Logger ind...</body>

</html>" %}

{{ (value | regex_findall_index(find="sid=[0-9]+", index=0, ignorecase=True))[4:] }}

Option 3:
You might indeed be able to do this with scrape or multiscrape. You’ll need to use your browser’s dev tools to determine the CSS selector and then set the attribute (onLoad). You will again have access to a value in your template (for the scrape sensor’s value_template) and should be able to do something similar with a regex as above.

I would personally go for option 2.

So after talks back and forth with @parautenbach we finally got the solution.

First of all, we tried to do it with Curl, but for some odd reason it just wouldn’t give me an output. After about an hour of me ripping my hair out i decided to go back to REST and try with a GET method. Somehow i now responded, i think that was because of the scan_interval now being there (Earlier it had just come up with an error, probably due to it already overwritting it).

The code was now:

sensor:
  - platform: rest
    name: SID
    resource: http://192.168.1.74/cgi-bin/handle_login.tcl?user=HA&pw=XXXXXXXX&submit=Log+p%C3%A5&sid=
    method: GET
    value_template: '{{ (value | regex_findall_index(find="sid=[0-9]+", index=0, ignorecase=True))[4:] }}'
    scan_interval: 86400  # 24 hours = 60 * 60 * 24

And it responded only with the SID (Session ID).

Now onto Multiscrape.
It now needed the variable on the resource template and after @parautenbach helping me out yet again it worked as well.

- resource_template: "http://192.168.1.74/cgi-bin/overview.tcl?sid={{ states('sensor.sid') }}"
  scan_interval: 60
  name: Standard
  log_response: true
  sensor:
  - unique_id: solceller_prod_live2
    name: Solcelle produktion Live2
    select: "#curr_power"
  - unique_id: solceller_prod_idag2
    name: Solcelle produktion i dag2
    select: "#prod_today"
    on_error:
        log: warning

A huge thanks to @parautenbach who was just amazing and dedicated time to help me solve my problem.
What an amazing community to enter, for a newcomer who just installed HA 10 days ago.

2 Likes

I lost all my values and am totally stuck, my old SMA inverter is only reachable via scrape no other SMA integration works, alas now the scrape did stop too, what am I missing how to transform this old scrape to the new one.

 Webscrapping for SMA Inverter values
- name: pv_einheit
  platform: scrape
  resource: "https://www.sunnyportal.com/Templates/PublicPage.aspx?page=my page"
  select: "#ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldUnit"

- name: pv_periode
  platform: scrape
  resource: "https://www.sunnyportal.com/Templates/PublicPage.aspx?page=my page"
  select: "#ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldPeriodTitle"

- name: pv_wert
  platform: scrape
  resource: "https://www.sunnyportal.com/Templates/PublicPage.aspx?page=my page"
  select: "#ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldValue"
  value_template: '{% if is_state("sensor.pv_einheit", "Wh") %}{{ value | float / 1000 }}{% else %}{{ value | float }}{% endif %}'
unit_of_measurement: 'kW/h'

Assuming the web page hasn’t changed and the old selects worked, then according to the docs this should work:

scrape:
  - resource: "https://www.sunnyportal.com/Templates/PublicPage.aspx?page=my page"
    sensor:
      - name: pv_einheit
        select: "#ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldUnit"
      - name: pv_periode
        select: "#ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldPeriodTitle"
      - name: pv_wert
        select: "#ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldValue"
        value_template: '{% if is_state("sensor.pv_einheit", "Wh") %}{{ value|float(0)/1000 }}{% else %}{{ value|float(0) }}{% endif %}'
        unit_of_measurement: 'kWh'

I’ve corrected your (incorrectly-indented) unit_of_measurement to 'kWh' as 'kW/h' is meaningless.

There’s a potential race condition here: the pv_wert template assumes that the pv_einheit scrape is evaluated first. That assumption might need testing.

Thank you I give it a try!

Guys, do you know if is possible to get all div siblings, or for example first 10 at least from site like this?

div.container-table.hlasenie-block > .table-row

I have more similar websites where are some warnings and it’s dynamic, I don’t know how much can shows on website. I really don’t know how can I handle this type of scrapping. Should I try add first 10 selectors to attribute of sensor or I need to do that via python script?

I have the exact same page as you for my solar panels. This is my page_soup.txt:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<style type="text/css">
.in_body
{
	margin-top:0px;
	margin-left:0px;
	margin-right:0px;
	margin-bottom:0px;
	background-color:transparent;
}
.div_c
{
	margin-left:50px;
	margin-right:50px;
	margin-top:50px;
	margin-bottom:50px;
}
.cu
{
	cursor:pointer;
}
.b
{
	font-weight:bold;
}
.lab_5
{
	font-size:16px;
	color:#666666;
	margin-left:-20px;
}
.lab_l2
{
	float:left;
	width:32%;
	color:#666666;
	margin-bottom:-2px;
	font-size:14px;
}
.lab_r2
{
	float:left;
	width:68%;
	color:#666666;
	text-align:right;
	font-size:14px;
}
.cl
{
	clear:left;
}
.line
{
	height:1px;
	background-color:#666666;
	width:100%;
	margin-top:5px;
	margin-bottom:5px;
}
.sp_5
{
	height:5px;
	width:500px;
}
.sp_20
{
	height:20px;
	width:500px;
}
.label
{
	float:left;
	width:50%;
	color:#666666;
	margin-bottom:-2px;
	font-size:14px;
}
.lab_r
{
	float:left;
	width:50%;
	color:#666666;
	text-align:right;
	font-size:14px;
}
.lab_l
{
	float:left;
	width:40%;
	color:#666666;
	margin-bottom:-2px;
	margin-left:10%;
	font-size:14px;
}
.line_l
{
	height:1px;
	background-color:#666666;
	width:450px;
	margin-top:5px;
	margin-bottom:5px;
	margin-left:50px;
}
.sub
{
    display:inline-block;
    width:16px;
    text-align:center;
}
</style>
<script type="text/javascript">
var height=0;function fileText(id,value){if(document.getElementById(id)){document.getElementById(id).innerHTML=value}}function changeFont(){reCon("main_div").style.fontFamily=window.parent.reFont()}function child_getH(){var nh=document.body.offsetHeight+100;if(nh<500||nh==null){nh=500}if(height!=nh){height=nh;window.parent.child_height(height)}}function reCon(id){return document.getElementById(id)}function ready(){try{window.parent.show_ifr()}catch(e){}child_getH()}function show(v){var c=document.getElementById(v);if(c!=null){c.style.display=""}}function hide(v){var c=document.getElementById(v);if(c!=null){c.style.display="none"}};
</script>
<script type="text/javascript">
var webdata_sn = "";
var webdata_msvn = "V1.18.00";
var webdata_ssvn = "V1.18.00";
var webdata_pv_type = "";
var webdata_rate_p = "";
var webdata_now_p = "1441";
var webdata_today_e = "6.72";
var webdata_total_e = "10851.4";
var webdata_alarm = "";
var webdata_utime = "0";
var cover_mid = "";
var cover_ver = "";
var cover_wmode = "APSTA";
var cover_ap_ssid = "";
var cover_ap_ip = "";
var cover_ap_mac = "";
var cover_sta_ssid = "";
var cover_sta_rssi = "70%";
var cover_sta_ip = "";
var cover_sta_mac = "";
var status_a = "1";
var status_b = "0";
var status_c = "0";

function initPageText(){var list=window.parent.reList("status");fileText("st1",list["t1"]);fileText("st2",list["t2"]);fileText("st3",list["t3"]);for(var i=1;i<=27;i++){if(i!=14){fileText("tx"+i,list[i])}}changeFont();child_getH()}function upfold(v){if(document.getElementById("up_"+v+"_div").style.display=="none"){show("up_"+v+"_div");reCon("p_"+v).innerHTML="-"}else{hide("up_"+v+"_div");reCon("p_"+v).innerHTML="+"}}function init_main_page(){var on=window.parent.reTip("1");var off=window.parent.reTip("2");document.getElementById("cover_mid").innerHTML=cover_mid;document.getElementById("cover_ver").innerHTML=cover_ver;document.getElementById("cover_ap_status").innerHTML=off;document.getElementById("cover_sta_status").innerHTML=off;if(cover_wmode!="STA"){document.getElementById("cover_ap_status").innerHTML=on;document.getElementById("cover_ap_ssid").innerHTML=cover_ap_ssid;document.getElementById("cover_ap_ip").innerHTML=cover_ap_ip;document.getElementById("cover_ap_mac").innerHTML=cover_ap_mac}if(cover_wmode!="AP"){document.getElementById("cover_sta_status").innerHTML=on;document.getElementById("cover_sta_ssid").innerHTML=cover_sta_ssid;document.getElementById("cover_sta_rssi").innerHTML=cover_sta_rssi;document.getElementById("cover_sta_ip").innerHTML=cover_sta_ip;document.getElementById("cover_sta_mac").innerHTML=cover_sta_mac}if(webdata_sn==""){webdata_sn="---"}fileText("webdata_sn",webdata_sn);if(webdata_msvn==""){webdata_msvn="---"}fileText("webdata_msvn",webdata_msvn);if(webdata_ssvn==""){webdata_ssvn="---"}fileText("webdata_ssvn",webdata_ssvn);if(webdata_pv_type==""){webdata_pv_type="---"}fileText("webdata_pv_type",webdata_pv_type);if(webdata_rate_p==""){webdata_rate_p="---"}fileText("webdata_rate_p",webdata_rate_p+" W");if(webdata_now_p==""||webdata_now_p==0){webdata_now_p="---"}fileText("webdata_now_p",webdata_now_p+" W");if(webdata_today_e==""){webdata_today_e="---"}fileText("webdata_today_e",webdata_today_e+" kWh");if(webdata_total_e==""){webdata_total_e="---"}fileText("webdata_total_e",webdata_total_e+" kWh");if(webdata_alarm==""){webdata_alarm="---"}fileText("webdata_alarm",webdata_alarm);if(webdata_utime==""){if(document.getElementById("webdata_sn").innerHTML=="---"){webdata_utime="---"}else{webdata_utime=value+window.parent.reTip("5")}}fileText("webdata_utime",webdata_utime);var st_en=window.parent.reTip("3");var st_dis=window.parent.reTip("4");var st_un=window.parent.reTip("41");if(status_a=="1"){document.getElementById("cover_remote_status_a").innerHTML=st_en}else{if(status_a=="0"){document.getElementById("cover_remote_status_a").innerHTML=st_dis}else{document.getElementById("cover_remote_status_a").innerHTML=st_un}}if(status_b=="1"){document.getElementById("cover_remote_status_b").innerHTML=st_en}else{if(status_b=="0"){document.getElementById("cover_remote_status_b").innerHTML=st_dis}else{document.getElementById("cover_remote_status_b").innerHTML=st_un}}};

</script>
</head>
<body class="in_body" onload="init_main_page();">
<div class="div_c" id="main_div">
<div class="lab_5 cu b" onclick="upfold(1);child_getH();"><span class="sub" id="p_1">-</span><span id="st1" style="margin-left:3px"></span></div>
<div class="sp_5"></div>
<div id="up_1_div">
<div class="lab_l2" id="tx1"></div>
<div class="lab_r2" id="webdata_sn"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx2"></div>
<div class="lab_r2" id="webdata_msvn"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx3"></div>
<div class="lab_r2" id="webdata_ssvn"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx4"></div>
<div class="lab_r2" id="webdata_pv_type"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx5"></div>
<div class="lab_r2" id="webdata_rate_p"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx6" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_now_p" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx7" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_today_e" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx8" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_total_e" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx9" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_alarm" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l2" id="tx10" style="color:#666666;font-weight:bold;"></div>
<div class="lab_r2" id="webdata_utime" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
</div>
<div class="sp_20"></div>
<div class="lab_5 cu b" onclick="upfold(2);child_getH();"><span class="sub" id="p_2">+</span><span id="st2" style="margin-left:3px"></span></div>
<div class="sp_5"></div>
<div id="up_2_div" style="display:none">
<div class="label" id="tx11"></div>
<div class="lab_r" id="cover_mid"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="label" id="tx12"></div>
<div class="lab_r" id="cover_ver"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="label" id="tx13"></div>
<div class="lab_r" id="cover_ap_status" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l" id="ap_ssid">SSID</div>
<div class="lab_r" id="cover_ap_ssid"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx15"></div>
<div class="lab_r" id="cover_ap_ip"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx16"></div>
<div class="lab_r" id="cover_ap_mac"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="label" id="tx17"></div>
<div class="lab_r" id="cover_sta_status" style="color:#666666;font-weight:bold;"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="lab_l" id="tx18"></div>
<div class="lab_r" id="cover_sta_ssid"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx19"></div>
<div class="lab_r" id="cover_sta_rssi"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx20"></div>
<div class="lab_r" id="cover_sta_ip"></div>
<div class="cl"></div>
<div class="line_l"></div>
<div class="lab_l" id="tx21"></div>
<div class="lab_r" id="cover_sta_mac"></div>
<div class="cl"></div>
<div class="line_l"></div>
</div>
<div class="sp_20"></div>
<div class="lab_5 cu b" onclick="upfold(3);child_getH();"><span class="sub" id="p_3">+</span><span id="st3" style="margin-left:3px"></span></div>
<div class="sp_5"></div>
<div id="up_3_div" style="display:none">
<div class="label" id="tx25"></div>
<div class="lab_r" id="cover_remote_status_a"></div>
<div class="cl"></div>
<div class="line"></div>
<div class="label" id="tx26"></div>
<div class="lab_r" id="cover_remote_status_b"></div>
<div class="cl"></div>
<div class="line"></div>
</div>
</div>
<script type="text/javascript">
	    initPageText();
	    ready();
	</script>
</body>
</html>

And I’m using your sensor config, but I’m still getting:

Scraper_noname_3 # Today Solar Generation # Unable to scrape data: Could not find a tag for given selector Consider using debug logging and log_response for further investigation.

Any pointers what is going on here or which selector I should be using?

You only have one <script> element ahead of the data, so use:

select: "body > script:nth-child(2)"

instead.