Scrape sensor improved - scraping multiple values

SirBacon · October 6, 2024, 2:26pm

SirBacon:

multiscrape:
  - name: Multiscrape Data
    resource_template: 'http://192.168.2.31/get.json?f=$.status.*'
    scan_interval: 10
    log_response: true
    sensor:
      - unique_id: multiscrape_VVW_mode_test
        name: Vloerverwarming Modus
        value_template: '{{ value_json.status.outputs.heater.mode }}'

Hi willembuys,

Thanks for your reply, and apologies for the late reply, very busy. I just included the multiscrape: in my configuration.yaml (had used the template before) and checked the response. (I’m running HA in a Docker on my Syno btw).

The file page_response_body.txt is there. And when I check contents it seems there is an error. It is a html file which shows:
(function(){var a=new XMLHttpRequest();a.open(“get”,“/missing”,true);a.send();a.onreadystatechange=function(){if(a.readyState==4&&(a.status==200||a.status==304)){var c=String(a.responseText);var e=document.open(“text/html”,“replace”);e.write(c);e.close()}else{var d={en:“The page you are looking for cannot be found.”, - followed by the same message in a lot of other languages.

This seems to indicate the page cannot be found. Which is weird, I just looked it up with no issue. Any pointers as to what I need to change?

danieldotnl · October 6, 2024, 2:52pm

Maybe the * in the url is a problem. Could you try a page without it or replace it by %2A?

SirBacon · October 6, 2024, 3:09pm

I removed the .* at the end of the url, but no change in the output. Changing it to %2A also did not work. Same output in both cases as the original error.

Edit:
Intriguingly, when I input directly into FireFox ‘http://192.168.2.31/get.json?f=$.status.*’ I do get a 404 response. Which seems to tell the page cannot be found. In my python program (UMR2toMQTT/UMR2toJSON.py at 9bb3acffb76fbe929a0e9de4c04c2a0e847d32b0 · Sir-Bacon/UMR2toMQTT · GitHub) I do get a nice json response. Is that due to the requests module being used?

Edit2:
Should I just completely forget about the json approach and use html instead? The website of the UMR2 controller is tabbed. If I request http://192.168.2.138/#Status I get directly the correct tab in view. The variable status is then defined as ‘stcv’ (status CV), see also screenshot of the Inspect window below.

danieldotnl · October 6, 2024, 4:18pm

Could it be that you are mixing up IPs? Probably there is a good reason but just want to be sure. The IP address in your code is different from the one in the multiscrape config.

SirBacon · October 6, 2024, 4:27pm

Dang, good catch! Yup, screwed up IP’s.

With the correct IP of the UMR I do get a good response. With the ‘.*’ included the file page_response_body.txt now only has 4 lines of contents:

{
	"type": "WTH_Regulator",
	"id": "D602020C1A1E0794"
}

This seems to be a json response, albeit with only 1 variable. Now how to get the rest of the variables?

Edit:
When I input http://192.168.2.138/get.json?f=$.status.* in ForeFox i get the complete json with all the elements I want to scrape.

Edit2:
With %2A I also get only the 4 lines json response. How do I get Multiscrape to return the complete json?

danieldotnl · October 7, 2024, 8:17am

That’s really strange… But since it’s local to you, I’m afraid I cannot reproduce it. Maybe paste your config in the multiscrape.get_content action (service) and play around a bit more with the params (specifying them separately in the config).

action: multiscrape.get_content
data:
  name: Multiscrape Data
  resource_template: 'http://192.168.2.31/get.json'
  params:
    f: '$.status.*'

willembuys · October 15, 2024, 6:37pm

It works like a charm.
However, a different sensor, which I scrape every 5 minutes and which uses the same login through form-submit, throws an error every 5 minutes, but otherwise scrapes just fine.
This is the error:

Exception in form-submit feature. Will continue trying to scrape target page.

cannot unpack non-iterable NoneType object

Any suggestion as to what is causing this? Which information can I provide to help investigate?

SirBacon · November 5, 2024, 8:42pm

OK, reply to myself, but just to document the solution. In the end Daniel assisted me and we found a solution. Using the normal HA scrape sensor, it turned out Multiscrape was not necessary. The scrape itself gets a json, from which multiple sensors can be read out. That was my misunderstanding.

Now I use the following scrape sensor, really easy:

rest:
  - resource: 'http://192.168.2.138/get.json?f=$.status.*'
    scan_interval: 10
    sensor:
      - name: Vloerverwarming Modus
        value_template: '{{ value_json.status.outputs.heater.mode }}'
      - name: Vloerverwarming Pomp
        unit_of_measurement: "%"
        value_template: '{{ value_json.status.outputs.pump.speed }}'
      - name: "Vloerverwarming Klep"
        unit_of_measurement: "%"
        value_template: '{{ value_json.status.outputs.valves.8.state }}'
      - name: "Vloerverwarming Temperatuur In"
        unit_of_measurement: "°C"
        value_template: '{{ value_json.status.inputs.max.temperature }}'
      - name: "Vloerverwarming Temperatuur Uit"
        unit_of_measurement: "°C"
        value_template: '{{ value_json.status.inputs.return.temperature }}'

Thanks again to Daniel for helping me out.

Just to be clear, something strange did happen in the reply from the UMR2 and the processing by Multiscrape. But since Daniel does not have that controller, he cannot reproduce it.

danieldotnl · November 11, 2024, 9:37pm

I fixed this in release v8.0.3!

sri4iot · November 18, 2024, 3:59am

Thanks for this wonderful utility, Daniel. This comes handy where the data is dynamic and have dependencies with online websites for the information but we only want specific value.
I am trying to scrape one of the site that got daily changing data which provides information about moon phases and other astrological information. I was facing issue with one of the detail which is placed on a table but the table rows will be dynamic. The row will be added or deleted based on the specific days values, hence index based scraping was not giving the correct data. I was searching for this thread for examples to identify the row based on a specific value and come up with the below on my investigation. Sharing here for anyone having similar needs.

  resource: https://websitelinkthatyouwanttoscrape
  scan_interval: 86400 #scrape once a day
  headers:
    User-Agent: Mozilla/5.0
  sensor:
    - unique_id: sensor_id
      name: Sensor Name as per your need
      select: table:nth-child(2) > tr:nth-child(2) > td > table > tr:contains("rowheadingyouwanttosearchfor") # For my use case, YMMV

dw1562 · January 27, 2025, 2:41am

OMG - What a thread! I’m with Powershop Australia and I would like to try to get my TOU rates, my daily usage and daily solar feed in into HA. A once a day update will suffice as they only read my smart meter once per day so there is no point in hammering their web server for no reason.

I will try to read all of the above and I think there is another smaller thread too but if anybody can put together a bullet point list of the various steps required to achieve this I would be really grateful.

willembuys · February 6, 2025, 10:11am

I started getting errors in Multiscrape after updating to HA Core 2025.2.0. I have downgraded back to HA Core 2025.1.4 and the sensors are available again. I haven’t found the time to look into this, but wonder whether other people encountered the same issue.

parautenbach · February 6, 2025, 4:30pm

For example?

willembuys · February 9, 2025, 6:50pm

It is fixed by release 8.0.5 of Multiscrape. The issue was caused by a change in the httpx library that was included in HA Core 2025.2.0.

mchk · March 8, 2025, 3:44am

Could you clarify one detail, please. It’s probably something very obvious, but I haven’t found an answer.
Is it possible to add several child elements (with sequential numbers) to one sensor?
When I test selectors on this resource https://try.jsoup.org/, I can replace the item number with the character “n” and get all the strings in the output. Is there the same mechanism in the scraping integration. Or does it involve using a python script?

willembuys · June 20, 2025, 9:31am

I have run into an issue where my long working scrapes are now no longer returning the expected body.
My guess is this is linked to the server to which I log in, now checks on this header:

X-Requested-With: 'XMLHttpRequest'

When I add this header in the request, it does not change anything. Is my understanding correct that this cannot be resolved?