Scrape sensor improved - scraping multiple values

Hi all, I am trying to integrate sensors using Multiscrape, but can’t get it to work.

Situation:
HA running in Docker on Synology, Multiscrape installed from HACS

I’m trying to read sensors from my floor heating controller, for which I used to run a python script on my syno: GitHub - Sir-Bacon/UMR2toMQTT: Python routine to read out UMR2 floorheating controller and publish values to MQTT
That would readout the json from the website and send it to MQTT for HA to pickup. That has stopped working recently and when I saw Multiscrape, my thought was that I could pull the scraping inside HA. And then stop using the script, much easier.

Now I’m trying to get this working, but failing sofar. When I drop my code in the Developer tools template I get the error ‘UndefinedError: ‘value_json’ is undefined’. I believe this means the json is not valid. But I know it is, it has always worked like that.

The following code does not work:

multiscrape:
  - name: Multiscrape Data
    resource_template: 'http://192.168.2.31/get.json?f=$.status.*'
    scan_interval: 10
    log_response: true
    sensor:
      - unique_id: multiscrape_VVW_mode_test
        name: Vloerverwarming Modus
        value_template: '{{ value_json.status.outputs.heater.mode }}'

sensor:
  - platform: rest
    resource: 'http://192.168.2.31/get.json?f=$.status.*'
    name:  Rest VVW mode test
    value_template: '{{ value_json.status.outputs.heater.mode }}'

This is to test with 1 value, I want to extract at least 5 values/states from the json.

Any pointers in which direction I need find a solution? Much appreciated.

Since you have enabled debugging, did you check that Multiscrape is indeed getting a JSON as response? You can check page_response_body.txt in homeassistant/multiscrape/‘name of sensor’.
I have had issues with square brackets at the start and end of JSON, so I removed them with a replace.

I wonder if this ever got resolved? I am currently using Multiscrape successfully to log in to a website to retrieve some values, but I can also perform actions on the website (after logging in) by getting a URL. I can’t have this URL scraped, because that means that the action is performed e.g. when I restart Home Assistant. But without the form submit functionality of Multiscrape, there’s no log in

Any idea’s on how to only call the URL ad hoc or how to pass on the headers to e.g. a rest command?

No it is not. I don’t understand how the integration is able to write a file but not to create a sensor.

Good to know I’m not the only one struggling with that. Would you mind sharing how you did the workaround by opening the debug files with the responses and extract them?

I haven’t been able to look at this yet, but in v8.01 which is currently available as a pre-release, you can set the scan_interval to 0 and the action will never be performed (not even on startup), unless manually triggered. Does that help or is there anything else in a rest command that you cannot achieve with multiscrape?

1 Like

I don’t understand how the integration is able to write a file but not to create a sensor.

I’m looking forward to your pull request :wink:

That looks like something I can work with. I will test when I find the time!

Anyone got issues with Powershop NZ? Something changed around the 30th September and now I have no sensors. Logs show the following:

2024-10-01 11:20:09.418 ERROR (MainThread) [custom_components.multiscrape.coordinator] Powershop # Exception in form-submit feature. Will continue trying to scrape target page.
[custom_components.multiscrape.coordinator] Powershop # Updating failed with exception: 
2024-10-01 11:20:19.433 ERROR (MainThread) [custom_components.multiscrape.sensor] Powershop # Powershop Off Peak # Unable to scrape data: Skipped scraping because data couldn't be updated 

I had added verify_ssl: false to the form section and rebooted but alas today, sensors are unavailable again.

Partial config below:

- resource: 'https://secure.powershop.co.nz/rates'
  name: Powershop
  log_response: true
  scan_interval: 43200
  form_submit:
    submit_once: true
    verify_ssl: false
    resource: 'https://secure.powershop.co.nz'
    select: ".content > form"
    input:
      email: !secret powershop_user
      password: !secret powershop_pass

v8.0.2 :sunny: Startup performance & restoring values

:warning: This release contains some important changes! :warning:

:tada: I’m super excited about this new release! It’s containing some changes that are not breaking but do make Multiscrape work slightly different than before. :rocket: Read on!

:zap: Performance Boosts!

This release brings huge improvements to startup performance. While it’s a big leap forward, you might not feel the impact as much because of the great improvements in Home Assistant over the past few months. But trust me, it’s fast! :zap:
On my production system:

:arrows_counterclockwise: Restoring Previous Values

Both sensor states and attribute values are now restored after a reboot, meaning you’ll be right back where you left off! No more waiting for new scrapes to kick in to get your data back after restarting. :arrows_counterclockwise: (It is also restored when you set ‘last’ in ‘on-error’)

:pause_button: No More Scrapes with scan_interval: 0

Here’s a subtle but important change: if you set scan_interval to 0, Multiscrape will no longer scrape on startup. Previously, it would still initiate a scrape on startup, but now it stays idle just like you’d expect. :stop_sign:

As always, thanks for using Multiscrape! Let me know how it’s working for you. Happy scraping! :spider::computer:

Other notes

  • I want to dedicate this release to @madelena , Product Manager @ Nabu Casa, as she (unknowingly) was one of the main triggers for this release. Seeing Multiscrape pop up here was painful :sweat_smile:: Release party 2024.4
  • Happy with Multiscrape? Consider to buy me a coffee or sponsor me on Github!
  • Check out my other custom integration: MeasureIt!
2 Likes

Hi willembuys,

Thanks for your reply, and apologies for the late reply, very busy. I just included the multiscrape: in my configuration.yaml (had used the template before) and checked the response. (I’m running HA in a Docker on my Syno btw).

The file page_response_body.txt is there. And when I check contents it seems there is an error. It is a html file which shows:
(function(){var a=new XMLHttpRequest();a.open(“get”,“/missing”,true);a.send();a.onreadystatechange=function(){if(a.readyState==4&&(a.status==200||a.status==304)){var c=String(a.responseText);var e=document.open(“text/html”,“replace”);e.write(c);e.close()}else{var d={en:“The page you are looking for cannot be found.”, - followed by the same message in a lot of other languages.

This seems to indicate the page cannot be found. Which is weird, I just looked it up with no issue. Any pointers as to what I need to change?

Maybe the * in the url is a problem. Could you try a page without it or replace it by %2A?

I removed the .* at the end of the url, but no change in the output. Changing it to %2A also did not work. Same output in both cases as the original error.

Edit:
Intriguingly, when I input directly into FireFox ‘http://192.168.2.31/get.json?f=$.status.*’ I do get a 404 response. Which seems to tell the page cannot be found. In my python program (UMR2toMQTT/UMR2toJSON.py at 9bb3acffb76fbe929a0e9de4c04c2a0e847d32b0 · Sir-Bacon/UMR2toMQTT · GitHub) I do get a nice json response. Is that due to the requests module being used?

Edit2:
Should I just completely forget about the json approach and use html instead? The website of the UMR2 controller is tabbed. If I request http://192.168.2.138/#Status I get directly the correct tab in view. The variable status is then defined as ‘stcv’ (status CV), see also screenshot of the Inspect window below.
image

Could it be that you are mixing up IPs? Probably there is a good reason but just want to be sure. The IP address in your code is different from the one in the multiscrape config.

Dang, good catch! Yup, screwed up IP’s.

With the correct IP of the UMR I do get a good response. With the ‘.*’ included the file page_response_body.txt now only has 4 lines of contents:

{
	"type": "WTH_Regulator",
	"id": "D602020C1A1E0794"
}

This seems to be a json response, albeit with only 1 variable. Now how to get the rest of the variables?

Edit:
When I input http://192.168.2.138/get.json?f=$.status.* in ForeFox i get the complete json with all the elements I want to scrape.

Edit2:
With %2A I also get only the 4 lines json response. How do I get Multiscrape to return the complete json?

That’s really strange… But since it’s local to you, I’m afraid I cannot reproduce it. Maybe paste your config in the multiscrape.get_content action (service) and play around a bit more with the params (specifying them separately in the config).

action: multiscrape.get_content
data:
  name: Multiscrape Data
  resource_template: 'http://192.168.2.31/get.json'
  params:
    f: '$.status.*'

It works like a charm.
However, a different sensor, which I scrape every 5 minutes and which uses the same login through form-submit, throws an error every 5 minutes, but otherwise scrapes just fine.
This is the error:

Exception in form-submit feature. Will continue trying to scrape target page.

cannot unpack non-iterable NoneType object

Any suggestion as to what is causing this? Which information can I provide to help investigate?

OK, reply to myself, but just to document the solution. In the end Daniel assisted me and we found a solution. Using the normal HA scrape sensor, it turned out Multiscrape was not necessary. The scrape itself gets a json, from which multiple sensors can be read out. That was my misunderstanding.

Now I use the following scrape sensor, really easy:

rest:
  - resource: 'http://192.168.2.138/get.json?f=$.status.*'
    scan_interval: 10
    sensor:
      - name: Vloerverwarming Modus
        value_template: '{{ value_json.status.outputs.heater.mode }}'
      - name: Vloerverwarming Pomp
        unit_of_measurement: "%"
        value_template: '{{ value_json.status.outputs.pump.speed }}'
      - name: "Vloerverwarming Klep"
        unit_of_measurement: "%"
        value_template: '{{ value_json.status.outputs.valves.8.state }}'
      - name: "Vloerverwarming Temperatuur In"
        unit_of_measurement: "°C"
        value_template: '{{ value_json.status.inputs.max.temperature }}'
      - name: "Vloerverwarming Temperatuur Uit"
        unit_of_measurement: "°C"
        value_template: '{{ value_json.status.inputs.return.temperature }}'

Thanks again to Daniel for helping me out.

Just to be clear, something strange did happen in the reply from the UMR2 and the processing by Multiscrape. But since Daniel does not have that controller, he cannot reproduce it.

I fixed this in release v8.0.3!

1 Like

Thanks for this wonderful utility, Daniel. This comes handy where the data is dynamic and have dependencies with online websites for the information but we only want specific value.
I am trying to scrape one of the site that got daily changing data which provides information about moon phases and other astrological information. I was facing issue with one of the detail which is placed on a table but the table rows will be dynamic. The row will be added or deleted based on the specific days values, hence index based scraping was not giving the correct data. I was searching for this thread for examples to identify the row based on a specific value and come up with the below on my investigation. Sharing here for anyone having similar needs.

  resource: https://websitelinkthatyouwanttoscrape
  scan_interval: 86400 #scrape once a day
  headers:
    User-Agent: Mozilla/5.0
  sensor:
    - unique_id: sensor_id
      name: Sensor Name as per your need
      select: table:nth-child(2) > tr:nth-child(2) > td > table > tr:contains("rowheadingyouwanttosearchfor") # For my use case, YMMV
1 Like