Scrape sensor improved - scraping multiple values

loocd · August 26, 2024, 3:11pm

Hi all, just wanted to follow-up if anybody has an idea on how to solve this?

danieldotnl · August 26, 2024, 7:51pm

My library uses the same kb platform
So I gave it a try but it seems to be some kind of OAuth implementation, and that’s quite complex and not supported by Multiscrape, unfortunately.

danieldotnl · August 26, 2024, 7:52pm

Not possible yet, but a nice feature! Please create a feature request on github.

danieldotnl · August 26, 2024, 7:56pm

I’m excited to share some new features and improvements in v7.1.2 which I just released. Here’s what’s new:

New Feature: Form Variables

A big shoutout to @jeremicmilan for his incredible dedication to this feature! I’ve added Form Variables to Multiscrape, allowing you to scrape the (token of a) page returned after logging in on some sites (specifically PHP). This token can then be sent in a header for authentication or other purposes. For all the details, make sure to check out the README!

New Feature: Cookies support

You asked, I delivered! The long-awaited support for cookies is finally here! Now, all cookies returned in HTTP sessions are automatically transferred to the next request. Plus, I’ve added logging so you can easily see which cookies are set. Sweet, right?

Automated Tests!!

I’m taking stability to the next level with the newly set up automated testing infrastructure! The first 2 automated tests have been added to Multiscrape, ensuring even more reliability in the future. Continuous improvements are on the way!

As always, a huge thank you to the amazing community for your continued support and feedback. Happy scraping!

PS: If you enjoy Multiscrape, please consider supporting me and buy me a coffee.

loocd · August 27, 2024, 8:18am

done, thank you!

NNenya · August 27, 2024, 9:39pm

I took note of the form variables with this release. Is it limited to capturing form response header values only? I am seeking to capture the csrftoken value from the form response page for use in the subsequent resource url.

danieldotnl · August 28, 2024, 12:58pm

That’s exactly what it’s meant for!

danieldotnl · September 4, 2024, 8:20pm

Now available in v7.2.0!

danieldotnl · September 4, 2024, 8:21pm

New Feature: Raw HTML Scraping with Multiscrape

I’m excited to announce that it is now possible to scrape raw HTML in Multiscrape! This feature has been a recurring request over the years, and I’m happy I could finally implement it.
It could for example be used to for displaying rich content on a markdown card.

A new configuration option for selectors has been added called extract. It is optional and can have these values:
- Text (default): Extracts plain text, as you are used to.
- Content: Returns the content of the selected tag.
- Tag: Returns both the content and the tag itself.

With this feature, your sensors (or attributes) can now have a state/value like:

<p>This is an <b>example</b> of what can be scraped with the <i>extract</i> feature.</p>

Thank you for your continued support, and happy scraping!

willembuys · September 16, 2024, 5:52pm

Would love to see OAuth support, as this will probably solve the issue of scraping Water-link for water meter data.

SirBacon · September 21, 2024, 5:32pm

Hi all, I am trying to integrate sensors using Multiscrape, but can’t get it to work.

Situation:
HA running in Docker on Synology, Multiscrape installed from HACS

I’m trying to read sensors from my floor heating controller, for which I used to run a python script on my syno: GitHub - Sir-Bacon/UMR2toMQTT: Python routine to read out UMR2 floorheating controller and publish values to MQTT
That would readout the json from the website and send it to MQTT for HA to pickup. That has stopped working recently and when I saw Multiscrape, my thought was that I could pull the scraping inside HA. And then stop using the script, much easier.

Now I’m trying to get this working, but failing sofar. When I drop my code in the Developer tools template I get the error ‘UndefinedError: ‘value_json’ is undefined’. I believe this means the json is not valid. But I know it is, it has always worked like that.

The following code does not work:

multiscrape:
  - name: Multiscrape Data
    resource_template: 'http://192.168.2.31/get.json?f=$.status.*'
    scan_interval: 10
    log_response: true
    sensor:
      - unique_id: multiscrape_VVW_mode_test
        name: Vloerverwarming Modus
        value_template: '{{ value_json.status.outputs.heater.mode }}'

sensor:
  - platform: rest
    resource: 'http://192.168.2.31/get.json?f=$.status.*'
    name:  Rest VVW mode test
    value_template: '{{ value_json.status.outputs.heater.mode }}'

This is to test with 1 value, I want to extract at least 5 values/states from the json.

Any pointers in which direction I need find a solution? Much appreciated.

willembuys · September 23, 2024, 8:36am

Since you have enabled debugging, did you check that Multiscrape is indeed getting a JSON as response? You can check page_response_body.txt in homeassistant/multiscrape/‘name of sensor’.
I have had issues with square brackets at the start and end of JSON, so I removed them with a replace.

willembuys · September 26, 2024, 6:55pm

I wonder if this ever got resolved? I am currently using Multiscrape successfully to log in to a website to retrieve some values, but I can also perform actions on the website (after logging in) by getting a URL. I can’t have this URL scraped, because that means that the action is performed e.g. when I restart Home Assistant. But without the form submit functionality of Multiscrape, there’s no log in

Any idea’s on how to only call the URL ad hoc or how to pass on the headers to e.g. a rest command?

Joao-Sousa-71 · September 29, 2024, 4:03pm

No it is not. I don’t understand how the integration is able to write a file but not to create a sensor.

github.com/danieldotnl/ha-multiscrape

How to store cookies in a HA sensor

opened 09:21PM - 27 Aug 24 UTC

Joao-Sousa-71

## Description: This is not a bug or issue but within the documentation I was n…ot able to find a way to store the cookies received after the authentication and store them in HA sensor to be used in some rest commands. I'm using this component to managem my pellets stove and after the authentication two cookies are received: "XSRF-TOKEN" and "myceza_session". This is the content of the _page_response_cookies.txt_ file ``` <Cookies[<Cookie XSRF-TOKEN=eyJpdiI6IlNjYXVwZFJOR1Zlam13UDlZYW5wSnc9PSIsInZhbHVlIjoiSFdzN0V6QWdUbkxaU2txaXh5WWEreWo4c1YzczFTa0xneU9ZdFdzbjREUWYrK0tLNDl4bVp3clhsVkg5UE1YayIsIm1hYyI6ImEwYWY5ODJkMTgwNzE5ODI1NTIyYzFlNjZmOTBlMDM1MGFkYzAxMGUyYzJkODEzYzA1YjBjOWU0YmU0NTgzNGQifQ%3D%3D for myceza.it />, <Cookie myceza_session=eyJpdiI6Ild5bVlDS21qaFhkUVwvK3AxQ2h1RmtBPT0iLCJ2YWx1ZSI6IjZNSXlWckJrcFBXSVlueWZOZEx1ZTZpdVpieW8xd2RIVEVJR1djVVhvcDk5NEJ4TVRqenQrYjgzVUVMTmNkTE8iLCJtYWMiOiI5ZjA4MmM4MjdjZThlMGFkMDI1MGQ0MzViZDAxNmE0Njc1MjhiODc1OTUwNDYxY2FkM2RkYzNhYmQ4Mzg3NWVjIn0%3D for myceza.it />]> ``` What I need is to have this two cookies available in two HA sensora to be possible to manage the stove by using rest commands. With the latest version can this be done? And if not, I can create a feature request. ## Version of the custom_component 7.1.2 (latest) ## Multiscrape configuration ``` multiscrape: - name: solzaima ha integration resource: 'https://myceza.it/en' scan_interval: 86400 log_response: True form_submit: submit_once: False resubmit_on_error: True resource: 'https://myceza.it/en/login' select: '#main > div > div.panel-body > form' input: username: [email protected] password: 'mypassword' sensor: - unique_id: mycezadatatoken select: 'div#app-meta' name: Myceza Data Token attribute: 'data-token' value_template: '{{ value }}' on_error: log: error - unique_id: mycezacsrftoken select: 'head > meta:nth-child(3)' name: Myceza CSRF Token attribute: 'content' value_template: '{{ value }}' on_error: log: error ``` Thank you.

willembuys · September 30, 2024, 9:25am

Good to know I’m not the only one struggling with that. Would you mind sharing how you did the workaround by opening the debug files with the responses and extract them?

danieldotnl · September 30, 2024, 11:39am

I haven’t been able to look at this yet, but in v8.01 which is currently available as a pre-release, you can set the scan_interval to 0 and the action will never be performed (not even on startup), unless manually triggered. Does that help or is there anything else in a rest command that you cannot achieve with multiscrape?

danieldotnl · September 30, 2024, 11:39am

I don’t understand how the integration is able to write a file but not to create a sensor.

I’m looking forward to your pull request

willembuys · September 30, 2024, 11:58am

That looks like something I can work with. I will test when I find the time!

xbmcnut · October 2, 2024, 1:58am

Anyone got issues with Powershop NZ? Something changed around the 30th September and now I have no sensors. Logs show the following:

2024-10-01 11:20:09.418 ERROR (MainThread) [custom_components.multiscrape.coordinator] Powershop # Exception in form-submit feature. Will continue trying to scrape target page.
[custom_components.multiscrape.coordinator] Powershop # Updating failed with exception: 
2024-10-01 11:20:19.433 ERROR (MainThread) [custom_components.multiscrape.sensor] Powershop # Powershop Off Peak # Unable to scrape data: Skipped scraping because data couldn't be updated

I had added verify_ssl: false to the form section and rebooted but alas today, sensors are unavailable again.

Partial config below:

- resource: 'https://secure.powershop.co.nz/rates'
  name: Powershop
  log_response: true
  scan_interval: 43200
  form_submit:
    submit_once: true
    verify_ssl: false
    resource: 'https://secure.powershop.co.nz'
    select: ".content > form"
    input:
      email: !secret powershop_user
      password: !secret powershop_pass

danieldotnl · October 4, 2024, 2:44pm

v8.0.2 Startup performance & restoring values

This release contains some important changes!

I’m super excited about this new release! It’s containing some changes that are not breaking but do make Multiscrape work slightly different than before. Read on!

Performance Boosts!

This release brings huge improvements to startup performance. While it’s a big leap forward, you might not feel the impact as much because of the great improvements in Home Assistant over the past few months. But trust me, it’s fast!
On my production system:

Restoring Previous Values

Both sensor states and attribute values are now restored after a reboot, meaning you’ll be right back where you left off! No more waiting for new scrapes to kick in to get your data back after restarting. (It is also restored when you set ‘last’ in ‘on-error’)

No More Scrapes with scan_interval: 0

Here’s a subtle but important change: if you set scan_interval to 0, Multiscrape will no longer scrape on startup. Previously, it would still initiate a scrape on startup, but now it stays idle just like you’d expect.

As always, thanks for using Multiscrape! Let me know how it’s working for you. Happy scraping!

Other notes

I want to dedicate this release to @madelena , Product Manager @ Nabu Casa, as she (unknowingly) was one of the main triggers for this release. Seeing Multiscrape pop up here was painful : Release party 2024.4
Happy with Multiscrape? Consider to buy me a coffee or sponsor me on Github!
Check out my other custom integration: MeasureIt!