The new way to SCRAPE

In the past I used scrape quite heavily to get data from the web into HA.
Here an example from my configuration.yaml where I retrieve an exchange rate string from an open webpage:

Depot Euro to CHF Exchange Rate

  • platform: scrape
    resource: hxxps://www.comdirect.de/inf/waehrungen/euro-schweizer_franken-kurs
    name: chf_eur_string
    select: “#keyelement_kurs_update > div.realtime-indicator > span”
    scan_interval: 600

When I installed the December update I found out that this is no longer the way to do it.
So I tried to muddle my way through the new scrape integration, and --------------- have no idea what to do.

Is there a tutorial, or is there anybody who could lead me through this new UI so I can generate the same sensor (sensor.chf_eur_string) I am using now?

Coment out scrape sensor setting in configuration.yaml.
Restart Home Asistant
Add Scrape integration
Form would open. Enter resource url.
Click Next
Add sensor by filling name and select expression fields
Submit

Thank you pedjas for your help - I did most of that, but when it comes to:

.
.
Click Next
Add sensor by filling name and select expression fields
.
.

That is where there are a number of cryptic fields that need to be filled in.
I seem not to be able to correlate these with the information I have - what goes where?

In the Select part you need to fill in the CSS selector WITHOUT quotes.
So use
#keyelement_kurs_update > div.realtime-indicator > span
and not
#keyelement_kurs_update > div.realtime-indicator > span”

That is a good start :slight_smile: - Thank you sammyke007 !

So now the sensor entity is being created - the value however remains “unknown” !?!

Anybody have a clue what is still missing?

So I re-read the documentation again and realized, that the GUI is not the exclusive new way to Scrape - thank god for that, because in its current format it is not really usable in my opinion - especially if you have a number of different resources (how is this even done in the GUI?).

So I went back to the configuration.yaml approach and it would appear, that the new format to achieve what I wanted is now something like this:

The indentation was a bit of trial and error - it does accept the scan interval where it is at the moment, but no idea if it is being applied?

2 Likes

I added my scrape sensor using user interface but cannot find them in yaml to edit. I also prefer editing such stuff in text editor. It is much simpler.

I don’t think that the sensors you create with the GUI will show up in the configuration.yaml.
You have to edit the sensors via the GUI - go to the integration page - find the scrape integration - click on “configure” and then you can select and edit the sensor from that sub-menu.

Super complicated if you ask me and I never figured out how you assign a certain resource to a specific sensor. That GUI needs some re-work, at the moment not really user friendly.

I ended up deleting all sensors I tried to create with the GUI.

Then I went to the configuration.yaml and edited my existing scrape sensors as shown above.
Seems to work - for now I do not get the notice under *repairs" that I have to do anything else.

1 Like

You can use both YAML or GUI to set up the SCRAPE sensor, but both ways don’t “sync” with each other. It’s what you personally prefer.
What is being deprecated however is configuring scrape as a platform in your configuration.yaml.

In the past, scrape was a platform under sensor in your configuration.yaml.
It looked like this in the past:

sensor:
  - platform: scrape
    name: Benzine 95 Vandaag
    resource: https://carbu.com//belgie/super95
    select: "#news > div > div:nth-child(1) > table > tbody > tr:nth-child(1) > td.price"

The above (platform) way is being deprecated. If you prefer configuration.yaml over GUI than you have to use scrape as an integration, not as a platform anymore. Example:

scrape:
  - resource: https://carbu.com//belgie/super95
    sensor:
      - name: Benzine 95 Vandaag
        select: "#news > div > div:nth-child(1) > table > tbody > tr:nth-child(1) > td.price"

Do you see the difference? The same happened to MQTT, REST and template in the past. Platform got deprecated and it became an integration.
I hope I explained it good enough :sweat_smile:

8 Likes

Perfect - thank you Sammyke007 :+1:

1 Like

In the meantime I figured out how to assign a source to a sensor in the GUI :slight_smile:

For each different resource you you have to click on the “add integration” button on the bottom right.

Not sure if that is intuitive either :slight_smile: :slight_smile: :slight_smile:

And if you have 20 different sources with one sensor each you end up with 20 different integrations on that page?

Correct, for me it looks like this with 2 resources (with each 2 sensors):

image

1 Like

OK - while the GUI takes a bit of getting used to, I am starting to like the functionality.

It seems as if the sensors you set in the GUI are auto-updating every 10 minutes.
If you need more/less frequent updates, you can disable the auto-update in the system options for each sensor. Then you have to add an automation to update them at your chosen time interval (or any other trigger).

Another irritating thing was the sensor name that appears by default in the little integration window.

Inspired by sammyke007 (who obviously is not a Tesla user :slight_smile: and had nice names for the sensors) - I figured out how to change the names - by clicking on the 3 dots bottom right after selecting the sensor in the integration.

So my apologies to the devs (should they read my negative comments above :slight_smile: ) - once you understand the GUI a bit more, it actually works quite well.

2 Likes

Thanks @sammyke007 , you are my hero for today :wink:

And for anyone, who might be affected:

I had two sensor from the same website and here is, how i got it working now:

Old situation:

sensor:
  - platform: scrape
    resource: https://www.knmi.nl/nederland-nu/weer/waarschuwingen/noord-holland
    select: "div.alert__heading"
    name: knmi_alert_heading
    scan_interval: 300

  - platform: scrape
    resource: https://www.knmi.nl/nederland-nu/weer/waarschuwingen/noord-holland
    select: "a.alert__description"
    name: knmi_alert_description
    scan_interval: 300

New situation:

scrape:
  - resource: https://www.knmi.nl/nederland-nu/weer/waarschuwingen/noord-holland
    sensor:
      - name: knmi_alert_heading
        select: "div.alert__heading"
      - name: knmi_alert_description
        select: "a.alert__description"

I really hope, this method survives.
I found the UI way just to complicated and I see a potential problem in it about sharing and caring.
Without YAML it is impossible to just copy this sensor and paste it on a forum or in a messaging app, so someone else can use it too. :frowning: It will take a long talk about, where to put what in UI. Not a way of easing things in my mind…

4 Likes

Yes, I realized that, but too late. I already created all over again using GUI.

Docs should be more clear on that matter. At first, I actually thought that docs were not updated as it still explained entering config in configuration.yaml.

And, of course having, two totally separate ways to create configuration and configuration is not even shared among them is confusing by definition.

1 Like

To make it even more interesting, docs has been updated yesterday, describing the new way of using YAML :slight_smile:

It still does not work as it should: there is the option scan_interval in the docs, but it has no effect whatsoever, scan is always done every 10 minutes…

Did you disable the automatic update (in the GUI) for the affected sensor?

Well - thinking about my comment - that would only apply to a sensor configured via the GUI.

So I take it, you configured a sensor in the configuration.yaml with a scan_interval of less than 10 min and it doesn’t do it?

I do not use this scrape thing myself, yet.

But I will say that for a scrape feature where the user consumes bandwidth and server CPU time from a 3rd party a configurable scan interval is essential.

Either you make it configurable like it was in yaml.

Or you remove the scan interval completely and document that people should always use an automation to fetch the data.

Except for html resources that belong to yourself, a 10 minute scan interval is really bad human behaviour.

I run a website and I have banned many IP addresses because some punk suddenly starts auto scraping some detail every 10 minutes. You only need a few handfulls of those and your network link is overloaded. Please friends, do not scrape at 10 min intervals anywhere unless it is your own internal resources. And if it is your own resource then you should have enough control to use something far batter than html scraping.

10 minute default scan interval is luring in-experienced HA users into being bad internet citizens

3 Likes

No, I’d like to have much less frequent updates than 10 minutes, that is why I usually set up 3600 seconds in yaml, but the sensors are still updated every 10 minutes.

Seems to be a bug, so I’ve created a ticket about it.

2 Likes