The new way to SCRAPE

It might be a stupid question but where do I found the yaml code of the scrape entity?
So far I have just used the UI and it works fine. The only thing I want to change is the scan interval that should be reduced to 1 min instead of 10. Should be an easy thing - if I knew where the yaml is…
I checked the sensor.yaml, the configuration.yaml and many others… I can’t find it.
Please provide me with a hint…
Thank you!

Ah - just found the answer. It does not exist. Strange for me but it seems I have to use the yaml code from the beginning. Unfortunately that makes the UI useless.

Great hint! Works perfectly!
Thank you very much, @Frank_Beetz!

1 Like

Hi,

I want to know the “Diesel” price getting from 2 google searches, so I can compare a choose the cheapest one :wink:

gulf hansbeke - Google Search.
and
Tankstation in de buurt
How can I do that ?

Please help, how can I run the scrape manually?
I only need it in certain cases.

Is off / disable

Enable asking for updates.
Whether Home Assistant should automatically query scrape entities for updates.

my script

alias: Zeit
sequence:
  - service: homeassistant.update_entity
    data: {}
    target:
      entity_id:
        - sensor.web_scrape_zeit
mode: single

or is it only possible in automation? but the service “homeassistant.update_entity” is always the same?
or is there a “scrape” service update?

only works in .yaml mode?

Thanks!

Hey, I’m trying to scrape times from a Bus service timetable but I’m not getting any value.

I’ve tryied both UI and yaml without luck. Here is the yaml version.

scrape:
  - resource: https://app.ambmobilitat.cat/stops/busamb:102736
    sensor:
      - name: E70
        select: "#root > div > div.desktop-page > div.content > div > div.route-list-container > div > div > a:nth-child(1) > div.text > div.subtitle > div > div > div.times > div:nth-child(1) > div:nth-child(1)"

Any suggestion or advice would be highly appreciated.

Cheers

Hey, I’m trying to scrape “Gasolio(self)” price from this site https://carburanti.mise.gov.it/ospzSearch/dettaglio/50294

in the “select” field I put #main-content > div > div > app-dettaglio-page > div > div:nth-child(2) > div > div > app-dettaglio > div > div > section.w-100.col-lg.col-sm-12.mt-sm-4.mt-lg-0 > div > table > tbody > tr:nth-child(2) > td:nth-child(2) > span

but the integration don’t show any value, what can I do to solve my issue?
Thks

1 Like

Hello,
being quite new on the usage of the SCRAPE integration, I figured out already some issues today.
While trying to understand how this works, I am currently stuck in reading out a temperature from a local webinterface of my heating.
I think the issue is, that the “value” is for example “50,0°C” while i think SCRAPE would expect just a number like “50,0”.
I am very new and would like to understand how the string “°C” could be removed.
Can anyone please help here?

Thank you in advance!
Ron

Hi @goetzpil you can use my addon which send the data over Mqtt

If you missing data create a feature request

Hi Ron,
It took me some time too, but I started playing around with python first to understand how to do those conversions with beautifulsoup. Once I got that, I used the standard Scrape component.

Specifically for your case. In the value template section, you can use {{ value | replace(“,”, “.”) | replace( “°C”, “”) }} to convert the value you receive to what you need.

There’s a few beers in it for anyone that can get the Version number of MS Edge from this page:
https://www.microsoft.com/en-us/edge/business/download?form=MA13FJ

That is to say I need the text under ‘Windows 64-bit’ - currently reading ‘125.0.2535.67‘

I’ve tried all sorts and concluded that the CSS is a little malformed. Not so much that the page fails to load but enough that it’s hard to scrape. …. but I’m only a relative newb to Scraping so I can’t be sure.

Hi there,

I also struggle to scrape the podcast episode of Was jetzt? (that is “https://zeitonline.simplecastaudio.com/b4b9795f-4b37-4d4f-bbfe-62b735703af8/episodes/497f2191-08eb-46e8-a735-99019b7a107a/audio/128/default.mp3?aid=rss_feed&awCollectionId=b4b9795f-4b37-4d4f-bbfe-62b735703af8&awEpisodeId=497f2191-08eb-46e8-a735-99019b7a107a&feed=Xtqjn37O”)

The CSS Selector should be “#folder4 > div.opened > div:nth-child(16) > span > span:nth-child(3) > span.html-attribute-value”, but the sensor returns “unknown”.

Any help is appreciated :wink:

Reading this comment sparked the memory of a convo I had in a slack channel a while back.

We came up with two kinda-sorta-okay solutions.

The first used curl to download a file but it ended up being a binary file even though the only contents it really included was the version number of the latest release.

https://msedgedriver.azureedge.net/LATEST_STABLE

The other method involved less grappling with odd characters in the binary file but instead leaned on JQ to process the output of the full JSON object of latest versions.

curl -s https://edgeupdates.microsoft.com/api/products\?view\=enterprise | jq '[.[] | select(.Product == "Stable") | .Releases.[] | select(.Platform == "Windows") | select(.Architecture == "x64")] | max_by(.ReleaseId) | .ProductVersion'

Quite a mouthful, but it returns the latest version via the release with the “max”/highest ReleaseId.

[[ Edit: we ended up going with the JQ approach because we could take the output of the whole release and use the extra info like URLs to better automate some of the pre-downloading of the installer.
You can peel back each of the pipes off the end of this command and get a better feel for how it’s working.

end of edit

]]

Neither me nor my coworker were super stoked about either approach, but agreed that both were more likely to stay working and functional longer than trying to scrape the version off the enterprise page.

Not sure if this qualifies for the beers since it’s not web scraping, but hey, hope this helps!

Cheers either way! :beers:

Hi AniseAridity. Wow. Thank you very much for the interesting info. I’ll probably give that a crack at some point and let you know how I get on (though I’m not strong on coding). What I’m actually going with at the moment is a self-hosted scarping engine - changedetection.io.
You’ll see they also have a cloud-hosted version which costs a few ££ but well worth it and far cheaper than alternatives. The notifications available are plentiful and there’s even one that creates Persistent Notifications in HA… which is what I’m using. So far so good… though the templating is fiddly. :slight_smile:

Hi to all,

I struggle to scrape values on this page:

http://www.centrometeolombardo.com/Moduli/Cartina/data/tt.js?reload

If I inspect the page source the selector should be:

body > table > tbody > tr:nth-child(6) > td.line-content

but the sensor returns “not available”.

Any help is appreciated,
Thanks in advance.