Scrape sensor improved - scraping multiple values

danieldotnl · December 20, 2023, 9:29pm

Yes @danieldotnl is trying to keep up, although severely limited in time. I really appreciate all the support @parautenbach is providing to the scrape community! I simply cannot reply to each (private) message myself, and try to focus on providing more value in multiscrape instead.

Anyway, I looked into this tonight and realized that I fixed this some time ago but never merged it into the master branch.
I believe this release fixes your issue:

kbrown01 · December 21, 2023, 7:56pm

Thank you! I can confirm this works using:

  - name: SOS scraper2
    resource: https://www.dailyfaceoff.com/nhl-weekly-schedule
    scan_interval: 360000
    sensor:
      - unique_id: hockey_strength_of_schedule_test
        name: Hockey Strength of Schedule Test
        select: '#__NEXT_DATA__'
        value_template: '{{ now() }}'
        attributes:
          - name: props
            select: '#__NEXT_DATA__'
            value_template: >
                {{ value | from_json }}

Note @parautenbach … I also tried with value_json but that did not work. I would assume this is because it really is not a JSON file, it is a string of JSON.

kbrown01 · December 21, 2023, 8:07pm

You can see in the following messages it was only some unpublished code. I downloaded that version and tested and got it working in one step. Thanks for the (try) at helping and I am glad it was not just some stupid mistake I was making.

Onward to new challenges!

parautenbach · December 22, 2023, 6:19am

You’re welcome! And I saw, thank you. Great of the contributor to have published that change.

albamatti · January 1, 2024, 1:23pm

Hello all,
I try to scrape from: https://gpsgadget.buienradar.nl/data/raintext?lat=52&lon=5
I use this code:

- resource: https://gpsgadget.buienradar.nl/data/raintext?lat=52&lon=5
  scan_interval: 900
  sensor:
  - unique_id: Regenklok
    name: Regenklok
    select: "body > pre"
    value_template: '{{ value.split("|")[1]}}'
    unit_of_measurement: "%"

From debug I get this response: “Unable to scrape data: Could not find a tag for given selector”
I tried to use: select: “pre” but that didn’t solve the issue
Maybe someone can help me with the right tags?

==== Solution ====
I found the solution. I fixed it with:

select: "p"  
    value_template: "{{ value.split('|')[1] }}"

wigster · January 3, 2024, 12:31pm

This may be useful to some of you. I’ve figured out how to scrape dynamical (Javascript-generated) websites using Browserless and multiscrape and have written this up here:

PskNorz · January 4, 2024, 1:42am

Hi guys, very new to scrape.
I am trying to get info for electricity price but i cant make it work.

the site is this : https://www.dei.gr/en/home/electricity/g1-g1n/

And i am trying to get those two values but i cant.

Igor01-Tech · January 11, 2024, 4:46pm

Hey guys!

Could you please help me scrape a specific file on GitHub? I want to get notified, when it changed: https://github.com/paperless-ngx/paperless-ngx/blob/c2c9a953d3f4dfb2b7d5eb8f4e055aad8339aae2/docker/compose/docker-compose.portainer.yml

So I want to check and get “4 days ago”.

So i got this as the selector:
#repo-content-pjax-container > react-app > div > div > div.Box-sc-g0xbh4-0.fSWWem > div > div > div.Box-sc-g0xbh4-0.emFMJu > div.Box-sc-g0xbh4-0.hlUAHL > div > div:nth-child(3) > div.Box-sc-g0xbh4-0.brFBoI > div > div.Box-sc-g0xbh4-0.jGfYmh > div.Box-sc-g0xbh4-0.lhFvfi > span.Text-sc-17v1xeu-0.kKFNhh.react-last-commit-oid-timestamp > relative-time

But it is not working… here is what it looks like in HA:

What am I doing wrong here?

Can anyone test this and help? Thanks a bunch!!

parautenbach · January 11, 2024, 8:04pm

Monitoring the commits with this seems a lot simpler: GitHub - Home Assistant.

homebrew · January 11, 2024, 9:36pm

Seems simple enough, but I’m having no luck.
I’ve used console to copy the selector path.
Aaaand nothing.
The page: Park City Weather | Park City Mountain Resort

The first bit of data I’m trying to grab is the 24hr snow fall, so console gave me this:
#snow_report_1 > div.snow_report__content.row > ul > li:nth-child(2) > div > h5
It seems to make sense, but doesn’t work.

Any ideas?

JeroenB · January 22, 2024, 10:17am

I’m trying to scrape a temperature measurement from a website - measurements are added every hour to a string - so far I can retrieve the entire string with measurements after ‘var query_temp’ - but I’m not experienced enough with this to obtain the last measurement (these are always in the positions -5 to -1 from the end of the string - indicated in the figure below). Could anyone point me in the right direction?

Troon · January 22, 2024, 10:29am

In future, please help us by posting relevant data as text: I’ve had to re-type all this for testing.

value_template: >
  {{ value|regex_findall("\s(\-?[0-9\.]*)\s")|last }}

regex_findall is returning a list of all numbers that are surrounded by whitespace:

\s — whitespace character before
( — start remembering
\-? — optional minus sign
[0-9\.]* — any sequence of digits and points
) — stop remembering
\s — whitespace character after

JeroenB · January 22, 2024, 2:46pm

Thank you! Still I’m having trouble → The data are here:

<script type="text/javascript">
      
      var query_labels = " 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15 ";
      var query_temp = " 44.8  44.9  44.9  44.8  44.7  44.5  49.8  60.8  60.6  60.4  60.2  59.9  59.6  59.3  58.4 ";
      var query_elec = " 279.45808708333334  400.80427425  0.0  0.0  0.0  0.0  2158.4836078611106  29.757760666666666  251.44402029444447  0.0  0.0  0.0  353.20271759999997  0.0  552.133760033333 ";
      var query_heat = " 946.9248917628065  1501.3182562778238  0.0  0.0  0.0  0.0  2366.505316186263  29.757760666666666  808.0376479104308  0.0  0.0  0.0  1304.7204898726695  0.0  2045.6582218857018 "
      var total_heat = "9.0";
      var total_electricity = "4.0";
      var month = "";

And I have used the following code:

value_template: "{{(value.split('var')[2])| replace('query_temp = \"', '')| replace('\";','')| regex_findall("\s(\-?[0-9\.]*)\s")| last| float}}"

But I’m getting a new error if I include this line:

Error loading /config/configuration.yaml: while parsing a block mapping
  in "/config/configuration.yaml", line 795, column 9
expected <block end>, but found '<scalar>'
  in "/config/configuration.yaml", line 798, column 119```

Troon · January 22, 2024, 2:52pm

Nested quotes (" within "). Use this instead:

value_template: >
  {{ value|regex_findall("query_temp[^;]*")|first|regex_findall("\s(\-?[0-9\.]+)[\"\s]")|last }}

EDIT: updated again: this version looks for query_temp rather than assuming it’s going to be on the third line.

First section returns the first instance of query_temp up to the next semicolon; second section pulls out the final number from it.

JeroenB · January 22, 2024, 3:18pm

That did it! Thanks so much!

fadudba · January 31, 2024, 12:31am

Hi can someone help me. I’m trying to get that value from a website.

thanks in advance!

wigster · February 7, 2024, 11:07pm

Hi. I’d like to scrape a status indicator. The problem is that the element has no data in it, but rather the only thing that changes is the colour defined in the style attribute

<div _ngcontent-pdh-c159="" style="display: flex; margin-left: auto;" title="AVAILABLE"><span _ngcontent-pdh-c159="" class="charger-status-dot" style="background-color: rgb(93, 199, 22); height: 34px; width: 34px;"></span></div>

Alternatively there is another element which has the title “AVAILABLE”

Is it possible to create a binary sensor which would be true if the colour/title match the above? I guess this is called extracting tag attributes?

Troon · February 8, 2024, 8:24am

We’d need the URL or the full HTML (pastebin?), and confirmation that the data you’re after is in the HTML as originally downloaded (View Source rather than F12 DevTools).

Could be as simple as select: div.buy-value.

If that colour definition is in the original HTML as fetched (i.e. not dynamically loaded afterwards), RESTful binary sensor. If that colour isn’t used anywhere else in the document, and the page length isn’t too great:

binary_sensor:
  - platform: rest
    resource: URL
    value_template: "{{ 'rgb(93, 199, 22);' in value }}"

Lots of "if"s there, but without a URL or the HTML to go off, I have to make assumptions.

wigster · February 8, 2024, 9:18am

I’ve put the page_soup.txt generated by multiscrape here:

https://pastebin.com/4EuREMjh

Since posting I’ve realised that multiscrape has an attribute key which should be able to return the tag attributes but somehow it does not work for this particular element. I’m experimenting with something like:

    - name: O-Life Home Charger status
      unique_id: o_life_home_charger_status
      select: ".spot-list-item div:nth-child(1) div div .charger-status-dot"
      attribute: "class"
      value_template: "{{value}}"

which I believe should return “charger-status-dot”, but it fails. It seems to work fine with other selectors that I am already getting from this page. Is it because the div is actually empty?

cvester · March 7, 2024, 10:09am

Can anyone help me. I keep getting the

Unable to scrape data: Could not find a tag for given selector Consider using debug logging and log_response for further investigation.

I’m just trying to extract a simple weather description from:

multiscrape:
  - name: DMI Weather
    resource: "https://www.dmi.dk/lokation/show/DK/2624652/Aarhus/"
    scan_interval: 3600
    sensor:
      - unique_id: dmi_weather_text
        name: DMI Weather Text
        select: "#textForecast > div:nth-child(2) > div > div.weather-forecast"

I have tried adding:

logger:
  default: info
  logs:
    custom_components.multiscrape: debug

But I keep getting the

Consider using debug logging and log_response