Scrape sensor improved - scraping multiple values

parautenbach · December 19, 2023, 6:49pm

I’m not a 100% sure, but if I follow you, you want to have the parsed JSON as a dict (key-value) pairs under that attribute, right?

You can get a dict object from JSON like this:

{% set json = '{"foo": "bar"}' %}
{{ (json | from_json)['foo'] }}

kbrown01 · December 19, 2023, 7:19pm

The issue lies in the fact of your (correct) creation of a JSON file. You will note you used double quotes for keys. In fact, the data itself on the web site does use double quotes but somehow they are changed to single quotes and something is choaking on that.

I took your example and tried:

{{ (value.replace("'", '"') | from_json)['props'] }}

Result:

friendly_name: Hockey Strength of Schedule JSON
props: >-
  {'pageProps': {'teams': [{'slug': 'anaheim-ducks', 'name': 'Anaheim Ducks',
  'logo': 'https://api.dailyfaceoff.com/uploads/team/logo/1/anaheim-ducks.png',
  'wins': 12, 'losses': 19, 'overtimeLosses': 0}, {'slug': 'arizona-coyotes',
  'name': 'Arizona Coyotes', 'logo':
  'https://api.dailyfaceoff.com/uploads/team/logo/23/airzona-coyotes.png',
  'wins': 15, 'losses': 13, 'overtimeLosses': 2}, {'slug': 'boston-bruins',
  'name': 'Boston Bruins', 'logo':

I also tried:

{{ (value | from_json)['props'] }}

result:

friendly_name: Hockey Strength of Schedule JSON
props: >-
  {'pageProps': {'teams': [{'slug': 'anaheim-ducks', 'name': 'Anaheim Ducks',
  'logo': 'https://api.dailyfaceoff.com/uploads/team/logo/1/anaheim-ducks.png',
  'wins': 12, 'losses': 19, 'overtimeLosses': 0}, {'slug': 'arizona-coyotes',
  'name': 'Arizona Coyotes', 'logo':
  'https://api.dailyfaceoff.com/uploads/team/logo/23/airzona-coyotes.png',
  'wins': 15, 'losses': 13, 'overtimeLosses': 2}, {'slug': 'boston-bruins',

So neither work but both DID grab the data only under [‘props’]. So it knows it is JSON key, values but if you try to write a structure within the attribute, that last part is bascially written as a string. It cannot output a dict object inside the attribute.

Then I just output “value” and behold:

props: >-
  {"props":{"pageProps":{"teams":[{"slug":"anaheim-ducks","name":"Anaheim
  Ducks","logo":"https://api.dailyfaceoff.com/uploads/team/logo/1/anaheim-ducks.png","wins":12,"losses":19,"overtimeLosses":0},{"slug":"arizona-coyotes","name":"Arizona
  Coyotes","logo":"https://api.dailyfaceoff.com/uploads/team/logo/23/airzona-coyotes.png","wins":15,"losses":13,"overtimeLosses":2},{"slug":"boston-bruins","name":"Boston
  Bruins","logo":"https://api.dailyfaceoff.com/uploads/team/logo/3/boston-bruins.png","wins":19,"losses":6,"overtimeLosses":5},{"slug":"buffalo-sabres","name":"Buffalo
  Sabres","logo":"https://api.dailyfaceoff.com/uploads/team/logo/4/buffalo-sabres.png","wins":13,"losses":16,"overtimeLosses":3},{"slug":"calgary-flames","name":"Calgary
  Flames","logo":"https://api.dailyfaceoff.com/uploads/team/logo/5/calgary-flames.png","wins":13,"losses":14,"overtimeLosses":5},{"slug":"carolina-hurricanes","name":"Carolina
  Hurricanes","logo":"https://api.dailyfaceoff.com/uploads/team/logo/6/carolina-hurricanes.png","wins":16,"losses":13,"overtimeLosses":3},{"slug":"chicago-blackhawks","name":"Chicago
  Blackhawks","logo":"https://api.dailyfaceoff.com/uploads/team/logo/7/chicago-blackhawks.png","wins":9,"losses":20,"overtimeLosses":1},{"slug":"colorado-avalanche","name":"Colorado
  Avalanche","logo":"https://api.dailyfaceoff.com/uploads/team/logo/8/colorado-avalanche.png","wins":19,"losses":10,"overtimeLosses":2},{"slug":"columbus-blue-jackets","name":"Columbus
  Blue

You can immediately see the difference … the value itself as a still has double quotes! It is perfectly ready for applying from_json to convert. But multiscrape is just not allowing that to happen.

The final result should be like this:

  props:
    pageProps:
      teams:
        - slug: anaheim-ducks
          name: Anaheim Ducks
          logo: https://api.dailyfaceoff.com/uploads/team/logo/1/anaheim-ducks.png
          wins: 12
          losses: 19
          overtimeLosses: 0
        - slug: arizona-coyotes
          name: Arizona Coyotes
          logo: >-
            https://api.dailyfaceoff.com/uploads/team/logo/23/airzona-coyotes.png
          wins: 15
          losses: 13
          overtimeLosses: 2
        - slug: boston-bruins
          name: Boston Bruins
          logo: https://api.dailyfaceoff.com/uploads/team/logo/3/boston-bruins.png
          wins: 19
          losses: 6
          overtimeLosses: 5

Hence, my workaround was simply take the “string” as attribute of one sensor and then a second sensor that parses that string with from_json and that works perfect. My question (or statement) is … to me this is a bug and it should work but doesn’t in multiscrape. It appears to me that it can only write strings to attributes and not anything else.

Oddly enough, three months ago an update in the code is entitled “Support dictionaries in attributes” so I would think it is something simple I am missing or this code does not work. Not sure.

parautenbach · December 19, 2023, 8:06pm

That’s very odd, because HA will definitely preserve the data type in attributes (states are always strings, but not attributes). So, yes, it must be something multiscrape does.

Have you compared with the normal built-in scrape integration, just as a test?

I checked my own implementations, and nowhere do I use attributes with complex data types.

That commit is a peculiar one, because the docs say this:

kbrown01 · December 19, 2023, 8:29pm

First … attribute as selected in your screen shot is not the same. I believe that is an attribute n the HTML element to select and not the content

Second … I can;t do this with regular old scrape because AFAIK it only support putting data into the state. The data itself is way, way longer than what can be put into a state and HA Scrape does not have an ability to create “attributes”.

If the JSON file itself was available through REST, then this would not be an issue because I could just use a REST sensor. However, the JSON data is buried in a script element on the page thus I needed to scrape. Then since scrape cannot do attributes, I needed multiscrape.

But alas, none works.

parautenbach · December 19, 2023, 8:39pm

Yes, you’re right. Sorry.

Correct again. I made a similar mistake looking at the scrape docs.

Have you checked in your browser’s dev tools whether there isn’t a separate HTTP request made for getting that JSON data?

kbrown01 · December 19, 2023, 8:47pm

Yes. No where to be found. As far as I could see, they build that page and send it as a whole and use JS in the page to parse the JSON that is included in the page.

Don’t I wish!

Maybe @danieldotnl will read and respond but as I said, I have a completely functional workaround which is multiscrape + template sensor which does work.

Just to note … I have built a large integration for Sports Standings and Scores that is followed here:

There are many components, some drawn from REST api’s at ESPN, some for other sites so I am pretty versed at this. But this is the first time I encountered this and while I solved it, it a’int perfect. I hate having sensors to gather whole chucks of stuff and other sensors to parse that as the “gathering” ones are just waste of space and never used.

parautenbach · December 20, 2023, 4:44am

You have your workaround, which is good, even if it’s not ideal.

Keen to hear if that is indeed a bug in multiscrape.

The only other real option I can think of is to write a Python script.

danieldotnl · December 20, 2023, 9:29pm

Yes @danieldotnl is trying to keep up, although severely limited in time. I really appreciate all the support @parautenbach is providing to the scrape community! I simply cannot reply to each (private) message myself, and try to focus on providing more value in multiscrape instead.

Anyway, I looked into this tonight and realized that I fixed this some time ago but never merged it into the master branch.
I believe this release fixes your issue:

kbrown01 · December 21, 2023, 7:56pm

Thank you! I can confirm this works using:

  - name: SOS scraper2
    resource: https://www.dailyfaceoff.com/nhl-weekly-schedule
    scan_interval: 360000
    sensor:
      - unique_id: hockey_strength_of_schedule_test
        name: Hockey Strength of Schedule Test
        select: '#__NEXT_DATA__'
        value_template: '{{ now() }}'
        attributes:
          - name: props
            select: '#__NEXT_DATA__'
            value_template: >
                {{ value | from_json }}

Note @parautenbach … I also tried with value_json but that did not work. I would assume this is because it really is not a JSON file, it is a string of JSON.

kbrown01 · December 21, 2023, 8:07pm

You can see in the following messages it was only some unpublished code. I downloaded that version and tested and got it working in one step. Thanks for the (try) at helping and I am glad it was not just some stupid mistake I was making.

Onward to new challenges!

parautenbach · December 22, 2023, 6:19am

You’re welcome! And I saw, thank you. Great of the contributor to have published that change.

albamatti · January 1, 2024, 1:23pm

Hello all,
I try to scrape from: https://gpsgadget.buienradar.nl/data/raintext?lat=52&lon=5
I use this code:

- resource: https://gpsgadget.buienradar.nl/data/raintext?lat=52&lon=5
  scan_interval: 900
  sensor:
  - unique_id: Regenklok
    name: Regenklok
    select: "body > pre"
    value_template: '{{ value.split("|")[1]}}'
    unit_of_measurement: "%"

From debug I get this response: “Unable to scrape data: Could not find a tag for given selector”
I tried to use: select: “pre” but that didn’t solve the issue
Maybe someone can help me with the right tags?

==== Solution ====
I found the solution. I fixed it with:

select: "p"  
    value_template: "{{ value.split('|')[1] }}"

wigster · January 3, 2024, 12:31pm

This may be useful to some of you. I’ve figured out how to scrape dynamical (Javascript-generated) websites using Browserless and multiscrape and have written this up here:

PskNorz · January 4, 2024, 1:42am

Hi guys, very new to scrape.
I am trying to get info for electricity price but i cant make it work.

the site is this : https://www.dei.gr/en/home/electricity/g1-g1n/

And i am trying to get those two values but i cant.

Igor01-Tech · January 11, 2024, 4:46pm

Hey guys!

Could you please help me scrape a specific file on GitHub? I want to get notified, when it changed: https://github.com/paperless-ngx/paperless-ngx/blob/c2c9a953d3f4dfb2b7d5eb8f4e055aad8339aae2/docker/compose/docker-compose.portainer.yml

So I want to check and get “4 days ago”.

So i got this as the selector:
#repo-content-pjax-container > react-app > div > div > div.Box-sc-g0xbh4-0.fSWWem > div > div > div.Box-sc-g0xbh4-0.emFMJu > div.Box-sc-g0xbh4-0.hlUAHL > div > div:nth-child(3) > div.Box-sc-g0xbh4-0.brFBoI > div > div.Box-sc-g0xbh4-0.jGfYmh > div.Box-sc-g0xbh4-0.lhFvfi > span.Text-sc-17v1xeu-0.kKFNhh.react-last-commit-oid-timestamp > relative-time

But it is not working… here is what it looks like in HA:

What am I doing wrong here?

Can anyone test this and help? Thanks a bunch!!

parautenbach · January 11, 2024, 8:04pm

Monitoring the commits with this seems a lot simpler: GitHub - Home Assistant.

homebrew · January 11, 2024, 9:36pm

Seems simple enough, but I’m having no luck.
I’ve used console to copy the selector path.
Aaaand nothing.
The page: Park City Weather | Park City Mountain Resort

The first bit of data I’m trying to grab is the 24hr snow fall, so console gave me this:
#snow_report_1 > div.snow_report__content.row > ul > li:nth-child(2) > div > h5
It seems to make sense, but doesn’t work.

Any ideas?

JeroenB · January 22, 2024, 10:17am

I’m trying to scrape a temperature measurement from a website - measurements are added every hour to a string - so far I can retrieve the entire string with measurements after ‘var query_temp’ - but I’m not experienced enough with this to obtain the last measurement (these are always in the positions -5 to -1 from the end of the string - indicated in the figure below). Could anyone point me in the right direction?

Troon · January 22, 2024, 10:29am

In future, please help us by posting relevant data as text: I’ve had to re-type all this for testing.

value_template: >
  {{ value|regex_findall("\s(\-?[0-9\.]*)\s")|last }}

regex_findall is returning a list of all numbers that are surrounded by whitespace:

\s — whitespace character before
( — start remembering
\-? — optional minus sign
[0-9\.]* — any sequence of digits and points
) — stop remembering
\s — whitespace character after

JeroenB · January 22, 2024, 2:46pm

Thank you! Still I’m having trouble → The data are here:

<script type="text/javascript">
      
      var query_labels = " 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15 ";
      var query_temp = " 44.8  44.9  44.9  44.8  44.7  44.5  49.8  60.8  60.6  60.4  60.2  59.9  59.6  59.3  58.4 ";
      var query_elec = " 279.45808708333334  400.80427425  0.0  0.0  0.0  0.0  2158.4836078611106  29.757760666666666  251.44402029444447  0.0  0.0  0.0  353.20271759999997  0.0  552.133760033333 ";
      var query_heat = " 946.9248917628065  1501.3182562778238  0.0  0.0  0.0  0.0  2366.505316186263  29.757760666666666  808.0376479104308  0.0  0.0  0.0  1304.7204898726695  0.0  2045.6582218857018 "
      var total_heat = "9.0";
      var total_electricity = "4.0";
      var month = "";

And I have used the following code:

value_template: "{{(value.split('var')[2])| replace('query_temp = \"', '')| replace('\";','')| regex_findall("\s(\-?[0-9\.]*)\s")| last| float}}"

But I’m getting a new error if I include this line:

Error loading /config/configuration.yaml: while parsing a block mapping
  in "/config/configuration.yaml", line 795, column 9
expected <block end>, but found '<scalar>'
  in "/config/configuration.yaml", line 798, column 119```