Scrape sensor improved - scraping multiple values

I have a situation where the page I wish to scrape has embedded JSON in a script tag. The script tag has an id so I can grab it. It is hockey statistics so it is very large and hence needs to be an attribute. I would have thought I could do this in one step but as of yet, have not been able to figure out the “secret”.

What I have (note I use includes so this YAML is in a file called multiscrape.yaml):

  - name: SOS scraper
    resource: https://www.dailyfaceoff.com/nhl-weekly-schedule
    scan_interval: 36000
    sensor:
      - unique_id: hockey_strength_of_schedule
        name: Hockey Strength of Schedule
        select: '#__NEXT_DATA__'
        value_template: '{{ now() }}'
        attributes:
          - name: props
            select: '#__NEXT_DATA__'
            value_template: >
                {{ value.replace("'", '"') }}

Now I have tried everything to parse “value” using “from_json” but it just does not look or operate correctly. So as shown in the above, I just store the string of JSON in an attribute on this sensor, and then use a template to parse that which does work perfectly (note using includes so this is in my template.yaml):

###
### Hockey Weekly Schedule
###
  - name: Hockey Weekly schedule
    unique_id: hockey_weekly_schedule
    state: "{{ now() }}"
    attributes:
        sos: "{{ (state_attr('sensor.hockey_strength_of_schedule','props') | from_json) }}"

This works perfectly but I would have thought that I could have just used the “from_json” filter in the multiscrape. When I do even with the replace of ’ to " it just returns the string with single quotes back in place. I am also confused at why “select” is needed twice or I just don;t know how it all works.

What I mean is if I use this:

  - name: SOS scraper test
    resource: https://www.dailyfaceoff.com/nhl-weekly-schedule
    scan_interval: 360000
    sensor:
      - unique_id: hockey_strength_of_schedule_test
        name: Hockey Strength of Schedule Test
        select: '#__NEXT_DATA__'
        value_template: '{{ now() }}'
        attributes:
          - name: sos
            select: '#__NEXT_DATA__'
            value_template: >
                {{ value.replace("'", '"') | from_json }}

I get this:

But for my workaround, I get this:

Any thoughts on what is wrong? The workaround is fine but why have two sensors when one is fine?

https://jinja.palletsprojects.com/en/3.0.x/

Without seeing your config or example data it’s hard to help, but try splitting on the currency symbol — or use a regex to extract the dish.

I don’t understand what commands you’re referring to. Do you mean templating functions like the ones you’re already using?

It’s not clear from the docs, but use value_json. instead of value..

Thanks for the input however that does not work at all:
This:

  - name: SOS scraper2
    resource: https://www.dailyfaceoff.com/nhl-weekly-schedule
    scan_interval: 360000
    sensor:
      - unique_id: hockey_strength_of_schedule_json
        name: Hockey Strength of Schedule JSON
        select: '#__NEXT_DATA__'
        value_template: '{{ now() }}'
        attributes:
          - name: props
            select: '#__NEXT_DATA__'
            value_template: >
                {{ value_json }}

Yields nothing in “props”:

I have read through a lot here and I see no where when someone is returning a large JSON structure to an attribute on the sensor. I see many where people “pluck off” some value which I can prove works. It seems to me that the last step writes it all as a string in the attribute.

I’m not a 100% sure, but if I follow you, you want to have the parsed JSON as a dict (key-value) pairs under that attribute, right?

You can get a dict object from JSON like this:

{% set json = '{"foo": "bar"}' %}
{{ (json | from_json)['foo'] }}

The issue lies in the fact of your (correct) creation of a JSON file. You will note you used double quotes for keys. In fact, the data itself on the web site does use double quotes but somehow they are changed to single quotes and something is choaking on that.

I took your example and tried:

{{ (value.replace("'", '"') | from_json)['props'] }}

Result:

friendly_name: Hockey Strength of Schedule JSON
props: >-
  {'pageProps': {'teams': [{'slug': 'anaheim-ducks', 'name': 'Anaheim Ducks',
  'logo': 'https://api.dailyfaceoff.com/uploads/team/logo/1/anaheim-ducks.png',
  'wins': 12, 'losses': 19, 'overtimeLosses': 0}, {'slug': 'arizona-coyotes',
  'name': 'Arizona Coyotes', 'logo':
  'https://api.dailyfaceoff.com/uploads/team/logo/23/airzona-coyotes.png',
  'wins': 15, 'losses': 13, 'overtimeLosses': 2}, {'slug': 'boston-bruins',
  'name': 'Boston Bruins', 'logo':

I also tried:

{{ (value | from_json)['props'] }}

result:

friendly_name: Hockey Strength of Schedule JSON
props: >-
  {'pageProps': {'teams': [{'slug': 'anaheim-ducks', 'name': 'Anaheim Ducks',
  'logo': 'https://api.dailyfaceoff.com/uploads/team/logo/1/anaheim-ducks.png',
  'wins': 12, 'losses': 19, 'overtimeLosses': 0}, {'slug': 'arizona-coyotes',
  'name': 'Arizona Coyotes', 'logo':
  'https://api.dailyfaceoff.com/uploads/team/logo/23/airzona-coyotes.png',
  'wins': 15, 'losses': 13, 'overtimeLosses': 2}, {'slug': 'boston-bruins',

So neither work but both DID grab the data only under [‘props’]. So it knows it is JSON key, values but if you try to write a structure within the attribute, that last part is bascially written as a string. It cannot output a dict object inside the attribute.

Then I just output “value” and behold:

props: >-
  {"props":{"pageProps":{"teams":[{"slug":"anaheim-ducks","name":"Anaheim
  Ducks","logo":"https://api.dailyfaceoff.com/uploads/team/logo/1/anaheim-ducks.png","wins":12,"losses":19,"overtimeLosses":0},{"slug":"arizona-coyotes","name":"Arizona
  Coyotes","logo":"https://api.dailyfaceoff.com/uploads/team/logo/23/airzona-coyotes.png","wins":15,"losses":13,"overtimeLosses":2},{"slug":"boston-bruins","name":"Boston
  Bruins","logo":"https://api.dailyfaceoff.com/uploads/team/logo/3/boston-bruins.png","wins":19,"losses":6,"overtimeLosses":5},{"slug":"buffalo-sabres","name":"Buffalo
  Sabres","logo":"https://api.dailyfaceoff.com/uploads/team/logo/4/buffalo-sabres.png","wins":13,"losses":16,"overtimeLosses":3},{"slug":"calgary-flames","name":"Calgary
  Flames","logo":"https://api.dailyfaceoff.com/uploads/team/logo/5/calgary-flames.png","wins":13,"losses":14,"overtimeLosses":5},{"slug":"carolina-hurricanes","name":"Carolina
  Hurricanes","logo":"https://api.dailyfaceoff.com/uploads/team/logo/6/carolina-hurricanes.png","wins":16,"losses":13,"overtimeLosses":3},{"slug":"chicago-blackhawks","name":"Chicago
  Blackhawks","logo":"https://api.dailyfaceoff.com/uploads/team/logo/7/chicago-blackhawks.png","wins":9,"losses":20,"overtimeLosses":1},{"slug":"colorado-avalanche","name":"Colorado
  Avalanche","logo":"https://api.dailyfaceoff.com/uploads/team/logo/8/colorado-avalanche.png","wins":19,"losses":10,"overtimeLosses":2},{"slug":"columbus-blue-jackets","name":"Columbus
  Blue

You can immediately see the difference … the value itself as a still has double quotes! It is perfectly ready for applying from_json to convert. But multiscrape is just not allowing that to happen.

The final result should be like this:

  props:
    pageProps:
      teams:
        - slug: anaheim-ducks
          name: Anaheim Ducks
          logo: https://api.dailyfaceoff.com/uploads/team/logo/1/anaheim-ducks.png
          wins: 12
          losses: 19
          overtimeLosses: 0
        - slug: arizona-coyotes
          name: Arizona Coyotes
          logo: >-
            https://api.dailyfaceoff.com/uploads/team/logo/23/airzona-coyotes.png
          wins: 15
          losses: 13
          overtimeLosses: 2
        - slug: boston-bruins
          name: Boston Bruins
          logo: https://api.dailyfaceoff.com/uploads/team/logo/3/boston-bruins.png
          wins: 19
          losses: 6
          overtimeLosses: 5

Hence, my workaround was simply take the “string” as attribute of one sensor and then a second sensor that parses that string with from_json and that works perfect. My question (or statement) is … to me this is a bug and it should work but doesn’t in multiscrape. It appears to me that it can only write strings to attributes and not anything else.

Oddly enough, three months ago an update in the code is entitled “Support dictionaries in attributes” so I would think it is something simple I am missing or this code does not work. Not sure.

That’s very odd, because HA will definitely preserve the data type in attributes (states are always strings, but not attributes). So, yes, it must be something multiscrape does.

Have you compared with the normal built-in scrape integration, just as a test?

I checked my own implementations, and nowhere do I use attributes with complex data types.

That commit is a peculiar one, because the docs say this:

First … attribute as selected in your screen shot is not the same. I believe that is an attribute n the HTML element to select and not the content

Second … I can;t do this with regular old scrape because AFAIK it only support putting data into the state. The data itself is way, way longer than what can be put into a state and HA Scrape does not have an ability to create “attributes”.

If the JSON file itself was available through REST, then this would not be an issue because I could just use a REST sensor. However, the JSON data is buried in a script element on the page thus I needed to scrape. Then since scrape cannot do attributes, I needed multiscrape.

But alas, none works.

Yes, you’re right. Sorry.

Correct again. I made a similar mistake looking at the scrape docs.

Have you checked in your browser’s dev tools whether there isn’t a separate HTTP request made for getting that JSON data?

Yes. No where to be found. As far as I could see, they build that page and send it as a whole and use JS in the page to parse the JSON that is included in the page.

Don’t I wish!

Maybe @danieldotnl will read and respond but as I said, I have a completely functional workaround which is multiscrape + template sensor which does work.

Just to note … I have built a large integration for Sports Standings and Scores that is followed here:

There are many components, some drawn from REST api’s at ESPN, some for other sites so I am pretty versed at this. But this is the first time I encountered this and while I solved it, it a’int perfect. I hate having sensors to gather whole chucks of stuff and other sensors to parse that as the “gathering” ones are just waste of space and never used.

1 Like

You have your workaround, which is good, even if it’s not ideal.

Keen to hear if that is indeed a bug in multiscrape.

The only other real option I can think of is to write a Python script.

Yes @danieldotnl is trying to keep up, although severely limited in time. I really appreciate all the support @parautenbach is providing to the scrape community! I simply cannot reply to each (private) message myself, and try to focus on providing more value in multiscrape instead.

Anyway, I looked into this tonight and realized that I fixed this some time ago but never merged it into the master branch.
I believe this release fixes your issue:

1 Like

Thank you! I can confirm this works using:

  - name: SOS scraper2
    resource: https://www.dailyfaceoff.com/nhl-weekly-schedule
    scan_interval: 360000
    sensor:
      - unique_id: hockey_strength_of_schedule_test
        name: Hockey Strength of Schedule Test
        select: '#__NEXT_DATA__'
        value_template: '{{ now() }}'
        attributes:
          - name: props
            select: '#__NEXT_DATA__'
            value_template: >
                {{ value | from_json }}

Note @parautenbach … I also tried with value_json but that did not work. I would assume this is because it really is not a JSON file, it is a string of JSON.

2 Likes

You can see in the following messages it was only some unpublished code. I downloaded that version and tested and got it working in one step. Thanks for the (try) at helping and I am glad it was not just some stupid mistake I was making.

Onward to new challenges!

1 Like

You’re welcome! And I saw, thank you. Great of the contributor to have published that change. :slight_smile:

Hello all,
I try to scrape from: https://gpsgadget.buienradar.nl/data/raintext?lat=52&lon=5
I use this code:

- resource: https://gpsgadget.buienradar.nl/data/raintext?lat=52&lon=5
  scan_interval: 900
  sensor:
  - unique_id: Regenklok
    name: Regenklok
    select: "body > pre"
    value_template: '{{ value.split("|")[1]}}'
    unit_of_measurement: "%"

From debug I get this response: “Unable to scrape data: Could not find a tag for given selector”
I tried to use: select: “pre” but that didn’t solve the issue
Maybe someone can help me with the right tags?

==== Solution ====
I found the solution. I fixed it with:

select: "p"  
    value_template: "{{ value.split('|')[1] }}"

This may be useful to some of you. I’ve figured out how to scrape dynamical (Javascript-generated) websites using Browserless and multiscrape and have written this up here:

1 Like

Hi guys, very new to scrape.
I am trying to get info for electricity price but i cant make it work.

the site is this : https://www.dei.gr/en/home/electricity/g1-g1n/

And i am trying to get those two values but i cant.

Hey guys!

Could you please help me scrape a specific file on GitHub? I want to get notified, when it changed: https://github.com/paperless-ngx/paperless-ngx/blob/c2c9a953d3f4dfb2b7d5eb8f4e055aad8339aae2/docker/compose/docker-compose.portainer.yml

So I want to check and get “4 days ago”.

So i got this as the selector:
#repo-content-pjax-container > react-app > div > div > div.Box-sc-g0xbh4-0.fSWWem > div > div > div.Box-sc-g0xbh4-0.emFMJu > div.Box-sc-g0xbh4-0.hlUAHL > div > div:nth-child(3) > div.Box-sc-g0xbh4-0.brFBoI > div > div.Box-sc-g0xbh4-0.jGfYmh > div.Box-sc-g0xbh4-0.lhFvfi > span.Text-sc-17v1xeu-0.kKFNhh.react-last-commit-oid-timestamp > relative-time

But it is not working… here is what it looks like in HA:

What am I doing wrong here?
image

Can anyone test this and help? Thanks a bunch!!

Monitoring the commits with this seems a lot simpler: GitHub - Home Assistant.