Problem with scraping a text value

tommygun · May 10, 2023, 4:02pm

Hi guys

I recently switched from iOBroker to Home Assistant and I have to say it’s really a shame I didn’t do this sooner.

Now I came across the integration “Scrape” which could solve a problem I have had for a long time. Namely, whether HDR or SDR content is currently being played on my TV. This is displayed in text form on the admin interface of my HDFury Diva.

Now I have created a sensor following the example, but what I just do not want to succeed is that the text is taken as a value. So in this case “4K50 444 BT709 8b SDR 594MHz 2.2”.

I just don’t manage to find the right selector, the value is either empty or unknown.

It would be super nice if someone could help me with this!

Here is the screenshot:

kbrown01 · May 10, 2023, 4:14pm

The selector should be the id pretended by # so #infoline__rx

tommygun · May 10, 2023, 4:26pm

That‘s the selector i tried first, as it was automatically copied by Chrome‘s development tools. Unfortunately, it doesn‘t work

kbrown01 · May 10, 2023, 4:32pm

I would suggest screen shot your scrape setup or paste in YAML if you are using YAML mode. Also some people create bad HTML so check that the ID.is not.in the full HTML in another location.

tommygun · May 10, 2023, 9:37pm

okay, so here is my yaml code inside my configuration.yaml

#Scrape 
scrape:
  - resource: https://www.home-assistant.io
    sensor:
      - name: "Current version"
        select: ".current-version h1"
        
  - resource: http://192.168.1.45/
    sensor:
      - name: "diva_in_tx0"
        select: "#infoline__rx"

And the index.html file: https://pastebin.com/1EUcz85G

kbrown01 · May 11, 2023, 3:14am

IN your posted HTML I see this:

<div id="infoline__rx" class="actvideoinfo--border"></div>

Which shows it has no content. I would assume that you are “scraping” before the content is resolved if the content is coming from some JS or reuires a form submit or something like that. But certainly if that is the HTML you are scraping, there is no value in that element.

tommygun · May 12, 2023, 2:38pm

First of all, thank you for your help and explanations so far!

I understand. The way I read the code, the page generates an output depending on which ID was selected in the backend. But this is apparently not as plain text in the html code. However, when I go into the developer tools it is visible, as probably already generated.

So does that mean I have no chance to implement the whole thing like this? Or is there another way?

Troon · May 12, 2023, 2:43pm

The initial HTML will be a “skeleton” with blank data fields, and then some Javascript (probably) will take over, make further network requests using a protocol known as XMLHTTP or Ajax, and plug the received values into the HTML. If you use View Source, you’ll see the initial state which is all you get when scraping. DevTools will show you the final modified state.

Best next step is to look at the Network tab in DevTools as you reload the page, and look for requests / responses with a type of xhr — if you’re lucky, one of those will be a JSON response containing the data you need, and you can use a REST sensor to pull that in.

tommygun · May 12, 2023, 3:41pm

Oh man, thank you so much for the explanation - now I understand the process!

Indeed there is a request with this type (see screenshot) and when I call this url in the browser I actually get text displayed in which my needed information is!

(http:// MY-IP /ssi/infopage.ssi) Here the output: https://pastebin.com/8c8zS4dS

What I need is the value in “TX0”:" 4K50 444 BT709 8b SDR 594MHz 2.2", or much more if it says SDR or HDR. So somehow an entity that outputs either HDR or SDR.

What would be the best way to do this?

Troon · May 12, 2023, 4:25pm

Rest sensor with that URL as the resource, and:

value_template: "{{ (value_json['TX0']|regex_findall('[HS]DR'))[0]|default('') }}"

That says pull out the TX0 value, find the first occurrence of H or SDR; and if it fails, return blank.

tommygun · May 12, 2023, 4:51pm

This actually worked!!!

Many, many thanks for the great support, it really brightened my day!