Help with Scrape - latest share price

1mfaasj · May 3, 2023, 9:10am

Hi
I cant get the scrape to work, so I appreciate if someone can help
I need to have the “latest share price” from:

https://business-customer.vwd.com/chartanalyse/web/?config=degiro&lang=nl&instrumentKey=IE00B3RBWM25.TRADE

How it looks like:

I have tried but it seems it doesn’t find the value:

scrape:
  - resource: `https://business-customer.vwd.com/chartanalyse/web/?config=degiro&lang=nl&instrumentKey=IE00B3RBWM25.TRADE`
    sensor:
      - name: World
        select: "tbody"

Troon · May 3, 2023, 9:19am

Scrape won’t work. This is the entire HTML of that site:


<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<meta http-equiv="x-ua-compatible" content="ie=edge">
	<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
	<meta name="format-detection" content="telephone=no">
	<title>Chart Analysis</title>
</head>
<body>
<script type="text/javascript" src="main.309228dde330990bf740.js?309228dde330990bf740"></script></body>
</html>

That Javascript builds a single-page app that uses websocket for its communication. When you use DevTools to find the element, you’re looking at what the Javascript has built. Scrape only sees the initial HTML (from View Source) as pasted above.

You’ll have to find another source for that data.

1mfaasj · May 3, 2023, 9:44am

Thanks for the reply!
I find another source and the value seems to be on the html source

Can I define the scrape like this? because when I try it doesn’t scrape the value Iam looking for:

scrape:
  - resource: `https://finance.yahoo.com/quote/VWRL.SW?p=VWRL.SW&.tsrc=fin-srch`
    sensor:
      - name: WorldETF
        select: "fin-streamer class"

Troon · May 3, 2023, 10:03am

Use the correct quotes around the resource (not backticks), and the select is:

select: "div#Lead-5-QuoteHeader-Proxy fin-streamer"

You can create scrape sensors via the UI to save restarting each time. Go to Integrations, add a new Scrape integration and take it from there.

1mfaasj · May 4, 2023, 6:28am

hmm I dont understand where you get the “select” from. I found another simple example to get this clear. how do I build the Select here?

url: https://huispedia.nl/amsterdam/1017gl/kerkstraat/8
source:

<div id="titleDetailsContainer" class="valuation-highlight">
<div class="detail-item detail-item-price ">
<span class="title">Realistische woningwaarde</span>
<span class="value">
<span class="value-range">
€ 1.394.000 -
1.507.000
</span>
<svg id="titleValuationDisclaimer" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" class="more-info-tooltip more-info-tooltip-warning">
<path d="M256 360c-13.25 0-23.1 10.74-23.1 24c0 13.25 10.75 24 23.1 24c13.25 0 23.1-10.75 23.1-24C280 370.7 269.3 360 256 360zM256 320c8.843 0 15.1-7.156 15.1-16V160c0-8.844-7.155-16-15.1-16S240 151.2 240 160v144C240 312.8 247.2 320 256 320zM504.3 397.3L304.5 59.38C294.4 42.27 276.2 32.03 256 32C235.8 32 217.7 42.22 207.5 59.36l-199.9 338c-10.05 16.97-10.2 37.34-.4218 54.5C17.29 469.5 35.55 480 56.1 480h399.9c20.51 0 38.75-10.53 48.81-28.17C514.6 434.7 514.4 414.3 504.3 397.3zM476.1 435.1C472.7 443.5 464.8 448 455.1 448H56.1c-8.906 0-16.78-4.484-21.08-12c-4.078-7.156-4.015-15.3 .1562-22.36L235.1 75.66C239.4 68.36 247.2 64 256 64c0 0-.0156 0 0 0c8.765 .0156 16.56 4.359 20.86 11.64l199.9 338C480.1 420.7 481.1 428.8 476.1 435.1z" /></svg>
</span>
</div>

so when I want the “realistische woningwaarde”, I need to have this Select right?

scrape:
  - resource: https://huispedia.nl/amsterdam/1017gl/kerkstraat/8
    sensor:
    - name: test
      select: ".value-range"

but the value is not scraped unfortunately in HA.

Troon · May 4, 2023, 6:58am

Working out a good select is a bit of understanding of HTML elements and attributes, BeautifulSoup selectors, plus how HA treats them, plus knowing how website connections work. The KerkStraat one won’t work because if you access it with a terminal — requesting the site via curl for example — it returns an error that you need Javascript and cookies enabled.

Try it: from a terminal, enter curl https://huispedia.nl/amsterdam/1017gl/kerkstraat/8 and note that what comes back isn’t the source you were expecting.

For the Yahoo! one, I looked at the source (pretty-printed via DevTools as the original is a mess):

and saw that we want the first fin-streamer element under the div with id Lead-5-QuoteHeaderProxy. We could also use the next div down as our “anchor”, with id quote-header-info.

Note that ids should be unique within the document, whereas class can be the same for many elements. That’s important for understanding if your select is going to return one or many items.

Despite being a strong advocate of YAML, I use the UI for Scrape configuration as it allows for changes without restarting. Also check your logs if the entities aren’t returning what you expect.

It’s important to realise that Scrape should be treated as a last-resort way of accessing information. It is messy, unreliable and dependent on the site owners not changing their site structure. Always look for another source of the data, ideally something machine-readable like XML or JSON.

1mfaasj · May 4, 2023, 7:53am

Thanks for your clear answer!

1mfaasj · May 5, 2023, 6:44am

I dug deep into it again.
It is always looking at the first element? fin streamer in this case? What if I also want to scrape the second value? (regularMarketChangePercent). Can I use data_field instead of fin streamer? because fin streamer is displayed more than once.

Troon · May 5, 2023, 6:56am

Lots of ways to go about it. You could ask for the second <fin-streamer> in that section:

select: div#quote-header-info fin-streamer:nth-child(2)

or ask for the <fin-streamer> with the attribute data-field="regularMarketChange":

select: div#quote-header-info fin-streamer[data-field="regularMarketChange"]

Note that I’m still anchoring off the <div> with id="quote-header-info" as there are lots of other <fin-streamer>s including many with data-field="regularMarketChange" in the document.

It’s really easy to accidentally select the wrong thing, which is why I’d recommend using the UI rather than YAML for scrape — a YAML change to a scrape sensor currently requires a full restart.

danieldotnl · May 5, 2023, 12:26pm

It’s really easy to accidentally select the wrong thing, which is why I’d recommend using the UI rather than YAML for scrape — a YAML change to a scrape sensor currently requires a full restart.

That’s not correct, you can do a quick reload.

Or even just multiscrape:

Troon · May 5, 2023, 12:45pm

The Quick Reload is identical to the “ALL YAML CONFIGURATION” action, which is the same as clicking all the individual items. Scrape isn’t one of them — multiscrape may be, but that’s not what’s being discussed here.

1mfaasj · May 5, 2023, 1:19pm

hm I thought I was almost there but it doesnt recognise the % as numeric value. That’s true because it adds an ( and ) before and after the numberic value I’m trying to scrape…

Can I somehow remove the ( ) ?
thats my last question haha

ValueError: Sensor sensor.test has device class None, state class None unit % and suggested precision None thus indicating it has a numeric value; however, it has the non-numeric value: (+0.96%) (<class 'str'>)

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/sensor/__init__.py", line 581, in state
    numerical_value = float(value)  # type:ignore[arg-type]
ValueError: could not convert string to float: '(+0.96%)'

my config:

  - resource: https://finance.yahoo.com/quote/VGWL.DE?p=VGWL.DE&.tsrc=fin-srch
    sensor:
      - name: test
        select: div#quote-header-info fin-streamer[data-field="regularMarketChangePercent"]
        unit_of_measurement: "%"

Troon · May 5, 2023, 1:23pm

This should do it:

 - resource: https://finance.yahoo.com/quote/VGWL.DE?p=VGWL.DE&.tsrc=fin-srch
    sensor:
      - name: test
        select: div#quote-header-info fin-streamer[data-field="regularMarketChangePercent"]
        unit_of_measurement: "%"
        value_template: '{{ value|regex_replace("[^\d\.\-]*(-?[\d\.]+).*$","\\1") }}'

That regex does the following:

[^\d\.\-]* — reads past everything that isn’t a digit, minus sign or decimal point
(-?[\d\.]+) — remembers everything that is an optional minus sign followed by one or more digits and decimal points
.*$ — reads past anything else to the end

and replaces it with the bit it remembered in the second bullet.

Alternatives (better, but thought of later):

        value_template: '{{ (value|regex_findall("(-?[\d\.]+)"))[0] }}'

        value_template: "{{ value|select('in','-.0123456789')|join }}"

1mfaasj · May 5, 2023, 1:32pm

it works, thanks for your help, much appreciated!