Scrape sensor improved - scraping multiple values

I just copied what a fellow kiwi had done as I had no clue Scrape sensor improved - scraping multiple values - #282 by joem

I’m glad that you’ve got it working @xbmcnut
Question: did you try to trim the select content to something like this dd:nth-child(4) > div.compound-read-low > ul? Just to make the code shorter.

@KevinE try using fieldset in the select

Thanks for your comments. Would be great if you can help to improve the documentation! (either by a PR on github or by sending me your suggested changes)

Here my answers to your questions:

  1. To find the form on the HTML page, multiscrape needs a CSS selector. CSS selectors that refer to the id of an element, always require a hashtag.
    For retrieving the input fields of the form, the name is being used, as this is also what is submitted.
    So in your case it means:
form_submit:
      submit_once: true
      resource: "https://eyeonwater.ca/signon"
      select: "#signin_account"
      input:
        username: !secret eyeonwater_username
        password: !secret eyeonwater_password
  1. https://stackoverflow.com/questions/19109912/yaml-do-i-need-quotes-for-strings-in-yaml
  2. Try to find out if it loads a new page after the mouse click (then use that one for scraping) or check (in your browser developer tools) if the values your want to scrape are already retrieved from the server. In that case, multiscrape is not bothered by the mouse click but just continues scraping.

Hi @danieldotnl

Thanks for a great home assistant add-on.
Can you please assist I have been battling to get the right selection for my scraping but it keeps on giving me the same error at the same position of the ID=4 which it don’t like the “4” and if I add “#4” it also don’t like it. What is the right selection for this scrape please. See below is the page, at the yellow highlight is what I am looking for at ID=4 which was “86.0%”


Here is one of many code selection in yellow
Battery Percent 3
And here is one of the logs pointing at the “4” being the problem which I have had in plenty of different orders

Tope this is enough information to assist me
Thanks

Thanks for the response Daniel.

Don’t know if you looked at the logon page for eyeonwater.ca but there is no “username” field (just email address) so how are you mapping my email address to the correct field?

Still can’t get it to work. I get one error and one warning my system log. I think I am not getting authenticated but not sure (this would be a good debug message to log that you got passed/failed authentication). Same error below is logged when I use an incorrect password.

Logger: custom_components.multiscrape.coordinator
Source: custom_components/multiscrape/coordinator.py:62
Integration: Multiscrape scraping component ([documentation](https://github.com/danieldotnl/ha-multiscrape), [issues](https://github.com/danieldotnl/ha-multiscrape/issues))
First occurred: 9:19:51 AM (1 occurrences)
Last logged: 9:19:51 AM

Scraper_noname_0 # Exception in form-submit feature. Will continue trying to scrape target page. Could not find form

And this warning:

Logger: custom_components.multiscrape.sensor
Source: custom_components/multiscrape/sensor.py:163
Integration: Multiscrape scraping component ([documentation](https://github.com/danieldotnl/ha-multiscrape), [issues](https://github.com/danieldotnl/ha-multiscrape/issues))
First occurred: 9:29:00 AM (1 occurrences)
Last logged: 9:29:00 AM

Scraper_noname_0 # Daily Water Consumption # Unable to scrape data: Could not find a tag for given selector. Consider using debug logging and log_response for further investigation.

I tried commenting out the sensor portion of my code to try and isolate an authentication issue and no errors showed in the log. Is this expected behaviour?

Cheers

Perhaps the ‘4’ is interpreted as a number instead of a string. I don’t know how to get around this however. You might try to write some basic python script with bs4 to try elsewhere?

Something like I did:

And @KevinE. Thanks for the tips but both of those suggestions are above my pay grade :grimacing:

Do you get the page_response_body.txt or page_soup.txt generated?

Yes, not sure what I do with these. I have scanned them both but there is no rendered data in them.

Well an attribute ID as “4” is illegal in HTML or XML. It is the same as a name token. See Basic HTML data types.

Guys, hoping someone can help me out here.

I am trying to get the realtime electricity price from here: Price Information

The table moves every half an hour and shows the past (real price), present and future (forecasted price). The price that I want is the immediate past, which is always second row, fifth column (USEP
($/MWh)).

From Chrome Inspect, I get

#realtimeWindow > div > div.tabberlive > div:nth-child(2) > div > div > div.realtimeTableContainer > table > tbody > tr:nth-child(2) > td:nth-child(5)

Here is my configuration.yaml

multiscrape:
  - resource: https://www.emcsg.com/marketdata/priceinformation
    scan_interval: 30
    sensor:
      - unique_id: electricity_usep_price
        name: Electricity USEP Price
        select: "#realtimeWindow > div > div.tabberlive > div:nth-child(2) > div > div > div.realtimeTableContainer > table > tbody > tr:nth-child(2) > td:nth-child(5)"
        #value_template: '{{ (value.split(":")[1]) }}'

I tried both WITH and WITHOUT tbody but got the same error in the log. My log is already set to DEBUG mode.

This error originated from a custom integration.

Logger: custom_components.multiscrape.sensor
Source: custom_components/multiscrape/sensor.py:163
Integration: Multiscrape scraping component (documentation, issues)
First occurred: 17:01:39 (9 occurrences)
Last logged: 17:05:41

Scraper_noname_0 # Electricity USEP Price # Unable to scrape data: Could not find a tag for given selector. Consider using debug logging and log_response for further investigation.

What am I missing?

@Dex
Try this:
select: ".realtimePriceTable tr:nth-child(2) td:nth-child(5)"

There is a username:

There should be something in your logs like: Form seems to be submitted succesfully

Don’t think I can help further without credentials.

Hi everyone,
I hope someone can help me!
I’m trying to read some energy variable but I have some problem.
The website request to enter with password and then enter the date.
I’m able to enter with my credential but when I read the date the value is 01/01/2015!
This is very strange because I see the correct date when I use chrome:

This is my yaml file:

immagine

How can I insert the actual date before scraping?
Thank you

Hi Daniel,
I tried using both username: and email: with no success. There are no messages in the log that say I passed or failed authentication, even if I put an incorrect password in. I think it’s worth cleaning this part of the code up to help in debugging authentication issues.

Cheers,

Hi, I have a question.
I’m not sure if I’m properly authenticated.
How can I see using logger?
My yaml:
immagine

This folder has been created automatically:
immagine

Which file should I check?
Reading in this community I have read that I must see in my log a message like this: The form appears to have been submitted successfully.
Where is?

Kevin, did you enable logging in your configuration.yaml like this?

logger:
  default: info
  logs:
    custom_components.multiscrape: debug

When I run this with your config, it tells me pretty clear that the reason why your config fails:

The form is hidden within a <script> tag, and showed with Javascript. This makes it a complicated case, I’ll try to take a better look later.

I’m not sure if I can make this more clear, as in the end, it is a form-submit feature, and not just for authentication. So I cannot assume that authentication failed, it can be any kind of form. E.g. your address for retrieving a garbage collection schedule.

See answer above to Kevin:
Add this to your configuration.yaml:

logger:
  default: info
  logs:
    custom_components.multiscrape: debug
1 Like

Hi all, my problem is config weather station in HA with xml file and multiscrape, this is page

This XML file does not appear to have any style information associated with it. The document tree is shown below.

<maintag>

<script/>

<misc>

<data misc="refresh_time">2022.10.12. 230831</data>

</misc>

<data realtime="temp">22.11111111111111</data>

</realtime>

and here is config in multiscrape.yaml

multiscrape:
  -resource: https://192.168.1.39/realtime.xml
  scan_interval: 30
  sensor:
  -unique_id: temp_out_weather
  name: "TEMP"
  select: realtime > data realtime="temp":nth-child(1)" 
  value_template: '{{ (value.split("")[1]) }}'

in developers tools / YAML show me this:

Invalid config for [multiscrape]: [multiscrape] is an invalid option for [multiscrape]. Check: multiscrape->multiscrape->0->multiscrape. (See /config/configuration.yaml, line 11).

in configuration.yaml line 11=


# Text to speech
tts:
  - platform: google_translate

automation: !include automations.yaml
script: !include scripts.yaml
scene: !include scenes.yaml  <------------------line--11------------
multiscrape: !include multiscrape.yaml

thank’s

You should not repeat the integration name when you include files. Try this in multiscrape.yaml:

- resource: https://192.168.1.39/realtime.xml
  scan_interval: 30
  sensor:
    - unique_id: temp_out_weather
      name: "TEMP"
      select: 'realtime > data realtime="temp":nth-child(1)'
      value_template: '{{ (value.split("")[1]) }}'