Scrape sensor improved - scraping multiple values

Troon · March 8, 2024, 3:40pm

I used the normal scrape integration not multiscrape.

select: "table > tr > td > a"
index: 5

Without meaning to be rude, if you can’t do that yourself, perhaps you’d be better off with the UI anyway? I’d do any new scrape sensors in the UI, and I’m quite good at YAML…

If you want to use multiscrape, you can verify individual results with scrape then convert those into multiscrape sensors.

GeoBau · April 2, 2024, 8:54am

I want to read a wind warning from weatherforecast:
Wetter und Klima - Deutscher Wetterdienst - Bayern (dwd.de)
temp

In the console of chrome I get the text with the selector
document.querySelectorAll(‘#Ammersee + table tbody tr td:nth-child(1)’)[0]
or as document.querySelectorAll(“#Ammersee + table > tbody > tr > td:nth-child(1)”)[0]

<td>Amtliche WARNUNG vor STARKWIND</td>

But I cannot find the right conversion to syntax of multiscrape

- resource: https://www.dwd.de/DE/wetter/warnungen_gemeinden/warntabellen/warntab_bay_node.html
  sensor:
    - name: wetter_warnung_herrsching
      select: "#Ammersee + table > tbody > tr > td:nth-child(1)"
      unique_id: wetter_warnung_herrsching
      on_error:
        value: "default"
        default: "Ammersee not found"
        log: "info"
  scan_interval: 600

the result is always “Ammersee not found”

Where is my mistake??

BTW: we have not always not always Sturm or Starkwind so you can propbably not test a proposal for the right selector - but right now it would work!

danieldotnl · April 2, 2024, 1:06pm

You should remove the tbody from the selector, as this is being added by your browser and it’s not part of the original html.

danieldotnl · April 2, 2024, 7:13pm

Release v0.7.0: New services!

This major release contains 2 brand new services that should make figuring out your configuration and css selectors much easier!
It makes use of the “new” functionality in Home Assistant that services can now provide a response. To make this possible, significant refactoring was required.

multiscrape.get_content
This service retrieves the content of the website you want to scrape. It shows the same data for which you had to enable log_response and open the page_soup.txt file.

multiscrape.scrape
This does what it says. It scrapes based on a configuration you can provide in the service data. It is ideal for quickly trying out multiple css selectors, or to scrape data in an automation that you only need when running that automation.

A nice detail is that both services accept exactly the same configuration as you provide in your configuration yaml. Even the form_submit features are supported! However, there is a small but important caveat. Read more about it in the readme.

GeoBau · April 3, 2024, 9:45am

Thanks for your Mail I have to wait for warnings for a new test!
I will then test the new services…

mark_wsnr · April 3, 2024, 11:13am

Hi everyone,

I want to scrape the content of the following website, but I am not quite sure how to get pass the login.

https://stw-muenster.tcpos.com/dist/#/Login

Looking through this thread I found, that I need the name for the input fields to login using the form_submit. But looking at the websites HTML, there is no name defined for the email and password fields:

I am quite new to scraping so I dont know wether this is even possible with the multiscrape integration or if I am just the limiting factor here.

I would be grateful for any help. Thanks!

GeoBau · April 3, 2024, 2:15pm

ok that was it , tbody was not good.
your new service scrape is very useful! Thanks

danieldotnl · April 5, 2024, 6:42pm

You can see in the browser debugging tools that the following is sent when you hit ‘Anmelden’:

{"json":"true","loginUsername":"[email protected]","password":"sdfadfsdfdsf33"}

Try to set that as input for form_submit.

alvaromm6556 · April 11, 2024, 11:45am

Anyone can help me?? i need filther this value:

I would like to get the number 28. I don´t know if in need to use a value_template or use the selector.

Troon · April 11, 2024, 11:53am

Assuming positive integers only, and no digits in the preamble text:

value_template: "{{ value|select('in','0123456789')|join }}"

mark_wsnr · April 11, 2024, 3:46pm

Thanks for your answer, I will try that as soon as I find time for it.

Just a quick follow-up question. If I submit this json as the input for the form_submit, will I still use the selector corresponding to the form itself or the selector for the “Anmelden” button?

mark_wsnr · April 16, 2024, 6:31am

I now got it to work using a RESTful Sensor.

When posting the login data in the json format you found I recive a response with the information I need.
Thanks again for your help!

Myny_HA · April 17, 2024, 7:21am

Hi fellow-scrapers,
sorry to bother you.
It seems that the multiscrape component is unable to use the login form of the following website: Hanna Cloud
The form itself has no ‘id’, only the root. When I use this as selector:

multiscrape:
  - resource: 'https://www.hannacloud.com/dashboard'
    scan_interval: 60
    log_response: true
    form_submit:
      submit_once: True
      resource: 'https://www.hannacloud.com/login'
      select: '#root'
      input:
        email: ***
        password: ***
#      userLanguage: English
#      source: web

    sensor:
      - select: '#BL122_7A8336-orp-value'
        name: Poolchloor1
        unit_of_measurement: 'mV'

Then, home assistant provides following logs:

2024-04-17 09:15:14.651 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Requesting page with form from: https://www.hannacloud.com/login

2024-04-17 09:15:14.651 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Executing form_page-request with a GET to url: https://www.hannacloud.com/login with headers: {}.

2024-04-17 09:15:14.653 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_headers written to file: form_page_request_headers.txt

2024-04-17 09:15:14.654 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_body written to file: form_page_request_body.txt

2024-04-17 09:15:14.769 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Response status code received: 200

2024-04-17 09:15:14.770 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_headers written to file: form_page_response_headers.txt

2024-04-17 09:15:14.771 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_body written to file: form_page_response_body.txt

2024-04-17 09:15:14.771 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Parse page with form with BeautifulSoup parser lxml

2024-04-17 09:15:14.774 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # The page with the form parsed by BeautifulSoup has been written to file: form_page_soup.txt

2024-04-17 09:15:14.774 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Try to find form with selector #root

2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Form looks like this:

<div id="root"></div>

2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Finding all input fields in form

2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Found the following input fields: {}

2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Found form action None and method None

2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Merged input fields with input data in config. Result: {'email': '***', 'password': '***', 'userLanguage': 'English', 'source': 'web'}

2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Determined the url to submit the form to: https://www.hannacloud.com/login

2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Submitting the form

2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Executing form_submit-request with a POST to url: https://www.hannacloud.com/login with headers: {}.

2024-04-17 09:15:14.777 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_headers written to file: form_submit_request_headers.txt

2024-04-17 09:15:14.778 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_body written to file: form_submit_request_body.txt

2024-04-17 09:15:14.803 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Response status code received: 404

2024-04-17 09:15:14.805 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_headers written to file: form_submit_response_headers.txt

2024-04-17 09:15:14.806 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_body written to file: form_submit_response_body.txt

2024-04-17 09:15:14.806 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Error executing POST request to url: https://www.hannacloud.com/login.

Error message:

HTTPStatusError("Client error '404 Not Found' for url 'https://www.hannacloud.com/login'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404")

I have been trying to solve this for days, hope you could give me a hint
Thanks in advance. I do believe it has to do with the fact that the page dynamically loads the form by using javascript

NeviZ · April 24, 2024, 4:07pm

Hi Guys!

I trying to fetch data from BKT power strip. Problem what I have is login “form” is maded on table, not on normal form.

Is there any possibilities login on that website to fetch data?

chrs · May 13, 2024, 10:46am

I’m trying to separate the and tags but I just cant figure it out.

Here is what the html page looks like.

13 maj 06.47, Sammanfattning natt, Jämtlands län

23:19 Trafikkontroll, Östersund
Under natten har polisen kontrollerat trafiken på Rådhusgatan, Söder i Östersund. 25 förare fick blåsa i polisens sållningsinstrument och alla var nyktra.

23:50 Viltolycka, Ragunda
Polisen kontaktas gällande en renpåkörning på riksväg 87/Finneråvägen, Stugun, Ragunda. En personbil har kolliderat med en ren. Inga personskador uppstod. Berörd sameby kontaktas.

<div class="event-page editorial-content">
        <h1>
            13 maj 06.47, Sammanfattning natt, Jämtlands län
        </h1>
        <div class="event-content">
            <div class="text-body editorial-html">
                <p>23:19 Trafikkontroll, Östersund
                  <br>Under natten har polisen kontrollerat trafiken på Rådhusgatan, Söder i Östersund. 25 förare fick blåsa i polisens sållningsinstrument och alla var nyktra.
                </p>
                <p>23:50 Viltolycka, Ragunda
                  <br>Polisen kontaktas gällande en renpåkörning på riksväg 87/Finneråvägen, Stugun, Ragunda. En personbil har kolliderat med en ren. Inga personskador uppstod. Berörd sameby kontaktas.
                </p>
            </div>

yaml:

multiscrape:
  - name: polisen_sammanfattning_jamtland_scrape
    resource: "https://www.polisen.se{{ state_attr('sensor.polisen_url_sammanfattning_jamtland', 'url') }}"
    scan_interval: 60
    sensor:
      - name: "Polisen Sammanfattning datum"
        unique_id: "polisen_sammanfattning_datum"
        select: ".event-page h1"
        value_template: "{{ value }}"
      - name: "Polisen Sammanfattning Text"
        unique_id: "polisen_sammanfattning_text"
        value_template: "{{ now().date() }}"  # Set the state to today's date
        attributes:
          - name: "text"
            select: ".event-page .text-body"
            value_template: "{{ value }}"
        on_error:
          value: "default"
          default: "Failed to Scrape"
          log: "info"

here is how the attributes looks like:
As you can see, it wont see as a new line.
it would be nice to make every new paragraph a separate attribute.

friendly_name: Polisen Sammanfattning Text
text: 23:19 Trafikkontroll, ÖstersundUnder natten har polisen kontrollerat trafiken på Rådhusgatan, Söder i Östersund. 25 förare fick blåsa i polisens sållningsinstrument och alla var nyktra.
23:50 Viltolycka, RagundaPolisen kontaktas gällande en renpåkörning på riksväg 87/Finneråvägen, Stugun, Ragunda. En personbil har kolliderat med en ren. Inga personskador uppstod. Berörd sameby kontaktas.

Thonglor · August 3, 2024, 1:35pm

I want to scrape the contents of the red marked field with this integration, but I don’t know what to use as select statement in configuration.yaml:

“Copy selector” on this item delivers
#root > div > header > div > div > div:nth-child(2) > div > div:nth-child(3) > div > div > div:nth-child(2)

What do I enter in configuration.yaml? This one does not work:

scrape:
  - resource: http://192.168.198.27/admin/dashboard
    sensor:
      - name: RSK450Ni_temp
        select: "#root > div > header > div > div > div:nth-child(2) > div > div:nth-child(3) > div > div > div:nth-child(2)"

Thanks for any help!

Thonglor · August 4, 2024, 8:21am

On https://try.jsoup.org/ the selection works properly:

2024-08-04 10_20_48-Try jsoup online_ Java HTML parser and CSS_XPath debugger - Iron

The error returned by HA is Index '0' not found in sensor.rsk450ni_temp

danieldotnl · August 9, 2024, 12:51pm

First try the multiscrape.get_content service and check if the value you are looking for is present in the response.
Then use the multiscrape.scrape service to try out whatever you want until it works.
Maybe something like: .css-1ejrq70 > .div:nth-child(1)?

Sypie · August 11, 2024, 9:04pm

Busy to get my lent library books into HA, inspired by this reddit post.

The website of the library does something weird: the login page is on a different domain then the actual page where is redirected to, where results need te be scraped. So far no luck.

This is the yaml I got stuck with, sensors are not showing anything:

- resource: "https://rijnbrink.hostedwise.nl/wise-apps/opac/3801/my-account/checkouts"
  scan_interval: 86400
  log_response: true
  form_submit:
    submit_once: True
    resubmit_on_error: False
    resource: "https://login.kb.nl/si/login/?sessionOnly=true&goto=https%3A%2F%2Flogin.kb.nl%2Fsi%2Fauth%2Foauth2.0%2Fv1%2Fauthorize%3Fscope%3Dopenid%2Bprofile%26state%3D-6xKNtQI0ijfe3SLTGATy5_6otsy1Z_RZDTklCIjTFE.JT1mw-_qLPA.5F2k6OLrQoGN__h0UTEMqw%26response_type%3Dcode%26client_id%3Drijnbrink_prd%26redirect_uri%3Dhttps%253A%252F%252Fiam-emea.wise.oclc.org%252Frealms%252Frijnbrink%252Fbroker%252Froyal-library-oidc%252Fendpoint%26nonce%3D8Nnyx8kAp-RuN04ADpDClA"
    select: "#username"
    input:
      username: 1234
      password: 1337
  sensor:
    - unique_id: bibliotheek_titel
      name: Titel
      select: "li.list_items:nth-child(1) > div:nth-child(2) > a:nth-child(1)"
    - unique_id: bibliotheek_inleverdatum
      name: Inleverdatum
      select: "li.list_items:nth-child(1) > div:nth-child(2) > ul:nth-child(4) > li:nth-child(1) > span:nth-child(2)"
    - unique_id: bibliotheek_boekomslag
      name: Omslag
      select: "li.list_items:nth-child(1) > div:nth-child(2) > ul:nth-child(4) > li:nth-child(1) > span:nth-child(2)"

loocd · August 12, 2024, 12:34pm

Is there any way to retain the html tags from the scraped page? I’d like to use the result in a markdown card.
@Roemer, have you ever found a solution to this? similar question from @Roemer

Meaning: if I scrape <section> with multiple  inside, then I’d like my scraped content to include those  so that the markdown card will show them appropriately.