I used the normal scrape integration not multiscrape.
select: "table > tr > td > a"
index: 5
Without meaning to be rude, if you can’t do that yourself, perhaps you’d be better off with the UI anyway? I’d do any new scrape sensors in the UI, and I’m quite good at YAML…
If you want to use multiscrape, you can verify individual results with scrape then convert those into multiscrape sensors.
In the console of chrome I get the text with the selector
document.querySelectorAll(‘#Ammersee + table tbody tr td:nth-child(1)’)[0]
or as document.querySelectorAll(“#Ammersee + table > tbody > tr > td:nth-child(1)”)[0]
<td>Amtliche WARNUNG vor STARKWIND</td>
But I cannot find the right conversion to syntax of multiscrape
This major release contains 2 brand new services that should make figuring out your configuration and css selectors much easier!
It makes use of the “new” functionality in Home Assistant that services can now provide a response. To make this possible, significant refactoring was required.
multiscrape.get_content
This service retrieves the content of the website you want to scrape. It shows the same data for which you had to enable log_response and open the page_soup.txt file.
multiscrape.scrape
This does what it says. It scrapes based on a configuration you can provide in the service data. It is ideal for quickly trying out multiple css selectors, or to scrape data in an automation that you only need when running that automation.
A nice detail is that both services accept exactly the same configuration as you provide in your configuration yaml. Even the form_submit features are supported! However, there is a small but important caveat. Read more about it in the readme.
Looking through this thread I found, that I need the name for the input fields to login using the form_submit. But looking at the websites HTML, there is no name defined for the email and password fields:
Thanks for your answer, I will try that as soon as I find time for it.
Just a quick follow-up question. If I submit this json as the input for the form_submit, will I still use the selector corresponding to the form itself or the selector for the “Anmelden” button?
Hi fellow-scrapers,
sorry to bother you.
It seems that the multiscrape component is unable to use the login form of the following website: Hanna Cloud
The form itself has no ‘id’, only the root. When I use this as selector:
2024-04-17 09:15:14.651 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Requesting page with form from: https://www.hannacloud.com/login
2024-04-17 09:15:14.651 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Executing form_page-request with a GET to url: https://www.hannacloud.com/login with headers: {}.
2024-04-17 09:15:14.653 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_headers written to file: form_page_request_headers.txt
2024-04-17 09:15:14.654 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_body written to file: form_page_request_body.txt
2024-04-17 09:15:14.769 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Response status code received: 200
2024-04-17 09:15:14.770 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_headers written to file: form_page_response_headers.txt
2024-04-17 09:15:14.771 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_body written to file: form_page_response_body.txt
2024-04-17 09:15:14.771 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Parse page with form with BeautifulSoup parser lxml
2024-04-17 09:15:14.774 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # The page with the form parsed by BeautifulSoup has been written to file: form_page_soup.txt
2024-04-17 09:15:14.774 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Try to find form with selector #root
2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Form looks like this:
<div id="root"></div>
2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Finding all input fields in form
2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Found the following input fields: {}
2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Found form action None and method None
2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Merged input fields with input data in config. Result: {'email': '***', 'password': '***', 'userLanguage': 'English', 'source': 'web'}
2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Determined the url to submit the form to: https://www.hannacloud.com/login
2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Submitting the form
2024-04-17 09:15:14.775 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Executing form_submit-request with a POST to url: https://www.hannacloud.com/login with headers: {}.
2024-04-17 09:15:14.777 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_headers written to file: form_submit_request_headers.txt
2024-04-17 09:15:14.778 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_body written to file: form_submit_request_body.txt
2024-04-17 09:15:14.803 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Response status code received: 404
2024-04-17 09:15:14.805 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_headers written to file: form_submit_response_headers.txt
2024-04-17 09:15:14.806 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_body written to file: form_submit_response_body.txt
2024-04-17 09:15:14.806 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Error executing POST request to url: https://www.hannacloud.com/login.
Error message:
HTTPStatusError("Client error '404 Not Found' for url 'https://www.hannacloud.com/login'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404")
I have been trying to solve this for days, hope you could give me a hint
Thanks in advance. I do believe it has to do with the fact that the page dynamically loads the form by using javascript
I’m trying to separate the <p> and <br> tags but I just cant figure it out.
Here is what the html page looks like.
13 maj 06.47, Sammanfattning natt, Jämtlands län
23:19 Trafikkontroll, Östersund Under natten har polisen kontrollerat trafiken på Rådhusgatan, Söder i Östersund. 25 förare fick blåsa i polisens sållningsinstrument och alla var nyktra.
23:50 Viltolycka, Ragunda Polisen kontaktas gällande en renpåkörning på riksväg 87/Finneråvägen, Stugun, Ragunda. En personbil har kolliderat med en ren. Inga personskador uppstod. Berörd sameby kontaktas.
<div class="event-page editorial-content">
<h1>
13 maj 06.47, Sammanfattning natt, Jämtlands län
</h1>
<div class="event-content">
<div class="text-body editorial-html">
<p>23:19 Trafikkontroll, Östersund
<br>Under natten har polisen kontrollerat trafiken på Rådhusgatan, Söder i Östersund. 25 förare fick blåsa i polisens sållningsinstrument och alla var nyktra.
</p>
<p>23:50 Viltolycka, Ragunda
<br>Polisen kontaktas gällande en renpåkörning på riksväg 87/Finneråvägen, Stugun, Ragunda. En personbil har kolliderat med en ren. Inga personskador uppstod. Berörd sameby kontaktas.
</p>
</div>
yaml:
multiscrape:
- name: polisen_sammanfattning_jamtland_scrape
resource: "https://www.polisen.se{{ state_attr('sensor.polisen_url_sammanfattning_jamtland', 'url') }}"
scan_interval: 60
sensor:
- name: "Polisen Sammanfattning datum"
unique_id: "polisen_sammanfattning_datum"
select: ".event-page h1"
value_template: "{{ value }}"
- name: "Polisen Sammanfattning Text"
unique_id: "polisen_sammanfattning_text"
value_template: "{{ now().date() }}" # Set the state to today's date
attributes:
- name: "text"
select: ".event-page .text-body"
value_template: "{{ value }}"
on_error:
value: "default"
default: "Failed to Scrape"
log: "info"
here is how the attributes looks like:
As you can see, it wont see <br> as a new line.
it would be nice to make every new paragraph a separate attribute.
friendly_name: Polisen Sammanfattning Text
text: 23:19 Trafikkontroll, ÖstersundUnder natten har polisen kontrollerat trafiken på Rådhusgatan, Söder i Östersund. 25 förare fick blåsa i polisens sållningsinstrument och alla var nyktra.
23:50 Viltolycka, RagundaPolisen kontaktas gällande en renpåkörning på riksväg 87/Finneråvägen, Stugun, Ragunda. En personbil har kolliderat med en ren. Inga personskador uppstod. Berörd sameby kontaktas.
“Copy selector” on this item delivers #root > div > header > div > div > div:nth-child(2) > div > div:nth-child(3) > div > div > div:nth-child(2)
What do I enter in configuration.yaml? This one does not work:
scrape:
- resource: http://192.168.198.27/admin/dashboard
sensor:
- name: RSK450Ni_temp
select: "#root > div > header > div > div > div:nth-child(2) > div > div:nth-child(3) > div > div > div:nth-child(2)"
First try the multiscrape.get_content service and check if the value you are looking for is present in the response.
Then use the multiscrape.scrape service to try out whatever you want until it works.
Maybe something like: .css-1ejrq70 > .div:nth-child(1)?
Busy to get my lent library books into HA, inspired by this reddit post.
The website of the library does something weird: the login page is on a different domain then the actual page where is redirected to, where results need te be scraped. So far no luck.
This is the yaml I got stuck with, sensors are not showing anything:
Is there any way to retain the html tags from the scraped page? I’d like to use the result in a markdown card. @Roemer, have you ever found a solution to this? similar question from @Roemer
Meaning: if I scrape <section> with multiple <p> inside, then I’d like my scraped content to include those <p> so that the markdown card will show them appropriately.