Scrape sensor improved - scraping multiple values

I’m trying to scrape the top airing anime from myanimielist: [Top Anime - Top Airing - MyAnimeList.net]
i got lost in trying select from the inspection elements. I only have this so far:

- resource: https://myanimelist.net/topanime.php?type=airing
  scan _interval: 860400
  sensor:
    -unique_id: top_airing_anime_1
    name: Top Airing Anime 1
    select:

Hi there,

this should do the trick :

multiscrape:
  - name: test
    resource: https://myanimelist.net/topanime.php?type=airing
    sensor:
    - unique_id: top_airing_anime_1
      name: Top Airing Anime 1
      select: ".di-ib.clearfix"

Hi there,

know i need a little help to get my data on the screen :wink:

There is an Jellyfish Warning for Malta ( https://www.malteseislandsweather.com/ ) - depending on sights the list is dynamic and looks like this :slight_smile:

<div class="safebays-list">
	  <div class="safebay-island">Malta</div>
	  <div class="location-list">
	    <div class="location-name-list">Marfa</div>
	  </div>
	  <div class="location-list">
		<div class="location-name-list">Ramla l-Bir</div>
	  </div>
	  <div class="location-list">
	    <div class="location-name-list">Little Armier</div>
	  </div>
	  <div class="location-list">
	    <div class="location-name-list">Armier Bay</div>
	  </div>
	  <div class="location-list">
	    <div class="location-name-list">Mellieħa</div>
	  </div>
...

at the moment i am getting the first Beach shown with :slight_smile:

multiscrape:
  - resource: https://www.malteseislandsweather.com/
    log_response: true
    sensor:
      - name: Safe Beaches
        select_list: "div.location-list:nth-child(2) > div:nth-child(1)"

changing nth-child(2) to nth-child(3) gives me the next beach - but i have no idea how to get all beaches in a list. I am getting just one shown, depending on the index. As the list shows only “safe” Beaches the number of beaches in the list changes from day to day.

Anybody who can point me in the right direction?

thx :slight_smile:

Don’t go down to the lowest child. Stop the selector at the parent element. Have you tried that?

Hi,

i tried

select_list: "div.location-list"
and
select_list: "div.location-list > div:nth-child(1)"

both times the entity is empty.

Using the browser tools, what is the selector path it gives you? Also, you need to check the actual soup file to see the HTML that the scraper sees. It will be in your HA config directory under multiscrape. I don’t think location-list is what you want, since that is on the individual elements. You want something with safebays-list.

i got the soup file and i extracted the important part :

<div class="wpb_text_column wpb_content_element vc_custom_1655198915150">
	<div class="wpb_wrapper">
		<div class="safebays-home-subtitle">This list shows bays that are likely to be free from jellyfish, rough sea, debris etc... <br/> This does not mean that only these beaches are safe.</div>
		<div class="safebays-list">
			<div class="safebay-island">Malta</div>
				<div class="location-list">
					<div class="location-name-list">Marfa</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Ramla l-Bir</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Little Armier</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Armier Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Mellieħa</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Imġiebaħ</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Mistra Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">St. Paul's Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Qawra</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Baħar iċ-Ċagħaq</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">St. George's Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Sliema</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Rinella Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Marsaskala</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">St. Thomas Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Ħofriet</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Marsaxlokk</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Pretty Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Wied iż-Żurrieq</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Għar Lapsi</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Fomm ir-Riħ</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Ġnejna Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Għajn Tuffieħa</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Golden Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Anchor Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Paradise Bay</div>
				</div>
				<div class="clr"></div>
				<div class="safebay-island">Gozo</div>
				<div class="location-list">
					<div class="location-name-list">Wied il-Għasri</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Xwejni Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Qbajjar</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Marsalforn</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Ramla l-Ħamra</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">San Blas Bay</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Daħlet Qorrot</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Ħondoq ir-Rummien</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Mġarr ix-Xini</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Xlendi</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Dwejra</div>
				</div>
				<div class="location-list">
					<div class="location-name-list">Dwejra Inland Sea</div>
			</div>
		</div>
	</div>
</div>

I tried
div.wpb_text_column:nth-child(2) > div:nth-child(1) and
.safebays-list

as selector (_list) - both are giving me this error (HA Log) :

homeassistant.exceptions.InvalidStateError: Invalid state encountered for entity ID: sensor.malta_beaches. State max length is 255 characters.

Oh, yeah, it doesn’t take much to reach that limit. The workaround is to put the data in an attribute which doesn’t have that limit.

Hello, I’m wondering if somebody can help or share his experience.
I’m trying to use the multiscrape integration to get energy cost values (PUN values).
The document to parse is an XML but while there are a few examples on how to parse the values of an html document (also on the custom integration Wiki) I could find nothing for XMLs.
How do I identify the fields? I tried with the nodes path but no luck.
In particular I’m interested in using the select_list feature to get all the hourly costs at once (this is what I tried: select_list: 'NewDataSet > Prezzi > PUN')

Below an XML extract (you can have a look at the actual file here - you will get a landing page asking you to check a couple of boxes to accept the use terms and then you will be redirected to the file).

Thanks to anybody who can help!

<NewDataSet>
<xs:schema id="NewDataSet">
<xs:element name="NewDataSet" msdata:IsDataSet="true" msdata:UseCurrentLocale="true">
<xs:complexType>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="Prezzi">
<xs:complexType>
<xs:sequence>
<xs:element name="Data" type="xs:string" minOccurs="0"/>
<xs:element name="Mercato" type="xs:string" minOccurs="0"/>
<xs:element name="Ora" type="xs:string" minOccurs="0"/>
<xs:element name="PUN" type="xs:string" minOccurs="0"/>
<xs:element name="NAT" type="xs:string" minOccurs="0"/>
<xs:element name="CALA" type="xs:string" minOccurs="0"/>
<xs:element name="CNOR" type="xs:string" minOccurs="0"/>
<xs:element name="CSUD" type="xs:string" minOccurs="0"/>
<xs:element name="NORD" type="xs:string" minOccurs="0"/>
<xs:element name="SARD" type="xs:string" minOccurs="0"/>
<xs:element name="SICI" type="xs:string" minOccurs="0"/>
<xs:element name="SUD" type="xs:string" minOccurs="0"/>
<xs:element name="AUST" type="xs:string" minOccurs="0"/>
<xs:element name="COAC" type="xs:string" minOccurs="0"/>
<xs:element name="COUP" type="xs:string" minOccurs="0"/>
<xs:element name="CORS" type="xs:string" minOccurs="0"/>
<xs:element name="FRAN" type="xs:string" minOccurs="0"/>
<xs:element name="GREC" type="xs:string" minOccurs="0"/>
<xs:element name="SLOV" type="xs:string" minOccurs="0"/>
<xs:element name="SVIZ" type="xs:string" minOccurs="0"/>
<xs:element name="BSP" type="xs:string" minOccurs="0"/>
<xs:element name="MALT" type="xs:string" minOccurs="0"/>
<xs:element name="XAUS" type="xs:string" minOccurs="0"/>
<xs:element name="XFRA" type="xs:string" minOccurs="0"/>
<xs:element name="MONT" type="xs:string" minOccurs="0"/>
<xs:element name="XGRE" type="xs:string" minOccurs="0"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
</xs:element>
</xs:schema>
<Prezzi>
<Data>20230822</Data>
<Mercato>MGP</Mercato>
<Ora>1</Ora>
<PUN>119,631120</PUN>
<NAT>121,230000</NAT>
<CALA>124,000000</CALA>
<CNOR>119,190000</CNOR>
<CSUD>119,190000</CSUD>
<NORD>119,190000</NORD>
<SARD>119,190000</SARD>
<SICI>124,000000</SICI>
<SUD>119,190000</SUD>
<AUST>119,190000</AUST>
<COAC>119,190000</COAC>
<COUP>119,190000</COUP>
<CORS>119,190000</CORS>
<FRAN>119,190000</FRAN>
<GREC>119,190000</GREC>
<SLOV>119,190000</SLOV>
<SVIZ>119,190000</SVIZ>
<BSP>119,190000</BSP>
<MALT>124,000000</MALT>
<XAUS>119,190000</XAUS>
<XFRA>119,190000</XFRA>
<MONT>119,190000</MONT>
<XGRE>119,190000</XGRE>
</Prezzi>
<Prezzi>
<Data>20230822</Data>
<Mercato>MGP</Mercato>
<Ora>2</Ora>
<PUN>124,000000</PUN>
<NAT>124,000000</NAT>
<CALA>124,000000</CALA>
<CNOR>124,000000</CNOR>
<CSUD>124,000000</CSUD>
<NORD>124,000000</NORD>
<SARD>124,000000</SARD>
<SICI>124,000000</SICI>
<SUD>124,000000</SUD>
<AUST>124,000000</AUST>
<COAC>124,000000</COAC>
<COUP>124,000000</COUP>
<CORS>124,000000</CORS>
<FRAN>124,000000</FRAN>
<GREC>124,000000</GREC>
<SLOV>124,000000</SLOV>
<SVIZ>124,000000</SVIZ>
<BSP>124,000000</BSP>
<MALT>124,000000</MALT>
<XAUS>124,000000</XAUS>
<XFRA>124,000000</XFRA>
<MONT>124,000000</MONT>
<XGRE>124,000000</XGRE>
</Prezzi>
<Prezzi>
<Data>20230822</Data>
<Mercato>MGP</Mercato>
<Ora>3</Ora>
<PUN>119,000000</PUN>
<NAT>116,270000</NAT>
<CALA>149,270000</CALA>
<CNOR>115,996050</CNOR>
<CSUD>115,996050</CSUD>
<NORD>115,996050</NORD>
<SARD>115,996050</SARD>
<SICI>149,270000</SICI>
<SUD>115,996050</SUD>
<AUST>115,996050</AUST>
<COAC>115,996050</COAC>
<COUP>115,996050</COUP>
<CORS>115,996050</CORS>
<FRAN>115,996050</FRAN>
<GREC>115,996050</GREC>
<SLOV>115,996050</SLOV>
<SVIZ>115,996050</SVIZ>
<BSP>115,996050</BSP>
<MALT>149,270000</MALT>
<XAUS>115,996050</XAUS>
<XFRA>115,996050</XFRA>
<MONT>115,996050</MONT>
<XGRE>115,996050</XGRE>
</Prezzi>
</NewDataSet>

If you have XML, use a REST sensor. This config:

rest:
 - resource: https://www.mercatoelettrico.org/It/WebServerDataStore/MGP_Prezzi/20230822MGPPrezzi.xml
   headers:
     Cookie: GmeItaliano=EE0F6794E46EDE29034439E15E1D231663A8B5A0CFBB7C7C736C056E1B9056C85552B35A81F83E92B373C3270FA4A70764B342F08066058F2B8AB53566B699465CD2E05DD90304ADD3309764F208E4A63BBF8E773D0EAAC4E4D0EB8BCB82382A6A51C93F01C7E1D44360E19DE2A93EA859CB9C56EDCD14FDAFEFA0664020B4E07152CF7361EE9E239E3E88B0828FF90AA0A964BDD6436ACF092716DEA58D300E
   sensor:
     - name: "PUN values"
       value_template: "{{ (value_json['NewDataSet']['Prezzi']|map(attribute='PUN')|map('replace',',','.')|map('round',2)|list)[:12] }}"

gives you the next 12 values as a string, rounded to 2 decimals, and with decimal points instead of commas:

Notes:

  • That cookie header might expire. You’ll have to work out a way around that.
  • Sensor state is a string, and cannot exceed 255 characters hence my rounding and limiting to 12.

You could create a separate sensor for each value, with date and hour as attributes.

Thank you for your time and your suggestions.
I had a look and played a bit but given mainly the need to potentially have to deal with the cookie header I decided to continue using the multiscraper that can automatically manage this aspect.
Also thank you for pointing out the limitation of the status length, so I followed your suggestion and created one sensor for each value (I will concatenate in a list later for my needs).
I got everything working still relying on the CSS selector… not sure if this is the right approach for an XML… but it works (of course suggestions about this or how to improve this overall are always welcome - for example see the warning below, even if I made the use of lxml explicit and it should be used by default).

I’m pasting my solution below, if anybody else from Italy lands on this page and need it.

PS:
I’m using the XML approach as it allows me to build a template as a resource and potentially query any day with little effort - other options available from the website and relying on HTML tables are not so immediate.

multiscrape:
  - resource_template: 'https://www.mercatoelettrico.org/It/WebServerDataStore/MGP_Prezzi/{{ as_timestamp(now()) | timestamp_custom("%Y%m%d", True) }}MGPPrezzi.xml'
    scan_interval: 3600
    parser: 'lxml'
    name: 'PUN oggi'
    button:
      unique_id: 'aggiorna_misure_pun_oggi'
      name: 'Aggiorna misure PUN oggi'
    log_response: true
    form_submit:
      submit_once: false
      resource: 'https://www.mercatoelettrico.org/It/Tools/Accessodati.aspx'
      select: '#form1'
      input:
        'ctl00$ContentPlaceHolder1$CBAccetto1': 'on'
        'ctl00$ContentPlaceHolder1$CBAccetto2': 'on'
        'ctl00$ContentPlaceHolder1$Button1': 'Accetto'
      input_filter:
        - 'ctl00$ContentPlaceHolder1$Button2'
        - 'ctl00$vai'
        - 'ctl00$LinkButton2'
        - 'ctl00$LoginButton'
    sensor:
      - select: 'NewDataSet:nth-child(1) > Prezzi:nth-child(2) > PUN:nth-child(4)'
        name: 'PUN oggi 00'
        unique_id: 'pun_oggi_00'
        icon: 'mdi:currency-eur'
        unit_of_measurement: '€/kWh'
        value_template: '{{ value | replace (",", ".") |float | int / 1000}}'
      - select: 'NewDataSet:nth-child(1) > Prezzi:nth-child(3) > PUN:nth-child(4)'
        name: 'PUN oggi 01'
        unique_id: 'pun_oggi_01'
        icon: 'mdi:currency-eur'
        unit_of_measurement: '€/kWh'
        value_template: '{{ value | replace (",", ".") |float | int / 1000}}'
1 Like

Hi,
I am experiencing a problem with retrieving data from a web page.
Specifically I want to get the data from :
body > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(9) > td:nth-child(2) > font:nth-child(1) > strong:nth-child(1) > font:nth-child(1) > small:nth-child(1) retrieved from http://perachora-davis.meteoclub.gr/
I’ve tried everything but it doesn’t bring me any data.
I have tried with other fields and the result is the same.

Can you help me? What can go wrong?

Thanks

Your selector doesn’t match anything on the page: there is no <tbody> in the HTML. The tool you’re using to work out the selector is not giving a valid answer. The HTML on that page is also not valid, so some tools will struggle.

What value are you trying to retrieve? It’s much better to work these out manually and minimise the list. I think you’re trying to access the barometric pressure (9th tr with complications due to invalid code), which is:

tr:nth-child(11) > td:nth-child(2) > font:nth-child(1)

If you then want to extract the value, set the value_template to:

{{ value|select('in','.0123456789')|join }}

…which works so long as the description doesn’t contain a digit or a dot.

image

I’d like to share a quick and simple recipe for those that often struggle to figure out the CSS selector path.

I have done a fair amount of web development, and it still catches me out from time to time. CSS remains tricky business for me.

Where people often get stuck, is where some parts of a page are loaded dynamically, so you can’t get access to the elements that you want. This is sometimes related to timing, because it’s not always trivial (if ever) to know when a page has fully loaded.

If you want a much better chance of finding your selector, simply do the following:

  1. Make a multiscrape sensor that just use value_template: "{{ value }}" and select: "foo" (foo can be anything, as long as it’s an ASCII word; don’t make it just . (a dot/fullstop), for example). What we’re trying to achieve here is to have no meaningful output, but something that has valid config that will produce a page_soup.txt file on your HA server.

  2. Copy the page_soup.txt to your local computer.

  3. Open this file in your browser. It will probably render badly, but that doesn’t matter: This is what the scraper sees. Find the element you’d like to scrape and right click on it to bring up the code inspector for that piece of HTML. The HTML elements will be visible in another window (typically). Highlight the element in that view and right click again to choose “Copy > Copy selector”. Chances are that what you have it this point can be used as your select: option’s config. Some of you might want to clean this up a bit for aesthetic and readability reasons. Read on…

  4. You can test your selector with a bit of Python using the soup file and without having to iterate on your HA config. You must have BeautifulSoup installed, obviously, which is what this custom component uses (I’m not going to detail the installation of Python or any libraries here).

from bs4 import BeautifulSoup
f = open('page_soup.txt')
soup = BeautifulSoup(f, 'html.parser')
soup.select_one("<YOUR_SELECTOR_PATH>")
  1. If you want to inspect pretty code in step 3 instead, you can prettify it. It can make it easier to debug and navigate, especially if you want to adjust the CSS selector path to only be as specific as it needs to be. A lot of web generated code by tools have tons of ugly IDs and a lot of classes where you only need one of them to uniquely identify an element for the purposes of scraping.
prettyHtml = soup.prettify()
with open('pretty.html', 'w') as o:
  o.write(prettyHtml)

I hope this helps!

2 Likes

I’d like to scrape price from here https://www.ditur.fi/garmin-epix-gen2-pro-47mm-sapphire-010-02803-11
but inspect gives me selector #product-price-109517 > span > span which doesn’t work. Could someone help me to right direction?

Did you read the post before yours?

Yes, I tried that, but maybe I do something wrong, because I cannot get proper selectors from any page with that. I got body > pre as selector.
My sensor is:

- resource: https://www.ditur.fi/garmin-epix-gen2-pro-47mm-sapphire-010-02803-11
  log_response: true
  sensor:
    select: "hhh"
    value_template: "{{ value }}"

I guess I should get the price 989 scraped from here, if it’s possible to get from the span class?

</script>
<div class="price-box price-final_price flex items-center gap-2 lg:gap-5 flex-wrap lg:flex-nowrap" x-data="initPrice109517()" x-spread="eventListeners">
<template x-if="!activeProductsPriceData &amp;&amp; !isPriceHidden()">
<div class="price-container flex items-center gap-2 lg:gap-5 flex-wrap lg:flex-nowrap">
<div class="old-price mr-2 flex order-3 lg:order-2">
<span class="price-wrapper title-font font-light text-base lg:text-2xl line-through text-gray-900" id="product-price-109517">
<span class="price">
<span class="price">1 099 €</span> </span>
</span>
</div>
<div class="discount flex order-1 lg:order-3 w-full lg:w-auto">
<span class="bg-hot_sale-lighter py-1 px-2 leading-none">
<span class="text-white text-xs uppercase font-semibold">10%</span>
</span>
</div>
<div class="final-price inline-block order-2 text-hot_sale-lighter lg:order-1" itemprop="offers" itemscope="" itemtype="http://schema.org/Offer">
<span class="price-label hidden">
</span>
<span class="price-wrapper title-font text-xl lg:text-2xl lg:font-semibold" id="product-price-109517">
<span class="price">
<span class="price">989 €</span> </span>
</span>
<meta content="989" itemprop="price"/>
<meta content="EUR" itemprop="priceCurrency"/>
</div>
</div>

Is this HTML from the soup file?

Yes, from page_soap.txt

Describe your steps in more detail.

What did you do with that file?

How did you try to determine a selector?