Help with scrap sensor

Makis · April 8, 2020, 2:38pm

Hi
I need to scrap one value from this windguru page
and more particular number 23.4 in photo below.
The problem is I don’t know where and what to look for.
Can someone give me something to start with please?

AhmadK · April 8, 2020, 3:48pm

looks like that website does not like robots

jocnnor · April 8, 2020, 3:51pm

I’ll start with a high level for how I use scrape sensors. First thing I do is…try not to develop one inside of Home Assistant. The sheer number of times you’ll have to restart to get it perfect will crush your soul.

Step 1) Make sure you have a working python3 terminal you can use.

Step 2) install BeautifulSoup and requests modules (these are what the scrape sensor uses)

pip install bs4
pip install requests

Step 3) Try to get it working with a python script or in the python shell. Here’s the simple shell I use.

from bs4 import BeautifulSoup
import requests

# Change these 2 things
URL="https://www.windguru.cz/23082"
# This is the select line you will use in the config
SELECT=""
# You may need to use a template after the fact...

r = requests.get(URL)
data=r.text

# Print the output of the request command to see what we even get. 
print(data)

soup = BeautifulSoup(data)

# See what the select returns
val = soup.select(SELECT)
print(val)

# Try to get to the lowest thing we can...
value = val[INDEX].text
print(value)

# From here on out, we have to do template code to pair it down more.

Consult the documentation to find out the right select value

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

Looking at that website you linked…I feel like this point is important…

BeautifulSoup does not execute Javascript , so any data delivered or rendered via JS will not be available to you if you scrape with BeautifulSoup . To access Javascript -rendered pages you will need to use a full-fledged rendering engine.

Often, the data you want is populated from javascript calls. The request you get originally is just a bunch of javascript with no real data. That site is 99% javascript

There is an API available…but I don’t know any station_ids (especially ones without a password), so I’m stuck there too.

http://stations.windguru.cz/json_api_stations.html

AhmadK · April 8, 2020, 3:56pm

that’s a lot of info!
I just created a scrape sensor and enabled its debug logging just to see it’s a bit too complicated

Makis · April 8, 2020, 5:26pm

I read some of the documents @jocnnor suggested and I understand 3% of them.
I thought it would be something easier.
RIght now it seems easier to make an anemometer with esp8266
(after the lessons you gave me about mqtt)

tom_l · April 8, 2020, 5:42pm

I did warn you,

Looking at the source of the web page it is really not written to be scraped easily either.

Installing an anemometer on your house would be easier.

Makis · April 8, 2020, 6:06pm

Please give me the configuration of the scrapper sensor you made just to see how you wrote it.

Makis · April 8, 2020, 6:07pm

Yes, way too difficult. I try to read but I am not become wiser

Makis · April 8, 2020, 6:08pm

Last attempt?
how about the info from this site?

Annotation 2020-04-08 210424

AhmadK · April 8, 2020, 8:31pm

Well, it’s similar to the example here, you just need to change the resource and select variables (use Inspector, I put div.live-td.live-current).
As I said, after seeing “Unable to extract data from HTML” in HA logs I enabled debugging in HA by setting in logger

homeassistant.components.scrape.sensor: debug

opened the log and searched for live-current and there was none and the whole document was dodgy.

lol. I can only say it’s in ancient Greek

jocnnor · April 8, 2020, 9:19pm

Did you try your own python script yet?

from bs4 import BeautifulSoup
import requests

# Change these 2 things
URL="http://stravon.gr/pinakas/meteostation/kokkinia"
# This is the select line you will use in the config
SELECT=".pricetab > .infos > h3"
# You may need to use a template after the fact...
INDEX=4

r = requests.get(URL)
data=r.text

soup = BeautifulSoup(data)

#print(soup)

val = soup.select(SELECT)

print("Output of SELECT: **************")
print(val)
print("**************")

value = val[INDEX].text
print(value)

On the webpage, just do a right click > inspect (on chrome). From there, you can see the html.

When I tried it, the webpage reloaded and stopped updating on me, so all of the values were blank. But not important.

If you read the CSS selector doc, you can see what it is trying to do. It’s going to build a list of all outputs that match your select.

At first, I just tried “.infos > h3” as the select attribute. This means it will put, into a list, all elements where <h3> is directly under <div class=“infos”>. But there were a lot of those. So I tried to pair it down more.

" .pricetab > .infos > h3" as there were only a few .pricetab classes.

That created the following list…

Output of SELECT: **************
[<h3> Μέγιστη :
 °C</h3>, <h3> Ελάχιστη :  °C</h3>, <h3> Μέγιστη :  % </h3>, <h3> Ελάχιστη :  % </h3>, <h3> Μέγιστη Ριπή :  km/h</h3>, <h3>.</h3>, <h3> Μηνιαίος :  mm</h3>, <h3> Ετήσιος :  mm</h3>, <h3> Μέγιστη :  mm/h</h3>, <h3> - </h3>]
**************

From there, I just counted how many there were until I found the one I wanted. The one we want is the 5th in the list.

So now I know the select, and index, and the output. Time to build the sensor.

sensor:
  - platform: scrape
    resource: "http://stravon.gr/pinakas/meteostation/kokkinia"
    select: ".pricetab > .infos > h3"
    index: 4
    # output is Μέγιστη Ριπή : XXX km/h. Split at the ":" and take the 2nd part. 
    # If you want to remove the km/h, you'll have to strip that too....
    value_template: "{{ value.split(":")[1].strip() }}"

Makis · April 9, 2020, 6:42am

Thanks! it is working. However I would like the number 9.7 as this is the actual live wind data (the one you gave me is the highest of the day)

But first of all I would like to try myself to get the result you gave me.
I have a windows 10 laptop. I installed python and beatifulsoup4. (as you can understand I have no idea of the language or the way to use it)

please see the picture and tell me what is wrong.
there is a problem with output select
I am sure it is something obvious but with zero knowledge I am stuck

Output of SELECT: **************
[<h3> Μέγιστη :
 °C</h3>, <h3> Ελάχιστη :  °C</h3>, <h3> Μέγιστη :  % </h3>, <h3> Ελάχιστη :  % </h3>, <h3> Μέγιστη Ριπή :  km/h</h3>, <h3>.</h3>, <h3> Μηνιαίος :  mm</h3>, <h3> Ετήσιος :  mm</h3>, <h3> Μέγιστη :  mm/h</h3>, <h3> - </h3>]
**************

Makis · April 9, 2020, 10:36am

ok. I manage to capture the value I need!
I had to install in python the “resources” (I am not sure how what it is - a library?)
Annotation 2020-04-09 133254

AhmadK · April 9, 2020, 10:42am

I presume you just need to call it using shell command

Makis · April 9, 2020, 1:52pm

for anyone with time to spare
could you please help with the following

I need to scrap data from this site
http://chalandri.meteoclub.gr/
and more particular the wind data = 23 (again)

I have try the following

#!/usr/bin/python3

from bs4 import BeautifulSoup
import requests

# Change these 2 things
URL="http://chalandri.meteoclub.gr/"
# This is the select line you will use in the config
SELECT=".td_wind_data>td"
# You may need to use a template after the fact...
INDEX=4

r = requests.get(URL)
data=r.text


soup = BeautifulSoup(data)

#print(soup)

val = soup.select(SELECT)

print("Output of SELECT: **************")
print(val)
print("**************")

value = val
print(value)

and I am getting the following

Output of SELECT: **************
[<td>Ταχύτητα ριπής ανέμου - Wind Speed (gust)</td>, <td>21 km/h</td>, <td>Μέση ταχύτητα ανέμου - Wind Speed (avg)</td>, <td>17 km/h</td>, <td>Διεύθυνση ανέμου - Wind Bearing</td>, <td>350° N</td>, <td>Μποφόρ - Beaufort F3</td>, <td>Gentle breeze</td>, <td>Wind Run</td>, <td>166,6 km</td>, <td> </td>, <td> </td>]
**************

How can I extract the data 21 ?(it was 23 when I took the screenshot)

jocnnor · April 9, 2020, 2:14pm

Hey, I’m glad you’re kind of figuring it all out yourself!

I really should have that last print do each value in the list on its own line so it’s way easier to see. Change that part of the script to this instead.

print("********** Output of SELECT: **********")
for v in range(len(val)):
    print(" index[{}]:  {}".format(v, val[v].text))
print("***************************************")

Here is what the output of the select actually is.

[
<td>Ταχύτητα ριπής ανέμου - Wind Speed (gust)</td>, 
<td>21 km/h</td>, 
<td>Μέση ταχύτητα ανέμου - Wind Speed (avg)</td>, 
<td>17 km/h</td>, <td>Διεύθυνση ανέμου - Wind Bearing</td>, 
<td>350° N</td>, <td>Μποφόρ - Beaufort F3</td>, 
<td>Gentle breeze</td>, <td>Wind Run</td>, 
<td>166,6 km</td>, 
<td> </td>, 
<td> </td>
]

As you can see, it’s a python list [] with a bunch of things. In the way I originally had it, it will print the <td> </td> tags. In the new way I have it, I print the actual output of the tags as this is what is returned from the scrape sensor.

So we want the 2nd entry in the list (or, list index 1)

So, change INDEX in the python script to 1. Likewise, you’d change the index part in the config to 1 as well.

When you do that, the final output will be:

21 km/h

If you remove the value_template, your sensor will have this string as the value.

If you would rather remove the km/h and put that as a unit_of_measurement or something, then you’ll have to use a value_template to strip the extra stuff. In this case, we’ll just use a regex filter to extract the number.

# value is the output of the select. Do your favorite string manipulation to extract only what you care about.
value_template: "{{ value | regex_findall_index(find='\d+') }}"

Makis · April 9, 2020, 2:45pm

Thanks again. Very useful!

I have a problem in HA configuration though.
Both value_template: formulas are giving the below message.
do you know why?

unknown escape sequence at line 7, column 60:
     ... ue | regex_findall_index(find='\d+') }}"

the same I got in the morning with the previous setup

jocnnor · April 9, 2020, 2:57pm

Yeah, try this instead.

value_template: '{{ value | regex_findall_index(find="\d+") }}'

The problem is that when the template is enclosed in double quote characters, the escape characters (i.e., the backslashes) are being consumed by the YAML parser, so they’re gone by the time the template is rendered. So our valid regex command “\d+” is turning into “d+” which is no longer valid.

So we’ll use single quotes instead.

yoni · April 9, 2020, 3:15pm

This one is a good idea, will setup an automation when the conditions are great for kitesurfing here in st pete

123 · April 9, 2020, 3:28pm

You can also just slice off the last 5 characters from the string value:

value_template: "{{ value[:-5] }}"