Multiscrape or: How i learned to stop and love the BOM
I wanted to find a way to publish historical rain records into HA.
The BOM have various rain gauges published on the web ie: http://www.bom.gov.au/cgi-bin/wrap_fwo.pl?IDV60178.html and I wanted to incorporate some of these using the multiscrape component.
Testing my selectors at the very useful https://try.jsoup.org/ it informed me that the BOM blocks http scrapers and access to data must be through ftp.
I needed a way to ftp the file to somewhere I could scrape it with http.
if you have your own domain with cpanel access you can set up a cron job to fetch the file (you can specify the frequency), copy it to your website and allow scraper access.
Here is the cron job I setup on my cpanel: curl ftp://ftp.bom.gov.au//anon/gen/fwo/IDV60178.html -o ~/public_html/h-a/IDV60178.htmll >/dev/null 2>&1
“~/public_html/h-a/IDV60178.html” is where I want the file copied to
“>/dev/null 2>&1” stops the cpanel from sending me an alert every 30 mins after the job completes
After much frustration I found the selector #container > table:nth-child(9) > tbody > tr:nth-child(9) > td:nth-child(2) must be edited down to:
table:nth-child(9) > tr:nth-child(9) > td:nth-child(2) for multiscrape to cope
I can now use these sensors in Home assistant (inc the platinum weather card), but I need some help to display the last 7 days as a card, at the moment it looks like this:
I’m pretty sure that lovelace hates all humans (not just me), but if anyone could help me with display names for the values it would be great.
I don’t understand how to make the descriptor (ie. Rain 4) for the mm value to change with the day/date
Thanks for that, thats a great start.
I’ve spent a couple of hours trying and I can get the list of days all nice but I can’t work out how to get the string into bastard lovelace.
I’ll have another go tomorrow
This looks similar to another post, they solved it by turning the sensors in a template weather entity - maybe this will allow you to use lovelace weather cards to display it properly:
It is named after weather underground, but I don’t think it relies on it - by the looks you can put in your own sensors.
That’s not “edited down for multiscrape to cope” — it’s a fault of whatever tool you were using to find the selector. The HTML of your source page doesn’t include an (optional) <tbody>, so you were asking it to find something that isn’t there.
Can’t believe that government site is still not using HTTPS, and doesn’t provide the raw data via JSON or XML which would make things easier and more robust than scraping.
This looks similar to another post, they solved it by turning the sensors in a template weather entity - maybe this will allow you to use lovelace weather cards to display it properly:
Thanks, I’ll look into that and see if there is anything helpful there
@Troon Yes I’m out of my depth with this, the selector came from safari, chrome and firefox all giving me options that would work with https://try.jsoup.org/ but wouldn’t with multiscrape. I included this information to help anyone searching for help, as all my searchs found people with similar problems but no solutions offered
And agreed I’m astounded the site is not https and we have to go about this round about way. the guys working on the BOM weather plugin i believe have found a way around this with an undocumented API.
Then again this is Australia, until recently had a conservative government who decided we didn’t need a minister for science, massively reduced funding to the government science research body, believed the left was trying to introduce electric vehicles to ruin our weekends, insisted global warming was not happening and said Australians don’t need internet faster than 30mbs. It’s bloody lucky the BOM even have a web site as opposed to a weather fax service
Hi, multiscrape dev here. Nice write-up! I will check if I can make the name of the sensor “templatable” in the config. That will make it a lot easier I guess.
Not sure though what will happen if one doesn’t provide the (optional) unique_id.
And to make it easier next time: enable log_response and then paste the content of the page_soup.txt file in jsoup.org as that file contains the exact content that’s being scraped, without interference of the browser decorating your html with tbody etc.