Multiscrape or: How i learned to stop and love the BOM

worldvirus · January 31, 2023, 4:31am

I wanted to find a way to publish historical rain records into HA.
The BOM have various rain gauges published on the web ie: http://www.bom.gov.au/cgi-bin/wrap_fwo.pl?IDV60178.html and I wanted to incorporate some of these using the multiscrape component.

Testing my selectors at the very useful https://try.jsoup.org/ it informed me that the BOM blocks http scrapers and access to data must be through ftp.
I needed a way to ftp the file to somewhere I could scrape it with http.

if you have your own domain with cpanel access you can set up a cron job to fetch the file (you can specify the frequency), copy it to your website and allow scraper access.
Here is the cron job I setup on my cpanel:
curl ftp://ftp.bom.gov.au//anon/gen/fwo/IDV60178.html -o ~/public_html/h-a/IDV60178.htmll >/dev/null 2>&1

“~/public_html/h-a/IDV60178.html” is where I want the file copied to
“>/dev/null 2>&1” stops the cpanel from sending me an alert every 30 mins after the job completes

After much frustration I found the selector
#container > table:nth-child(9) > tbody > tr:nth-child(9) > td:nth-child(2) must be edited down to:
table:nth-child(9) > tr:nth-child(9) > td:nth-child(2) for multiscrape to cope

yaml now worked

multiscrape:
- name: BOM scraper
  resource: http://www.”mywebsite”.net/h-a/IDV60178.html
  scan_interval: 1800
  sensor:
    - unique_id: Ricketts_Marsh_Rain_Since_9am
      name: Rain Since Yesterday
      select: "table:nth-child(9) > tr:nth-child(9) > td:nth-child(9)"
      unit_of_measurement: mm
    - unique_id: Ricketts_Marsh_Rain_History_1
      name: Rain Yesterday
      select: "table:nth-child(9) > tr:nth-child(9) > td:nth-child(8)"
      unit_of_measurement: mm
    - unique_id: Ricketts_Marsh_Rain_History_2
      name: Rain 2
      select: "table:nth-child(9) > tr:nth-child(9) > td:nth-child(7)"
      unit_of_measurement: mm
    - unique_id: Ricketts_Marsh_Rain_History_3
      name: Rain 3
      select: "table:nth-child(9) > tr:nth-child(9) > td:nth-child(6)"
      unit_of_measurement: mm
    - unique_id: Ricketts_Marsh_Rain_History_4
      name: Rain 4
      select: "ttable:nth-child(9) > tr:nth-child(9) > td:nth-child(5)"
      unit_of_measurement: mm
    - unique_id: Ricketts_Marsh_Rain_History_5
      name: Rain 5
      select: "table:nth-child(9) > tr:nth-child(9) > td:nth-child(4)"
      unit_of_measurement: mm
    - unique_id: Ricketts_Marsh_Rain_History_6
      name: Rain 6
      select: "table:nth-child(9) > tr:nth-child(9) > td:nth-child(3)"
      unit_of_measurement: mm
    - unique_id: Ricketts_Marsh_Rain_History_7
      name: Rain 7
      select: "table:nth-child(9) > tr:nth-child(9) > td:nth-child(2)"
      unit_of_measurement: mm

I can now use these sensors in Home assistant (inc the platinum weather card), but I need some help to display the last 7 days as a card, at the moment it looks like this:
HA Rain
I’m pretty sure that lovelace hates all humans (not just me), but if anyone could help me with display names for the values it would be great.
I don’t understand how to make the descriptor (ie. Rain 4) for the mm value to change with the day/date

justone · January 31, 2023, 5:48pm

Perhaps you should fiddle with the Developer Tools > Template and for example enter something such as

{{ as_timestamp(now()) | timestamp_custom}}
{{ as_timestamp(now()) | timestamp_custom("%A")}}
{{ (as_timestamp(now()) - 86400) | timestamp_custom("%A")}}

and then read on about timestamps and such like. Small hint … 86400 is a day by … you’ll surely find out yourself.

worldvirus · February 1, 2023, 8:41am

Thanks for that, thats a great start.
I’ve spent a couple of hours trying and I can get the list of days all nice but I can’t work out how to get the string into bastard lovelace.
I’ll have another go tomorrow

Edwin_D · February 1, 2023, 8:53am

This looks similar to another post, they solved it by turning the sensors in a template weather entity - maybe this will allow you to use lovelace weather cards to display it properly:

It is named after weather underground, but I don’t think it relies on it - by the looks you can put in your own sensors.

Troon · February 1, 2023, 9:04am

That’s not “edited down for multiscrape to cope” — it’s a fault of whatever tool you were using to find the selector. The HTML of your source page doesn’t include an (optional) <tbody>, so you were asking it to find something that isn’t there.

Can’t believe that government site is still not using HTTPS, and doesn’t provide the raw data via JSON or XML which would make things easier and more robust than scraping.

worldvirus · February 2, 2023, 12:35am

This looks similar to another post, they solved it by turning the sensors in a template weather entity - maybe this will allow you to use lovelace weather cards to display it properly:

Thanks, I’ll look into that and see if there is anything helpful there

@Troon Yes I’m out of my depth with this, the selector came from safari, chrome and firefox all giving me options that would work with https://try.jsoup.org/ but wouldn’t with multiscrape. I included this information to help anyone searching for help, as all my searchs found people with similar problems but no solutions offered
And agreed I’m astounded the site is not https and we have to go about this round about way. the guys working on the BOM weather plugin i believe have found a way around this with an undocumented API.

Then again this is Australia, until recently had a conservative government who decided we didn’t need a minister for science, massively reduced funding to the government science research body, believed the left was trying to introduce electric vehicles to ruin our weekends, insisted global warming was not happening and said Australians don’t need internet faster than 30mbs. It’s bloody lucky the BOM even have a web site as opposed to a weather fax service

worldvirus · February 2, 2023, 1:47am

justone was on the money. with the Lovelace Card Templater and the following code all is working beautifully

type: custom:card-templater
card:
  type: entities
  title: Rickets Marsh Rainfall
  entities:
    - entity: sensor.rickets_marsh_rain_since_9am
      name: Rain Since 9:00am
    - entity: sensor.rickets_marsh_rain_history_1
      name: Yesterday
    - entity: sensor.rickets_marsh_rain_history_2
      name_template: '{{ (as_timestamp(now()) - 172800) | timestamp_custom("%A")}}'
    - entity: sensor.rickets_marsh_rain_history_3
      name_template: '{{ (as_timestamp(now()) - 259200) | timestamp_custom("%A")}}'
    - entity: sensor.rickets_marsh_rain_history_4
      name_template: '{{ (as_timestamp(now()) - 345600) | timestamp_custom("%A")}}'
    - entity: sensor.rickets_marsh_rain_history_5
      name_template: '{{ (as_timestamp(now()) - 432000) | timestamp_custom("%A")}}'
    - entity: sensor.rickets_marsh_rain_history_6
      name_template: '{{ (as_timestamp(now()) - 518400) | timestamp_custom("%A")}}'
    - entity: sensor.rickets_marsh_rain_history_7
      name_template: '{{ (as_timestamp(now()) - 604800) | timestamp_custom("%A")}}'
entities:
  - sensor.time

danieldotnl · February 3, 2023, 8:27am

Hi, multiscrape dev here. Nice write-up! I will check if I can make the name of the sensor “templatable” in the config. That will make it a lot easier I guess.
Not sure though what will happen if one doesn’t provide the (optional) unique_id.

And to make it easier next time: enable log_response and then paste the content of the page_soup.txt file in jsoup.org as that file contains the exact content that’s being scraped, without interference of the browser decorating your html with tbody etc.

worldvirus · February 4, 2023, 4:23am

Thanks @ danieldotnl, and big thanks for the plugin and all your work