Hass.io Dynamic Web Scraping?

I’ve converted over to Hass.io and it’s been smooth so far. I want to scrape 2 values from a page http://www.bcferries.com/current_conditions/arrivals-departures.html?dept=HSB&route=08 to tell me when my ferry left dock and if it’s late. The problem? - the table gets filled in row by row as the day goes by, so there is no helpful fixed reference for the Scrape component to pull. For example, as the day goes by I need the 14th and 15th “td” tag, then the 20th and 21st, then the 26th and 27th, etc

-Can’t use Scrape alone as the reference changes, unless there is some way to pass a variable to it.
-Can’t use python-script as it won’t import libraries like Requests, Beautifulsoup.
-Appdaemon might be work, but the addon for Hassio is a beta, and I haven’t played with it enough to figure out if libraries can be installed and how to connect to HA

Workaround 1 - Create a Google Sheet that pulls the range I need and does the calculation to a fixed cell with a trigger to update every 15 min. Publish the sheet to the web and use a Scrape sensor in HA to pull the value.
- this actually works really well and is a great solution for other uses, but the lag in publishing to the web makes it too
unpredictable for an hourly ferry.

Workaround 2 - create 30 Scrape sensors - one for each tag used during the day - then create a template sensor that loops to find the first one that isn’t blank (ie the current ferry).
- this seems really cumbersome to implement and maintain

Does anyone have any suggestions? Thanks

I didn’t find a way add external libraries to AppDaemon without creating my own addon and editing the Dockerfile, and I’d prefer to use the Community AppDaemon addon.

I was however able to create a script using the built in Python 3 libraries such as urllib, re, etc that does what I want by looking for the “td” tags in the webpage. I haven’t turned it into an App yet and still need to read up on how to return a sensor value to HA rather than ‘Print’.

# python3 script to find out the latest ferry status
import urllib.request
import re
import datetime
import math

# create a list of all the <td tags that hold time in the ferry table
url = "http://orca.bcferries.com:8080/cc/marqui/arrivals-departures.asp?dept=HSB&route=08"
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
respData = resp.read()
tds = re.findall(r'<td width=100 align=\\\'right\\\'*>(.*?)</td>',str(respData))
    
#find the latest ferry times and calculate the time difference
n= 57
i=1
while i <= n:
    i += 4
    if len(tds[i])<=1:
        i -= 4
        break
        
schedule_depart=tds[i-1].strip(' ')
actual_depart=tds[i].strip(' ')
schedule_time= datetime.datetime.strptime(schedule_depart, "%H:%M  %p")
depart_time= datetime.datetime.strptime(actual_depart, "%H:%M  %p")
minutes_late=math.ceil((depart_time-schedule_time).total_seconds()/60)
        
print("The latest ferry was to leave Horseshoe Bay at",schedule_depart,",left at",actual_depart,"and was",minutes_late,"minutes late." )

I used:

It took me some time to figure it out but the answer is all the way at the end.

Jhh

Thanks for that, it looks like a really good option. I was already down AppDaemon path a bit. It looks like it’s now working so far. Every 5 minutes (300 seconds) it polls the website and writes a value to 3 different text files that are read by 3 HA command line sensors. This was suggested in a post by Rene Tode https://community.home-assistant.io/t/custom-sensor-via-appdaemon/12431 . If anyone can suggest a direct way to update a sensor that’s only purpose is receive a value from AppDaemon I’d be curious to try it to save the writes to the SD card.

One of the HA sensors:

sensor 9:
  - platform: command_line
    name: Ferry Scheduled
    command: "cat /config/appdaemon/apps/schedule_depart.txt"

and the AppDaemon code:

import appdaemon.appapi as appapi
import urllib.request
import re
import datetime
import math

class FerryStatus(appapi.AppDaemon):

  def initialize(self):
    self.log("Ferry Status loaded")
    starttime = datetime.datetime.now() + datetime.timedelta(seconds=5)      
    self.run_every(self.checkup,starttime,300) 

  def checkup(self, kwargs):
    url = "http://orca.bcferries.com:8080/cc/marqui/arrivals-departures.asp?dept=HSB&route=08"
    req = urllib.request.Request(url)
    resp = urllib.request.urlopen(req)
    respData = resp.read()
    tds = re.findall(r'<td width=100 align=\\\'right\\\'*>(.*?)</td>',str(respData))
    n= 53
    i=1
    while i <= n:
        i += 4
        if len(tds[i])<=1:
            i -= 4
            break
    schedule_depart=tds[i-1].strip(' ')
    actual_depart=tds[i].strip(' ')
    schedule_time= datetime.datetime.strptime(schedule_depart, "%H:%M  %p")
    depart_time= datetime.datetime.strptime(actual_depart, "%H:%M  %p")
    minutes_late=math.ceil((depart_time-schedule_time).total_seconds()/60)
    sensorfile1 = open("/config/appdaemon/apps/schedule_depart.txt", 'w')
    sensorfile1.write(schedule_depart)
    sensorfile1.close()
    sensorfile2 = open("/config/appdaemon/apps/actual_depart.txt", 'w')
    sensorfile2.write(actual_depart)
    sensorfile2.close()
    sensorfile3 = open("/config/appdaemon/apps/minutes_late.txt", 'w')
    sensorfile3.write(str(minutes_late))
    sensorfile3.close()

There is a module which could help if you want a closer integration.

Great suggestion, but currently with Hass.io I would have to clone the AppDaemon addon and edit the DockerFile in order to install the module unfortunately. I’ll definitely keep it on my radar.

Do you do anything with BC Hydro? I would love to scrap their website for last hour energy use. I even looked at buying one of the zigbee devices but I would need BC Hydro approval and couldn’t get the info into HA. Thanks

I think this app work on the iphone but can’t get the info into HA.

https://itunes.apple.com/ca/app/stumply/id1169804248?mt=8

That’s not something I’ve looked into yet, but a quick scan of the pages shows the table view in Detailed Consumption -> Custom -> Hourly could be scraped by the method above, or even more simply sending a periodic request to Export Data-CSV or XML and reading the file with HA. That would be useful for having a graph and seeing the trend, but I think the problem is that the data does not seem to get posted for the current day until midnight, so I don’t see an easy answer for real time consumption values like last hour use.

Thanks for taking a look.

FYI for anyone who happens to be trying to integrate BC Ferries Bowen Island route status, I recently found someone had created a Twitter bot that gives the latest status (likely using AIS data for the position of the ship) such as:

“The Queen of Capilano has arrived at Snug Cove, Bowen Island at 3:44 pm.”

SInce the latest information is at the top that makes using the Scrape sensor. This setup seems to be working:

Create a Scrape sensor

sensor 17:

ferry_status:
alias: Return the ferry status to Google Assistant
sequence:

  • service: tts.google_say
    entity_id: media_player.living_room_speaker
    data_template:
    message: ‘{{states(’‘sensor.twitter_bowen_ferry’’)}} ’
    cache: false

In Google Home I created a shortcut so that if I ask “Where is the ferry” or “Ferry Status” it triggers the script and reads back the value from the Twitter page.

I found that the scrape sensor were taking a fair bit of bandwidth and an on demand system would be better. I’ve recently started converting to the Node Red addon (steep, steep learning curve (for me) but very powerful once it starts to click). Here are my steps - Ask Google Assistant for status -> triggers a script that toggles an input boolean -> Node Red watches for the toggle action and then runs a flow to figure out if the ferry has left, if it is on time, and when the next one is -> Node Red passes a message to be read by Home Assistant.

The flow can be imported from below (enter your own password):

{“id”:“bd44ba05.c18518”,“type”:“server-state-changed”,“z”:“3754c020.afda5”,“name”:“Trigger the ferry status”,“server”:“829bef4.0e4f71”,“entityidfilter”:“input_boolean.ferry_trigger”,“entityidfiltertype”:“substring”,“haltifstate”:"",“x”:100,“y”:460,“wires”:[[“f77162ac.630e2”]]},{“id”:“f77162ac.630e2”,“type”:“www-request”,“z”:“3754c020.afda5”,“name”:“Request table from BC Ferries”,“method”:“GET”,“ret”:“txt”,“url”:“https://orca.bcferries.com/cc/marqui/arrivals-departures.asp?dept=HSB&route=08",“follow-redirects”:true,“tls”:"",“x”:410,“y”:460,“wires”:[[“f3ca36ee.418778”]]},{“id”:“f3ca36ee.418778”,“type”:“html”,“z”:“3754c020.afda5”,“name”:"CSS Filter”,“property”:“payload”,“outproperty”:“payload”,“tag”:“td[width=100][align=‘right’]”,“ret”:“text”,“as”:“single”,“x”:670,“y”:460,“wires”:[[“8ac990c4.a621c”]]},{“id”:“8ac990c4.a621c”,“type”:“function”,“z”:“3754c020.afda5”,“name”:“Function to extract times”,“func”:"//cut off the header of the ferry table in the array\nmsg.items=msg.payload.slice(4);\n\n//create sub array that finds the position of the first null in Actual Departure\nmsg.actualarr = [];\nfor (var i = 1; i < msg.items.length; i += 4) {\n msg.actualarr.push(msg.items[i]);\n}\n\n// Extract the scheduled, actual and next ferry times from the first array\n\n msg.i = (msg.actualarr.indexOf("")-1)*4;\n msg.lastdepart = msg.items[(msg.i+1)].split(/[\s:]+/);\n msg.lastsched = msg.items[msg.i].split(/[\s:]+/);\n msg.nextsched = msg.items[(msg.i+4)];\n\n// calculate the minutes late\n\nmsg.lastdeparttime= new Date(2018,01,01,msg.lastdepart[0],msg.lastdepart[1],0);\nmsg.lastschedtime= new Date(2018,01,01,msg.lastsched[0],msg.lastsched[1],0);\n\n\n// calculate the minutes late - add 12 hours if AM\nif (msg.lastdepart[2]==“AM”) {\n msg.lastdeparttimestamp=msg.lastdeparttime.getTime()/1000/60;\n} else {\n msg.lastdeparttimestamp=(msg.lastdeparttime.getTime()+43200)/1000/60; \n}\n\nif (msg.lastsched[2]==“AM”) {\n msg.lastschedtimestamp=msg.lastschedtime.getTime()/1000/60;\n} else {\n msg.lastschedtimestamp=(msg.lastschedtime.getTime()+43200)/1000/60; \n}\n\nmsg.latetime= msg.lastdeparttimestamp-msg.lastschedtimestamp\n\nreturn msg;",“outputs”:1,“noerr”:0,“x”:510,“y”:540,“wires”:[[“ad1c6e84.86a73”]]},{“id”:“ad1c6e84.86a73”,“type”:“template”,“z”:“3754c020.afda5”,“name”:“Create message for Google Assistant”,“field”:“payload”,“fieldType”:“msg”,“format”:“handlebars”,“syntax”:“mustache”,“template”:“The ferry scheduled at: {{lastsched}} left Horseshoe Bay at: {{lastdepart}} and was: {{latetime}} minutes late. The next ferry home will be at {{nextsched}}.”,“output”:“str”,“x”:830,“y”:540,“wires”:[[“272db2e0.3c2b2e”]]},{“id”:“272db2e0.3c2b2e”,“type”:“googlehome-notify”,“z”:“3754c020.afda5”,“server”:“987b051a.83dce8”,“name”:“Google Assistant Messages”,“x”:1200,“y”:260,“wires”:[]},{“id”:“829bef4.0e4f71”,“type”:“server”,“z”:"",“name”:“Home Assistant”,“url”:“http://hassio/homeassistant",“pass”:“xxxxxxxxxxxx”},{“id”:“987b051a.83dce8”,“type”:“googlehome-config-node”,“z”:"",“ipaddress”:“192.168.0.38”,“language”:"en”}

There has been a recent revamp in this area, and a iOS app, and the dev also put together a API for it

https://www.bcferriesapi.ca/