Scraping COVID-19 stats from Canadian gov't website

Hey everyone. I’ve written a python script to scrape the covid-19 data for my local region(s) from the Canadian government’s website (because I have no clue how to use the scrape sensor).

It all works great when I run it as my HA user from within my python virtual environment. However, when I try to set it up as a command line sensor, it doesn’t work and I’m not sure why.

Here is the script itself. If you’re Canadian, this might be useful to you as well, if we can get it working. Just keep in mind that I’m no python developer, most of it has been cobbled together with bits from StackOverflow :slight_smile:

import json
from bs4 import BeautifulSoup
import urllib.request

webrequest = urllib.request.urlopen("https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection.html")
html_bytes = webrequest.read()
html_data = html_bytes.decode("utf8")
webrequest.close()

header = ">Current situation</h2>"
start_table = "<table>"
end_table = "</table>"
header_start = html_data.find(header) + len(header)
table_start = html_data[header_start].find(start_table) + len(start_table) + header_start
table_end = html_data[table_start:].find(end_table) + len(end_table)

table = html_data[table_start:table_start+table_end]
rows = BeautifulSoup(table, "html.parser")("tr")
datarows = rows[1:]
data = [[cell.text for cell in row("td")] for row in rows[1:]]

dumpdata = dict()
for row in data:
    dumpdata[row[0].lower().replace(" ", "_")] = {
        "confirmed": row[1].replace("\n", "").replace(" ", "").replace(",", ""),
        "probable": row[2].replace("\n", "").replace(" ", "").replace(",", ""),
        "deaths": row[3].replace("\n", "").replace(" ", "").replace(",", "")
    }

json_data = json.dumps(dumpdata)
print(json_data)

When run on the command line, it outputs exactly what I want; basically just a JSON dictionary of all the provinces and their confirmed, probable, and death numbers:

(hass-3.8) hass@orion ~/config $ python ./python_scripts/corona_parser.py
{"british_columbia": {"confirmed": "617", "probable": "0", "deaths": "13"}, "alberta": {"confirmed": "358", "probable": "0", "deaths": "2"}, "saskatchewan": {"confirmed": "72", "probable": "0", "deaths": "0"}, "manitoba": {"confirmed": "11", "probable": "10", "deaths": "0"}, "ontario": {"confirmed": "588", "probable": "0", "deaths": "8"}, "quebec": {"confirmed": "221", "probable": "792", "deaths": "4"}, "new_brunswick": {"confirmed": "18", "probable": "0", "deaths": "0"}, "nova_scotia": {"confirmed": "51", "probable": "0", "deaths": "0"}, "prince_edward_island": {"confirmed": "3", "probable": "0", "deaths": "0"}, "newfoundland_and_labrador": {"confirmed": "4", "probable": "31", "deaths": "0"}, "yukon": {"confirmed": "2", "probable": "0", "deaths": "0"}, "northwest_territories": {"confirmed": "1", "probable": "0", "deaths": "0"}, "nunavut": {"confirmed": "0", "probable": "0", "deaths": "0"}, "repatriated_travellers": {"confirmed": "13", "probable": "0", "deaths": "0"}, "total": {"confirmed": "1959", "probable": "833", "deaths": "27"}}
(hass-3.8) hass@orion ~/config $

And here’s the YAML for the sensor:

sensor:
  - name: coronavirus_canada_ca
    platform: command_line
    command: python /home/hass/config/python_scripts/corona_parser.py
    scan_interval: 3600
    value_template: >
      {{ value_json.nova_scotia.confirmed | int }}
    json_attributes:
      - british_columbia
      - alberta
      - saskatchewan
      - manitoba
      - ontario
      - quebec
      - new_brunswick
      - prince_edward_island
      - nova_scotia
      - newfoundland_and_labrador
      - yukon
      - northwest_territories
      - nunavut
      - repatriated_travellers
      - total

Finally, here’s the error I get from HA’s logs. I tried upping the log level for the command line sensor to debug, but I only get one more entry that tells me that it’s about to run the command, so that was useless.

2020-03-25 11:19:25 ERROR (SyncWorker_9) [homeassistant.components.command_line.sensor] Command failed: python /home/hass/config/python_scripts/corona_parser.py
2020-03-25 11:19:25 WARNING (SyncWorker_9) [homeassistant.components.command_line.sensor] Empty reply found when expecting JSON data

I have no idea why the command is failing since it appears to work just fine under the same conditions that HA would call it. Any help is appreciated.

Everything looks correct. Try checking the path of the python interpreter from your venv with

which python

And then use the full path to the python interpreter in your sensor, i.e.

command: /usr/local/bin/python /home/hass/config/python_scripts/corona_parser.py

Yeah, I had though of that, but I have other python scripts that work in a similar manner. I have one that captures the status of my UPS and converts it to JSON for a sensor, almost exactly like this one does.

Ok, if other scripts work, the this is likely not the issue.

Just tried some experiments. I can have a script that is nothing but this, but it fails in the same manner:

import json
import urllib.request
print("{}")

But commenting out the second import lets it run. The JSON import is fine, but either other import in my original script causes it to fail through HA.

Ok, so its some python version issue. Any difference if you use python3 to invoke the script?

I can try, but it’s ultimately the same binary:

(hass-3.8) hass@orion ~/config $ which python
/srv/hass-3.8/bin/python
(hass-3.8) hass@orion ~/config $ dir /srv/hass-3.8/bin/python
lrwxrwxrwx 1 hass hass 9 Oct 30 22:07 /srv/hass-3.8/bin/python -> python3.8

(hass-3.8) hass@orion ~/config $ which python3
/srv/hass-3.8/bin/python3
(hass-3.8) hass@orion ~/config $ dir /srv/hass-3.8/bin/python3
lrwxrwxrwx 1 hass hass 9 Oct 30 22:07 /srv/hass-3.8/bin/python3 -> python3.8

The hass-3.8 folder is HA installed in a python 3.8 virtual environment.

Ok. I have 30+ scripts that collects data from various servers and devices and instead of polling data from HA, i run a rest command from the python scripts to send data and update sensors in HA directly from the script. If you think thats an option, I can post a simple python module I include for this purpose.

Another easier solution is to send the data through mqtt from you script, which is pretty straight forward.

Yet another option is to use appdaemon, https://appdaemon.readthedocs.io/en/latest/

All options gets rid of any polling needs which should lower the load on HA.

Is it easy to post the data to MQTT directly from a python script? I’ve never done that before.

Thats really easy:

import paho.mqtt.publish as publish
import json

HOSTNAME = "127.0.0.1"
MQTT_PORT = 1883

def send(topic, payload, do_retain=False):
    publish.single(topic, payload, hostname=HOSTNAME, port=MQTT_PORT, retain=do_retain)
    print("sending... " + topic + ": " + payload)
1 Like

And payload need to be a string so you would need to convert the dict to a string with json.dumps before sending:

my_data = {"name": "Tomas"}
json_data = json.dumps(my_data)
send("mqtt_test", json_data)

I ended up using the following, because I couldn’t figure out how to authenticate using publish.

import paho.mqtt.client as mqtt
client = mqtt.Client()
client.username_pw_set(mqtt_user, mqtt_pass)
client.connect(mqtt_host, mqtt_port)
client.publish(topic, json_data)

That part worked like a charm…NOw I just have to reconfigure the sensor for MQTT instead of command line.

1 Like

Ok! Check out the MQTT auto discovery. That way there is no need to configure the sensors in HA.

Too late :slight_smile: I’ve alreay done it manually. I set the script up as a cron job and the sensor is working great but I wish I could figure out why it wouldn’t run from HA as a sensor.

I didn’t specifically want to mark anything here as an “answer” because we’ve really just worked around the entire issue. The problem is that the exact same command that’s working as a cron job couldn’t run as a command line sensor.

1 Like

The problem is most likely the included library as you already noted.

Any updates for a total newbie on how to implement this

Here’s my working solution:

Script:

import json
import urllib.request
from bs4 import BeautifulSoup
import paho.mqtt.client as mqtt

mqtt_host = "10.1.1.4"
mqtt_port = 1883
mqtt_user = 'homeassistant'
mqtt_pass = 'Aut0mate.'

webrequest = urllib.request.urlopen("https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection.html")
html_bytes = webrequest.read()
html_data = html_bytes.decode("utf8")
webrequest.close()

header = ">Current situation</h2>"
start_table = "<table>"
end_table = "</table>"
header_start = html_data.find(header) + len(header)
table_start = html_data[header_start].find(start_table) + len(start_table) + header_start
table_end = html_data[table_start:].find(end_table) + len(end_table)

table = html_data[table_start:table_start+table_end]
rows = BeautifulSoup(table, "html.parser")("tr")
datarows = rows[1:]
data = [[cell.text for cell in row("td")] for row in rows[1:]]

dumpdata = dict()
for row in data:
    dumpdata[row[0].lower().replace(" ", "_")] = {
        "confirmed": row[1].replace("\n", "").replace(" ", "").replace(",", ""),
        "probable": row[2].replace("\n", "").replace(" ", "").replace(",", ""),
        "deaths": row[3].replace("\n", "").replace(" ", "").replace(",", "")
    }

json_data = json.dumps(dumpdata)
# print(json_data)

client = mqtt.Client()
client.username_pw_set(mqtt_user, mqtt_pass)
client.connect(mqtt_host, mqtt_port)
client.publish("coronavirus/canada_ca", json_data)

Sensor YAML:

sensor:
  - name: coronavirus_canada_ca
    platform: mqtt
    force_update: true
    unit_of_measurement: people
    state_topic: coronavirus/canada_ca
    value_template: >
      {{ value_json.nova_scotia.confirmed | int }}
    json_attributes_topic: coronavirus/canada_ca
    json_attributes:
      - british_columbia
      - alberta
      - saskatchewan
      - manitoba
      - ontario
      - quebec
      - new_brunswick
      - prince_edward_island
      - nova_scotia
      - newfoundland_and_labrador
      - yukon
      - northwest_territories
      - nunavut
      - repatriated_travellers
      - total

Finally, I set up a cron job to run the script every 15 minutes. This is outside of HA. Normally, I would have used the command line sensor, but that doesn’t work with the imports in the script. The other thing I would have tried is just automating the running of the script with a command line switch, but I feared that it would give me the same error, so I just circumvented HA in this case. Here’s the cron line to do this:

*/15 * * * * /srv/hass-3.8/bin/python /home/hass/config/python_scripts/corona_canada.py

Hope that helps!

1 Like

Looks great! You could as an alternative to a cron job just establish a simple loop in the python script:

While True:
   the code in the script
   .....
   time.sleep(number of seconds to wait until next update)

and then set up the script to run as a service. The service can then be set up to start only after HA has started.

whew…way beyond me for now!! good work though!!

They’ve completely changed the Canadian stats page. It’s now using an iFrame that links to an html document that gets populated by some javascript. I haven’t been able to scrape it successfully yet.

I would love to hear the advice of anyone having more experience in this matter.

1 Like