Scraping COVID-19 stats from Canadian gov't website

I can try, but it’s ultimately the same binary:

(hass-3.8) hass@orion ~/config $ which python
/srv/hass-3.8/bin/python
(hass-3.8) hass@orion ~/config $ dir /srv/hass-3.8/bin/python
lrwxrwxrwx 1 hass hass 9 Oct 30 22:07 /srv/hass-3.8/bin/python -> python3.8

(hass-3.8) hass@orion ~/config $ which python3
/srv/hass-3.8/bin/python3
(hass-3.8) hass@orion ~/config $ dir /srv/hass-3.8/bin/python3
lrwxrwxrwx 1 hass hass 9 Oct 30 22:07 /srv/hass-3.8/bin/python3 -> python3.8

The hass-3.8 folder is HA installed in a python 3.8 virtual environment.

Ok. I have 30+ scripts that collects data from various servers and devices and instead of polling data from HA, i run a rest command from the python scripts to send data and update sensors in HA directly from the script. If you think thats an option, I can post a simple python module I include for this purpose.

Another easier solution is to send the data through mqtt from you script, which is pretty straight forward.

Yet another option is to use appdaemon, https://appdaemon.readthedocs.io/en/latest/

All options gets rid of any polling needs which should lower the load on HA.

Is it easy to post the data to MQTT directly from a python script? I’ve never done that before.

Thats really easy:

import paho.mqtt.publish as publish
import json

HOSTNAME = "127.0.0.1"
MQTT_PORT = 1883

def send(topic, payload, do_retain=False):
    publish.single(topic, payload, hostname=HOSTNAME, port=MQTT_PORT, retain=do_retain)
    print("sending... " + topic + ": " + payload)
1 Like

And payload need to be a string so you would need to convert the dict to a string with json.dumps before sending:

my_data = {"name": "Tomas"}
json_data = json.dumps(my_data)
send("mqtt_test", json_data)

I ended up using the following, because I couldn’t figure out how to authenticate using publish.

import paho.mqtt.client as mqtt
client = mqtt.Client()
client.username_pw_set(mqtt_user, mqtt_pass)
client.connect(mqtt_host, mqtt_port)
client.publish(topic, json_data)

That part worked like a charm…NOw I just have to reconfigure the sensor for MQTT instead of command line.

1 Like

Ok! Check out the MQTT auto discovery. That way there is no need to configure the sensors in HA.

Too late :slight_smile: I’ve alreay done it manually. I set the script up as a cron job and the sensor is working great but I wish I could figure out why it wouldn’t run from HA as a sensor.

I didn’t specifically want to mark anything here as an “answer” because we’ve really just worked around the entire issue. The problem is that the exact same command that’s working as a cron job couldn’t run as a command line sensor.

1 Like

The problem is most likely the included library as you already noted.

Any updates for a total newbie on how to implement this

Here’s my working solution:

Script:

import json
import urllib.request
from bs4 import BeautifulSoup
import paho.mqtt.client as mqtt

mqtt_host = "10.1.1.4"
mqtt_port = 1883
mqtt_user = 'homeassistant'
mqtt_pass = 'Aut0mate.'

webrequest = urllib.request.urlopen("https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection.html")
html_bytes = webrequest.read()
html_data = html_bytes.decode("utf8")
webrequest.close()

header = ">Current situation</h2>"
start_table = "<table>"
end_table = "</table>"
header_start = html_data.find(header) + len(header)
table_start = html_data[header_start].find(start_table) + len(start_table) + header_start
table_end = html_data[table_start:].find(end_table) + len(end_table)

table = html_data[table_start:table_start+table_end]
rows = BeautifulSoup(table, "html.parser")("tr")
datarows = rows[1:]
data = [[cell.text for cell in row("td")] for row in rows[1:]]

dumpdata = dict()
for row in data:
    dumpdata[row[0].lower().replace(" ", "_")] = {
        "confirmed": row[1].replace("\n", "").replace(" ", "").replace(",", ""),
        "probable": row[2].replace("\n", "").replace(" ", "").replace(",", ""),
        "deaths": row[3].replace("\n", "").replace(" ", "").replace(",", "")
    }

json_data = json.dumps(dumpdata)
# print(json_data)

client = mqtt.Client()
client.username_pw_set(mqtt_user, mqtt_pass)
client.connect(mqtt_host, mqtt_port)
client.publish("coronavirus/canada_ca", json_data)

Sensor YAML:

sensor:
  - name: coronavirus_canada_ca
    platform: mqtt
    force_update: true
    unit_of_measurement: people
    state_topic: coronavirus/canada_ca
    value_template: >
      {{ value_json.nova_scotia.confirmed | int }}
    json_attributes_topic: coronavirus/canada_ca
    json_attributes:
      - british_columbia
      - alberta
      - saskatchewan
      - manitoba
      - ontario
      - quebec
      - new_brunswick
      - prince_edward_island
      - nova_scotia
      - newfoundland_and_labrador
      - yukon
      - northwest_territories
      - nunavut
      - repatriated_travellers
      - total

Finally, I set up a cron job to run the script every 15 minutes. This is outside of HA. Normally, I would have used the command line sensor, but that doesn’t work with the imports in the script. The other thing I would have tried is just automating the running of the script with a command line switch, but I feared that it would give me the same error, so I just circumvented HA in this case. Here’s the cron line to do this:

*/15 * * * * /srv/hass-3.8/bin/python /home/hass/config/python_scripts/corona_canada.py

Hope that helps!

1 Like

Looks great! You could as an alternative to a cron job just establish a simple loop in the python script:

While True:
   the code in the script
   .....
   time.sleep(number of seconds to wait until next update)

and then set up the script to run as a service. The service can then be set up to start only after HA has started.

whew…way beyond me for now!! good work though!!

They’ve completely changed the Canadian stats page. It’s now using an iFrame that links to an html document that gets populated by some javascript. I haven’t been able to scrape it successfully yet.

I would love to hear the advice of anyone having more experience in this matter.

1 Like

I found the source of the data after making my way through lots of HTML and JavaScript files. It’s a comma-separated text file located here:
https://health-infobase.canada.ca/src/data/covidLive/covid19.csv

I’ve also updated my python script. The YAML for the sensors remains the same:

import json
import urllib.request
import paho.mqtt.client as mqtt
from datetime import datetime

mqtt_host = "10.1.1.4"
mqtt_port = 1883
mqtt_user = 'homeassistant'
mqtt_pass = 'Pub1i5h.'

#webrequest = urllib.request.urlopen("https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection.html")
webrequest = urllib.request.urlopen("https://health-infobase.canada.ca/src/data/covidLive/covid19.csv")
csv_lines = webrequest.read().decode("utf8").splitlines()
webrequest.close()

data = dict()

# The CSV contains comma-separated lines for every stat update since the government started
# posting statistics updates. This loop creates a dictionary that only keeps the latest
# updates. Somebody may want the history for something though...
#
# Also re-format the date, since it is not directly comparable as dd-mm-yy.
#
# Skip the first line, since that is just the labels
for line in csv_lines[1:]:
    # ID,EnglishName,FrenchName,Date,Confirmed,Presumptive,Deaths,Total,NumToday,PercentToday,NumTested
    parts = line.split(",")
    updated = {
        'id': int(parts[0]),
        'name': parts[1],
        'date': datetime.strptime(parts[3], '%d-%m-%Y').strftime('%Y-%m-%d'),
        'confirmed': int(parts[4]),
        'presumptive': int(parts[5]),
        'deaths': int(parts[6]),
        'current': int(parts[7])
    }
    
    id = updated['id']
    if (id not in data.keys()) or (data[id]['date'] <= updated['date']):
        data[id] = updated

# Re-key the dictionary on the province names to make the JSON for the sensor
# look a little nicer.
provinceData = dict()
for key in data.keys():
    datum = data[key]
    provinceData[datum['name'].lower().replace(' ', '_')] = datum
    print(datum)

# Dump the dictionary out to JSON and publish it to MQTT
jsonData = json.dumps(provinceData)
client = mqtt.Client()
client.username_pw_set(mqtt_user, mqtt_pass)
client.connect(mqtt_host, mqtt_port)
client.publish("coronavirus/canada_ca", jsonData)
1 Like

Thank you for all the work you’ve done. I’ve adapted it to work with a Command_Line Sensor.

The command_line sensor calls a python script every 6 hours.

  - platform: command_line
    name: c19
    command: 'python3 /config/c19.py'
    value_template: '{{ value_json.quebec.confirmed }}'
    scan_interval: 21600

Here’s the adapted python script:

import json
from requests import get
from datetime import datetime

response = get("https://health-infobase.canada.ca/src/data/covidLive/covid19.csv")
csv_lines = response.text.splitlines()
response.close()

data = dict()

# Create a dictionary containing just the latest updates
# Skip the first line in csv_lines which contains headings
for line in csv_lines[1:]:
    # ID,EnglishName,FrenchName,Date,Confirmed,Presumptive,Deaths,Total,NumToday,PercentToday,NumTested
    parts = line.split(",")
    updated = {
        'id': int(parts[0]),
        'name': parts[1],
        'date': datetime.strptime(parts[3], '%d-%m-%Y').strftime('%Y-%m-%d'),
        'confirmed': int(parts[4]),
        'presumptive': int(parts[5]),
        'deaths': int(parts[6]),
        'current': int(parts[7])
    }
    
    id = updated['id']
    if (id not in data.keys()) or (data[id]['date'] <= updated['date']):
        data[id] = updated

# Re-key the dictionary on the province names
provinceData = dict()
for key in data.keys():
    datum = data[key]
    provinceData[datum['name'].lower().replace(' ', '_')] = datum

print(json.dumps(provinceData))

Here’s the resulting sensor:
Screenshot from 2020-03-30 12-36-11


NOTE
A limitation of Command Line Sensor is its json_attributes option. An MQTT Sensor offers json_attributes_template which is more flexible when it comes to extracting attributes.

1 Like

Some of the “numbers” in this CSV file have started containing commas. I’ve adjusted the pertinent part of the script to strip the quotes and remove the commas:

    updated = {
        'id': int(parts[0]),
        'name': parts[1],
        'date': datetime.strptime(parts[3], '%d-%m-%Y').strftime('%Y-%m-%d'),
        'confirmed': int(parts[4].strip('"').replace(',','')),
        'presumptive': int(parts[5].strip('"').replace(',','')),
        'deaths': int(parts[6].strip('"').replace(',','')),
        'total': int(parts[7].strip('"').replace(',',''))

For anyone in Nova Scotia, the provincial government has a page with the positive vs. negative test numbers that updates faster than the Canadian government one. I’ve created a similar script to parse the numbers from that one as well. It doesn’t contain the same fields, but it’s still useful.

import json
from requests import get
from datetime import datetime
import xml.etree.ElementTree as ET

def string_to_int(str):
    return 0 if (str == "") else int(str.strip('"').replace(",", ""))

response = get("https://novascotia.ca/coronavirus/data/COVID-19-data.csv", verify=False)
data_lines = response.text.splitlines()
response.close()

data = dict()
data["total"] = 0
data["deaths"] = 0

for row in data_lines[2:]:
    col = row.split(",")
    data["date"] = col[0]
    data["total"] += string_to_int(col[1])
    data["new"] = string_to_int(col[1])
    data["recovered"] = string_to_int(col[3])
    data["hospitalized"] = string_to_int(col[4])
    data["deaths"] += string_to_int(col[6])
    data["new_deaths"] = string_to_int(col[6])

data["current"] = data["total"] - data["recovered"] - data["deaths"]

json_data = json.dumps(data)
print(json_data)

Edit: Fixed bug. Added “new_deaths” column.

2 Likes

Good work on this everyone. thanks

1 Like

Nova Scotia here!