Scrape CSV

manjotsc · June 16, 2021, 6:14am

Hi,

I need help with scraping this csv file, Thanks

https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv

Troon · June 16, 2021, 7:16am

File looks like this:

<feff>Date;2021-06-16;
Total de cas confirmés;373 370;(+ 153)
Total de personnes rétablies;360 587;
Total des hospitalisations;192;(- 17)
Total des décès;11 177;(+ 0)

Here’s how to pull that data into a single sensor, with the date as the sensor state, and the other data as attributes. Create an external Python script — not an HA Script nor a python_script. You’ll have to work out how to do that on your system: I’m on Docker which makes it easier. I’ve called it /config/scripts/quebec.py:

import requests
import json
r = requests.get('https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv').text
r = r.replace('\u00ef\u00bb\u00bf', '')
d = {}
for i in r.split('\n')[:-1]:
   l = i.split(';')
   d[l[0].replace('\u00c3\u00a9', 'e').replace('\u00c3\u00a8', 'e')] = l[1].replace(' ', '')
print(json.dumps(d))

Then create a command line sensor:

- platform: command_line
  name: Quebec test
  command: "python3 /config/scripts/quebec.py"
  value_template: "{{ value_json['Date'] }}"
  json_attributes:
    - Total de cas confirmes
    - Total de personnes retablies
    - Total des hospitalisations
    - Total des deces

I’m sorry I had to remove the accented characters (é and è) — there’s probably a way to get them to work but you can figure that out on your own time.

Line 4 of the script removes the byte order mark.

The loop then reads each line in turn (except the blank last line), splitting it at each semicolon. The first element of that list is the dictionary key (with accents removed), and the second element is the value (with spaces in the middle of the numbers removed).

I haven’t captured the “change since last value” third column.

Troon · June 16, 2021, 2:59pm

I’ve now managed this with no external scripts, just a command_line sensor and template sensors. First, the command_line sensor:

- platform: command_line
  command: 'python3 -c "import requests; print(requests.get(\"https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv\").text)"'
  name: Quebec COVID status
  value_template: >-
    {% set r = '["' ~ value.replace('ï»¿', '').replace('\n','","').replace('\r','') ~ '"]' %}
    {% set ns = namespace(s="{") %}
    {% for i in (r|from_json) %}
      {% set k = i.split(';')[0].replace('Ã©','e').replace('Ã¨','e') %}
      {% set v = i.split(';')[1].replace(' ','') %}
      {% set ns.s = ns.s ~ "\"" ~ k ~ "\":\"" ~ v ~ "\"," %}
    {% endfor %}
    {{ ns.s[:-1]~"}" }}

First line of the template removes the BOM, and turns it into a string that looks like a list.
The rest of the template makes it into a string that looks like a dictionary.

Next, the template sensors, using the new template integration (docs):

template:
  - sensor:
      - name: QCS Date
        state: "{{ (states('sensor.quebec_covid_status')|from_json)['Date'] }}"
      - name: QCS Total de cas confirmés
        state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total de cas confirmes'] }}"
      - name: QCS Total de personnes rétablies
        state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total de personnes retablies'] }}"
      - name: QCS Total des hospitalisations
        state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total des hospitalisations'] }}"
      - name: QCS Total des décès
        state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total des deces'] }}"

manjotsc · June 16, 2021, 4:59pm

Thanks, I’ll give a try when I am back home

manjotsc · June 17, 2021, 5:41am

I can’t get command_line part working, I using hassio with home assistant.

Troon · June 17, 2021, 6:22am

Urgh, right. Here it is as a rest sensor instead then (docs). template sensors unchanged from prior post:

rest:
  - resource: https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv
    scan_interval: 3600
    sensor:
      - name: Quebec COVID status
        value_template: >-
          {% set r = value.replace('\ufeff', '').split('\n') %}
          {% set ns = namespace(s="{") %}
          {% for i in r %}
            {% if i|length > 1 %}
              {% set k = i.split(';')[0].replace('é','e').replace('è','e') %}
              {% set v = i.split(';')[1].replace(' ','') %}
              {% set ns.s = ns.s ~ "\"" ~ k ~ "\":\"" ~ v ~ "\"," %}
            {% endif %}
          {% endfor %}
          {{ ns.s[:-1]~"}" }}

Good to see no new deaths since yesterday’s data. Hopefully this sensor will be pointless and unneeded soon, despite being one of my more complex templates!

koying · June 17, 2021, 7:27am

Pretty sure a CSV-to-Sensor solution will useful in other situations.
Great job

manjotsc · June 17, 2021, 7:33am

Thanks, works great

Is there a possibility to remove this dot/comma from this sensor.

chrome_r0zzl4CjMU

- resource: https://santemontreal.qc.ca/en/public/coronavirus-covid-19/situation-of-the-coronavirus-covid-19-in-montreal/
  scan_interval: 60
  sensor:
    - name: Sante Montreal Confirmed Cases
      select: "tr:nth-child(1) > td:nth-child(1) > h3"

    - name: Sante Montreal Confirmed Cases in Past 24 Hours
      select: "tr:nth-child(1) > td:nth-child(2) > h3"

    - name: Sante Montreal Confirmed Cases Last Updated
      select: "#c43669 b"

    - name: Sante Montreal Recovered Cases
      select: "td:nth-child(3) > h3"
      
    - name: Sante Montreal Number of Deaths
      select: "tr:nth-child(2) > td:nth-child(1) > h3"

Troon · June 17, 2021, 7:37am

- resource: https://santemontreal.qc.ca/en/public/coronavirus-covid-19/situation-of-the-coronavirus-covid-19-in-montreal/
  scan_interval: 60
  sensor:
    - name: Sante Montreal Confirmed Cases
      select: "tr:nth-child(1) > td:nth-child(1) > h3"
      value_template: "{{ value.replace(',', '') }}"

Troon · June 17, 2021, 7:38am

Thanks — although this was pretty non-standard CSV with no commas and no column headers!

manjotsc · June 17, 2021, 7:41am

Is there a documentation where I can learn creating these value templates?

Troon · June 17, 2021, 7:58am

https://jinja.palletsprojects.com/en/latest/templates/

Both are linked from Developer Tools / Templates, which is a great tool for developing templates. It also helps knowing a bit of Python, which is where the .replace method (as opposed to the Jinja |replace filter, which does the same thing) I’m using comes from.

All the value_template does is process the incoming data to provide the state. In this case, you wanted to remove the commas, so I used value.replace(',', '') to replace the comma with nothing. That looks a bit complex because of the “functional comma” in the declaration, but if you want to replace “a” with “b”, then it’s .replace('a', 'b').

I thought it might help those trying to learn this stuff to see what’s going on in my seemingly rather complicated value_template for the Quebec rest sensor. Paste this into Developer Tools / Templates and you’ll see the steps to get from non-standard CSV to a usable dictionary (string):

{% set value = "\ufeffDate;2021-06-16;
Total de cas confirmés;373 370;(+ 153)
Total de personnes rétablies;360 587;
Total des hospitalisations;192;(- 17)
Total des décès;11 177;(+ 0)" %}
Start:
{{ value }}
{# Turn it into a list. -#}
{% set r = value.replace('\ufeff', '').split('\n') %}
List: {{ r }}
{# Create a namespace, needed for referring to a variable -#}
{# inside and outside the scope of a loop -#}
{% set ns = namespace(s="{") -%}
{# Iterate through the elements of our list, converted from a string -#}
{% for i in r %}
Line {{ loop.index }}:     {{ i }}
{# There seems to be a blank line in the actual response, filter that out -#}
{% if i|length > 1 -%}
{# Pull out the key from the line: first element when split by semicolons, accents removed -#}
{% set k = i.split(';')[0].replace('é','e').replace('è','e') -%}
Key {{ loop.index }}:      {{ k -}}
{# and the value, second element, commas removed -#}
{% set v = i.split(';')[1].replace(' ','') %}
Value {{ loop.index }}:    {{ v }}
{# Add the key:value pair to our dict string -#}
{% set ns.s = ns.s ~ "\"" ~ k ~ "\":\"" ~ v ~ "\"," -%}
Dict state: {{ ns.s }}
{% endif -%}
{% endfor %}
{# Knock off the final comma and close the the dict -#}
Final: {{ ns.s[:-1]~"}" }}

manjotsc · June 23, 2021, 6:07am

Can you help with extracting just this value below, Thanks

EXCEL_2XYm7iEGZi

Troon · June 23, 2021, 6:29am

Did you try yourself? With the explanation I gave earlier, I would have thought it wouldn’t take much to get that value out. Also, if that was the only value you cared about from the start, it would have been useful to state that in the initial request.

Here’s my minimal work solution, in keeping with your apparent approach:

rest:
  - resource: https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv
    scan_interval: 3600
    sensor:
      - name: Quebec case delta
        value_template: >-
          {% set r = value.replace('\ufeff', '').split('\n') %}
          {% set ns = namespace(s="") %}
          {% for i in r %}
            {% if " cas " in i %}
              {% set ns.s = i.split(';')[2].replace(' ','').replace('(','').replace(')','').replace('+','') %}
            {% endif %}
          {% endfor %}
          {{ ns.s }}

manjotsc · June 23, 2021, 6:58am

Thanks, I tried myself but couldn’t get around, with the example above, it bit more clear. Let me try again with different value.

Troon · June 23, 2021, 7:21am

See point 9 here: How to help us help you - or How to ask a good question

hung2k · May 24, 2022, 12:56pm

Hi, @Troon

Would you mind to tell me how to scrape a specific line only in a CSV?

Let’s say i just want line 5 HK Observatory only

Date time,Automatic Weather Station,Air Temperature(degree Celsius)
202205242040,Chek Lap Kok,26.4
202205242040,Cheung Chau,23.7
202205242040,Clear Water Bay,23.9
202205242040,Happy Valley,25.0
202205242040,HK Observatory,24.8
202205242040,HK Park,24.3
202205242040,Kai Tak Runway Park,24.8
202205242040,Kau Sai Chau,24.1
202205242040,King's Park,24.1

https://data.weather.gov.hk/weatherAPI/hko_data/regional-weather/latest_1min_temperature.csv

Troon · May 25, 2022, 6:20am

There’s probably a more efficient way to do it — this is just an edited version of the code above:

rest:
  - resource: https://data.weather.gov.hk/weatherAPI/hko_data/regional-weather/latest_1min_temperature.csv
    scan_interval: 3600
    sensor:
      - name: HK Observatory
        unit_of_measurement: '°C'
        value_template: >-
          {% set r = value.replace('\ufeff', '').split('\n') %}
          {% set ns = namespace(s="") %}
          {% for i in r %}
            {% if "HK Observatory" in i %}
              {% set ns.s = i.split(',')[2] %}
            {% endif %}
          {% endfor %}
          {{ ns.s }}

Adjust scan_interval (refresh time in seconds) to suit your needs, but if you check it too often they might block you…

hung2k · May 25, 2022, 11:44am

Sorry, i am not familiar with python and just overlook your latest minimal solution.
Thank you for your kindly help.