Hi,
I need help with scraping this csv file, Thanks
Hi,
I need help with scraping this csv file, Thanks
File looks like this:
<feff>Date;2021-06-16;
Total de cas confirmés;373 370;(+ 153)
Total de personnes rétablies;360 587;
Total des hospitalisations;192;(- 17)
Total des décès;11 177;(+ 0)
Here’s how to pull that data into a single sensor, with the date as the sensor state, and the other data as attributes. Create an external Python script — not an HA Script nor a python_script
. You’ll have to work out how to do that on your system: I’m on Docker which makes it easier. I’ve called it /config/scripts/quebec.py
:
import requests
import json
r = requests.get('https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv').text
r = r.replace('\u00ef\u00bb\u00bf', '')
d = {}
for i in r.split('\n')[:-1]:
l = i.split(';')
d[l[0].replace('\u00c3\u00a9', 'e').replace('\u00c3\u00a8', 'e')] = l[1].replace(' ', '')
print(json.dumps(d))
Then create a command line sensor:
- platform: command_line
name: Quebec test
command: "python3 /config/scripts/quebec.py"
value_template: "{{ value_json['Date'] }}"
json_attributes:
- Total de cas confirmes
- Total de personnes retablies
- Total des hospitalisations
- Total des deces
Line 4 of the script removes the byte order mark.
The loop then reads each line in turn (except the blank last line), splitting it at each semicolon. The first element of that list is the dictionary key (with accents removed), and the second element is the value (with spaces in the middle of the numbers removed).
I haven’t captured the “change since last value” third column.
I’ve now managed this with no external scripts, just a command_line
sensor and template
sensors. First, the command_line
sensor:
- platform: command_line
command: 'python3 -c "import requests; print(requests.get(\"https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv\").text)"'
name: Quebec COVID status
value_template: >-
{% set r = '["' ~ value.replace('', '').replace('\n','","').replace('\r','') ~ '"]' %}
{% set ns = namespace(s="{") %}
{% for i in (r|from_json) %}
{% set k = i.split(';')[0].replace('é','e').replace('è','e') %}
{% set v = i.split(';')[1].replace(' ','') %}
{% set ns.s = ns.s ~ "\"" ~ k ~ "\":\"" ~ v ~ "\"," %}
{% endfor %}
{{ ns.s[:-1]~"}" }}
First line of the template removes the BOM, and turns it into a string that looks like a list.
The rest of the template makes it into a string that looks like a dictionary.
Next, the template sensors, using the new template
integration (docs):
template:
- sensor:
- name: QCS Date
state: "{{ (states('sensor.quebec_covid_status')|from_json)['Date'] }}"
- name: QCS Total de cas confirmés
state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total de cas confirmes'] }}"
- name: QCS Total de personnes rétablies
state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total de personnes retablies'] }}"
- name: QCS Total des hospitalisations
state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total des hospitalisations'] }}"
- name: QCS Total des décès
state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total des deces'] }}"
Thanks, I’ll give a try when I am back home
Urgh, right. Here it is as a rest
sensor instead then (docs). template
sensors unchanged from prior post:
rest:
- resource: https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv
scan_interval: 3600
sensor:
- name: Quebec COVID status
value_template: >-
{% set r = value.replace('\ufeff', '').split('\n') %}
{% set ns = namespace(s="{") %}
{% for i in r %}
{% if i|length > 1 %}
{% set k = i.split(';')[0].replace('é','e').replace('è','e') %}
{% set v = i.split(';')[1].replace(' ','') %}
{% set ns.s = ns.s ~ "\"" ~ k ~ "\":\"" ~ v ~ "\"," %}
{% endif %}
{% endfor %}
{{ ns.s[:-1]~"}" }}
Good to see no new deaths since yesterday’s data. Hopefully this sensor will be pointless and unneeded soon, despite being one of my more complex templates!
Pretty sure a CSV-to-Sensor solution will useful in other situations.
Great job
Thanks, works great
Is there a possibility to remove this dot/comma from this sensor.
- resource: https://santemontreal.qc.ca/en/public/coronavirus-covid-19/situation-of-the-coronavirus-covid-19-in-montreal/
scan_interval: 60
sensor:
- name: Sante Montreal Confirmed Cases
select: "tr:nth-child(1) > td:nth-child(1) > h3"
- name: Sante Montreal Confirmed Cases in Past 24 Hours
select: "tr:nth-child(1) > td:nth-child(2) > h3"
- name: Sante Montreal Confirmed Cases Last Updated
select: "#c43669 b"
- name: Sante Montreal Recovered Cases
select: "td:nth-child(3) > h3"
- name: Sante Montreal Number of Deaths
select: "tr:nth-child(2) > td:nth-child(1) > h3"
- resource: https://santemontreal.qc.ca/en/public/coronavirus-covid-19/situation-of-the-coronavirus-covid-19-in-montreal/
scan_interval: 60
sensor:
- name: Sante Montreal Confirmed Cases
select: "tr:nth-child(1) > td:nth-child(1) > h3"
value_template: "{{ value.replace(',', '') }}"
Thanks — although this was pretty non-standard CSV with no commas and no column headers!
Is there a documentation where I can learn creating these value templates?
https://jinja.palletsprojects.com/en/latest/templates/
Both are linked from Developer Tools / Templates, which is a great tool for developing templates. It also helps knowing a bit of Python, which is where the .replace
method (as opposed to the Jinja |replace
filter, which does the same thing) I’m using comes from.
All the value_template
does is process the incoming data to provide the state. In this case, you wanted to remove the commas, so I used value.replace(',', '')
to replace the comma with nothing. That looks a bit complex because of the “functional comma” in the declaration, but if you want to replace “a” with “b”, then it’s .replace('a', 'b')
.
I thought it might help those trying to learn this stuff to see what’s going on in my seemingly rather complicated value_template
for the Quebec rest
sensor. Paste this into Developer Tools / Templates and you’ll see the steps to get from non-standard CSV to a usable dictionary (string):
{% set value = "\ufeffDate;2021-06-16;
Total de cas confirmés;373 370;(+ 153)
Total de personnes rétablies;360 587;
Total des hospitalisations;192;(- 17)
Total des décès;11 177;(+ 0)" %}
Start:
{{ value }}
{# Turn it into a list. -#}
{% set r = value.replace('\ufeff', '').split('\n') %}
List: {{ r }}
{# Create a namespace, needed for referring to a variable -#}
{# inside and outside the scope of a loop -#}
{% set ns = namespace(s="{") -%}
{# Iterate through the elements of our list, converted from a string -#}
{% for i in r %}
Line {{ loop.index }}: {{ i }}
{# There seems to be a blank line in the actual response, filter that out -#}
{% if i|length > 1 -%}
{# Pull out the key from the line: first element when split by semicolons, accents removed -#}
{% set k = i.split(';')[0].replace('é','e').replace('è','e') -%}
Key {{ loop.index }}: {{ k -}}
{# and the value, second element, commas removed -#}
{% set v = i.split(';')[1].replace(' ','') %}
Value {{ loop.index }}: {{ v }}
{# Add the key:value pair to our dict string -#}
{% set ns.s = ns.s ~ "\"" ~ k ~ "\":\"" ~ v ~ "\"," -%}
Dict state: {{ ns.s }}
{% endif -%}
{% endfor %}
{# Knock off the final comma and close the the dict -#}
Final: {{ ns.s[:-1]~"}" }}
Can you help with extracting just this value below, Thanks
Did you try yourself? With the explanation I gave earlier, I would have thought it wouldn’t take much to get that value out. Also, if that was the only value you cared about from the start, it would have been useful to state that in the initial request.
Here’s my minimal work solution, in keeping with your apparent approach:
rest:
- resource: https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv
scan_interval: 3600
sensor:
- name: Quebec case delta
value_template: >-
{% set r = value.replace('\ufeff', '').split('\n') %}
{% set ns = namespace(s="") %}
{% for i in r %}
{% if " cas " in i %}
{% set ns.s = i.split(';')[2].replace(' ','').replace('(','').replace(')','').replace('+','') %}
{% endif %}
{% endfor %}
{{ ns.s }}
Thanks, I tried myself but couldn’t get around, with the example above, it bit more clear. Let me try again with different value.
Hi, @Troon
Would you mind to tell me how to scrape a specific line only in a CSV?
Let’s say i just want line 5 HK Observatory only
Date time,Automatic Weather Station,Air Temperature(degree Celsius)
202205242040,Chek Lap Kok,26.4
202205242040,Cheung Chau,23.7
202205242040,Clear Water Bay,23.9
202205242040,Happy Valley,25.0
202205242040,HK Observatory,24.8
202205242040,HK Park,24.3
202205242040,Kai Tak Runway Park,24.8
202205242040,Kau Sai Chau,24.1
202205242040,King's Park,24.1
https://data.weather.gov.hk/weatherAPI/hko_data/regional-weather/latest_1min_temperature.csv
There’s probably a more efficient way to do it — this is just an edited version of the code above:
rest:
- resource: https://data.weather.gov.hk/weatherAPI/hko_data/regional-weather/latest_1min_temperature.csv
scan_interval: 3600
sensor:
- name: HK Observatory
unit_of_measurement: '°C'
value_template: >-
{% set r = value.replace('\ufeff', '').split('\n') %}
{% set ns = namespace(s="") %}
{% for i in r %}
{% if "HK Observatory" in i %}
{% set ns.s = i.split(',')[2] %}
{% endif %}
{% endfor %}
{{ ns.s }}
Adjust scan_interval
(refresh time in seconds) to suit your needs, but if you check it too often they might block you…
Sorry, i am not familiar with python and just overlook your latest minimal solution.
Thank you for your kindly help.