Hi,
I need help with scraping this csv file, Thanks
Hi,
I need help with scraping this csv file, Thanks
File looks like this:
<feff>Date;2021-06-16;
Total de cas confirmés;373 370;(+ 153)
Total de personnes rétablies;360 587;
Total des hospitalisations;192;(- 17)
Total des décÚs;11 177;(+ 0)
Hereâs how to pull that data into a single sensor, with the date as the sensor state, and the other data as attributes. Create an external Python script â not an HA Script nor a python_script
. Youâll have to work out how to do that on your system: Iâm on Docker which makes it easier. Iâve called it /config/scripts/quebec.py
:
import requests
import json
r = requests.get('https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv').text
r = r.replace('\u00ef\u00bb\u00bf', '')
d = {}
for i in r.split('\n')[:-1]:
l = i.split(';')
d[l[0].replace('\u00c3\u00a9', 'e').replace('\u00c3\u00a8', 'e')] = l[1].replace(' ', '')
print(json.dumps(d))
Then create a command line sensor:
- platform: command_line
name: Quebec test
command: "python3 /config/scripts/quebec.py"
value_template: "{{ value_json['Date'] }}"
json_attributes:
- Total de cas confirmes
- Total de personnes retablies
- Total des hospitalisations
- Total des deces
Line 4 of the script removes the byte order mark.
The loop then reads each line in turn (except the blank last line), splitting it at each semicolon. The first element of that list is the dictionary key (with accents removed), and the second element is the value (with spaces in the middle of the numbers removed).
I havenât captured the âchange since last valueâ third column.
Iâve now managed this with no external scripts, just a command_line
sensor and template
sensors. First, the command_line
sensor:
- platform: command_line
command: 'python3 -c "import requests; print(requests.get(\"https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv\").text)"'
name: Quebec COVID status
value_template: >-
{% set r = '["' ~ value.replace('', '').replace('\n','","').replace('\r','') ~ '"]' %}
{% set ns = namespace(s="{") %}
{% for i in (r|from_json) %}
{% set k = i.split(';')[0].replace('Ă©','e').replace('Ăš','e') %}
{% set v = i.split(';')[1].replace(' ','') %}
{% set ns.s = ns.s ~ "\"" ~ k ~ "\":\"" ~ v ~ "\"," %}
{% endfor %}
{{ ns.s[:-1]~"}" }}
First line of the template removes the BOM, and turns it into a string that looks like a list.
The rest of the template makes it into a string that looks like a dictionary.
Next, the template sensors, using the new template
integration (docs):
template:
- sensor:
- name: QCS Date
state: "{{ (states('sensor.quebec_covid_status')|from_json)['Date'] }}"
- name: QCS Total de cas confirmés
state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total de cas confirmes'] }}"
- name: QCS Total de personnes rétablies
state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total de personnes retablies'] }}"
- name: QCS Total des hospitalisations
state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total des hospitalisations'] }}"
- name: QCS Total des décÚs
state: "{{ (states('sensor.quebec_covid_status')|from_json)['Total des deces'] }}"
Thanks, Iâll give a try when I am back home
Urgh, right. Here it is as a rest
sensor instead then (docs). template
sensors unchanged from prior post:
rest:
- resource: https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv
scan_interval: 3600
sensor:
- name: Quebec COVID status
value_template: >-
{% set r = value.replace('\ufeff', '').split('\n') %}
{% set ns = namespace(s="{") %}
{% for i in r %}
{% if i|length > 1 %}
{% set k = i.split(';')[0].replace('Ă©','e').replace('Ăš','e') %}
{% set v = i.split(';')[1].replace(' ','') %}
{% set ns.s = ns.s ~ "\"" ~ k ~ "\":\"" ~ v ~ "\"," %}
{% endif %}
{% endfor %}
{{ ns.s[:-1]~"}" }}
Good to see no new deaths since yesterdayâs data. Hopefully this sensor will be pointless and unneeded soon, despite being one of my more complex templates!
Pretty sure a CSV-to-Sensor solution will useful in other situations.
Great job
Thanks, works great
Is there a possibility to remove this dot/comma from this sensor.
- resource: https://santemontreal.qc.ca/en/public/coronavirus-covid-19/situation-of-the-coronavirus-covid-19-in-montreal/
scan_interval: 60
sensor:
- name: Sante Montreal Confirmed Cases
select: "tr:nth-child(1) > td:nth-child(1) > h3"
- name: Sante Montreal Confirmed Cases in Past 24 Hours
select: "tr:nth-child(1) > td:nth-child(2) > h3"
- name: Sante Montreal Confirmed Cases Last Updated
select: "#c43669 b"
- name: Sante Montreal Recovered Cases
select: "td:nth-child(3) > h3"
- name: Sante Montreal Number of Deaths
select: "tr:nth-child(2) > td:nth-child(1) > h3"
- resource: https://santemontreal.qc.ca/en/public/coronavirus-covid-19/situation-of-the-coronavirus-covid-19-in-montreal/
scan_interval: 60
sensor:
- name: Sante Montreal Confirmed Cases
select: "tr:nth-child(1) > td:nth-child(1) > h3"
value_template: "{{ value.replace(',', '') }}"
Thanks â although this was pretty non-standard CSV with no commas and no column headers!
Is there a documentation where I can learn creating these value templates?
https://jinja.palletsprojects.com/en/latest/templates/
Both are linked from Developer Tools / Templates, which is a great tool for developing templates. It also helps knowing a bit of Python, which is where the .replace
method (as opposed to the Jinja |replace
filter, which does the same thing) Iâm using comes from.
All the value_template
does is process the incoming data to provide the state. In this case, you wanted to remove the commas, so I used value.replace(',', '')
to replace the comma with nothing. That looks a bit complex because of the âfunctional commaâ in the declaration, but if you want to replace âaâ with âbâ, then itâs .replace('a', 'b')
.
I thought it might help those trying to learn this stuff to see whatâs going on in my seemingly rather complicated value_template
for the Quebec rest
sensor. Paste this into Developer Tools / Templates and youâll see the steps to get from non-standard CSV to a usable dictionary (string):
{% set value = "\ufeffDate;2021-06-16;
Total de cas confirmés;373 370;(+ 153)
Total de personnes rétablies;360 587;
Total des hospitalisations;192;(- 17)
Total des décÚs;11 177;(+ 0)" %}
Start:
{{ value }}
{# Turn it into a list. -#}
{% set r = value.replace('\ufeff', '').split('\n') %}
List: {{ r }}
{# Create a namespace, needed for referring to a variable -#}
{# inside and outside the scope of a loop -#}
{% set ns = namespace(s="{") -%}
{# Iterate through the elements of our list, converted from a string -#}
{% for i in r %}
Line {{ loop.index }}: {{ i }}
{# There seems to be a blank line in the actual response, filter that out -#}
{% if i|length > 1 -%}
{# Pull out the key from the line: first element when split by semicolons, accents removed -#}
{% set k = i.split(';')[0].replace('Ă©','e').replace('Ăš','e') -%}
Key {{ loop.index }}: {{ k -}}
{# and the value, second element, commas removed -#}
{% set v = i.split(';')[1].replace(' ','') %}
Value {{ loop.index }}: {{ v }}
{# Add the key:value pair to our dict string -#}
{% set ns.s = ns.s ~ "\"" ~ k ~ "\":\"" ~ v ~ "\"," -%}
Dict state: {{ ns.s }}
{% endif -%}
{% endfor %}
{# Knock off the final comma and close the the dict -#}
Final: {{ ns.s[:-1]~"}" }}
Can you help with extracting just this value below, Thanks
Did you try yourself? With the explanation I gave earlier, I would have thought it wouldnât take much to get that value out. Also, if that was the only value you cared about from the start, it would have been useful to state that in the initial request.
Hereâs my minimal work solution, in keeping with your apparent approach:
rest:
- resource: https://cdn-contenu.quebec.ca/cdn-contenu/sante/documents/Problemes_de_sante/covid-19/csv/situation-au-quebec.csv
scan_interval: 3600
sensor:
- name: Quebec case delta
value_template: >-
{% set r = value.replace('\ufeff', '').split('\n') %}
{% set ns = namespace(s="") %}
{% for i in r %}
{% if " cas " in i %}
{% set ns.s = i.split(';')[2].replace(' ','').replace('(','').replace(')','').replace('+','') %}
{% endif %}
{% endfor %}
{{ ns.s }}
Thanks, I tried myself but couldnât get around, with the example above, it bit more clear. Let me try again with different value.
Hi, @Troon
Would you mind to tell me how to scrape a specific line only in a CSV?
Letâs say i just want line 5 HK Observatory only
Date time,Automatic Weather Station,Air Temperature(degree Celsius)
202205242040,Chek Lap Kok,26.4
202205242040,Cheung Chau,23.7
202205242040,Clear Water Bay,23.9
202205242040,Happy Valley,25.0
202205242040,HK Observatory,24.8
202205242040,HK Park,24.3
202205242040,Kai Tak Runway Park,24.8
202205242040,Kau Sai Chau,24.1
202205242040,King's Park,24.1
https://data.weather.gov.hk/weatherAPI/hko_data/regional-weather/latest_1min_temperature.csv
Thereâs probably a more efficient way to do it â this is just an edited version of the code above:
rest:
- resource: https://data.weather.gov.hk/weatherAPI/hko_data/regional-weather/latest_1min_temperature.csv
scan_interval: 3600
sensor:
- name: HK Observatory
unit_of_measurement: '°C'
value_template: >-
{% set r = value.replace('\ufeff', '').split('\n') %}
{% set ns = namespace(s="") %}
{% for i in r %}
{% if "HK Observatory" in i %}
{% set ns.s = i.split(',')[2] %}
{% endif %}
{% endfor %}
{{ ns.s }}
Adjust scan_interval
(refresh time in seconds) to suit your needs, but if you check it too often they might block youâŠ
Sorry, i am not familiar with python and just overlook your latest minimal solution.
Thank you for your kindly help.