Scrape Sensor Component, get two tags

jusdwy · December 6, 2018, 6:38pm

I’m trying to use the built-in scrape sensor to grab weather alerts from Environment Canada. Ideally, I want to grab two tags and join them, but not have to create two separate sensors. I’ve read through the Beautiful Soup documentation, and tried a few things but no luck so far.

Here’s what I have

#https://www.home-assistant.io/components/sensor.scrape/
  - platform: scrape
    resource: https://weather.gc.ca/rss/warning/on-150_e.xml
    select: "entry title, entry summary"
    value_template: >-
      {% if value.count("No watches") > 0 %}
        none
      {% else %}
        {{ value }}
      {% endif %}
    name: Environment Canada Alert
    scan_interval: 21600

So I’m trying to grab both the “title” and “summary” from under the “entry” tag. I can easily get either, but not both. This current setup only gives me the title.

jusdwy · December 6, 2018, 11:10pm

Solved it by hacking a custom version of the scrape sensor.

If anyone cares, I added an optional second “select” field that if it exists just gets joined to the get of the data gathered by the first “select” field.

config\custom_components\sensor\scrape.py

"""
Support for getting data from websites with scraping.

For more details about this platform, please refer to the documentation at
https://home-assistant.io/components/sensor.scrape/
"""
import logging

import voluptuous as vol
from requests.auth import HTTPBasicAuth, HTTPDigestAuth

from homeassistant.components.sensor import PLATFORM_SCHEMA
from homeassistant.components.sensor.rest import RestData
from homeassistant.const import (
    CONF_NAME, CONF_RESOURCE, CONF_UNIT_OF_MEASUREMENT, STATE_UNKNOWN,
    CONF_VALUE_TEMPLATE, CONF_VERIFY_SSL, CONF_USERNAME, CONF_HEADERS,
    CONF_PASSWORD, CONF_AUTHENTICATION, HTTP_BASIC_AUTHENTICATION,
    HTTP_DIGEST_AUTHENTICATION)
from homeassistant.helpers.entity import Entity
import homeassistant.helpers.config_validation as cv

REQUIREMENTS = ['beautifulsoup4==4.6.3']

_LOGGER = logging.getLogger(__name__)

CONF_ATTR = 'attribute'
CONF_SELECT = 'select'
CONF_SELECT2 = 'select2'

DEFAULT_NAME = 'Web scrape'
DEFAULT_VERIFY_SSL = True

PLATFORM_SCHEMA = PLATFORM_SCHEMA.extend({
    vol.Required(CONF_RESOURCE): cv.string,
    vol.Required(CONF_SELECT): cv.string,
    vol.Optional(CONF_SELECT2): cv.string,
    vol.Optional(CONF_ATTR): cv.string,
    vol.Optional(CONF_AUTHENTICATION):
        vol.In([HTTP_BASIC_AUTHENTICATION, HTTP_DIGEST_AUTHENTICATION]),
    vol.Optional(CONF_HEADERS): vol.Schema({cv.string: cv.string}),
    vol.Optional(CONF_NAME, default=DEFAULT_NAME): cv.string,
    vol.Optional(CONF_PASSWORD): cv.string,
    vol.Optional(CONF_UNIT_OF_MEASUREMENT): cv.string,
    vol.Optional(CONF_USERNAME): cv.string,
    vol.Optional(CONF_VALUE_TEMPLATE): cv.template,
    vol.Optional(CONF_VERIFY_SSL, default=DEFAULT_VERIFY_SSL): cv.boolean,
})


def setup_platform(hass, config, add_entities, discovery_info=None):
    """Set up the Web scrape sensor."""
    name = config.get(CONF_NAME)
    resource = config.get(CONF_RESOURCE)
    method = 'GET'
    payload = None
    headers = config.get(CONF_HEADERS)
    verify_ssl = config.get(CONF_VERIFY_SSL)
    select = config.get(CONF_SELECT)
    select2 = config.get(CONF_SELECT2)
    attr = config.get(CONF_ATTR)
    unit = config.get(CONF_UNIT_OF_MEASUREMENT)
    username = config.get(CONF_USERNAME)
    password = config.get(CONF_PASSWORD)
    value_template = config.get(CONF_VALUE_TEMPLATE)
    if value_template is not None:
        value_template.hass = hass

    if username and password:
        if config.get(CONF_AUTHENTICATION) == HTTP_DIGEST_AUTHENTICATION:
            auth = HTTPDigestAuth(username, password)
        else:
            auth = HTTPBasicAuth(username, password)
    else:
        auth = None
    rest = RestData(method, resource, auth, headers, payload, verify_ssl)
    rest.update()

    if rest.data is None:
        _LOGGER.error("Unable to fetch data from %s", resource)
        return False

    add_entities([
        ScrapeSensor(rest, name, select, select2, attr, value_template, unit)], True)


class ScrapeSensor(Entity):
    """Representation of a web scrape sensor."""

    def __init__(self, rest, name, select, select2, attr, value_template, unit):
        """Initialize a web scrape sensor."""
        self.rest = rest
        self._name = name
        self._state = STATE_UNKNOWN
        self._select = select
        self._select2 = select2
        self._attr = attr
        self._value_template = value_template
        self._unit_of_measurement = unit

    @property
    def name(self):
        """Return the name of the sensor."""
        return self._name

    @property
    def unit_of_measurement(self):
        """Return the unit the value is expressed in."""
        return self._unit_of_measurement

    @property
    def state(self):
        """Return the state of the device."""
        return self._state

    def update(self):
        """Get the latest data from the source and updates the state."""
        self.rest.update()

        from bs4 import BeautifulSoup

        raw_data = BeautifulSoup(self.rest.data, 'html.parser')
        _LOGGER.debug(raw_data)

        try:
            if self._attr is not None:
                value = raw_data.select(self._select)[0][self._attr]
            else:
                if self._select2 is not None:
                    value = raw_data.select(self._select)[0].text + " " + raw_data.select(self._select2)[0].text
                else:
                    value = raw_data.select(self._select)[0].text
                
            _LOGGER.debug(value)
        except IndexError:
            _LOGGER.error("Unable to extract data from HTML")
            return

        if self._value_template is not None:
            self._state = self._value_template.render_with_possible_json_value(
                value, STATE_UNKNOWN)
        else:
            self._state = value

and here’s how I call it

#https://www.home-assistant.io/components/sensor.scrape/
  - platform: scrape
    resource: https://weather.gc.ca/rss/warning/on-150_e.xml
    select: "entry title"
    select2: "entry summary"
    value_template: >-
      {% if value.count("No watches") > 0 %}
        none
      {% else %}
        {{ value }}
      {% endif %}
    name: Environment Canada Alert
    scan_interval: 21600

I see there’s a Environment Canada component that is in the works, but for now, this will do for me.

123 · December 7, 2018, 1:13am

Perfect time to test your custom component, there’s currently a Snow Squall Warning in effect!
Collingwood%20-%20Snow%20Squall

jusdwy · December 7, 2018, 2:52am

I implemented the scrape sensor a few days ago, but didn’t know what the alert actually showed up as (and didn’t care to look for a place with an alert at the time). I set a notification to appear when there was an alert, and went to work formatting. And here we are!

The style of that screenshot looks great, might steal it for how to display in HA card

123 · December 7, 2018, 3:34am

FWIW, I created an Environment Canada driver for my HA system about nine years ago. It gets its data from EnviroCan’s XML feeds (not RSS). For example, here’s the URL for your neck of the woods:

http://dd.weatheroffice.ec.gc.ca/citypage_weather/xml/ON/s0000108_e.xml

The XML contains a warnings node. Here’s what it says for Collingwood right now. The important stuff is in the first two lines (high priority warning, snow squall warning):

<warnings url="http://weather.gc.ca/warnings/report_e.html?on18">
<event type="warning" priority="high" description="SNOW SQUALL WARNING ">
<dateTime name="eventIssue" zone="UTC" UTCOffset="0">
<year>2018</year>
<month name="December">12</month>
<day name="Thursday">06</day>
<hour>22</hour>
<minute>34</minute>
<timeStamp>20181206223400</timeStamp>
<textSummary>Thursday December 06, 2018 at 22:34 UTC</textSummary>

My driver polls their site every 30 minutes and gets the latest weather data. It also checks if warnings contains data and then announces it in my home:

“Attention! There is a high priority weather warning in effect. Snow Squall Warning.”

It also displays the warning on the thermostat and changes the thermostat’s backlight color to red (warning=red, watch=yellow, end of warning/watch=green).

The one thing EnviroCan has overlooked to include in the XML data (for > 9 years) is the weather warning’s descriptive text! My driver follows the URL in the warnings node then scrapes that web-page for the text. Over the years they’ve changed the HTML formatting so every few years or so I have to tweak the code to adapt to their modifications.

Whereas the RSS link provides a summary of the weather warning (i.e. image in my previous post), EnviroCan often has a lot more to say on their warnings web-page :
https://weather.gc.ca/warnings/report_e.html?on18

Looks like there’ll be some decent skiing opportunities in your area by Friday! After you dig the car out, of course.