Scrape Sensor - Local file - Hass.io

e1miran · April 4, 2018, 8:39pm

I’m trying to use scrape sensor for a html resource that is local. The reason is that I’d like to scrape text from a page that is behind a login. I’m able to log into that page and grab the source code and save that as a html text file. I’ve placed it in my www/ folder within config/ and was hoping to scrape it from there.

In the browser I can bring up the page using the url https://[redacted].duckdns.org:8123/local/file.html - no problem. But it does not work in HA. The log shows “Error fetching data from https://…”. It’s not an issue with an incorrect CSS selector, because I then created a test sensor to scrape the current version # from the HA website and it worked. But when I took that source code and housed it locally, I again got the error. So how can I scrape a local webpage?

lolouk44 · April 5, 2018, 10:57am

Can you share some of your code?

e1miran · April 5, 2018, 1:06pm

This works:

  - platform: scrape
    resource: https://www.home-assistant.io
    name: Release
    select: ".current-version h1"
    value_template: '{{ value.split(":")[1] }}'

If I take the source code HTML from https://www.home-assistant.io and save it to a file named hass.html in the /www folder in my hassio /config directory, I can successfully browse the following url in a web browser:

https://[REDACTED].duckdns.org/local/hass.html

However, if I try to use that same url with the scrape sensor component (like below), I get an error saying the the data cannot be fetched.

  - platform: scrape
    resource: https://[REDACTED].duckdns.org/local/hass.html
    name: Release
    select: ".current-version h1"
    value_template: '{{ value.split(":")[1] }}'

The above is only an example, for my actual use-case I will not be scraping the home-assistant.io website. This is just to illustrate that it works with the page served externally, but not internally. If I can’t can’t get the scrape component to work, I may try using a python script instead.

lolouk44 · April 5, 2018, 1:28pm

Hum I wonder if this is because you in effect have the page rendeder trying t to read what it’s rendering, so there may be a “safety” function to avoid infinite loops…

What if you host the page outside of HA’s config directory, does that work? (Not sure which OS, but if on Linux you’ll need something like Apache).

cryptelli · October 3, 2018, 8:02am

I’m also experiencing difficulties when trying to scrape data from a local resource. The partial error which @e1miran included in his OP was probably identical to the one I receive below:

Error fetching data: <PreparedRequest [GET]> from http://xxx.xxx.xx.xxx:xxxxx/ failed with HTTPConnectionPool(host='xxx.xxx.xx.xxx', port=xxxxx): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0f60bd5b38>: Failed to establish a new connection: [Errno 111] Connection refused',))

The port seems to be causing the issue since it works fine without. Any suggestions?