Update browserless v2
This is an update to the original guide following the rather breaking change that occurred with browserless v2. At the time of writing [3/2024], browserless v2 is still not quite as fully featured as v1 was, but it is in the end simpler to deal with and it is the only package that is available as an add-on for HA.
Thanks to @atlflyer for pointing out that browserless can accept js scripts as binary data, which gets around the requirement to minify the JS I had in the previous version of thisâŠ
Intro
This rough initial stab at a guide will describe how to scrape data from websites which are dynamic â these are pages which are generated through user interaction using Javascript and when scraped normally do not contain the data you want. I warm you that I do not know Javascript and have managed to put this together over by trial and error. Improvements are welcome. I thought it worthwhile to write up since this is a question that comes up in the Forums frequently, with no solutions.
I am going to assume you already know how to scrape static websites using the scrape or multiscrape integrations, finding the selectors to extract the data you want.
As you may imagine, executing Javascript requires a full browser rather than just an HTML parser. Such a solution is provided by the Browserless Chrome add-on, which you can install from the AlexBelgium repositories. Add the following repository to the add-on store and install the add-on
Browserless Chrome is two things: a headless Chromium instance which is running Puppeteer, an engine allowing you to script and automate user actions. Possibly itâs Puppeteer running Chromium. In any case, we are going to use this to log into a dynamical website, render it and send the static page to mutliscrape. You can also use Browserless to e.g. take screenshots of the rendered website for display on your espHome device. You can read the docs at (browserless.io)[https://www.browserless.io/].
Launch the browserless add-on. You should be able to access it at port 3000 of your Home Assistant.
http://homeassistant.local:3000
Unfortunately as of the time of writing you will see nothing but the message
No handler or file found for resource /
In v1 of browserless there used to be a live debugger here which you could use to construct the scripts. Currently, this is not fully ready for v2.
Nonetheless, you can replicate the experience using the Chrome hosted by browserless at:
https://chrome.browserless.io/
This is actually still running v1.6 of browserless, but it shouldnât matter for the most part. Since this is a public website, it is also quite slow, but if you have patience you can get things done.
What you should see is a panel on the left in which you can enter Puppeteer/Javascript and a Play button on the right which will execute the script showing you what the headless Chrome is doing. You can try any of the default scripts to see how it works.
I will demonstrate how to use this to log into my car-charger metering company. I will not give you login credentials, but it should be enough information for you to replicate this for what you actually want to do.
The login page for the website is here:
https://net.olife-energy.com/admin/#/?returnUrl.cn=undefined
If you open it in your browser and inspect it, you will be able to find the elements that are in the login form. However, if you try to GET the page you will see it is just a bunch of Javascript. Try this in a terminal:
curl 'https://net.olife-energy.com/admin/#/?returnUrl.cn=undefined' > page.html
and open page.html
in your browser â there is nothing there. This is the usual case for dynamical websites and why scrape canât get anything.
Automating user actions on the website
We will now log into the webpage in Browserless. Click on the new file icon in the top right of the left panel. An empty My-Script will appear. Copy the following there:
export default async ({ page }) => {
await page.goto("https://net.olife-energy.com/admin");
await page.type('input[id*="loginLogin"]', "MySecretUsr");
await page.type(
'input[id*="loginPassword"]',
"MySecretPassword"
);
await Promise.all([
// Wait for navigation to complete
//page.waitForNavigation(),
page.click(".btn-auth"),
//page.waitForSelector("tr:nth-child(2) td:nth-child(6)"),
]);
const html = await page.content();
return html;
};
Run this script using the Play button in the top right. You will see that Browserless opens the webpage, types the user/password and clicks the login button. Since the login is wrong, it does not go anywhere, but after some seconds the browser offers you a file to download containing the static version of the login page. As you may imagine, had the login actually succeeded, you would have received the static version of the page behind the login for you to scrape in the usual way.
Letâs briefly explain how the script is constructed, so you can do it yourself. If you inspect the login page in your browser, you will see that the selector loginLogin
corresponds to the username box in the page. Browserless opens a headless chrome tab and attaches to a Puppeteer instance. Then page.goto
opens a goes to the address and then the page.type
commands enter the username and password into to the boxes identified by their selectors.
The next bit, Promise.all
is a set of things that must complete before the script moves on. Here there is a page.click
instruction which emulates a click on the login box identified by its selector .btn-auth
. Then there are a couple of lines Iâve commented out which, if uncommented, make the Promise wait till the click actually navigates away to the next page and then waits until the selector I need loads and gets rendered. Finally page.content()
gets the static content of the page and loads it into the html
variable and returns it to browserless to offer to your browser.
This should be enough to allow you to experiment with your target websites and emulate the correct data entry and clicks to get to the page that you want and to download it.
Once you are happy that your script works, save it somewhere accessible to your HA instance, e.g. I put my scripts into /config/js_scrapers. For concreteness, letâs call it my_scraper.js
.
Automatic execution of the script
In browserless v1 it was necessary to restructure this script a bit before allowing HA to use it. There is now a simpler method for v2 thanks to @atlflyer:
We will write a bash script to be executed by HA which will call the curl command and submit your javascript and save the rendered static website. We will then point multiscrape at this static website.
Create a bash script with a name e.g. browserless_scraper.sh
in /config/scripts (e.g. using the VS Code add on; create the scripts directory if you donât have it)
#!/bin/sh
curl -X POST 'http://localhost:3000/function' --data-binary '@/config/js_scrapers/'$1 -H 'Content-Type: application/javascript' > /config/www/browserless/$2
This will submit the javascript script $1 (first variable supplied to the command) as an unaltered binary file to the browserless instance running on your HAOS machine âlocalhostâ. The output will be saved in the /config/www/browserless folder with file name given by the second parameter $2.
SSH into HA and go to /config/scripts and make your bash script executable:
cd /config/scripts
chmod +x ./browserless_scraper.sh
Also create the correct directory structure to contain your static pages:
cd /config
mkdir ./www
cd www
mkdir ./browserless
The /config/www
folder is publicly accessible if you know the precise filename (so think about the security there if your HA is exposed to the internet, I guess â see here.)
Now go to VS Code again and edit your configuration.yaml adding the following:
shell_command:
browserless_scraper: "./scripts/browserless_scraper.sh {{function}} {{output}}"
[Note: seems like this script sometimes fails with a FileNotFound error. On my HAOS installation it works fine, but @giantmuskrat suggests using bash
in the shell command:
shell_command:
browserless_scraper: "bash ./scripts/browserless_scraper.sh {{function}} {{output}}"
Endnote]â
Restart HA.
You will now have a shell_command service which will be able to take any JS script and submit it to browserless, which will output the static page to a file with the name you have supplied.
Creating a scrape sensor
This bit should be roughly standard â I am using the multiscrape HACS extention
to scrape since it allows you to create multiple sensors from a single call to a webpage. This is especially important in this case since running the script takes quite a lot of time in principle. You do not want to be repeating the log-in for every sensor separately.
We define the sensor in the following manner:
- name: Charger Stats
resource: "http://localhost:8123/local/browserless/my_output.html"
scan_interval: 0
timeout: 30
log_response: true
method: GET
sensor:
- name: Home Charger status
unique_id: home_charger_status
select: ".spot-list-item div:nth-child(1) div div .charger-status-dot"
attribute: "class"
value_template: "{{value}}"
A couple of comments â the resource
is the static page that the curl command got from browserless not the website you want to scrape. I have set the scan_interval to zero, which disables automatic updating, since it makes not sense to run this scrape without having updated the static page first. I will do this through an automation (see below).
If you have multiscrape try to scrape multiple websites with Browserless simultaneously, it might fail. I just schedule the scraping calls separately using automations.
Automating the scraping
The scraping is done by an automation script, for example
- alias: "Scrape: Trigger charger scraping"
id: dorhgoidhfgiosds
trigger:
- platform: homeassistant
event: start
id: HAStart
- platform: time_pattern
minutes: 7
action:
- if:
- condition: trigger
id: HAStart
then:
- delay: "0:05:00"
- service: shell_command.browserless_scraper
data:
function: "my_scraper.js"
output: "my_output.html"
- delay: "0:00:45"
- service: multiscrape.trigger_charger_stats
In this automation, 7 minutes past every hour (and on HA restart), the browserless_scraper command is called, which executes the shell script with the two parameters function
and output
. Then after some delay for browserless to complete, multiscrape is called to update the sensors on the basis of the static version of the page sitting in the www folder on HA.
I hope this is helpful to someone!