[GUIDE] Scraping dynamic websites with browserless + multiscrape. v2 update

Update browserless v2

This is an update to the original guide following the rather breaking change that occurred with browserless v2. At the time of writing [3/2024], browserless v2 is still not quite as fully featured as v1 was, but it is in the end simpler to deal with and it is the only package that is available as an add-on for HA.

Thanks to @atlflyer for pointing out that browserless can accept js scripts as binary data, which gets around the requirement to minify the JS I had in the previous version of this


Intro

This rough initial stab at a guide will describe how to scrape data from websites which are dynamic – these are pages which are generated through user interaction using Javascript and when scraped normally do not contain the data you want. I warm you that I do not know Javascript and have managed to put this together over by trial and error. Improvements are welcome. I thought it worthwhile to write up since this is a question that comes up in the Forums frequently, with no solutions.

I am going to assume you already know how to scrape static websites using the scrape or multiscrape integrations, finding the selectors to extract the data you want.

As you may imagine, executing Javascript requires a full browser rather than just an HTML parser. Such a solution is provided by the Browserless Chrome add-on, which you can install from the AlexBelgium repositories. Add the following repository to the add-on store and install the add-on

Browserless Chrome is two things: a headless Chromium instance which is running Puppeteer, an engine allowing you to script and automate user actions. Possibly it’s Puppeteer running Chromium. In any case, we are going to use this to log into a dynamical website, render it and send the static page to mutliscrape. You can also use Browserless to e.g. take screenshots of the rendered website for display on your espHome device. You can read the docs at (browserless.io)[https://www.browserless.io/].

Launch the browserless add-on. You should be able to access it at port 3000 of your Home Assistant.

http://homeassistant.local:3000

Unfortunately as of the time of writing you will see nothing but the message

No handler or file found for resource /

In v1 of browserless there used to be a live debugger here which you could use to construct the scripts. Currently, this is not fully ready for v2.

Nonetheless, you can replicate the experience using the Chrome hosted by browserless at:

https://chrome.browserless.io/

This is actually still running v1.6 of browserless, but it shouldn’t matter for the most part. Since this is a public website, it is also quite slow, but if you have patience you can get things done.

What you should see is a panel on the left in which you can enter Puppeteer/Javascript and a Play button on the right which will execute the script showing you what the headless Chrome is doing. You can try any of the default scripts to see how it works.

I will demonstrate how to use this to log into my car-charger metering company. I will not give you login credentials, but it should be enough information for you to replicate this for what you actually want to do.

The login page for the website is here:

https://net.olife-energy.com/admin/#/?returnUrl.cn=undefined

If you open it in your browser and inspect it, you will be able to find the elements that are in the login form. However, if you try to GET the page you will see it is just a bunch of Javascript. Try this in a terminal:

curl 'https://net.olife-energy.com/admin/#/?returnUrl.cn=undefined' > page.html

and open page.html in your browser – there is nothing there. This is the usual case for dynamical websites and why scrape can’t get anything.

Automating user actions on the website

We will now log into the webpage in Browserless. Click on the new file icon in the top right of the left panel. An empty My-Script will appear. Copy the following there:

export default async ({ page }) => {
  await page.goto("https://net.olife-energy.com/admin");
  await page.type('input[id*="loginLogin"]', "MySecretUsr");
  await page.type(
    'input[id*="loginPassword"]',
    "MySecretPassword"
  );
  await Promise.all([
    // Wait for navigation to complete
    //page.waitForNavigation(),
    page.click(".btn-auth"),
    //page.waitForSelector("tr:nth-child(2) td:nth-child(6)"),
  ]);


  const html = await page.content();

  return html;
};

Run this script using the Play button in the top right. You will see that Browserless opens the webpage, types the user/password and clicks the login button. Since the login is wrong, it does not go anywhere, but after some seconds the browser offers you a file to download containing the static version of the login page. As you may imagine, had the login actually succeeded, you would have received the static version of the page behind the login for you to scrape in the usual way.

Let’s briefly explain how the script is constructed, so you can do it yourself. If you inspect the login page in your browser, you will see that the selector loginLogin corresponds to the username box in the page. Browserless opens a headless chrome tab and attaches to a Puppeteer instance. Then page.goto opens a goes to the address and then the page.type commands enter the username and password into to the boxes identified by their selectors.

The next bit, Promise.all is a set of things that must complete before the script moves on. Here there is a page.click instruction which emulates a click on the login box identified by its selector .btn-auth. Then there are a couple of lines I’ve commented out which, if uncommented, make the Promise wait till the click actually navigates away to the next page and then waits until the selector I need loads and gets rendered. Finally page.content() gets the static content of the page and loads it into the html variable and returns it to browserless to offer to your browser.

This should be enough to allow you to experiment with your target websites and emulate the correct data entry and clicks to get to the page that you want and to download it.

Once you are happy that your script works, save it somewhere accessible to your HA instance, e.g. I put my scripts into /config/js_scrapers. For concreteness, let’s call it my_scraper.js.

Automatic execution of the script

In browserless v1 it was necessary to restructure this script a bit before allowing HA to use it. There is now a simpler method for v2 thanks to @atlflyer:

We will write a bash script to be executed by HA which will call the curl command and submit your javascript and save the rendered static website. We will then point multiscrape at this static website.

Create a bash script with a name e.g. browserless_scraper.sh in /config/scripts (e.g. using the VS Code add on; create the scripts directory if you don’t have it)

#!/bin/sh
curl -X POST 'http://localhost:3000/function' --data-binary '@/config/js_scrapers/'$1 -H 'Content-Type: application/javascript' > /config/www/browserless/$2

This will submit the javascript script $1 (first variable supplied to the command) as an unaltered binary file to the browserless instance running on your HAOS machine “localhost”. The output will be saved in the /config/www/browserless folder with file name given by the second parameter $2.

SSH into HA and go to /config/scripts and make your bash script executable:

cd /config/scripts
chmod +x ./browserless_scraper.sh

Also create the correct directory structure to contain your static pages:

cd /config
mkdir ./www
cd www
mkdir ./browserless

The /config/www folder is publicly accessible if you know the precise filename (so think about the security there if your HA is exposed to the internet, I guess – see here.)

Now go to VS Code again and edit your configuration.yaml adding the following:

shell_command:
  browserless_scraper: "./scripts/browserless_scraper.sh {{function}} {{output}}"

[Note: seems like this script sometimes fails with a FileNotFound error. On my HAOS installation it works fine, but @giantmuskrat suggests using bash in the shell command:

shell_command:
  browserless_scraper: "bash ./scripts/browserless_scraper.sh {{function}} {{output}}"

Endnote]’

Restart HA.

You will now have a shell_command service which will be able to take any JS script and submit it to browserless, which will output the static page to a file with the name you have supplied.

Creating a scrape sensor

This bit should be roughly standard – I am using the multiscrape HACS extention

to scrape since it allows you to create multiple sensors from a single call to a webpage. This is especially important in this case since running the script takes quite a lot of time in principle. You do not want to be repeating the log-in for every sensor separately.

We define the sensor in the following manner:

- name: Charger Stats
  resource: "http://localhost:8123/local/browserless/my_output.html"
  scan_interval: 0
  timeout: 30
  log_response: true
  method: GET
  sensor:
    - name: Home Charger status
      unique_id: home_charger_status
      select: ".spot-list-item div:nth-child(1) div div .charger-status-dot"
      attribute: "class"
      value_template: "{{value}}"

A couple of comments – the resource is the static page that the curl command got from browserless not the website you want to scrape. I have set the scan_interval to zero, which disables automatic updating, since it makes not sense to run this scrape without having updated the static page first. I will do this through an automation (see below).

If you have multiscrape try to scrape multiple websites with Browserless simultaneously, it might fail. I just schedule the scraping calls separately using automations.

Automating the scraping

The scraping is done by an automation script, for example

- alias: "Scrape: Trigger charger scraping"
  id: dorhgoidhfgiosds
  trigger:
    - platform: homeassistant
      event: start
      id: HAStart
    - platform: time_pattern
      minutes: 7
  action:
    - if:
        - condition: trigger
          id: HAStart
      then:
        - delay: "0:05:00"
    - service: shell_command.browserless_scraper
      data:
        function: "my_scraper.js"
        output: "my_output.html"
    - delay: "0:00:45"
    - service: multiscrape.trigger_charger_stats

In this automation, 7 minutes past every hour (and on HA restart), the browserless_scraper command is called, which executes the shell script with the two parameters function and output. Then after some delay for browserless to complete, multiscrape is called to update the sensors on the basis of the static version of the page sitting in the www folder on HA.

I hope this is helpful to someone!

9 Likes

Hi, thanks for your writeup. I’m trying to do something similar but i’m struggeling with converting the browserless script to json. I have nested " and i tried everything '"` like you suggested, and \ on the outer ’ and \ on the inner "
 Is there anything you can suggest to make this work?

curl 'http://localhost:3033/function' --json '{"code": "module.exports=async({page:t,context:a})=>{const{url:e}=a;await t.goto(e),await t.type(\"input[type=email][name=email]\",\"******\"),await t.type(\"input[type=password][name=otp]\",\"******\"),await t.click(\"button[type=submit][name=otl]\"),await t.waitForNavigation({waitUntil:\"domcontentloaded\"}),await t.click( '\"'input[type=submit][value=\"Add Job\"] '\"'),await t.waitForTimeout(5e3);return{data:await t.content(),type:\"application/html\"}};", "context": {"url": "*****"}}'

this is the line that is causing trouble i think.

await t.click( '\"'input[type=submit][value=\"Add Job\"] '\"')

and this is the original line of code

await page.click('input[type=submit][value="Add Job"]');

thanks for your help

I think I would re-express

await page.click('input[type=submit][value="Add Job"]');

as

await page.click(\"input[type=submit][value=\\\"Add Job\\\"]\");

i.e. replace the single quotes with double-quotes and escape them and then escape the escape for the inner quotes – i.e. the inner " becomes a \\\" – I’ve just noticed that the website’s parser parses my escapes, so mangling them. The code part is fine. It is supposed to be a triple \ before the double quote.

My heuristic on this is that there can be no single quotes in the json at all and then the curl command parameter can be surrounded by single quotes.

I think i already tried that but it didn’t work.
I found an other way to do what i wanted with node-red, but thanks anyway.

Hi @wigster, thanks for writing this up!
Since the release of browserless v2, it seems the debugger is no longer available in the browserless-chrome addon.
I want to write my first scripts based on yours but I have no idea where to place them in the HA file system. Any advice?

Alas I upgraded only yesterday and have hit the same problem. From the documentation it seems as if the debugger should be there, so maybe it’s a misconfiguration of the container?

The hosted demo server at chrome.browserless.io still runs 1.61, so maybe it’s a bit early to try to migrate? It seems like the project is in transition without taking hostages. IN any case, you could try to prototype there.

But some minor changes do seem to be necessary in the form of the function call, see

browserless/MIGRATION-2.0.md at main · browserless/browserless (github.com)

1 Like

Thank you for this guide, it’s helped me solve a problem that’s vexed me for some time!

I had problems calling the /function endpoint using json in the commandline (it always caused an error in the add-on about no matching WebSocket route handler), but the endpoint also supports receiving the JS code as a binary file upload. This approach avoids the need to minify and escape quotations marks, so I think it’s a bit easier. The curl commandline in this case looks like

curl --data-binary @/config/myfunction.js -H 'Content-Type: application/javascript' http://localhost:3000/function

I used the debugger on the demo server, and it seems to be generating the 2.0-style browserless functions. I prototyped there and ran the resulting function in the add-on with no issues.

1 Like

That’s great to know. Can you share the yaml configuration that allows the binary JS file to be used by multiscrape?

I am still trying to figure out how best to do this, but I do not think you can through multiscrape.

Right now, I have a shell script that is a curl command like the above from @atlflyer. I then call it from the automation and send the output to a file in the /config/www folder, which is accessible as a public url. I then call the multiscrape service form the automation to update that set of sensors and multiscrape is pointed to scrape the file sitting in /config/www.

Once I am happy with the results, I will update the guide.

Have a look at the guide – I’ve updated it for how I have it running on my machine now with browserless v2. There might be a nicer solution though – suggestions welcome.

Thank you for the excellent writeup!
I’m trying to use this method for scraping, but I can’t get the shell command to work. I’ve created a “scripts” folder and put a “browserless_scraper.sh” in the folder, and have this in my configuration.yaml:

shell_command:
  browserless_scraper: "./scripts/browserless_scraper.sh {{function}} {{output}}"

I can find the shell command in home assistant, but when I try to run it, I always get this error logged: FileNotFoundError: [Errno 2] No such file or directory: ‘./scripts/browserless_scraper.sh’

I’m using HomeAssistant on an intel nuc, and I’m guessing I need to specify the path to the folder containing the script somehow differently, but I can’t figure out how.

If I open a terminal through the HA UI, I can see the file, and run it. I’m not all that familiar with the docker containers and what file structure is accessible from where :frowning:

I have a standard HAOS installation running on an arm machine. In my case you can attach to the docker container of HA by sshing into the machine and sending:

docker exec -it homeassistant /bin/bash

this opens a bash shell in the container and I believe the environment, paths etc should be the same as what HA sees. In my case, the /config directory is the default path, but maybe that’s not the same for everyone.

You could try:

shell_command:
  browserless_scraper: "/config/scripts/browserless_scraper.sh {{function}} {{output}}"

So maybe this is he problem – I’ve created the scripts directoty inside /config. Is that what you have?

If I ssh into my home assistant I come to the “Welcome to the Home Assistant command line”, and the prompt starts with [core-ssh ~]$. I can’t run the docker command as it say “command not found”. This is what I see in terms of directories, and I have the script under config/scripts
image

Still, I get this when trying to run it:
FileNotFoundError: [Errno 2] No such file or directory: ‘/config/scripts/browserless_scraper.sh’

I had the same issue.

Adding the word ‘bash’ to the line:

browserless_scraper: "bash ./scripts/browserless_scraper.sh {{function}} {{output}}"

Fixed it for me.

@Linqman - I’m curious if this helped you too.

It did! Thank you very much for the help!

I made a code and tested it in https://chrome.browserless.io/.
Then I copied the code to a file js_scrapers/warema_zip_runter.js.
Studio code does not like the file:
export default async ({ page }: { page: Page }) => { 

“Type annotations can only be used in typescript files”
So I changed the filename to warema_zip_runter.ts
now a different error appears in Studio Code: "Cannot find name ‘Page’

Then I tried the service: shell_command.browserless_scraper,


I get an error in my_output.html: “missing ) after argument list”
If I change the export call to: export default async ( page ) => { 
 and try again another error appears: “cannot find page.goto” in my_output.html


the installation with http://homeassistant.local:3000 is working.

This is my code warema_zip_runter.ts:

// Full TypeScript support for both puppeteer and the DOM
// https://chrome.browserless.io/
export default async ({ page }: { page: Page }) => {
const optionAll= {
  wa_E5:"Alle runter", 
  wa_E6:"Zip runter",
  wa_E7:"Dach runter",
  wa_E8:"Zip hoch",
  wa_E9:"Alle hoch"
}
const option="wa_E6"
const startTime = Date.now();
  await page.goto('http://webcontrol-warema.t9lewrtkt6fgp5wt.myfritz.net:80', { timeout: 60000 });
const endTime = Date.now();
const durationInSeconds = (endTime - startTime) / 1000; // Umwandlung von Millisekunden in Sekunden
console.log(`Duration for webcontroll-warema: ${durationInSeconds} seconds`);

//delay 
function delay(time) {
  return new Promise(function(resolve) { 
    setTimeout(resolve, time)
  });
}
console.log('warten 10s ladezeit warema');
await delay(10000)
console.log(`start: ${ optionAll[option] }`)
// Den Wert des Select-Elements setzen
await page.evaluate(`document.getElementById("KanalSelectBox").getElementsByTagName("select")[0].value = "${option}"`);
// Ein "change" Event auslösen, um das Dropdown-MenĂŒ zu aktualisieren
await page.evaluate("document.getElementById('KanalSelectBox').getElementsByTagName('select')[0].dispatchEvent(new Event('change'))");
// click auslösen geht nur bei szenen, sonst mĂŒĂŸte schieber bewegt werden
await page.evaluate("document.getElementById('btn-start').click()");
  // Logs show up in the browser's devtools
console.log(`beendet ${option}: ${ optionAll[option] }`)
return `beendet ${option}: ${ optionAll[option] }`
};

Any help welcome!

1 Like

I made 2 errors:

  1. I installed only browserless Chrome und forget the Repository Updater from alex belgium
  2. then it works with pure js page in my js-file “export default async ({ page }) => {
”
    It is working!

The browserless add-on is stuck on 2.2.05 and has no option to update to the latest 2.8.0. Is this a Home Assistant bug? And there is no manual installation process for add-ons.

I tried the “Repository Updater” but I just get an error:
CRITICAL: API request was denied despite using an API token. Missing scopes? Expired token? Invalid token?

What does this do? Why isn’t the repository updated automatically?