Hi, I have a javascript website that I need to login to and then pull a value (rendered via JS). I currently have it running on my laptop on a daily basis to extract this data and import it into Mariadb SQL database. I had this all running previously on a Raspberry pi supervised instance of HA using a python script daily that ran and utilised chromium / selenium to manage it.
I am needing to ideally offload this back onto the Home Assistant (HAOS on rpi) install. I have looked into the following a bit but the fact that is is a JS website seems to just put me up against it
NodeRed with a few selenium webdrivers etc. - These all seem to just provide a port to communicate with a remote selenium install
NodeRed with webdriver - Again per above just appears to provide a port to another machine to run it on
Appdaemon - Instances of people potentially doing it appears to be years ago and various attempts to follow few things I found have been unsuccessful.
I am aware that HAOS on a raspberry pi is not the most powerful machine but it doesn’t really matter if this takes 1 or even 5 minutes to run (which in some instances I have seen it take more than a few minutes). Need some guidance as to how I could possibly achieve this again keeping in mind the need to “login” and the need to extract a value rendered through Javascript.
Many Thanks for any guidance as to what to look into is much appreciated.
If I look at the network tab in Developer tools at the sources list it submits my login payload to an api (believe it is “non” public) and returns a token / uuid in my “sessions” Object response. I have reached out to the website previously and they advised the API is not “accessible” and is in place solely for the webpage rendering and as such a user can’t query it directly. The value I want to extract if looking at the data this way is in the 1st “User” Object as with the following data shown in response of which I have removed most irrelevant values but left the nesting structure in tact in case relevant
Hope this helps to clarify what I am attempting to do and enough about the layout of the website and how the login works and also how the data I want to scrape is generated. Any tips on what to use etc. to get this data appreciated. As stated I am running my python script daily on my laptop currently to do this however I am not always at home and need to try offload it to something that can run on a raspberry pi with home assistant OS (speed as stated irrelevant - though if it takes more than about 10 minutes between submitting login data and doing the extraction it may timeout).
P.S. No idea why the mid section of code format didn’t format properly. Tried multiple times unsuccessfully so just tabbed it out as best I could to make flow structured. Guessing something about its structure is slightly out.
All this head scratching, just to get your super balance daily?
What website? What engine do they use to generate their web page behind the scenes?
Is the website stable, not changing the underlying code at whim, just the data passed back to you?
Does the token/UUID change? How often? Can you re-use it instead of re-authenticating?
Have you considered the “Beautiful Soup” Python web scraping library as well?
Is the data available elsewhere on another website in a different format? JSON, API, etc?
So you have already solved the data extraction challenge?
You could pull all this into a HomeAssistant integration, or alternatively just have it as a Python script you call from a cron job on a dedicated Raspberry Pi (or docker instance) and pass the data to HomeAssistant through something like MQTT.
Do you want the data in the HomeAssistant historic database, or an external MariaDB database? Are you interrogating that database for other purposes on a daily basis, or just keeping it for historic purposes? Why choose MariaDB specifically?
Are you planning on just using the HomeAssistant platform for scheduling and hosting or for integration to other things?
Any need for the cut-down HomeAssistant OS, or is a generic Raspberry Pi platform going to be adequate for your needs? Even a cheap headless $5 Pi Zero W you can just plug into your router spare USB2 port? Can you host this setup on the cloud, and bypass all this mucking around with HomeAssistant altogether?
Most of your issues are already wrapped into well proven libraries you can call. Tying them together is the challenge. Python is but one choice to do this.
Many options…
Standing on the shoulders of giants
Start here.
Two active HomeAssistant web scraping integrations that spring to mind are the HACS integrations for a SEMS Portal, and the Waste Collection scheduler. Have a look at their code for ideas.
Its not pulling Super balance but a balance of an investment account. For this investment I can only get the data through the website (ie. not a listed stock I can pull the price from a stock exchange).
I do also have a range of stocks and their values pull in daily also via simple json and also some other mutual funds but they are easy as well. These are also in python at the moment but given the easy platform this data is on figure that is the easy part of my current problem. All this data is pulled in daily and linked to a front end software package I made outside of HAOS which allows me to track how my investments are tracking on daily, weekly, EOM etc. basis. I also have other systems which check in to this daily outside of all this and alerts to major concerns (drops over 10% from previous day close). It also tracks all my dividends which makes tax time so much easier
I was previously running Home Assistant Supervised with all this running on this one system but it has become harder to manage recently and when I do OS updates on it I am more often getting problems with Selenium / Chromium compatibility becoming out of sync and it is becoming a bit harder to manage. This system does also run a range of other things such as Plex Server, Energy Monitoring and a bit of automation based off energy usage.
I don’t really want to ideally have to get / setup another raspberry pi or have the issue of having to make sure I remember to run the scripts from my PC daily.
All this data is currently stored in MariadB on my home assistant server (raspberry pi). This was primarily chosen because its available as a HA addon. This also stores some Energy data (not available on dashboard). This is the only “machine” I have that runs all day every day that is capable of running scripts etc.
The website is pretty stable for formatting and in the last almost 6 years I have only had to make 4 changes due to minor changes such as classname modified.
Don’t know how often UUID/token changes as it is currently done through Selenium / Chrome all without me needing to deal with it programmatically.
Beautiful Soup, urllib and few others don’t work as like I stated it has nothing useful in the page source.
Data is not available elsewhere for this one investment. It is not a matter of I own x number of units at a particular NAV price. It has been a good investment but if I can’t sort a solution for this then I will likely need to let it go.
Will look through the links you referenced and see what I can adapt but the key issue with this website is the need to JS rendered information and the lack of the website API being accessible.
If you can see it, you can scrape it. Even if you have to keyboard stuff it, using something like AutoIT in Windoze to shuffle things in and out of the clipboard.
The cheap headless $5 Pi Zero W you can just plug into your router spare USB2 port or a phone charger for power, wirelessly connected, coupled with MQTT for data interchange is starting to look like the set and forget solution. Keep it away from the hustle and bustle of HomeAssistant updates on Raspi OS.Trigger with a cron job.
Happy coding.
Thanks but utilising another raspberry pi is not really an option I want to consider (and I already have some to spare). Have to many things scattered around the house and near my router which is on display in the living room (to my wifes disgust). What I am aiming for is having it all on the one raspberry pi as it was before (all be it now HAOS instead of Debian with HA Supervisor) and without the need for me to run a set of scripts from my windows PC each night. Trying to make it all as clean as possible and with HAOS which should be easier to maintain then what I had.
Alternatively and don’t really want to go this route but my other option is to VM HAOS on a NUC I have which is dedicated to network computer backups but currently this only runs about once a week or so depending on what works people have done in the household (ideally don’t want it running 24/7).
Like I said if it all ends up being too hard will likely consider folding on this investment. Still to get through those links you provided though so hopefully will find something there but that will be a job for later tonight or tomorrow.
This is pretty standard stuff, really.
Do you know the “curl” and “jq” commands?
Your objective is to get that “user object” url.
Firstly, you’ll have to login somehow, via “curl”, then retrieve that user object, then use “jq” to extract the value from the json.
All of that will be in a shell script that you’ll call in ha through the “command_line” sensor.
Below, some pointers, based upon a script of mine that does basically what you want.
The actual urls, headers, … you’ll get by inspecting the exchanges in your browser.
Thanks Koying, for providing an idea. Have spent some time on that and haven’t succeeded. I suspect given its a financial entity and due to the heavy rendering via JS and also the tokens and whatever other security is in place that I can’t even see that it is just blocking me going this route all be it a quick and short possible solution and one I can see me utilising as I still have other scrapes to sort which aren’t as crucial to day to day. Perhaps a key point may be that if I login with developer in Chrome that I can see the response of that object in the developer section but if I double click that object I get an authentication error. Have tried a few other sites where I am logged in and equivalent objects when I do that I see the response in the web browser main section. So feel there is layers of protection in addition to what I know.
As a side note though in continuing to google this last night I did come across another possible solution. That being the “Browserless Chromium” addon. Only got it working in the debugger at this stage but it is logging in without issue and pulling the required Div Class value that I want. There is other info I need to still pull but this proves it should work and can expand from there. So at this stage I may continue down this path for this investment assuming have no issue getting it into SQL database. This will be my next step to work out before finalising the rest of the harvesting.
Exactly. Just trying to work out my best methods for it and ideally to make it as consistent as possible between this investment and all the others. Currently exploring pyscripts as an option as then in theory all the others can likely stay as the same and will basically just change the data extraction part of the one for this investment to either pull the data from browserless or alternatively get browserless to store to ha entity and pull it back from there for just this investment itself.
Appreciate everyones guidance on this. Have been looking into it on and off for quite a while and then it a short timeframe got I feel most of the way to a solution. Just wish I had the time to finish it all off today while some stuff is still fresh in my mind.
Finally got it all functional. Few headaches with getting from the Browserless Chromium debugger working to get it to HA and onwards to SQL but most of this related to I believe due to it being a financial site and extra balances needed to enable stealth and explicitly disable headless (changes the User Agent). This method is so much faster than my previous version with the whole scrape and loading into SQL completing in around 10 seconds as opposed to the 1 minute minimum it used to take with using Selenium / Chromium.
Pyscript was the option that I did end up going with but this was mainly due to needing to send out to SQL also. There is a Multiscrape integration that is generally used to work with Browserless Chromium but if I went this route then I would have had to do a separate additional step to get it sent onwards to SQL which is my primary location. Though I will likely as I have done it with sending to HA also look to incorporate it into my dashboard which is seen constantly (but that is a project for another time).
Code which was finally used (with the removal of all bar the first line of SQL - As it is about 100 extra lines and didn’t need to make any mods to this code at all (except for adjusting the indentations to match this). So figure anyone wanting to adapt a previous python - selenium option and moving over to this could do the same
import aiohttp
import time
import datetime
import pymysql
@service
def Investment1():
browserless_url = "http://localhost:3000/function?headless=false&stealth=true"
# This is the JS which will be run by Browserless
js_code = """
export default async function ({ page }) {
await page.goto("https://InvestmentSite/auth/login");
await page.type('input[name="email"]', "[email protected]");
await page.type('input[name="password"]', "myPassword");
await Promise.all([
page.click('button[type="submit"]'),
page.waitForNavigation({ waitUntil: 'networkidle2' })
]);
await page.waitForSelector('.css-19344h9.e4o4i6u8');
const result = await page.evaluate(() => {
return document.querySelector('.css-19344h9.e4o4i6u8').innerHTML;
});
return { data: result, type: 'application/json' };
};
"""
try:
# Using aiohttp for non-blocking. Requests / URLLib causes issues with freezing home assistant unless you use Task.executor to offload to background thread but this is not advised
async with aiohttp.ClientSession() as session:
async with session.post(
browserless_url,
data=js_code,
headers={"Content-Type": "application/javascript"},
timeout=60
) as response:
if response.status == 200:
res_json = await response.json()
scraped_value = res_json.get("data")
# Update Home Assistant entity
state.set("sensor.Investment1", value=scraped_value) #Didn't really need this line but wanted to test working within HA before I invoked the SQL code
log.info(f"Async Scrape Success: {scraped_value}") #A secondary check
else:
log.error(f"Browserless Error: {response.status}")
db = pymysql.connect(host='localhost',user='User',password='Password',database='dbInvest')
#A series of code directly transferred from my original Python code was placed here.
except Exception as e:
log.error(f"Async scraping failed: {e}")