Stuggling to scrape

byrnecore · March 17, 2023, 12:33pm

Hey team!
I am trying to scrape Central north fire danger days in Australia from one of these two websites and no matter how I try select the css selector, it returns “unknown”. I can return basic values like <title> so I know it can scrape, but as soon as I get to the content I want, it fails. This is my last resort. I suspect it might be because of lazy loading or JS that isn’t visible when the scrape occurs.

Options 1

  - platform: scrape
    name: "Fire scrape test 2"
    resource: https://www.cfa.vic.gov.au/warnings-restrictions/total-fire-bans-fire-danger-ratings/north-central-fire-district
    select: "#dvFDRCode > div"

Option 2

  - platform: scrape
    name: "Fire scrape test"
    resource: http://www.bom.gov.au/vic/forecasts/fire-danger-ratings.shtml
    select: "#content > div > div > table:nth-child(5) > tbody > tr:nth-child(4) > td:nth-child(2)"

Any pointers or confirmations to my suspicions?

Hellis81 · March 17, 2023, 12:49pm

If you just open the source then you can see the javascript at the top almost

byrnecore · March 20, 2023, 6:37am

Hellis81, can help me understand more how that helps me? I’ve looked through most of the JS, but I couldn’t see any obvious ways to grab what I need.

In the second example, it seems like i’s the <table> I am having trouble with… I can easily select id’s above the table, but not in it. Seems like the. #content is all from the same source of info.

Any more help would be super nice and appreciated

Shaun

Hellis81 · March 20, 2023, 6:43am

You can’t grab it. It’s impossible
Scrapers only grab the html. You need to find a different source like an API

chris.huitema · March 20, 2023, 7:35am

have a look at the BOM integration in HACS it pulls what you need from an API I think, they have fire ratings and warnings in sensors… I have this info at the top of the first card on my dashboard, today the fire danger is high

Troon · March 20, 2023, 7:38am

Taking the second URL first (which does have the info in the static HTML), if you look at the HTML source, you’ll see there isn’t a <tbody> element in the table. Try removing it from your selector. The tools that work out a selector will often insert one, which is optional in the HTML spec but if you specify it in the selector and it isn’t there, it’ll fail. Best to work the selector out manually.

For the first URL, turn on the Network tab of whatever browser DevTools you’re using and hit refresh. Look for an “xhr” (XMLHttpResponse aka AJAX) request:

Then look at the response:

Oooh! There’s your data!

So this one can be solved with a rest sensor using that URL as a resource:

https://www.cfa.vic.gov.au/api/cfa/tfbfdr/district

…although it’s a POST request and may need some headers set up like the site sends:

byrnecore · March 21, 2023, 1:33am

Oh man… Can’t believe I couldn’t find this before. Thanks!

So much data and it is accurate! Automation’s here I come!

byrnecore · March 21, 2023, 1:35am

Awesome advice. I got the BOM integration working. But I will see if i can get it working for my learning.

There will be challenges to get around which is why i didn’t try this right away. Permissions is the issue.

Not sure how to solve for this.

Troon · March 21, 2023, 7:10am

Like I said, it’s a POST request. See the end of my post above.