Scraper question

gr4z · January 27, 2020, 12:39pm

Hi, I am trying to get my head around the scraper sensor. I am trying to get the info of my oil tank from the supplier’s web site (BoilerJuice in the UK). They provide a website you can check your oil level, I would like to get this into HA.
Here is the code on their site, I am trying to extract the part which states in this example ‘(693 litres)’ or the ‘hidden percentage’ value of ‘63’. Can anyone help me as I wracking my brain trying to understand what to search for. Thanks

		<h3>Oil Remaining</h3>
		<p>All readings approximate</p>

		<div class="jerryCan-container">

			<div class="jerryCan" style="background-image: url(https://s3-eu-west-1.amazonaws.com/boiler-juice-static-assets-prod/images/my-tank/my_tank_only_base.png);">

				<div class="progress-bar-container">
					<div class="progress vertical bottom">
													<div
									class="progress-bar"
									role="progressbar"
									data-transitiongoal="63"
									data-delay="2000"
									data-transition="1500"
									style="
										-webkit-transition: height 1.5s ease-in-out;
										-moz-transition: height 1.5s ease-in-out;
										-ms-transition: height 1.5s ease-in-out;
										-o-transition: height 1.5s ease-in-out;
										transition: height 1.5s ease-in-out;"
							></div>
											</div>
				</div>

				<div class="text">
											<p><span class="single">0</span><span>%</span></p>
						<p>(693 litres)</p>
						<p class="hidden percentage">63</p>
					
				</div>

			</div>

			<div class="side-progress hidden">

				<div class="bar-container">

					<div class="space"></div>

					<div class="bar"></div>

					<div class="status">
						<span class="dot yellow"></span>
						<p>Medium</p>
					</div>


					<div class="bubble" style="background-image: url(https://s3-eu-west-1.amazonaws.com/boiler-juice-static-assets-prod/images/my-tank/[email protected]" >
						<p>Please take a top up</p>
					</div>
				</div>

			</div>

			<div class="clear"></div>

		</div>

code-in-progress · January 27, 2020, 12:55pm

I think it would just be select: "p.hidden, p.percentage". Alternately, I think you could also do something like select: "div.text > p.percentage". The second one might be a better choice as it requires the <p> tag to be a direct descendant of the <div> tag.

Another idea would be to use NodeRed with the Cheerio node and scrape it that way. I’ve had a lot of success with Cheerio over BeautifulSoup.

gr4z · January 27, 2020, 1:16pm

Thank you. I have used the following:

  - platform: scrape
    resource: https://www.boilerjuice.com/my-tank/
    username: <redacted>
    password: <redacted>
    name: Oil
    select: "div.text > p.percentage"

But get an error

ERROR (SyncWorker_0) [homeassistant.components.scrape.sensor] Unable to extract data from HTML

Sorry for the dumb questions, but not sure what I am doing wrong.

code-in-progress · January 27, 2020, 1:21pm

There are never any dumb questions.

What were you using in the select: before? I’m thinking it’s not finding the div.text tag and that’s why you’re getting the error. Can you try the first example I sent?

gr4z · January 27, 2020, 1:25pm

Same error
ERROR (SyncWorker_4) [homeassistant.components.scrape.sensor] Unable to extract data from HTML

code-in-progress · January 27, 2020, 1:29pm

Ok, that sounds like a different problem. Can you try changing select: to select: "h3"? That should grab the <h3>Oil Remaining</h3> tag and return the text.

I have a feeling though that this might be an authentication issue rather than a select issue. But, the above test should reveal that.

gr4z · January 27, 2020, 1:34pm

OK, so I don’t get an error, but the sensor’s state is ‘Sign in to your account’ so not sure if the scraper can get in…

Valentino_Stillhardt · January 27, 2020, 1:34pm

I did a similar thing and the HA component for extracting HTML didn’t find some things.

The thing that did it was using the: https://www.home-assistant.io/integrations/sensor.command_line/

and Regex (https://www.guru99.com/linux-regular-expressions.html)

Test your Regex with https://regex101.com

Done.

Valentino_Stillhardt · January 27, 2020, 1:36pm

You need to authenticate yourself first.

This may seem pretty obvious, but it can get pretty tricky…

gr4z · January 27, 2020, 1:37pm

Yep am using the username and password variables. They dont seem to work though.

Valentino_Stillhardt · January 27, 2020, 1:38pm

Can you give us some code, so we see what you did?

gr4z · January 27, 2020, 1:39pm

Same as above really

  - platform: scrape
    resource: https://www.boilerjuice.com/my-tank/
    username: <redacted>
    password: <redacted>
    name: Oil
    select: "h3"

code-in-progress · January 27, 2020, 1:39pm

Yup, I was just testing that. Rather than returning a 403 error (unauthorized), the site is redirecting to a login page which means that the Scrape sensor isn’t sending the credentials properly. You could try it with the authentication: property set to either basic or digest, but neither of those may work depending on how BoilerJuice has setup their authentication. Another idea is something like what @Valentino_Stillhardt suggested, but I would go for a curl route. Curl has a lot more authentication features that would allow you to try to tailor the request to the site.

[EDIT]: Also, take a look at the headers that are coming back when you open the oil level page in a browser. You may need to change the user agent as a lot of sites scan user agent headers and block unknown user agents. You would do that in the headers: configuration of the Scrape sensor.

gr4z · January 27, 2020, 1:54pm

Alas that is all beyond my knowledge. Thanks anyway, thought it might be easy!

code-in-progress · January 27, 2020, 2:07pm

No problem. Sorry I couldn’t help more.

octaviz · January 31, 2020, 12:37pm

Hi !, I am looking for a way to get temperature, humidity, etc. data from a Mobile-Alerts sensor but there is no way, this is the url:

https://measurements.mobile-alerts.eu/Home/SensorsOverview?phoneid=041873969167

The source code does not know if it is possible to get that data.

Very Thanks

code-in-progress · January 31, 2020, 1:17pm

Please edit the URL and remove your number (for security purposes) (and/or contact an admin to edit it).

Yes, the scraper can easily get this data. You’d just need to figure out the XPath to get to the data you want. This tool should help:

http://xpather.com/

Grab the HTML from the URL you posted and paste it into XPather. Then, start writing your query at the top of the screen and you should be able to come up with a query for the select: property.

You could also use this https://selectorgadget.com/ which is a Chrome extension.

octaviz · January 31, 2020, 1:46pm

Do not worry about the id number, it is a demo mode for testing.
I’ve tried xpather and I don’t know if I’ve done it right, that’s how my yaml code should look like

- platform: scrape
    name: Temperatura Exterior
    resource: https://measurements.mobile-alerts.eu/Home/SensorsOverview?phoneid=041873969167
    select: "//div[@class='panel panel-default'][1]/div[@class='panel-body'][1]/div[@class='sensor'][1]/div[@class='sensor-component'][2]/h4[1]/text()[1]"

The select code is the one you gave me when I selected the temperature values

I think it’s not right, because it doesn’t work

octaviz · January 31, 2020, 1:51pm

Hi!, the code correct is this: select: “.sensor-component+ .sensor-component h4”
very thanks!!!

finalbillybong1 · May 24, 2020, 11:08am

Did you ever get this sorted? I’m using BoilerJuice too and want to do the same thing.