Scrape sensor improved - scraping multiple values

papacrown · December 4, 2021, 4:44pm

Yeah I’ve never actually reset the Generac system, I’ll restart it and see if that helps. Thank you so much for bearing with me.

malosaa · December 4, 2021, 4:46pm

no problems at all, just let me know, i’m glad to help

farberm · December 16, 2021, 9:24pm

Can someone help determine why I get an “unknown” for this tankpercentage". Here is my yaml config…Tank percentage is in the form of “62%”

multiscrape:

resource: AmeriGas Login
scan_interval: 3600
headers:
User-Agent: Mozilla/5.0
form_submit:
submit_once: True
select: “form-control-valid”
input:
email: myusername
password: mypassword
sensor:
- select: “#layoutDiv > main > div.container.pl-0.pr-0.pl-xl-3.pr-xl-3.pl-lg-3.pr-lg-3.pl-md-3.pr-md-3.pl-sm-0.pr-sm-0 > div:nth-child(2) > div.col-12.col-xl-6.col-lg-6.col-md-12.col-sm-12.pl-0.pr-0.pr-xl-3.pr-lg-3.pr-md-0.pr-sm-0 > div.col-12.bg-white.tankanddeliveries-padding.top-margin > div:nth-child(3) > div.col-12.col-xl-4.col-lg-4.col-md-12.col-sm-12.p-0.mt-3.EstimatedTankDiv > div > div.col-12.p-0.lblvalue-Estimatedtank”
name: Tankpercentage
- unique_id: Tank_percentage

Here is the error from Log:

ogger: custom_components.multiscrape.sensor
Source: custom_components/multiscrape/sensor.py:139
Integration: Multiscrape scraping component (documentation, issues)
First occurred: December 15, 2021, 10:20:05 PM (19 occurrences)
Last logged: 4:20:05 PM

Sensor Tankpercentage was unable to extract data from HTML

Thanks in advance

danieldotnl · January 2, 2022, 8:16pm

As of release v5.7.0, Multiscrape could also be used as an improved REST component. It now supports JSON in the value_templates, enabling you with the same syntax as the RESTful sensors but added to all the extras of Multiscrape. E.g. form-submit, entity pictures, icon templates, etc.!

fab_ha_git · January 3, 2022, 9:40pm

how it works? I don’t understand… can u tell some exemples?

danieldotnl · January 6, 2022, 2:28pm

Check out this: Issue with Multiscrape / Scrape - #34 by Robi07

joem · January 11, 2022, 2:42am

I’m trying to get access to my electricity data from powernet.nz
The login page is https://secure.powershop.co.nz

To access it, I’m using

      select: "#Container2"
      input: 
        username: "user"
        password: "password"

The error I’m getting is WARNING (MainThread) [custom_components.multiscrape.sensor] Sensor Powershop daily consumption was unable to extract data from HTML

Either I am using the wrong select or there is something deeper hidden in the webpage that won’t let me log in.
In the inspector I notice a token, that changes with every page reload: <input type="hidden" name="authenticity_token" value="xxxxxxxxxtNom93qzs2QsUFLwYswaz9uWG5PczZzzrJNBvXB78pnrKQUH9ss4vybfoxdaZiL9Bg==">

Could someone help me out?

danieldotnl · January 14, 2022, 9:09am

It is very difficult to help without the username/password Those hidden input fields are taken into account (submitted) by the form-submit feature though.

Could you try-out pre-release 6.0.0? I released it yesterday and it’s stuffed with extra logging and debug information! Also checkout the updated troubleshooting part in the readme.

joem · January 14, 2022, 9:37am

Installed 6.0.0
the log, amongst other things returns the following:

2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape] # Start loading multiscrape
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape] # Reload service registered
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape] # Start processing config from configuration.yaml
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape] # Found no name for scraper, generated a unique name: Scraper_noname_0
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape] Scraper_noname_0 # Setting up multiscrape with config:
OrderedDict([('resource', 'https://secure.powershop.co.nz/customers/REDACTED/balance'), ('scan_interval', datetime.timedelta(seconds=3600)), ('form_submit', OrderedDict([('submit_once', True), ('resource', 'https://secure.powershop.co.nz/customers/REDACTED/balance'), ('select', '#Container2'), ('input', OrderedDict([('username', '[email protected]'), ('password', 'REDACTED')])), ('resubmit_on_error', True)])), ('sensor', [OrderedDict([('unique_id', 'powershop_daily_consumption'), ('name', 'Powershop daily consumption'), ('select', Template("#unit-balance-container > div.estimated-cost.white-box > p > span")), ('on_error', OrderedDict([('log', 'warning'), ('value', 'last')])), ('force_update', False)])]), ('timeout', 10), ('log_response', False), ('parser', 'lxml'), ('verify_ssl', True), ('method', 'GET')])
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Initializing scraper
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Found form-submit config
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Refresh triggered
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Continue with form-submit
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Requesting page with form from: https://secure.powershop.co.nz/customers/REDACTED/balance
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Executing form_page-request with a GET to url: https://secure.powershop.co.nz/customers/REDACTED/balance.
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Response status code received: 302
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Start trying to capture the form in the page
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Parse HTML with BeautifulSoup parser lxml
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Try to find form with selector #Container2
2022-01-14 22:32:13 INFO (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Unable to extract form data from.
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Exception extracing form data: list index out of range
2022-01-14 22:32:13 ERROR (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Exception in form-submit feature. Will continue trying to scrape target page
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Updating data from https://secure.powershop.co.nz/customers/REDACTED/balance
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Executing page-request with a get to url: https://secure.powershop.co.nz/customers/REDACTED/balance.
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Response status code received: 302
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Data succesfully refreshed. Sensors will now start scraping to update.
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Start loading the response in BeautifulSoup.
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape] Finished fetching scraper data data in 2.041 seconds (success: True)
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Powershop daily consumption # Setting up sensor
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Powershop daily consumption # Start scraping to update sensor
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Powershop daily consumption # Select selected tag: None
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Exception occurred while scraping, will try to resubmit the form next interval.
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Powershop daily consumption # Exception selecting sensor data: 'NoneType' object has no attribute 'name'
HINT: Use debug logging and log_response for further investigation!
2022-01-14 22:32:26 WARNING (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Powershop daily consumption # Unable to extract data
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Powershop daily consumption # On-error, keep old value: None
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_0 # Powershop daily consumption # Updated sensor and attributes, now adding to HA

danieldotnl · January 14, 2022, 9:54am

So it’s stuck finding the form because #Container2 is pointing to a div and not to the form.
Also, the input field is has id email instead of username.

Try this:

select: "#Container2 > div > form"
input: 
  email: "user"
  password: "password"

totesz · January 18, 2022, 3:23pm

Anyone else having troubles with debugging?
I’m trying to find out why scraping doesn’t work, but even though I’ve enabled debugging as per the user guide, when I add log_response to file, I get a problem when validating the config:

Invalid config for [multiscrape]: [log_response] is an invalid option for [multiscrape]. Check: multiscrape->multiscrape->0->log_response. (See /config/configuration.yaml, line 114).

I even get the same issue with adding log_response to the sample config:

multiscrape:
  - resource: https://www.home-assistant.io
    scan_interval: 3600
    log_response: file
    sensor:
      - unique_id: ha_latest_version
        name: Latest version
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) }}'
      - unique_id: ha_release_date
        icon: >-
          {% if is_state('binary_sensor.ha_version_check', 'on') %}
            mdi:alarm-light
          {% else %}
            mdi:bat
          {% endif %}
        name: Release date
        select: ".release-date"

danieldotnl · January 18, 2022, 3:53pm

This option has been added in the latest pre-release (6.0.0). So you need to either enable the pre-release in HACS or wait for the final release.

danieldotnl · January 18, 2022, 3:55pm

Never mind, I see I forgot to update the READme. The responses are always written to files, so the value should just be ‘True’ instead of ‘file’.

Update: the readme on github has been updated.

totesz · January 18, 2022, 4:19pm

It was working with 6.0.0 and True I managed to figure out the problem, it was having special UTF characters in the URL. Thanks for the quick help!

kaizersoje · January 19, 2022, 12:53pm

Need help with a multiscrape sensor.

multiscrape:
  - resource: "https://www.amaysim.com.au/my-account/my-amaysim/products"
    name: Amaysim
    scan_interval: 30
    log_response: true
    method: GET
    form_submit:
      submit_once: true
      resubmit_on_error: false
      resource: "https://accounts.amaysim.com.au/identity/login"
      select: "#new_session"
      input:
        username: !secret amaysim_username
        password: !secret amaysim_password
    sensor:
      - select: "#outer_wrap > div.inner-wrap > div.page-container > div:nth-child(2) > div.row.margin-bottom > div.small-12.medium-6.columns > div > div > div:nth-child(2) > div:nth-child(2)"
        name: amaysim_remaining_data
        value_template: "{{ value }}"

I can see from the logs that after successfully submitting the form, it says that it is getting the data from the resource url, with a response code of 200.

I have pasted the contents from the log_response file page_soup.txt below

<html><body><p>/**/('OK')</p></body></html>

Below is the content from the form_submit_response_body.txt

<html><body>You are being <a href="https://accounts.amaysim.com.au/identity">redirected</a>.</body></html>

It seems that after submitting the form, the sensor is scraping data from the intermediate page and not from the resource url.

Joao-Sousa-71 · February 5, 2022, 1:33pm

@danieldotnl first let me thank you for this great component! I’m using form submit and need to extract to a sensor (or sensor attribute) a XSFR-TOKEN and a Cookie that is showed in page “page_response_headers.txt” . Is there any easy way? With this then I cal the endpoint address (api json) to grab the information that I need since the page content is created with JavaScript. Can you help?
Thank you!

RuprectDK · February 11, 2022, 2:53pm

I am trying to scrape the XML file from my HP Envy printer, but each element contains predicates, this is a small sample of what I get from the printer:

<?xml version="1.0" encoding="UTF-8"?>
<!--THIS DATA SUBJECT TO DISCLAIMER(S) INCLUDED WITH THE PRODUCT OF ORIGIN.-->
<pudyn:ProductUsageDyn xsi:schemaLocation="http://www.hp.com/schemas/imaging/con/ledm/productusagedyn/2007/12/11 ../schemas/ProductUsageDyn.xsd" xmlns:dd="http://www.hp.com/schemas/imaging/con/dictionaries/1.0/" xmlns:dd2="http://www.hp.com/schemas/imaging/con/dictionaries/2008/10/10" xmlns:pudyn="http://www.hp.com/schemas/imaging/con/ledm/productusagedyn/2007/12/11" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
	<dd:Version>
		<dd:Revision>SVN-IPG-LEDM.119</dd:Revision>
		<dd:Date>2010-08-31</dd:Date>
	</dd:Version>
	<pudyn:PrinterSubunit>
		<dd:TotalImpressions PEID="5082">369</dd:TotalImpressions>
		<dd:MonochromeImpressions>0</dd:MonochromeImpressions>
		<dd:ColorImpressions>135</dd:ColorImpressions>
		<dd:A4EquivalentImpressions>
			<dd:TotalImpressions PEID="5082">369</dd:TotalImpressions>
			<dd:MonochromeImpressions>0</dd:MonochromeImpressions>
		</dd:A4EquivalentImpressions>
		<dd:SimplexSheets>64</dd:SimplexSheets>
		<dd:DuplexSheets PEID="5088">143</dd:DuplexSheets>
		<dd:JamEvents PEID="16076">3</dd:JamEvents>
		<dd:MispickEvents>2</dd:MispickEvents>
		<dd:TotalFrontPanelCancelPresses PEID="30033">4</dd:TotalFrontPanelCancelPresses>
		<pudyn:UsageByMarkingAgent>
			<dd2:CumulativeMarkingAgentUsed PEID="64100">
				<dd:ValueFloat>12</dd:ValueFloat>
				<dd:Unit>milliliters</dd:Unit>
			</dd2:CumulativeMarkingAgentUsed>
			<dd2:CumulativeHPMarkingAgentUsed PEID="64101">
				<dd:ValueFloat>12</dd:ValueFloat>
				<dd:Unit>milliliters</dd:Unit>
			</dd2:CumulativeHPMarkingAgentUsed>
			<dd:CumulativeHPMarkingAgentInserted PEID="64001">
				<dd:ValueFloat>14</dd:ValueFloat>
				<dd:Unit>milliliters</dd:Unit>
			</dd:CumulativeHPMarkingAgentInserted>
		</pudyn:UsageByMarkingAgent>
	</pudyn:PrinterSubunit>
</pudyn:ProductUsageDyn>

I have created the following sensor:

- resource: http://10.0.0.97/DevMgmt/ProductUsageDyn.xml
  scan_interval: 10
  method: get
  sensor:
    - name: HP Printer Total Impressions
      unique_id: hp_printer_total_impressions
      select: "TotalImpressions"

But I get an error:

2022-02-11 15:41:46 DEBUG (MainThread) [custom_components.multiscrape.scraper] Updating from http://10.0.0.97/DevMgmt/ProductUsageDyn.xml
2022-02-11 15:41:47 DEBUG (MainThread) [custom_components.multiscrape] Finished fetching scraper data data in 1.857 seconds (success: True)
2022-02-11 15:41:47 DEBUG (MainThread) [custom_components.multiscrape.sensor] Exception selecting sensor data: list index out of range
2022-02-11 15:41:47 ERROR (MainThread) [custom_components.multiscrape.sensor] Sensor HP Printer Total Impressions was unable to extract data from HTML

How do I get to select within XML

RuprectDK · February 14, 2022, 12:49pm

I have found a HACS HP Printer Integration that will parse the XML, but I would still like to know how to parse above XML, because my printer include detail on what type of paper has been used, and the HACS HP Printer Integration does not export thosse values.

joem · February 21, 2022, 2:49am

Did you manage to find out how to “skip” the redirect page?
I’m having a similar situation, when after login the scraper is redirected and scraping the wrong page instead of the resource
<html><body>You are being <a href="https://secure.powershop.co.nz/">redirected</a>.</body></html>

kaizersoje · February 21, 2022, 9:57am

No I haven’t. I have raised an issue #89 as well.