Scrape sensor improved - scraping multiple values

Yeah I’ve never actually reset the Generac system, I’ll restart it and see if that helps. Thank you so much for bearing with me.

no problems at all, just let me know, i’m glad to help :smile:

2 Likes

Can someone help determine why I get an “unknown” for this tankpercentage". Here is my yaml config…Tank percentage is in the form of “62%”

multiscrape:

  • resource: AmeriGas Login
    scan_interval: 3600
    headers:
    User-Agent: Mozilla/5.0
    form_submit:
    submit_once: True
    select: “form-control-valid”
    input:
    email: myusername
    password: mypassword
    sensor:
    - select: “#layoutDiv > main > div.container.pl-0.pr-0.pl-xl-3.pr-xl-3.pl-lg-3.pr-lg-3.pl-md-3.pr-md-3.pl-sm-0.pr-sm-0 > div:nth-child(2) > div.col-12.col-xl-6.col-lg-6.col-md-12.col-sm-12.pl-0.pr-0.pr-xl-3.pr-lg-3.pr-md-0.pr-sm-0 > div.col-12.bg-white.tankanddeliveries-padding.top-margin > div:nth-child(3) > div.col-12.col-xl-4.col-lg-4.col-md-12.col-sm-12.p-0.mt-3.EstimatedTankDiv > div > div.col-12.p-0.lblvalue-Estimatedtank”
    name: Tankpercentage
    - unique_id: Tank_percentage

Here is the error from Log:

ogger: custom_components.multiscrape.sensor
Source: custom_components/multiscrape/sensor.py:139
Integration: Multiscrape scraping component (documentation, issues)
First occurred: December 15, 2021, 10:20:05 PM (19 occurrences)
Last logged: 4:20:05 PM

Sensor Tankpercentage was unable to extract data from HTML

Thanks in advance

1 Like

:partying_face: :partying_face: :partying_face:

As of release v5.7.0, Multiscrape could also be used as an improved REST component. It now supports JSON in the value_templates, enabling you with the same syntax as the RESTful sensors but added to all the extras of Multiscrape. E.g. form-submit, entity pictures, icon templates, etc.!

:partying_face: :partying_face: :partying_face:

2 Likes

how it works? I don’t understand… can u tell some exemples?

Check out this: Issue with Multiscrape / Scrape - #34 by Robi07

1 Like

I’m trying to get access to my electricity data from powernet.nz
The login page is https://secure.powershop.co.nz

To access it, I’m using

      select: "#Container2"
      input: 
        username: "user"
        password: "password"

The error I’m getting is WARNING (MainThread) [custom_components.multiscrape.sensor] Sensor Powershop daily consumption was unable to extract data from HTML

Either I am using the wrong select or there is something deeper hidden in the webpage that won’t let me log in.
In the inspector I notice a token, that changes with every page reload: <input type="hidden" name="authenticity_token" value="xxxxxxxxxtNom93qzs2QsUFLwYswaz9uWG5PczZzzrJNBvXB78pnrKQUH9ss4vybfoxdaZiL9Bg==">

Could someone help me out?

It is very difficult to help without the username/password :slight_smile: Those hidden input fields are taken into account (submitted) by the form-submit feature though.

Could you try-out pre-release 6.0.0? I released it yesterday and it’s stuffed with extra logging and debug information! Also checkout the updated troubleshooting part in the readme.

1 Like

Installed 6.0.0
the log, amongst other things returns the following:

2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape] # Start loading multiscrape
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape] # Reload service registered
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape] # Start processing config from configuration.yaml
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape] # Found no name for scraper, generated a unique name: Scraper_noname_0
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape] Scraper_noname_0 # Setting up multiscrape with config:
OrderedDict([('resource', 'https://secure.powershop.co.nz/customers/REDACTED/balance'), ('scan_interval', datetime.timedelta(seconds=3600)), ('form_submit', OrderedDict([('submit_once', True), ('resource', 'https://secure.powershop.co.nz/customers/REDACTED/balance'), ('select', '#Container2'), ('input', OrderedDict([('username', '[email protected]'), ('password', 'REDACTED')])), ('resubmit_on_error', True)])), ('sensor', [OrderedDict([('unique_id', 'powershop_daily_consumption'), ('name', 'Powershop daily consumption'), ('select', Template("#unit-balance-container > div.estimated-cost.white-box > p > span")), ('on_error', OrderedDict([('log', 'warning'), ('value', 'last')])), ('force_update', False)])]), ('timeout', 10), ('log_response', False), ('parser', 'lxml'), ('verify_ssl', True), ('method', 'GET')])
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Initializing scraper
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Found form-submit config
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Refresh triggered
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Continue with form-submit
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Requesting page with form from: https://secure.powershop.co.nz/customers/REDACTED/balance
2022-01-14 22:32:11 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Executing form_page-request with a GET to url: https://secure.powershop.co.nz/customers/REDACTED/balance.
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Response status code received: 302
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Start trying to capture the form in the page
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Parse HTML with BeautifulSoup parser lxml
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Try to find form with selector #Container2
2022-01-14 22:32:13 INFO (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Unable to extract form data from.
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Exception extracing form data: list index out of range
2022-01-14 22:32:13 ERROR (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Exception in form-submit feature. Will continue trying to scrape target page
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Updating data from https://secure.powershop.co.nz/customers/REDACTED/balance
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Executing page-request with a get to url: https://secure.powershop.co.nz/customers/REDACTED/balance.
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Response status code received: 302
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Data succesfully refreshed. Sensors will now start scraping to update.
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Start loading the response in BeautifulSoup.
2022-01-14 22:32:13 DEBUG (MainThread) [custom_components.multiscrape] Finished fetching scraper data data in 2.041 seconds (success: True)
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Powershop daily consumption # Setting up sensor
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Powershop daily consumption # Start scraping to update sensor
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Powershop daily consumption # Select selected tag: None
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Exception occurred while scraping, will try to resubmit the form next interval.
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Powershop daily consumption # Exception selecting sensor data: 'NoneType' object has no attribute 'name'
HINT: Use debug logging and log_response for further investigation!
2022-01-14 22:32:26 WARNING (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Powershop daily consumption # Unable to extract data
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Powershop daily consumption # On-error, keep old value: None
2022-01-14 22:32:26 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_0 # Powershop daily consumption # Updated sensor and attributes, now adding to HA

So it’s stuck finding the form because #Container2 is pointing to a div and not to the form.
Also, the input field is has id email instead of username.

Try this:

select: "#Container2 > div > form"
input: 
  email: "user"
  password: "password"
1 Like

Anyone else having troubles with debugging?
I’m trying to find out why scraping doesn’t work, but even though I’ve enabled debugging as per the user guide, when I add log_response to file, I get a problem when validating the config:

Invalid config for [multiscrape]: [log_response] is an invalid option for [multiscrape]. Check: multiscrape->multiscrape->0->log_response. (See /config/configuration.yaml, line 114).

I even get the same issue with adding log_response to the sample config:

multiscrape:
  - resource: https://www.home-assistant.io
    scan_interval: 3600
    log_response: file
    sensor:
      - unique_id: ha_latest_version
        name: Latest version
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) }}'
      - unique_id: ha_release_date
        icon: >-
          {% if is_state('binary_sensor.ha_version_check', 'on') %}
            mdi:alarm-light
          {% else %}
            mdi:bat
          {% endif %}
        name: Release date
        select: ".release-date"

This option has been added in the latest pre-release (6.0.0). So you need to either enable the pre-release in HACS or wait for the final release.

Never mind, I see I forgot to update the READme. The responses are always written to files, so the value should just be ‘True’ instead of ‘file’.

Update: the readme on github has been updated.

It was working with 6.0.0 and True I managed to figure out the problem, it was having special UTF characters in the URL. Thanks for the quick help!

Need help with a multiscrape sensor.

multiscrape:
  - resource: "https://www.amaysim.com.au/my-account/my-amaysim/products"
    name: Amaysim
    scan_interval: 30
    log_response: true
    method: GET
    form_submit:
      submit_once: true
      resubmit_on_error: false
      resource: "https://accounts.amaysim.com.au/identity/login"
      select: "#new_session"
      input:
        username: !secret amaysim_username
        password: !secret amaysim_password
    sensor:
      - select: "#outer_wrap > div.inner-wrap > div.page-container > div:nth-child(2) > div.row.margin-bottom > div.small-12.medium-6.columns > div > div > div:nth-child(2) > div:nth-child(2)"
        name: amaysim_remaining_data
        value_template: "{{ value }}"

I can see from the logs that after successfully submitting the form, it says that it is getting the data from the resource url, with a response code of 200.

I have pasted the contents from the log_response file page_soup.txt below

<html><body><p>/**/('OK')</p></body></html>

Below is the content from the form_submit_response_body.txt

<html><body>You are being <a href="https://accounts.amaysim.com.au/identity">redirected</a>.</body></html>

It seems that after submitting the form, the sensor is scraping data from the intermediate page and not from the resource url.

1 Like

@danieldotnl first let me thank you for this great component! I’m using form submit and need to extract to a sensor (or sensor attribute) a XSFR-TOKEN and a Cookie that is showed in page “page_response_headers.txt” . Is there any easy way? With this then I cal the endpoint address (api json) to grab the information that I need since the page content is created with JavaScript. Can you help?
Thank you!

1 Like

I am trying to scrape the XML file from my HP Envy printer, but each element contains predicates, this is a small sample of what I get from the printer:

<?xml version="1.0" encoding="UTF-8"?>
<!--THIS DATA SUBJECT TO DISCLAIMER(S) INCLUDED WITH THE PRODUCT OF ORIGIN.-->
<pudyn:ProductUsageDyn xsi:schemaLocation="http://www.hp.com/schemas/imaging/con/ledm/productusagedyn/2007/12/11 ../schemas/ProductUsageDyn.xsd" xmlns:dd="http://www.hp.com/schemas/imaging/con/dictionaries/1.0/" xmlns:dd2="http://www.hp.com/schemas/imaging/con/dictionaries/2008/10/10" xmlns:pudyn="http://www.hp.com/schemas/imaging/con/ledm/productusagedyn/2007/12/11" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
	<dd:Version>
		<dd:Revision>SVN-IPG-LEDM.119</dd:Revision>
		<dd:Date>2010-08-31</dd:Date>
	</dd:Version>
	<pudyn:PrinterSubunit>
		<dd:TotalImpressions PEID="5082">369</dd:TotalImpressions>
		<dd:MonochromeImpressions>0</dd:MonochromeImpressions>
		<dd:ColorImpressions>135</dd:ColorImpressions>
		<dd:A4EquivalentImpressions>
			<dd:TotalImpressions PEID="5082">369</dd:TotalImpressions>
			<dd:MonochromeImpressions>0</dd:MonochromeImpressions>
		</dd:A4EquivalentImpressions>
		<dd:SimplexSheets>64</dd:SimplexSheets>
		<dd:DuplexSheets PEID="5088">143</dd:DuplexSheets>
		<dd:JamEvents PEID="16076">3</dd:JamEvents>
		<dd:MispickEvents>2</dd:MispickEvents>
		<dd:TotalFrontPanelCancelPresses PEID="30033">4</dd:TotalFrontPanelCancelPresses>
		<pudyn:UsageByMarkingAgent>
			<dd2:CumulativeMarkingAgentUsed PEID="64100">
				<dd:ValueFloat>12</dd:ValueFloat>
				<dd:Unit>milliliters</dd:Unit>
			</dd2:CumulativeMarkingAgentUsed>
			<dd2:CumulativeHPMarkingAgentUsed PEID="64101">
				<dd:ValueFloat>12</dd:ValueFloat>
				<dd:Unit>milliliters</dd:Unit>
			</dd2:CumulativeHPMarkingAgentUsed>
			<dd:CumulativeHPMarkingAgentInserted PEID="64001">
				<dd:ValueFloat>14</dd:ValueFloat>
				<dd:Unit>milliliters</dd:Unit>
			</dd:CumulativeHPMarkingAgentInserted>
		</pudyn:UsageByMarkingAgent>
	</pudyn:PrinterSubunit>
</pudyn:ProductUsageDyn>

I have created the following sensor:

- resource: http://10.0.0.97/DevMgmt/ProductUsageDyn.xml
  scan_interval: 10
  method: get
  sensor:
    - name: HP Printer Total Impressions
      unique_id: hp_printer_total_impressions
      select: "TotalImpressions"

But I get an error:

2022-02-11 15:41:46 DEBUG (MainThread) [custom_components.multiscrape.scraper] Updating from http://10.0.0.97/DevMgmt/ProductUsageDyn.xml
2022-02-11 15:41:47 DEBUG (MainThread) [custom_components.multiscrape] Finished fetching scraper data data in 1.857 seconds (success: True)
2022-02-11 15:41:47 DEBUG (MainThread) [custom_components.multiscrape.sensor] Exception selecting sensor data: list index out of range
2022-02-11 15:41:47 ERROR (MainThread) [custom_components.multiscrape.sensor] Sensor HP Printer Total Impressions was unable to extract data from HTML

How do I get to select within XML

I have found a HACS HP Printer Integration that will parse the XML, but I would still like to know how to parse above XML, because my printer include detail on what type of paper has been used, and the HACS HP Printer Integration does not export thosse values.

Did you manage to find out how to “skip” the redirect page?
I’m having a similar situation, when after login the scraper is redirected and scraping the wrong page instead of the resource
<html><body>You are being <a href="https://secure.powershop.co.nz/">redirected</a>.</body></html>

No I haven’t. I have raised an issue #89 as well.

1 Like