Scrape sensor improved - scraping multiple values

danieldotnl · July 27, 2022, 7:24pm

I’m using firefox. You can find it there in the developer tools on the network tab. Open it first, clear everything and then submit the form. Select the post request and check the ‘request’ tab on the right.

swifty · July 27, 2022, 9:46pm

Amazing, thanks so much for the help - that was the tip I needed. I have installed firefox and can see the request data.
I also have the scrape working now - out of interest I used the Firefox dev tools to get the css selector and it gave a totally different one to Chrome… and the firefox one works perfectly with the multiscrape component!

Firefox Gives:
section.activity-blocks__block:nth-child(2) > div:nth-child(1) > div:nth-child(2) > p:nth-child(2)

Whereas Chrome gave:
#flogas-content > div:nth-child(2) > div > div > section:nth-child(2) > div > div.activity-blocks__secondary > p:nth-child(2)

canaru · July 29, 2022, 1:56pm

Anyone know how I can select a sub product with a “select element click” like in this site before scrappe data?
Tks.

sler · August 3, 2022, 1:31pm

I’m trying to scrape time table from Marprom Interaktivni vozni redi

But CSS selector from chrome or firefox doesn’t work, is scrape supporting tables and nth-childs etc?
CSS selector from chrome: table.table:nth-child(9) > tbody:nth-child(2) > tr:nth-child(2) > td:nth-child(2)

brendio · August 4, 2022, 5:25pm

I’ve spent a few days now trying to debug my login config. I found a missing ‘/’, which allowed me to get further along, but now I am getting an error message:

2022-08-05 01:23:09 DEBUG (MainThread) [custom_components.multiscrape.http] VIA Farm - Chameleon Sensors # Error executing POST request to url: https://via.farm/signinwithpassword/.

Error message:

HTTPStatusError("Client error '403 Forbidden' for url 'https://via.farm/signinwithpassword/'\nFor more information check: https://httpstatuses.com/403")

The form_submit_response_body.txt includes the following info:

  <p>CSRF verification failed. Request aborted.</p>

  <p>You are seeing this message because this HTTPS site requires a &#39;Referer header&#39; to be sent by your Web browser, but none was sent. This header is required for security reasons, to ensure that your browser is not being hijacked by third parties.</p>
  <p>If you have configured your browser to disable &#39;Referer&#39; headers, please re-enable them, at least for this site, or for HTTPS connections, or for &#39;same-origin&#39; requests.</p>
  <p>If you are using the &lt;meta name=&quot;referrer&quot; content=&quot;no-referrer&quot;&gt; tag or including the &#39;Referrer-Policy: no-referrer&#39; header, please remove them. The CSRF protection requires the &#39;Referer&#39; header to do strict referer checking. If you&#39;re concerned about privacy, use alternatives like &lt;a rel=&quot;noreferrer&quot; ...&gt; for links to third-party sites.</p>

As part of my debugging process (and to set myself another challenge), I attempted the scraping in Python and managed to get it working there (that’s how I debugged the missing trailing ‘/’ in my resource url. So, I know it can be scraped, but it seems to want a referer header, which HA isn’t giving.

Anyone else come across this error?

3raser95 · August 5, 2022, 9:01am

I’m trying to scrape values from a weather service and the precipitation values are separated by a comma in the forecast. I use select_list like this

multiscrape:
  - resource: https://www.foreca.fi/Finland/Raisio/10vrk
    scan_interval: 600
    log_response: true
    sensor:
      - unique_id: foreca_10_day
        name: Foreca 10 day
        select: "#updated p"
        value_template: "{{ value }}"
        attributes:
          - name: "condition"
            select_list: "#tenday .row .day a"
            attribute: "title"
          - name: "datetime"
            select_list: "#tenday .row .day a h5"
          - name: "wind_bearing"
            select_list: "#tenday .row .day a .dayfc .wind .wd img"
            attribute: "title"
          - name: "temperature"
            select_list: "#tenday .row .day a .dayfc p.tx span"
          - name: "templow"
            select_list: "#tenday .row .day a .dayfc p.tn span"
          - name: "wind_speed"
            select_list: "#tenday .row .day a .dayfc .wind .ws em"
          - name: "precipitation"
            select_list: "#tenday .row .day a .dayfc .rain em"

And all the other values are good for me and I can manipulate them for my custom weather entity but the precipitation values like I said produce a weird string due to the comma separator:
precipitation: 10,9,22,2,0,0,4,0,0,0,0,7,7,4,5
in that case it should be 10.9, 22.2, 0, 0.4, 0, 0, 0, 0, 0, 7.7, 4.5

I have been trying to figure out a way to fix it but is it a lost cause? If I split by comma then my values are completely wrong and there are no period separated precipitation values on that website.

danieldotnl · August 5, 2022, 11:31am

@brendio @3raser95 I’m currently on vacation and unfortunately cannot look into your issues over the coming weeks.
If you can, feel free to create a PR.

Amardeep · August 7, 2022, 2:56pm

This is such a great custom component. I’m using it to scrape bin collection dates (probably not the most imaginatice use, but useful). This was quite easy, as my address is represented by a unique ID which forms part of the URL, so so I didn’t need to use forms.

I’m trying to do the same at my parents’ house, but they live in a different council area, so the website is different, too. I need to input the Post Code and hit a “Find address” button, which then brings up a drop-down selection box from which I need to select the address… The page then reloads and shows the collection dates. There seem to be two submissions.

Would someone be able to help me check if scraping will even be possible?

Start page: ‘https://www.barnet.gov.uk/citizen-home/rubbish-waste-and-recycling/household-recycling-and-waste/collections-for-postcode.html’

Post Code: ‘N2 9ED’ - note: this is not my parents’ post code, but random post code in Barnet. All post codes and available on Google Maps anyway.

Any help is very much appreciated!!

Balders · August 12, 2022, 3:54pm

Thanks for detailing your journey here, as a fellow Flogas user I’ve been trying to follow but can’t seem to get the scrape to work. Would you be kind enough to post your config minus any personal details please?

swifty · August 12, 2022, 7:36pm

Sure, here you go - note it still seems a bit finnicky with failing login the first time - I thought it was sorted but I think it was because I had the scan interval set really low, so it had already done the first attempt by the time I checked.

multiscrape:
  - resource: 'https://my.flogas.co.uk/account/telemetry'
    scan_interval: 43200
    method: 'GET'
    log_response: false
    form_submit:
      submit_once: true
      resource: 'https://my.flogas.co.uk'
      select: "#login_form"
      input:
        email_address: YOUR_EMAIL
        password: 'YOUR_PASSWORD'
        remember_me: "0"
        redirect: "/account/telemetry"
        ajax_origin: "login_form"
        module: "flogas\\portal\\login_form"
        ajax_act: "do_ajax_submit"
        rp: ""
        window_location: "/"
        hc_timing: "94409"
        ajax_hash: "1ed3133e89ddf3cca1414d81cec88be967c0921fc1da54dcd96f1d36200ae4579db43545c52610e7125856ed9d67920ae41b6b920919225448326e8e2e88f416"
    sensor:
      - select: 'section.activity-blocks__block:nth-child(1) > div:nth-child(1) > div:nth-child(2) > p:nth-child(2)'
        value_template: '{{ value.replace("%","") }}'
        unit_of_measurement: "%"
        name: LPG_Level_Percent
        on_error:
          value: last
      - select: 'section.activity-blocks__block:nth-child(1) > div:nth-child(1) > div:nth-child(2) > p:nth-child(3)'
        value_template: '{{ value.replace(" litres","") }}'
        unit_of_measurement: "litres"
        name: LPG_Level_Litres
        on_error:
          value: last

I have the interval set to 12 hours since the telemetry data is ‘live’ so no point trying to update it all the time.
If you have the same problem as me, you will initially find the sensor value is ‘unavailable’ since the first login attempt didn’t work… to ‘fix’ this, I just go to dev tools and reload the multiscrape component and then it logs in OK and pulls the stats from the page.
It would be nice to not have to do this, either by restoring the previous value at startup, or somehow fixing the login so it doesn’t need two attempts before working… but I’ve not had any luck yet - if you do, please let me know

danieldotnl · August 12, 2022, 8:03pm

Maybe you can try to create an automation that will call the multiscrape trigger service when the sensor becomes unavailable? (instead of manual reloading)

swifty · August 12, 2022, 9:11pm

Didn’t even spot the service, will give it a try - thanks

Balders · August 12, 2022, 10:24pm

A massive thank you to both, I am now connected & I did experience the same issue re the need to log in twice. As suggested I set up an automation to restart the service if it was unavailable, and can confirm that was a fix for my system.

MKumar · August 19, 2022, 2:06am

Hi,
I am trying to scrape a website that requires login.

My Configuration:

multiscrape:
- resource: 'https://xxxxx.com/studportal/Student-Home.aspx'
  scan_interval: 3600
  form_submit:
    submit_once: True
    resource: 'https://xxxxx.com/studportal/frmLogin.aspx'
    select: "#aspnetForm"
    input:
      ctl00_ContentPlaceHolder1_txtUserID_Raw: 'user'
      ctl00$ContentPlaceHolder1$txtUserID: 'user'
      ctl00_ContentPlaceHolder1_txtPassword_Raw: 'passwd'
      ctl00$ContentPlaceHolder1$txtPassword: 'passwd'
      ctl00$ContentPlaceHolder1$BnLogin: Login
      #ctl00$ContentPlaceHolder1$BnForgotPassword: Login
  sensor:
    - select: '#studInfoPanelDetailTBL > tr > td'
      name: studentname
  log_response: true

Form submitted by HA:

Form submitted in browser on successful login:

From the log response I see that form is submitted for ‘Forget Password’ instead of ‘Login’.
Can anybody suggest how do I configure to login.

Crashbash2020 · August 19, 2022, 4:09am

Hi Joem,

did you ever manage to get this working? i’m trying to do the same thing but you appear to be the only example i can find that is trying this with powershop NZ

Balders · August 26, 2022, 10:33am

Hi Swifty, after a couple of days working beautifully Flogas changed the site layout - is your config still working?

swifty · August 26, 2022, 11:16am

Unfortunately not, they have massively changed the website (typical as it’s been the same for the last ~7 years I’ve been with them!)
From what I can tell (i’m absolutely no expert in this area) it’s doing some kind of javascript chunking to load the site, as the log_response results now contain very little for the login page mostly just javascript references as far as I can see.

I’m afraid I’m not sure where we go from here unless @danieldotnl knows of some way to get the site data to load ?

danieldotnl · August 26, 2022, 7:09pm

No, it will be complicated

djiesr · August 27, 2022, 4:37pm

Hi, I try to fallow the price of one item on multiple groceries.
For some it’s work perfect, but the other, the log return almost page empty of in body only css, script…
Someone can try?
Exemple Épicerie Maxi | Faites votre épicerie en magasin ou en ligne

multiscrape:
  - resource: https://www.maxi.ca/beurre-d-arachide-cr-meux/p/20039581001_EA
    scan_interval: 3600
    log_response: True
    headers:
      user-agent: Mozilla/5.0
      accept-encoding: gzip
      accept-language: fr-CA
    sensor:
      - unique_id: multiscrape_test
        name: Multiscrape Test
        select: "#site-content > div > div > div.product-tracking > div > div.product-details-page-details__content__name > div > div > div.product-details-page-details__content__sticky-placeholder > div > div.product-details-page-details__content__prices > div > div > div > span > span.price__value.selling-price-list__item__price.selling-price-list__item__price--now-price__value"
        value_template: '{{ value | replace (",", ".") }}'

and the log return this body:

> <body data-engine-version="0.5.6">
> <div id="root">
> <style>
>             *,:after,:before{box-sizing:border-box}*,:after,:before{box-sizing:border-box}@keyframes spin{0%{transform:rotate(0)}100%{transform:rotate(360deg)}}body,html{height:100%;margin:0;padding:0;font-family:Univers,Helvetica,Arial,sans-serif;font-size:12px;font-weight:400}#root{position:relative;height:100%;z-index:1}.root-spinner-wrapper{display:-ms-flexbox;display:flex;-ms-flex-pack:center;justify-content:center;-ms-flex-align:center;align-items:center;height:100%;width:100%}.root-spinner{content:'';animation:spin 1s linear infinite;border-radius:50%;border-color:#000 #000 transparent transparent;border-style:solid;border-width:.3rem;width:50px;height:50px;margin:0 auto}
>             </style>
> <div class="root-spinner-wrapper">
> <div class="root-spinner"></div>
> </div>
> </div>
> <div id="privacy-policy"></div>
> <script>
>             window['ldBronxCDNPath'] = 'https://assets.loblaws.ca/pcx_bronx_fe_prod/builds/production/2.34.34/3778e612/maxi-mkt';
>             window['ldBronxAppBundle'] = 'https://assets.loblaws.ca/pcx_bronx_fe_prod/builds/production/2.34.34/3778e612/maxi-mkt/maxi-mkt-bundle.js';
>         </script>
> <script async="true" src="https://assets.loblaws.ca/pcx_bronx_fe_prod/builds/production/2.34.34/3778e612/maxi-mkt/maxi-mkt-vendor.bundle.js"></script>
> </body>

Thanks

Jonsson9 · September 10, 2022, 10:06am

Anyone know whats going on here? No sensor is created.

- platform: multiscrape
  name: MinEnergi scraper
  resource: https://minenergi.elvaco.se/#!/consumption?selection=1day&medium=el&offset=0
  verify_ssl: false
  scan_interval: 60
  selectors:
    kwh:
      name: Förbrukning
      select: 'table > tr:nth-child(1) > td:nth-child(3)'
  prelogin:
      preloginpage: https://minenergi.elvaco.se/#!/login
      preloginform: 'login()'
      username_field: 'username'
      password_field: 'password'
      username: 'x'
      password: 'x'