Scrape Help

I’m trying to use the scrape sensor to make a simple internet uptime monitor.

My modem (Arris/Motorola SB6141) has a pretty simple web page that displays the up time at the bottom of the page. If I inspect the page code it is stored here:

  <tbody>
  <tr>
    <th style="background-color: rgb(115, 107, 8);"><font color="#ffffff">Cable Modem Operation</font></th>
    <th style="background-color: rgb(115, 107, 8);"><font color="#ffffff">Value</font> </th></tr>
  <tr>
    <td>Current Time and Date</td>
    <td>Apr 03 2018 07:26:40</td></tr>
  <tr>
    <td>System Up Time</td>
    <td>0 days 11h:44m:5s</td></tr>
  </tbody>

I created a scrape sensor config like so:

- platform: scrape
  resource: http://192.168.100.1
  name: Uptime
  select: "td:nth-of-type(21)"

If I am understanding correctly, that should get the value of the 21st field. Maybe?

Instead I get this:

2018-04-03 08:20:45 ERROR (MainThread) [homeassistant.components.sensor] scrape: Error on device update!
Traceback (most recent call last):
  File "/usr/src/app/homeassistant/helpers/entity_platform.py", line 188, in _async_add_entity
    await entity.async_device_update(warning=False)
  File "/usr/src/app/homeassistant/helpers/entity.py", line 327, in async_device_update
    yield from self.hass.async_add_job(self.update)
  File "/usr/local/lib/python3.6/asyncio/futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "/usr/local/lib/python3.6/asyncio/tasks.py", line 250, in _wakeup
    future.result()
  File "/usr/local/lib/python3.6/asyncio/futures.py", line 243, in result
    raise self._exception
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/app/homeassistant/components/sensor/scrape.py", line 120, in update
    value = raw_data.select(self._select)[0].text
IndexError: list index out of range

What am I doing wrong here?

It looks like there are only 4 < td > elements on the page and you’re trying to scrape the 21st one.

Try changing the number.

Sorry, I should have been more clear. That is just the section of the page that has the Uptime. I didn’t want to crowd the thread with the full code, but I can if it helps. When I searched it was the 21st instance of < td> out of 23.

It’s pretty hard to get scrape sensors right on the first go. My approach is to just spam them - create like 12 of them from 1-12 just to get my bearings and see how the scraper is reading the page, then I work from there.

Ended up trying exactly that, with

select: "td:nth-of-type(x)" 

and replaced x with numbers 1 through 23. Just to see if I got anything and just ended up with 23 of the same warning message. So now I’m at a complete loss. Here is the full page source, to see if anyone can tell me why I’m an idiot.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0037)indexdata.html -->
<HTML><HEAD>
<META content="text/html; charset=windows-1252" http-equiv=Content-Type>
<META content=no-cache http-equiv=Pragma>
<META content="Wed, 30 Apr 1975 02:00:00 GMT" http-equiv=Expires>
<META content="Microsoft FrontPage 4.0" name=GENERATOR>

<script language="JavaScript" src="utility.js" type="text/javascript">
</script>

</HEAD>


<BODY aLink=#7b2939 link=#485a91 text=#000000 vLink=#7b2939 onload="onloadmainpage()">

<script language="javascript" type="text/javascript">
var infoText = 'This page provides information about the startup \
      process of the Cable Modem. If there is a problem with the startup, the \
      word "Failed" may appear in the Status column. Should this occur, visit \
      the Help area and perform the Checkup procedures listed there. If the \
      problem continues, click on the word "Failed" for more detailed \
      information about the failure, or call your service provider for \
      assistance.';

document.write(displayHeader("cm","cmStatus",infoText));
</script>
	   

<CENTER>
<TABLE align=center border=1 cellPadding=8 cellSpacing=0 width="100%">
  <TBODY>
  <TR>
    <TH><FONT color=#ffffff>Task</FONT></TH>
    <TH><FONT color=#ffffff>Status</FONT> </TH></TR>
  <TR>
    <TD>DOCSIS Downstream Channel Acquisition</TD>
    <TD>Done</TD></TR>
  <TR>
    <TD>DOCSIS Ranging</TD>
    <TD>Done</TD></TR>
  <TR>
    <TD>Establish IP Connectivity using DHCP</TD>
    <TD>Done</TD></TR>
  <TR>
    <TD>Establish Time Of Day</TD>
    <TD>Done</TD></TR>
  <TR>
    <TD>Transfer Operational Parameters through TFTP</TD>
    <TD>Done</TD></TR>
  <TR>
    <TD>Register Connection</TD>
    <TD>Done</TD></TR>
  <TR>
    <TD>Cable Modem Status</TD>
    <TD>Operational</TD></TR>
  <TR>
    <TD>Initialize Baseline Privacy</TD>
    <TD>Done</TD></TR>
  </TBODY>
</TABLE>
</CENTER>

<P></P>

<TABLE align=center border=1 cellPadding=8 cellSpacing=0 width="100%">
  <TBODY>
  <TR>
    <TH><FONT color=#ffffff>Cable Modem Operation</FONT></TH>
    <TH><FONT color=#ffffff>Value</FONT> </TH></TR>
  <TR>
    <TD>Current Time and Date</TD>
    <TD>Apr 03 2018 08:59:38</TD></TR>
  <TR>
    <TD>System Up Time</TD>
    <TD>0 days 0h:27m:48s</TD></TR>
  </TBODY>
</TABLE>
	
<P></P>



<script language="javascript" type="text/javascript">
document.write(displayFooter("cm"));
</script>
	  
</BODY>

It seems that Firefox sees the td’s as lower-case and Chrome as upper-case; so I tried both wondering if it was case sensitive and ended up with the same results.

I’m with the same problem, and I want just to scrape the only value that shows…

<div class=“col-md-6 compra”>
<h4 class=“pull-left”>Compra <span class=“pull-right”>$ 34,31</span></h4>
</div>

That is 34,31

Although a couple year old is the thread, help someone, it might.

shell script modified to add mqtt
from here

will check modem status and restart if it is down.

#!/usr/bin/env bash

# mqtt setup -------------------------------------------------------------------------- #
mqtthost="hassio.local"
mqtttopic0="modemCheck"
mqtttopic1="netgear"


LOG="$HOME/modemCheck.log"
HOST="192.168.100.1"

# Test for connectivity
curl -m 5 -s $HOST/indexData.htm &> /dev/null

if [ $? -eq 0 ]; then
    # Check modem for Failed status.
    FAILED="$(curl -m 5 -s $HOST/indexData.htm | grep '<TD>' | grep 'Failed')"
    UPTIME="$(curl -s $HOST/indexData.htm | grep 'days' | sed -e 's/    <TD>//g' | sed -e 's/<\/TD><\/TR>//g')"
    if [ "$FAILED" = "" ]; then
        echo "[$(date)] The modem is operational. Uptime: $UPTIME" >> $LOG
        mosquitto_pub -h $mqtthost -q 0 -t $mqtttopic0/$mqtttopic1/state -m "Online"
        mosquitto_pub -h $mqtthost -q 0 -t $mqtttopic0/$mqtttopic1/uptime -m "$UPTIME"

    else
        echo "[$(date)] Modem Failure. Restarting modem. Uptime was: $UPTIME" >> $LOG
        curl -m 5 -s $HOST/reset.htm?reset_modem="Restart Cable Modem" &> /dev/null
        echo "[$(date)] Modem restarted." >> $LOG
        mosquitto_pub -h $mqtthost -q 0 -t $mqtttopic0/$mqtttopic1/state -m "Restart"
        mosquitto_pub -h $mqtthost -q 0 -t $mqtttopic0/$mqtttopic1/uptime -m "$UPTIME"

    fi
else
    echo "[$(date)] No network connectivity." >> $LOG
    mosquitto_pub -h $mqtthost -q 0 -t $mqtttopic0/$mqtttopic1 -m "Offline"

    exit 1
fi

1 Like

FYI this won’t work on newer firmware, restarting thru the web UI has been disabled “in the name of security”. A smart-plug would work though letting you power cycle it.