Support for xpath?

reef-actor · October 10, 2018, 7:09pm

I have an xpath query to parse some html into json for a sensor. I know I could use command_line sensor and curl+xmllint to do this, but I am hoping I can do it without external dependencies.

'{',
  string-join(
  //td[not(following::td[@class='today'])]//div[@class=('rubbish','recycling_box','recycling_bin')]
    /concat(
      '"',
      normalize-space(text()),
      '":"',
      xs:date(replace(replace(@id,<regex>), <regex>)) + xs:yearMonthDuration('P1M')
      ,'"'
      )
,',')
,'}'

It selects all divs in any td that matches the specified classes and that doesn’t have a td of class today after it. The text of a div becomes a json key and the id is regex manipulated into a date.

The only beautifulsoup examples I can find seem are simple in comparison.
Is it possible to do this using something like the scape sensor?

reef-actor · October 16, 2018, 9:08am

So it turns out that Python is without XPath 2.0 support, given that the spec is 10 yrs old at this point I don’t expect that to change.
xqilla is a app that will process XPath 2.0 but isn’t available for my platform.
So instead I have created a regex to parse the HTML and do date addition. Behold!

class=\"today\".*?JSC[0-9_]+_(?P<d1>3[01]|[12]?[0-9])(?:(?P<m1_10>10)|(?P<m1_11>11)|(?P<m1_0>0)|(?P<m1_1>1)|(?P<m1_2>2)|(?P<m1_3>3)|(?P<m1_4>4)(?P<m1_5>5)|(?P<m1_6>6)|(?P<m1_7>7)|(?P<m1_8>8)|(?P<m1_9>9))(?=.*?(?P<m1>(?(m1_0)1)(?(m1_1)2)(?(m1_2)3)(?(m1_3)4)(?(m1_4)5)(?(m1_5)6)(?(m1_6)7)(?(m1_7)8)(?(m1_8)9)(?(m1_9)10)(?(m1_10)11)(?(m1_11)12)))(?P<y1>2[0-9]{3})JSC..\" +class=\"(?P<t1>[^\"]+).*?JSC[0-9_]+_(?P<d2>3[01]|[12]?[0-9])(?:(?P<m2_10>10)|(?P<m2_11>11)|(?P<m2_0>0)|(?P<m2_1>1)|(?P<m2_2>2)|(?P<m2_3>3)|(?P<m2_4>4)(?P<m2_5>5)|(?P<m2_6>6)|(?P<m2_7>7)|(?P<m2_8>8)|(?P<m2_9>9))(?=.*?(?P<m2>(?(m2_0)1)(?(m2_1)2)(?(m2_2)3)(?(m2_3)4)(?(m2_4)5)(?(m2_5)6)(?(m2_6)7)(?(m2_7)8)(?(m2_8)9)(?(m2_9)10)(?(m2_10)11)(?(m2_11)12)))(?P<y2>2[0-9]{3})JSC..\" +class=\"(?P<t2>[^\"]+).*?JSC[0-9_]+_(?P<d3>3[01]|[12]?[0-9])(?:(?P<m3_10>10)|(?P<m3_11>11)|(?P<m3_0>0)|(?P<m3_1>1)|(?P<m3_2>2)|(?P<m3_3>3)|(?P<m3_4>4)(?P<m3_5>5)|(?P<m3_6>6)|(?P<m3_7>7)|(?P<m3_8>8)|(?P<m3_9>9))(?=.*?(?P<m3>(?(m3_0)1)(?(m3_1)2)(?(m3_2)3)(?(m3_3)4)(?(m3_4)5)(?(m3_5)6)(?(m3_6)7)(?(m3_7)8)(?(m3_8)9)(?(m3_9)10)(?(m3_10)11)(?(m3_11)12)))(?P<y3>2[0-9]{3})JSC..\" +class=\"(?P<t3>[^\"]+)

vloris · October 16, 2018, 12:49pm

I really like the fact you mention the famous stackoverflow comment about parsing html with regex yourself

Mahko_Mahko · November 19, 2021, 10:54am

OMG was laughing like a madman at that stackoverflow comment. Not sure I’m convinced yet though;)