Help scraping a Google Sheet

So moving into the digital age my wife made a Lunch schedule in a google sheet and shared that with the family. But I want to add it into my hass install to!
The Google Sheet is published online and I can access it but I’m having a hard time scraping it. Teh HTML looks like this:

<table class="waffle" cellspacing="0" cellpadding="0">
  <thead>
    <tr>
      <th class="row-header freezebar-origin-ltr header-shim row-header-shim"></th>
      <th id="67C0" style="width:243px" class="header-shim"></th>
      <th id="67C1" style="width:440px" class="header-shim"></th>
    </tr>
  </thead>
  <tbody>
    <tr style="height:54px;">
      <th id="67R0" style="height: 54px;" class="row-headers-background row-header-shim"><div class="row-header-wrapper" style="line-height: 54px;">1</div></th>
      <td class="s0" dir="ltr" colspan="2">Vecka 8</td>
    </tr>
    <tr style="height:40px;"><th id="67R1" style="height: 40px;" class="row-headers-background row-header-shim"><div class="row-header-wrapper" style="line-height: 40px;">2</div></th>
      <td class="s1" dir="ltr">Dag</td>
      <td class="s1" dir="ltr">Mat</td>
    </tr>
    <tr style="height:40px;">
      <th id="67R2" style="height: 40px;" class="row-headers-background row-header-shim"><div class="row-header-wrapper" style="line-height: 40px;">3</div></th>
      <td class="s2" dir="ltr">Måndag</td>
      <td class="s3" dir="ltr">Limekyckling och broccoli + ris</td>
    </tr>
    <tr style="height:40px;">
      <th id="67R3" style="height: 40px;" class="row-headers-background row-header-shim"><div class="row-header-wrapper" style="line-height: 40px;">4</div></th>
      <td class="s4" dir="ltr">Tisdag</td>
      <td class="s5" dir="ltr">Köttbullar och mos/broccoli + gräddsås</td>
    </tr>
etc etc etc.

I’ve tried using select: ".waffle .s3" works and it picks out the first row with the s3 class. the problem is that the same class is in more place further down so that won’t work.
Using select: "tr#67R2 > .s3" just throws me an error in the log file, I even tried select: 'td:nth-of-type(4)' but that to gave me errors for some reason

[edit, things I’ve tried so far]
select: "tr#67R2 > .s3" = soupsieve.util.SelectorSyntaxError: Malformed id selector at position 2
line 1:
select: "td:nth-of-type(5)" = no values
select: ".waffle td:nth-of-type(5)" = no values
select: "#67R7~ .s7" = Malformed id selector at position 0, so I can’t start with a #
select: ".waffle s3:nth-of-type(3)" = no values (Tried to get the 3d time s3 was mentioned)

Running out of ideas now… anyone with a suggestion?

So this is how far I’ve gotten so far! Looking as it should besides the duplicates that I can’t find out the correct syntax for!
Current setup with scrape sensors is here
image

Hi, I happened upon this during a google search. I see you were having an issue with soupsieve:

select: "tr#67R2 > .s3" = soupsieve.util.SelectorSyntaxError: Malformed id selector at position 2

People run into these kinds of issues, and aren’t familiar with why they get them unless they are used to using real CSS selectors. #67R2 is an invalid ID because in CSS, an ID selector is a # followed by a CSS identifier, and CSS identifiers cannot start with numbers unless you use a CSS escape code. To represent 6 you have to use the hexadecimal escape code \36 (we follow it with a space so not to capture 7 in the escape code): r"#\36 7R2".

If nothing, I hope that at least eases the confusion as to why the errors occur. Soup Sieve simply follows the CSS specifications.

This confuses people because Beautiful Soup (which uses Soup Sieve in the latest releases) used to have a select implementation in old versions that did not follow the CSS specifications. Soup Sieve does follow the spec, so things like this can throw errors and people aren’t sure why.

Just something to be aware of.

Note: I am the author of Soup Sieve.

Thanks for the tip, so I tried the following now
select: "th#\x36 7R2~ .s7" and that still gives me

soupsieve.util.SelectorSyntaxError: Malformed id selector at position 2
  line 1:
th#6 7R2~ .s7
  ^

So it seems that the problem lies in the # is there any other way I can get it to use the th id="67R2"?

That is because you are using Python string escapes. In my example, I did not use \x36 I used \36 and I put it in a Python raw string r"string", not a normal string "string". It may make more sense if I demonstrate:

We need to make sure the CSS selector parser gets th#\36 7R2 ~ .s7. You can tell in your error, that is not what it got th#6 7R2~ .s7.

In Python, if I print th#\x36 7R2 ~ .s7, I will get what your code is failing on:

>>> print('th#\x36 7R2 ~ .s7')
th#6 7R2 ~ .s7

This is not what we want.

Now if we use a raw string, things change a bit. Now the \ is treated as a literal \ and not an escape:

>>> print(r'th#\x36 7R2 ~ .s7')
th#\x36 7R2 ~ .s7

This is also the same as using a normal string and escaping the \ to ensure we get a literal \:

>>> print('th#\\x36 7R2 ~ .s7')
th#\x36 7R2 ~ .s7

As I mentioned earlier, the parser needs to get a string that uses a CSS escape code, not a Python escape code. So \x36 is not correct. CSS starts a hex escape code with \ and consumes every hex number after unless you provide a space. So what we actually want is:

>>> print('th#\\36 7R2 ~ .s7')
th#\36 7R2 ~ .s7

or with the raw string:

>>> print(r'th#\36 7R2 ~ .s7')
th#\36 7R2 ~ .s7

Now let’s feed this through soupsieve to show that it will not break:

>>> import soupsieve as sv
>>> sv.compile(r'th#\36 7R2 ~ .s2', flags=sv.DEBUG)
## PARSING: 'th#\\36 7R2 ~ .s2'
TOKEN: 'tag' --> 'th' at position 0
TOKEN: 'id' --> '#\\36 7R2' at position 2
TOKEN: 'combine' --> ' ~ ' at position 10
TOKEN: 'class' --> '.s2' at position 13
## END PARSING
SoupSieve(pattern='th#\\36 7R2 ~ .s2', namespaces=None, custom=None, flags=65536)

Selector is parsed with no errors.

Hope that helps. I probably shouldn’t have assumed that you caught that I was using a raw string, but as I mentioned earlier, you can also use normal strings, you just have to escape the \: 'th#\\36 7R2 ~ .s2'.

Ah, I through the r was a typo and just using th#\36 gave me unknown escape sequence

Thanks for the explanation but I must be super stupid… I tried almost all your suggestions and got the following result:


Using the original suggestion (select: r"#\36 7R2 ~ .s3")

soupsieve.util.SelectorSyntaxError: Invalid character '"' position 1
  line 1:
r"#\36 7R2 ~ .s3"
 ^

Using paranthesis (select: (r'th#\36 7R2 ~ .s3')

soupsieve.util.SelectorSyntaxError: Invalid character '(' position 0
  line 1:
(r'th#\36 7R2 ~ .s3')
^

Using double backslash to escape the bakslash (select: 'th#\\36 7R2 ~ .s7')

soupsieve.util.SelectorSyntaxError: Invalid character '7' position 8
  line 1:
th#\\36 7R2 ~ .s7
        ^

Using your raw string from the post above (select: r'th#\36 7R2 ~ .s7')

soupsieve.util.SelectorSyntaxError: Invalid character "'" position 1
  line 1:
r'th#\36 7R2 ~ .s7'
 ^

And also trying raw string with paranthesis (select: (r'th#\36 7R2 ~ .s7'))

soupsieve.util.SelectorSyntaxError: Invalid character '(' position 0
  line 1:
(r'th#\36 7R2 ~ .s7')
^

Unfortunately, I’m not familiar with the interface you are using. From a pure Python perspective what I’m saying is true.

I’m not familiar with this forum or the tools you may be using though that wrap around SoupSieve and BeautifulSoup. It appears from the syntax you keep posting that your are not using Python directly, or there is some level of indirection that sends what you input down to the underlying library.

It is possible to get a Google sheet as XML or json. That might be easier:
https://coderwall.com/p/duapqq/use-a-google-spreadsheet-as-your-json-backend

No worries, learned a lot of Python from you!
I’m working in yaml and Home-Assisstant so that explains it! :smiley:

Sweet! Getting it as json changes everything! Then I can use a Rest sensor instead of scraping it!

Oh, YAML! That makes more sense. Theoretically it should be possible to send that through properly down to Python, but JSON makes way more sense.

1 Like

So, since I suck at figuring out how to do templating I just did it my way instead!

if it works, it works! :smiley:
image

1 Like

How to convert google spreadsheet to json. your spreadsheets ID 2PACX-1vTu-7iqusFN7jReZI_8CbE1eQigFHBanT4VgQUQ8g7e8cDClt7s47-IazkQuPlGijUDh4aSr4NmrZ-f is not same in Jason link ID 1cEHjlTG8j4QpMZU6bAmqlf11e3vj7bh9hL5cNgHPoU4

https://docs.google.com/spreadsheets/d/e/2PACX-1vTu-7iqusFN7jReZI_8CbE1eQigFHBanT4VgQUQ8g7e8cDClt7s47-IazkQuPlGijUDh4aSr4NmrZ-f/pubhtml?gid=67&single=true

https://spreadsheets.google.com/feeds/list/1cEHjlTG8j4QpMZU6bAmqlf11e3vj7bh9hL5cNgHPoU4/default/public/values?alt=json

HI. Sorry to resurrect this, but this is exactly what I want to do with my wife making a dinner list. Could I have the code you are using in yaml? This is exactly what I am looking for.

Thanks.