Asking Assist to scrape data/get data from API

Cambionn · January 2, 2025, 1:09pm

I’m not 100% sure if this is the correct categories to place this, but since this is only really useful when working with a Voice Assistant (although the same thing in a different context could be used elsewhere) and it doesn’t really seem to fit better in the other 5 categories I see I’ve placed it here. If it should’ve been elsewhere, my apologies and do let me know.

So I’ve finally got my HA setup last week as replacement to be able to get rid of Google. I run everything locally on a small PC, and purposely stay away from uploading to LLMs online. My hardware isn’t capable of running an LLM, so I run on the basic Assist with Wisper and Piper setup. So far, very happy. Everything is running smoothly and commands work as expected.

I’m trying to add some fun assistant abilities. For example, my Assist can now recite some jokes as well as some Elder Scrolls facts on request. Next step I wanted to take is to let it answer simple “what is/who is” questions. Idea was doing so by scraping the first paragraph of a Wikipedia article on request. That way it should be rather broadly usable but not too heavy for the machine. But there I’m getting a bit stuck.

I first came out on the Scrape integration. However, that doesn’t seem to allow variables in the URL. I’m also not sure if it allows me to scrape on request only (as there is no point in scraping without input in this case, all it can do is affect performance and traffic). But to create an example of what I’m trying to do, let’s ignore that for a second (as well as that my select might be wrong).

What I want is to do something like asking Assist:

What is {foo}?

Then have it reply with the scraped answer from:

scrape:
  - resource: https://en.wikipedia.org/wiki/{foo}
    sensor:
      - name: wikipedia_info
        select: "div.mw-content-ltr.mw-parser-output p"
        index: 1

And then have Assist tell this back to me.

Since Scrape doesn’t seem to do this with the variables, this wouldn’t work one on one. I also have been searching online and found a lot of people scraping static data. But not much variable data, let alone based on a variable input. I did find this topic, but outside of “using that scraper instead as it allows templating” it doesn’t answer the “send variable with command” and “scrape on command” parts yet, or if that’s even possible with that scraper.

So I guess the question is, how do I approach this. Does anyone have a similar kind of thing set up already? It doesn’t seem to be so crazy in theory, but I’m not entirely sure how to get this started in HA.

For context, in case it matters, I’m running HAOS in a VM on a small Linux computer tucked away somewhere at home. HAOS has access to the internet, but cannot be reached from outside the local network.

NathanCu · January 2, 2025, 1:42pm

Let’s start with first no scrape does not do this.

What you’re describing is most appropriate at the tool use level in a local llm. (is where I’m currently at architecture wise in my build) This would most likely be handled by a machine capable of driving an llm and a local installation of something like open-webui

Let me save you some head banging…
You won’t be able to scrape on demand reliably enough with the intent recognizer you have. It will pule and choke on sentences trying to understand what’s going on. Stuff an LLM breathes like air. You will need that before you crack web scrape on demand. Before you crack that you also need web search on demand which is also something an llm solves and is best at its tool level.

(in essence local ollama-like tool using llm solves all that for you in one go… Intent recognizer dynamic search scrape will be omega level mutant hard)

Cambionn · January 2, 2025, 3:05pm

NathanCu:

Let’s start with first no scrape does not do this.

What you’re describing is most appropriate at the tool use level in a local llm. (is where I’m currently at architecture wise in my build) This would most likely be handled by a machine capable of driving an llm and a local installation of something like open-webui

You won’t be able to scrape on demand reliably enough with the intent recognizer you have. It will pule and choke on sentences trying to understand what’s going on. Stuff an LLM breathes like air.

You will need that before you crack web scrape on demand. Before you crack that you also need web search on demand which is also something an llm solves and is best at its tool level.

(in essence local ollama-like tool using llm solves all that for you in one go… Intent recognizer dynamic search scrape will be omega level mutant hard)

That seems overkill for what I want really. Especially when the whole point is to make a solution that doesn’t require hardware that can run a LLM properly. I really just need the first paragraph of a Wikipedia page returned.

I think you’re either misunderstanding my intention or are over complicating it. It doesn’t need to know intent, what’s going on, or do a web search for what I’m trying to do. It needs needs to grab a section of text from a (dynamically pre-set) webpage and return it, that’s it. Which webpage is dynamically set based on the command, it doesn’t need to find one itself. That is something a scraper can do. It will have some ugly formatting, but that’s something that can be dealt with as well.

So the flow will be:
Get command → scrape Wikipedia → clean received data → output data

The idea in theory isn’t that complex. And it’s actually similar to what digital assistants (like Google Assistant) did before the whole LLM hype came by. It used to say “according to Wikipedia…” all the time a few years back.

And no, it won’t be flawless. But it’s accurate enough for personal use and works most of the time. And surely an LLM can do it prettier. But an LLM is also way more resource intensive to run. The whole point is to run this kind of command without those resources nor third party cloud LLM. And to be fair, outside of not having those resources available at this time, it’s not really worth it for this case. Grabbing and altering some strings isn’t exactly a heavy task normally, and since we only scrape on demand we don’t use more bandwidth than if we would just look it up ourselves. Assuming you’re not asking thousands of questions at the same time on potato hardware there shouldn’t be much of an issue to this approach. Due to that, it may just end up being quicker than an LLM too.

So the main question is, how to get this flow into HA.

Edit: all the edits are some typo’s and some small stuff in the detailed explanation.

More into detail on the flow I’m trying to create

Most subjects can be relatively reliably found by using the URL in my example (https://en.wikipedia.org/wiki/{foo}), and the first paragraph can be found with the mentioned elements (1st p element inside the div.mw-content-ltr.mw-parser-output element).

So with that knowledge I can then setup that on the sentences “who is {foo}”, “what is {foo}”, “what is a {foo}”, and “what is an {foo}” it executes the scraper and returns the data.

If needed, some cleaning by rules can be done on the text. Since Wikipedia always has the same format, I can clean out the worst easily.

For example, with that setup I could ask: “Who is Bruce Springsteen?” (so foo == Bruce Springsteen).

It will then use the webpage https://en.wikipedia.org/wiki/Bruce Springsteen which Wikipedia knowns (it will auto-direct to https://en.wikipedia.org/wiki/bruce_springsteen). Grabbing the first paragraph returns:

<b>Bruce Frederick Joseph Springsteen</b>
(born September 23, 1949) is an American
<a href="/wiki/Rock_music" title="Rock music">rock</a>
singer, songwriter, and guitarist. Nicknamed "the Boss",
<sup class="reference" id="cite_ref-2"><a href="#cite_note-2"><span class="cite-bracket">[</span>2<span class="cite-bracket">]</span></a></sup>
he has released 21 studio albums over six decades, most featuring the
<a href="/wiki/E_Street_Band" title="E Street Band">E Street Band</a>
, his backing band since 1972. Springsteen is a pioneer of
<a href="/wiki/Heartland_rock" title="Heartland rock">heartland rock</a>
, combining commercially successful rock with poetic, socially conscious lyrics which reflect
<a href="/wiki/Working_class" title="Working class">working class</a>
American life. He is known for his descriptive lyrics and energetic concerts, which sometimes last over four hours.
<sup id="cite_ref-3" class="reference"><a href="#cite_note-3"><span class="cite-bracket">[</span>3<span class="cite-bracket">]</span></a></sup>

From there, I can work to sanitize where needed. Let’s say we remove , , </a>. If the  tags that are around this are also returned ofc we remove those too. Those are the easy ones. Then we remove everything starting <a until we find another >. At last, we remove everything starting with <sup until . Other than that we just put a space behind each sentence (in whatever way) and paste them together (if they don’t come in a string without a proper enter character).

Then we come up with the following:

Bruce Frederick Joseph Springsteen (born September 23, 1949) is an American rock singer, songwriter, and guitarist. Nicknamed "the Boss", he has released 21 studio albums over six decades, most featuring the E Street Band, his backing band since 1972. Springsteen is a pioneer of heartland rock, combining commercially successful rock with poetic, socially conscious lyrics which reflect working class American life. He is known for his descriptive lyrics and energetic concerts, which sometimes last over four hours.

That is perfect as an answer. Then just tell Assist to say:

According to Wikipedia:
{cleaned output from scrape}

And we’ll get an Assist that when asked:

Who is Bruce Springsteen?

Can tell me:

According to Wikipedia:
Bruce Frederick Joseph Springsteen (born September 23, 1949) is an American rock singer, songwriter, and guitarist. Nicknamed “the Boss”, he has released 21 studio albums over six decades, most featuring the E Street Band, his backing band since 1972. Springsteen is a pioneer of heartland rock, combining commercially successful rock with poetic, socially conscious lyrics which reflect working class American life. He is known for his descriptive lyrics and energetic concerts, which sometimes last over four hours.

That’s pretty decent, and will sound fine when pronounced by Piper. At worst, a bit vast due to all the comma’s used (I noticed it sometimes forgets to have a small break on those).

If I do the same with another input, it would also work. For example, if we now ask the question:

What is a Chromecast?

With the same logic Assist will tell me:

According to Wikipedia:
Chromecast is a discontinued line of digital media players developed by Google. The devices, designed as small dongles, can play Internet-streamed audio-visual content on a high-definition television or home audio system. The user can control playback with a mobile device or personal computer through mobile and web apps that can use the Google Cast protocol, or by issuing commands via Google Assistant; later models introduced an interactive user interface and remote control. Content can be mirrored to video models from the Google Chrome web browser on a personal computer or from the screen of some Android devices.

Again, perfectly fine answer that Piper can pronounce decently.

Wikipedia redirects also show their usefulness here, because if you said “What is a Google Chromecast?” instead of just “What is a Chromecast?” it still works as Wikipedia redirects https://en.wikipedia.org/wiki/Google Chromecast to https://en.wikipedia.org/wiki/Google_Chromecast which just returns https://en.wikipedia.org/wiki/chromecast with an extra “(Redirected from Google Chromecast)” note. So it’s still giving you the correct data.

Mikefila · January 2, 2025, 6:09pm

If this is solely for wikipedia, using their api would probably be better than trying to scrape the site.

https://www.mediawiki.org/wiki/API:Main_page

Cambionn · January 2, 2025, 7:21pm

That works indeed, I agree it’s better.
Using for example:

https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&titles=Bruce%20Springsteen&redirects=1&formatversion=2&exsentences=5&exintro=1&explaintext=1&exsectionformat=wiki

returns:

{
    "batchcomplete": true,
    "query": {
        "pages": [
            {
                "pageid": 60192,
                "ns": 0,
                "title": "Bruce Springsteen",
                "extract": "Bruce Frederick Joseph Springsteen (born September 23, 1949) is an American rock singer, songwriter, and guitarist. Nicknamed \"the Boss\", he has released 21 studio albums over six decades, most featuring the E Street Band, his backing band since 1972. Springsteen is a pioneer of heartland rock, combining commercially successful rock with poetic, socially conscious lyrics which reflect working class American life. He is known for his descriptive lyrics and energetic concerts, which sometimes last over four hours.\nSpringsteen released his first two albums, Greetings from Asbury Park, N.J. and The Wild, the Innocent & the E Street Shuffle, in 1973."
            }
        ]
    }
}

It sadly does seem that I can’t set a paragraph as limit. Although limiting it to sentences would probably be well enough in most cases, with only an odd ending here and there.

That being said, with this I still need to parse foo into the URL, and read the JSON properly so only the extract value will be returned by Assist (and the \ and \n need to be handled properly but that shouldn’t be too much of an issue). The flow itself doesn’t change that much, although I definitely agree it’s better to use the API than to scrape.

I go sit for it myself after dinner as I’m now quickly checking while cooking, but any directions on how to do that nicely in Home Assistant for use with Assist? I mean, I’ve set up voice commands, but the whole parsing it to the API and handling the returned JSON properly to get the output the way I want (I can do it in programming languages, but I’m just not sure about the correct approach in HA as I’ve just been playing with it for a week now ).

Cambionn · January 2, 2025, 11:55pm

Okay, I feel like I’m almost there. Using an API definitely made stuff easier to find.
I added the following to my configuration.yaml

# Get wikidata by restAPI
rest_command:
  get_wiki_data:
    url: https://en.wikipedia.org/w/api.php?action=query&format=json&redirects=1&formatversion=2&exsentences=5&exintro=1&explaintext=1&exsectionformat=wiki&prop=extracts&titles={{ foo }}

Then I added the following automation:

- id: "1735853490742"
  alias: Request info
  description: Grabs info of Wikipedia about a who or what
  triggers:
    - trigger: conversation
      command:
        - Who['s][ is] {foo}
        - What['s][ is][( a| an)] {foo}
  conditions: []
  actions:
    - sequence:
        - set_conversation_response: Looking up {{ trigger.slots.foo }}.
        - action: rest_command.get_wiki_data
          response_variable: wiki_data
          data:
            foo: "{{trigger.slots.foo}}"
        - if: "{{ wiki_data['status'] == 200 }}"
          then:
            if: "{{ value_json.wiki_data.wiki_data.query.pages[:1].missing == TRUE }}"
            then:
              - set_conversation_response: "Could not find information on {{ trigger.slots.foo }}."
            else:
              - set_conversation_response: "'According to Wikipedia:' {{ value_json.wiki_data.query.pages[:1].extract }}"
          else:
            - set_conversation_response: "Could not connect to find information on {{ trigger.slots.foo }}."
  mode: single

Now I’m only left with three issues. Maybe I’m just tired as it’s 1AM here, and I’m missing some common sense due to that. But here I could use some help:

For some reason, the 1st conversation response isn’t executed. It goes straight to the second. Not that I really need it, as it’s rather quick. I put it there for testing, and probably gonna remove it. But still wish to know why it’s not triggering.
For some reason it gets mad when I try to add an enter between “according to Wikipedia:” and the output. It’s odd as I’m sure I have enters in other conversation responses. The use of " and ’ is due to me messing ti try to get it to work. However, it had the same issue without the quotes on the string.
I can’t seem to figure out how to parse the JSON file properly in YAML. I found this thread but I can’t seem to figure it either way. But while I’m trying to select specific values I currently get the output:

According to Wikipedia:{‘content’: {‘batchcomplete’: True, ‘query’: {‘pages’: [{‘pageid’: 60192, ‘ns’: 0, ‘title’: ‘Bruce Springsteen’, ‘extract’: ‘Bruce Frederick Joseph Springsteen (born September 23, 1949) is an American rock singer, songwriter, and guitarist. Nicknamed “the Boss”, he has released 21 studio albums over six decades, most featuring the E Street Band, his backing band since 1972. Springsteen is a pioneer of heartland rock, combining commercially successful rock with poetic, socially conscious lyrics which reflect working class American life. He is known for his descriptive lyrics and energetic concerts, which sometimes last over four hours.\nSpringsteen released his first two albums, Greetings from Asbury Park, N.J. and The Wild, the Innocent & the E Street Shuffle, in 1973.’}]}}, ‘status’: 200}

And on a non-existing page I get:

According to Wikipedia:{‘content’: {‘batchcomplete’: True, ‘query’: {‘normalized’: [{‘fromencoded’: False, ‘from’: ‘uwifbcdhisabgkfij’, ‘to’: ‘Uwifbcdhisabgkfij’}], ‘pages’: [{‘ns’: 0, ‘title’: ‘Uwifbcdhisabgkfij’, ‘missing’: True}]}}, ‘status’: 200}

Which if my code would work would be (but obviously with the JSON not parsing it’s never TRUE):

Could not find information on Uwifbcdhisabgkfij.

Cambionn · January 3, 2025, 11:22am

Well I figured it out. Guess some sleep was a good idea indeed.
The working YAML for the automation is:

- id: "1735853490742"
  alias: Request info
  description: Grabs info of Wikipedia about a who or what
  triggers:
    - trigger: conversation
      command:
        - Who['s][ is] {foo}
        - What['s][ is][( a| an)] {foo}
  conditions: []
  actions:
    - sequence:
        - action: rest_command.get_wiki_data
          response_variable: wiki_data
          data:
            foo: "{{trigger.slots.foo}}"
        - if: "{{ wiki_data.status == 200 }}"
          then:
            if: "{{ wiki_data.content.query.pages[0].missing == 'True' }}"
            then:
              - set_conversation_response: "Could not find information on {{ trigger.slots.foo }}."
            else:
              - set_conversation_response: "According to Wikipedia:

                  {{ wiki_data.content.query.pages[0].extract }}"
          else:
            - set_conversation_response: "Could not connect to find information on {{ trigger.slots.foo }}."
  mode: single

It fixes issue 2 and 3. I’m still not sure about the 1st, but as that one would be removed anyways I just leave it at that. If anyone knows it I’d still be interested to know.