Discussion thread for tools to make inexpensive models smarter (gpt-4o-mini in my case)

Hello everyone,

Since LLMs, we talk to Assist much more naturally and freely and are often quite happy to believe what it tells us :wink:.

However, I’ve noticed that the bullshit level for smart-home related questions is still quite high.
The reason for this are usually mathematical correlations, as the models do not use a calculator, for example, but answer according to “linguistic probability”.

The more expensive models usually produce far better results in this respect, but

  • by far not always
  • I am afraid of the costs as the children are also on board. :stuck_out_tongue:

That’s why I started with a first test script today, to see if this can improved with a tool-set. The first one: Determine the min/max value from a list of numbers.

Here is an example about what might go wrong without additional guidance for the LLM:

“Phew, it’s so hot in my study, what’s the coldest room in the house I could move to?”

The current temperatures in the various rooms are

  • Bathroom: 26.0°C
  • Roof bathroom: 28.5°C
  • Office: 27.96°C
  • Kitchen: 25.48°C
  • <name_of_kid_1>: 26.57°C
  • <name_of_kid_2>: 26.46°C
  • Bedroom: 25.42°C
  • Games room: 27.89°C
  • WC: 25.8°C
  • Living room: 25.08°C

The coolest room is the bedroom at 25.42°C.

Uhhhm, well. That’s rather wrong …
And if you repeat the question a few more times, you get a different room each time. :face_with_spiral_eyes:

So I created the following script and shared it with Assist:

alias: Min-Max-Calculator
icon: mdi:calculator-variant
description: >-
  Returns the minimum or maximum of a list of numbers (LLM tool). 

  Can, for example, find the lowest temperature among many.

  Depending on the mode set, the function returns the smallest or largest number
  from the passed array.
mode: single
fields:
  operation:
    name: Operation
    description: >-
      'min' to get the smallest value from a list, or 'max' to get the largest
      value from a list
    required: true
    selector:
      select:
        options:
          - min
          - max
  numbers:
    name: Numbers
    description: A JSON array string with several numbers, e.g. [20.43, 21.5, 22.05]
    required: true
    selector:
      text: null
sequence:
  - variables:
      nums: "{{ numbers | from_json }}"
      result: |-
        {% if operation == 'min' %}
          {{ {'value': nums | min } }}
        {% else %}
          {{ {'value': nums | max } }}
        {% endif %}
  - stop: ""
    response_variable: result

Then I added this text in the Conversation Agent settings (in the OpenAI Integration configuration):

You are VERY bad at calculations, finding min/max values, comparing numbers or date calculations like when what date might be tomorrow, next week or how many days it is until a given date.
Always, really always use the tools provided when possible to get the solution. Do NOT try to calculate yourself as long as there is another way.

After that, Assist can handle questions about min / max values of things like temperatures, illumination, battery states, … without problems.

At the beginning I always thought about large and cool use-cases when looking for ideas for new intends / scripts.
But I think there is a lot potential with smaller tools for Assist, as many problems in a smart home are not related to linguistic probability, but technical details instead. :wink:
Getting this right might improve the user experience a lot, at least until cheap models are getting better at that kind of problems.

Some other ideas in my head:

  • Kind of a real calculator that can handle +, -, *, /, avg, …
  • Date calculations like “whats the date of day after tomorrow”, “what date is next tuesday”, “what day is today + x days”.

The second example is also something I had a lot problems with.
Asking for appointments in the calendar, or asking about the weather next weekend, …
There’s really a lot where these models fail.

So if you

  • have scripts like this in your setup too
  • know other use-cases where the LLMs fail at general simple tasks that are needed for different daily questions
  • Simply want to discuss about this :stuck_out_tongue:

Feel free to participate. :slightly_smiling_face:

Context is king to make an LLM work.

Start reading.

2 Likes

Hi Nathan,

sure I already follow the Friday’s party since the beginning. :wink:
(But I’m lagging a few posts behind currently with reading …)

I haven’t had the time to build a real large tool-set library for my own setup so far as my spare time is somewhat limited at the moment.

So I still have only a few little tools with hopefully good descriptions for the AI to handle the most important stuff for me. (Which already works very well and is such a difference to the voice assistants of the past. :slightly_smiling_face:

But I noticed already there, that calculations are often a problem in LLM use.
Adding additional context by guidance in the prompt (handing a default calendar so it can look up days) didn’t solve that completely, in case this is what you comment with context is king was about? (Maybe I misinterpreted that.)

So this thread was more about the small problems that spread into a lot different usecases, as calculations are everywhere in this smart world …

1 Like

Calculations ARE ALWAYS a problem. That’s my point. They’re not designed to math - they’re designed to place the next most likely token after the last token - THAT’S IT. Im actually giving Friday an actual calculator and do ALL math for her. (NEVER rely on the LLM)

LLMS are not designed for most things people ascribe them to - Friday is about PUSH ALL THAT to tools.

Calendar - Tool. Todo - Tool. Entire ERP system. Just a tool. Airspace Management - Tool. File management - Tool. Email, Notifications, IM, all Tools. THATS what makes Friday able to do what she does. You’re on the right track but do NOT mistake that she’s is absolutely solving the same problem - LLMs (even the big ones) simply cant do most things people want them to so you have to design chains of tool use. GPT5 when it hits is a different animal - but even thinking LLMS need to be thought of as pattern matcher - story teller. You’re just putting a tool in its hands and telling it to use the tool at the same time.

1 Like

Thanks for the fast reply. :slight_smile:

Yes, this is already pretty clear to me since I’ve seen the first ideas about custom intends combined with an LLM in one of the release party videos some (or more) time ago.

That was also the time when I started to hope we can setup something better on our own than the big companies, as they won’t allow that kind of customization.

Later also your Friday’s Party (and some other threads around here) became a large source of inspiration.

Now I just need the kids to grow up, so I have more time for this kind of stuff. :upside_down_face:

1 Like

Well I can’t help you with that part. :rofl:

So let’s talk what we CAN do. If we target ‘commodity’ let’s call that anything better that an Nvidia 40xx or better. This is basically what the monastery is for Friday. The next set of posts are designed to offload as much of her runtime to the monastery (which is basically Intel version that)

So I’ll be cutting up the inference jobs (the stuff that makes her seem smarter than she is) into tiny chunks that run all the time on my local stuff then come back and summarize that once in a while. (That’s what the kata cabinet is when you get to that part, where I park those summaries)

Then instead of the big giant glob of stuff Friday knows I load the summary and pointers to all the other junk.

Its really the only way it’ll be effective long run in small iron. (there that’s what 12-18 months of your question taught me.) :rofl:

In the meantime - she runs the ‘front’ odel on GPT4.1-mini (noticed you’re on 4.0, 4.1 is MUCH better similar price, first change id recommend. You can even make it work with Nano.)

1 Like

Ok, I also extended the script so it can handle all basic math operations and some aggregations.

This is really starting to get funny. :partying_face:

How can I assist?

      Do you have a tool called calculator?

Yes, I have access to a calculator tool. How can I assist you with calculations?

      show me the room temps

Here are the current room temperatures:  
- Bathroom: 26.0°C 
- Attic Bathroom: 27.0°C 
- Office: 26.8°C 
- Kitchen: 25.0°C 
- <Child 1> Room: 26.0°C 
- <Child 2> Room: 26.0°C 
- Bedroom: 25.5°C 
- Playroom: 26.98°C 
- WC: 25.5°C 
- Living Room: 25.0°C  

Let me know if you need anything else!

      calculate the average temp of the rooms above

The average temperature of the rooms is approximately 25.98°C. If you need further assistance, just let me know!

      show me the calculation method

The average temperature is calculated 
by adding all the room temperatures together 
and then dividing by the number of rooms.

For the given temperatures: 

1. Add the temperatures: 
  26.0 + 27.0 + 26.8 + 25.0 + 26.0 + 26.0 + 25.5 + 26.98 + 25.5 + 25.0 = 259.78  
2. Divide by the number of rooms (10): 
  259.78 / 10 = 25.978  

So, the average temperature is approximately 25.98°C.

I guess at some time I have to provide access to my InfluxDB with an additional tool. :exploding_head:

1 Like

recorder get_statistics came out last month - I already ahave a tool in friday’s pack - go look for History CRUD… If you’re using Influx for recorder… You’re welcome. Friday's Party: Creating a Private, Agentic AI using Voice Assistant tools - #114 by NathanCu

1 Like

Like I said, I’m hanging a few posts behind in your thread. :smile:

Thanks a lot. :+1:

1 Like

Ok, something else I noticed and fits perfectly in this thread, but that I didn’t fix so far:

The LLM fails amazingly in listing the state of multiple entities.
Like Which lights in the living room are turned on.
If I tell it (after a wrong answer) that this isn’t true and that it should take a close look on all devices and their states, then it can provide the correct answer most of the time.
But on the first try it fails far too often.

This is really something I didn’t expect, as the entity data, their status and the room are shared by Home Assistant with the LLM.
I have A LOT entities in my Home Assistant installation, and we have a lot ambient lights in our room that get automatically activated when one of the main lights in the room is turned on.

But still, this seems like a simple and common task for the LLM.

No idea how the data is provided to the assistant, but is it in such a bad way, that we really need a tool to fetch and filter (type, room, status, maybe tags, …) entities for easier access? :face_with_monocle:

Same happens with open windows or other things where it has to find the correct entities out of a large list and then filter them by status / room.

It often doesn’t only miss some of the devices, but even mix in devices from other rooms.
So my feeling is really that it get confused and could need some help…

edit:
I also tested it with better/more expensive models than my default gtp-4o-mini.
They sometimes provide the answer more correctly.
But they still fail from time to time.

You’re describing a major failure on our part.

Assuming it can do anything asked. Full stop. It cannot. It is assembling tokens period.

Even the thinking models are just assembling tokens and then checking themselves - they’re not really… Counting. So you have to make it easy.

Any time you want the llm to answer factually for anything, don’t give it enough lead to fail. You give it a tool. If you want it to count lights you have to give it a way to count lights or live with bad numbers.

In my system the index tool is that. There’s a function to expand and it will basically give the llm a cheat sheet… How many lights are on in the living room translates to an index call for lights and living room filtered on, voila there’s a simple list with a definite count. No math just answers that are easy for the llm to regurgitate.

LLMs are story tellers. If you don’t explicitly put exactly what they need to tell a factual story and complete task. They make it up to complete the task. It’s that simple. If it’s making something up it doesn’t have enough to give the answer. You say yes it does but that answer involves it looping back and mathing. It will fail almost every time. Unless you provide the answer with a tool and instructions how to use it.

1 Like

Ok, I was already afraid of that. :smile:

Thanks a lot for sharing so much of your lessons learned.
Will take a closer look at the grand index in your Friday’s Party thread. :+1: