A JRPG style conversation with the voice assistant on the S3 Box 3

Lajos · February 29, 2024, 10:00am

I have created a few assistant image pack then came the idea, that how funny it would be if the box could show the conversation in jrpg style? So I digged around a bit in the original firmware, and added parts where necessary and it came out like this:

Firmware:

Chris.le · February 29, 2024, 6:27pm

This is amazing

jenova70 · March 14, 2024, 12:50pm

I really like this! I am currently updating the official firmware of the S3 Box to display text on it thanks to your idea.

Kudos man, the idea is awesome

Lajos · March 14, 2024, 1:32pm

Wow looks pretty nice

jenova70 · March 14, 2024, 3:18pm

I used almost everything from your firmware.
I added the fact that the “boxes” are drawn programmatically on the pictures.
That means you do not need boxes on the pictures that will receive the text.
You just put your picture, and we draw the box and the text on top of them

It should soon be available

Lajos · March 14, 2024, 3:31pm

I was struggling with text trimming, I think it should be handled more elegantly by calculating with the DisplayBuffer, but I did not have the time/knowledge to implement it properly.

jenova70 · March 15, 2024, 8:54am

Ongoing progress

Thx again for your idea, I had no clue the box was fast enough for voice + displaying illustration + text

github.com/esphome/firmware

Text on S3 Boxes

esphome:main ← jlpouffier:text-on-s3-box

opened 04:35PM - 14 Mar 24 UTC

jlpouffier

+456 -0

The idea is coming from the Voice Assistant Contest. Credit to [user Lajos and …his entry](https://community.home-assistant.io/t/a-jrpg-style-conversation-with-the-voice-assistant-on-the-s3-box-3/697172). ## Features added **Spoken text is displayed on the box during the thinking phase** ![IMG_2577](https://github.com/esphome/firmware/assets/5878296/d77ee940-4ea2-4ccb-b188-aedf62efeae0) **Response text is displayed on the box during the replying phase** ![IMG_2578](https://github.com/esphome/firmware/assets/5878296/3a6b6e15-fcd8-4912-8d6c-aaa94be70c6a) This behavior is user-configurable via a switch called `Display conversation` on Home Assistant. ![CleanShot 2024-03-14 at 17 33 47](https://github.com/esphome/firmware/assets/5878296/a81c149c-e224-4c4d-a2cf-a9dfd387133e) The value of the switch is restored, but `ON` by default if no value is found (It will be `ON` when updating for the first time) ## Specific changes of the firmware. ### Allowed characters In ESPHome, we need to load what character we are planning to display. Because the firmware is supposed to be used by all our supported languages, I searched for a proxy that would be a good approximation of every character that we could display. I ended up extracting all unique characters used in our test file on the [intent repository of Home Assistant](https://github.com/home-assistant/intents) This is this part of the firmware: ```yaml # These unique characters have been extracted from every test file of every language available on https://github.com/home-assistant/intents (14 March 2024) allowed_characters: " !#%'()+,-./0123456789:;<>?@ABCDEFGHIJKLMNOPQRSTUVWYZ[]_abcdefghijklmnopqrstuvwxyz{|}°²³µ¿ÁÂÄÅÉÖÚßàáâãäåæçèéêëìíîðñòóôõöøùúûüýþāăąćčďĐđēėęěğĮįıļľŁłńňőřśšťũūůűųźŻżŽžơưșțΆΈΌΐΑΒΓΔΕΖΗΘΚΜΝΠΡΣΤΥΦάέήίαβγδεζηθικλμνξοπρςστυφχψωϊόύώАБВГДЕЖЗИКЛМНОПРСТУХЦЧШЪЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяёђєіїјљњћאבגדהוזחטיכלםמןנסעפץצקרשת،ءآأإئابةتجحخدذرزسشصضطظعغفقكلمنهوىيٹپچڈکگںھہیےংকচতধনফবযরলশষসచయలഅആഇഈഉഎഓകഗങചജഞടഡണതദധനപഫബഭമയരറലളവശസഹൺൻർൽൾაბგდევზთილმნოპრსტუფქყშჩცძჭხạảấầẩậắặẹẽếềểệỉịọỏốồổỗộớờởợụủứừửữựỳ—、一上不个中为主乾了些亮人任低佔何作供依侧係個側偵充光入全关冇冷几切到制前動區卧厅厨及口另右吊后吗启吸呀咗哪唔問啟嗎嘅嘛器圍在场執場外多大始安定客室家密寵对將小少左已帘常幫幾库度庫廊廚廳开式後恆感態成我戲戶户房所扇手打执把拔换掉控插摄整斯新明是景暗更最會有未本模機檯櫃欄次正氏水沒没洗活派温測源溫漏潮激濕灯為無煙照熱燈燥物狀玄现現瓦用發的盞目着睡私空窗立笛管節簾籬紅線红罐置聚聲脚腦腳臥色节著行衣解設調請謝警设调走路車车运連遊運過道邊部都量鎖锁門閂閉開關门闭除隱離電震霧面音頂題顏颜風风食餅餵가간감갔강개거게겨결경고공과관그금급기길깥꺼껐꼽나난내네놀누는능니다닫담대더데도동됐되된됨둡드든등디때떤뜨라래러렇렌려로료른를리림링마많명몇모무문물뭐바밝방배변보부불블빨뽑사산상색서설성세센션소쇼수스습시신실싱아안않알았애야어얼업없었에여연열옆오온완외왼요운움워원위으은을음의이인일임입있작잠장재전절정제져조족종주줄중줘지직진짐쪽차창천최추출충치침커컴켜켰쿠크키탁탄태탬터텔통트튼티파팬퍼폰표퓨플핑한함해했행혀현화활후휴힘，？" ``` Because this solution is not perfect, this list is loaded as a `substitution` so that a user can still add a few missed characters in the list. ### 2-stage thinking phase. Interestingly enough, we are starting our thinking phase at the end of the VAD stage, in the middle of the STT phase. This is because we want to take into account the time it takes for the STT engine to fully decode the spoken command. This means that when we start our thinking phase, the spoken text is not known, the silence has just been detected, and the processing of the last chunk of audio is still ongoing. At first, I thought that this would be an issue, but I like it even better now. - At the end of the VAD stage, the thinking phase is displayed. the spoken text is not known so 3 dots `...` are displayed instead. (Basically meaning: " I am still trying to figure out what you told me") - Once the STT phase is over, the screen is refreshed with the spoken text (Basically meaning: "Ok now I understand what you told me... I am still figuring out what to do, and how to reply to you") It is visible when the STT engine is slow. ![CleanShot 2024-03-14 at 17 31 22](https://github.com/esphome/firmware/assets/5878296/3349eb91-7bdf-4c8b-936d-813b82bee450) We do not have this problem for the response, as the thinking phase extends until the streaming of the audio.

jenova70 · March 17, 2024, 9:44am

The results of the contest are out!
They may be of interest to you
Have a look!

Lajos · March 17, 2024, 1:04pm

I am very flattered

youkorr · May 29, 2024, 3:15pm

hello your firmware S3-Box-3-firmware-fork I could not get it to work

styphonthal · June 5, 2024, 5:31pm

Really fast replies. What are you using for speech to text and TTS?

Lajos · June 17, 2024, 7:01pm

The firmware is already merged to the official one. I suggest to use it instead.

Lajos · June 17, 2024, 7:01pm

Nabu Casa cloud