Really its not about hardware, its the software and the complex DSP audio algs.
The ESP32-S3 contains the only free even though a closed source blob of BSS (Blind Source Seperation) likey a DUET alg where out of a binaural mix 2x positionally unique signals are seperated.
This works well for smart speakers as 80/20 mostly you have command speech & 3rd party noise, also because it positionally seperates it deverberates which is a huge problem solved for far-field.
There is also beamforming, which can be less effective than BSS as really it just focusses and dereverberates. Many conference mics such as the Jabra use beamforming, but unlike a smart speaker it has no mechanism to lock onto the speaker direction for that command sentence. So in the presence of noise or other voice it will just jump to the loudest.
As said its software and the critical important start of chain audio DSP to get a clear voice, from differing volumes and distance.
Just because a piece of hardware can employ multiple mics its a myth that is all that is needed as it needs DSP algs containing quite high-end science.
Google and even Xiaomi have resources and contacts where they can get these things commisioned. In fact Google go one further than BSS and use targetted voice extraction that is a type of BSS that works with user profiles.
The only thing about hardware as like Google, Amazon & Xiaomi is that hardware and software dictate is extremely beneficial as the models for ASR are trained specifically for that hardware and earlier in the chain software.
Google & Amazon are miles ahead, 1stly because they have the resources but also because the have the weight and engineers to create application specific SoCs with absolutely huge purchase power through economies of scale.
Even then Google for them makes a small loss whilst Amazon is currently leaking like sieve.
The difference with the best in academia working for and posting papers on the latest technical innovation and some very capable HA Python ESPHome programmers, is still huge and in the DSP world is completely dependent on free software and opensource provided by the big guys & academia, which is in very short supply.
Basically you have the BSS blob provided by Esspressif, a delay-sum beamformer by myself and various filters such as dtln, deepfilternet & conv-tasnet that are a massive evolution over early attempts such as RnNoise.
Hardware the Amlogic chips you mention are very low-end as they are expecting custom embedded systems written in a performant language like C/++ or even Rust or Go.
Python is more of RAD language hence why we never see it in kernels or drivers, so likely to make provision for Python and coding that is far removed from performant custom embedded systems hardware choice is like in need of Victorian engineering and compensate by going higher end.
The Orangepi5 is considerabilly better than the Rpi5 at similar price and near all RK3588(s) boards have recieved good support especially mainline.
Okdo now suppply the OKdo ROCK ZERO 3W 1GB with Wi-Fi/BLE without GPIO - OKdo which is a Cortex-A55 is an even bigger step up over the Rpi02W again at a similar price, but unlike the RK3588s support currently its not good, but likely with some community backing can be quickly supported.
When a Rpi5 4gb is only £2 more than a Rpi4, the Rpi4 doesn’t make sense anymore, or likely its CM4 as as similar CM5 is likely.
The Rpi5 for Arm is strangely inefficient and one reason why I have a preference for the Opi5 that posts nearly 2x Gflops/watt.
Raspberry Zero2W is an underclocked Pi3 and £17, whilst the £18 RadxaZero3W has a lot more under the hood.
Be it Raspberry, OrangePi or Radxa the current SBC that standout are the above Pi5 & Zero ones.
[EDIT] Radxa Zero3W likely will have an image by GitHub - Joshua-Riek/ubuntu-rockchip: Ubuntu 22.04 for Rockchip RK3588 Devices who does the best Ubuntu RK3588(s) images and would be great also to have Ubuntu images for the Radxa Zero3W as the Radxa ones still look bad.
OrangePi3B likely to be supported also.