Why does the network go unavailable periodically

I have a few devices (I think all Novostellar floods, but not all my Novostellar floods), that periodically go stupid (for want of another word).

I am using the native API, and they are on the network and associated, but they start giving errors about “network unavailable”. While they are doing that though, I can actually log in via the web at the regular address, on the same subnet as the HA server (which at that moment is working fine).

The following is a good example. While one was in this state, I went to the ESPHOME dashboard and told it to install the same firmware again (basically to force a reboot). It did. Notice how it goes from “not available” to accepting an upload from the same IP address. It’s as though the some aspect of the API is rejecting connections yet accepting for an OTA. As soon as it reboots it works fine for a day or so, but at some point is likely (say on a WIFI change of some sort) to go stupid again. As soon as the OTA finished it rebooted, reconnected to my browser, the API had connected and it was happy.

In the config I have “api” alone, no settings. wifi is a standard manual IP, nothing unusual. The board type is “generic-bk7231n-qfn32-tuya”, and I’m running esphome 2023.10.4. I have 28 esphome devices (not sure if that’s a lot or a few), and HA is running on a beefy HyperV server with plenty of horsepower, memory, etc.

Any thoughts where to look?

Next time I’ll try to get a wireshark capture, though I’m not exactly sure how to do that the way HA is virtualized inside the already virtualized HyperV container. Hoping someone has a suggestion where to start before I go down that rathole.

Linwood

Are you monitoring the WiFi signal strength?

sensor:
  - platform: wifi_signal
    name: ${friendly_name} WiFi Level
    id: ${friendly_name}_WiFi_level
    update_interval: 300s

Yes. Sorry, should have mentioned, it was -68dBm before it went nuts. After the upload it decided to change AP’s and was -77dBm, not sure why it opted to go for a worse one, but still should be viable.

But the fact that I could do an OTA upload literally at the same time it was refusing to connect to the API seems to say it is more than a signal strength issue. Something, not sure which side, was rejecting the connections. Notice there were 8 attempts shown above all in the same second, a loss of TCP connection or association would take a lot longer.

Is this a mesh network?

I have multiple AP’s with the same SSID different channels, but all are wired (the word “mesh” means different things to different people).

Is there some very small connection limit perhaps? I have the opposite problem right now with a different one, the API connection is up and fine and showing me a log, but if I try to go to the web server it fails with a “refused”. I sniffed that and indeed as soon as my PC sends a syn for the web TCP connection I get back and RST. The log file output shows clearly the web server component active and at the right port and address. It’s as though the web server either crashed and is not running or for other reasons is denying access. The only other thing I have running is sntp for time services and the API. That one’s signal is a nice strong -58dBm. It pings without missing. It just seems port 80 has been slammed shut.

I left that one (which is sitting right by my desk just getting prepped) in that condition overnight, nothing changed, the API and pings were still working, no disassociation or change, but the web server still refused to connect. I power cycled it, all back to normal. I’m assuming the web server is one process, the API responder is another process. This would seem to imply they may be crashing? Except the API process seemed to have been running enough that it logged – unless it’s a listener separately from the process that needs to own the API connection, and one was up (and logging) and the other crashed?

I’m not sure there is anything I can DO about it if that’s the case. Maybe something specific to this hardware not working well with the drivers esphome is using.

Depending on the complexity of your yaml the web server component might to be to heavy.

If you check the docs you see a warning regarding using esp8266 and the web server.

You could just recompile a funky node without web server component to see how it behaves.

Also the debug component might be useful to easily see heap memory etc. of your node (but not sure if the bk chips support that…)

And yes, web server, OTA, API are independent from each other while all depending on WIFI.