APsystems APS ECU R local inverters data pull

I had another weird issue today:

2021-01-10 02:33:09 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: Result on 'Inverter data' incorrect end signature '202' != END data=b'4150533131303031353030303230320a'
2021-01-10 02:34:09 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: Result on 'Inverter data' incorrect end signature '202' != END data=b'4150533131303031353030303230320a'
2021-01-10 13:03:31 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: Error getting checksum int from 'Inverter data' data=b''
2021-01-10 13:04:51 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: Error getting checksum int from 'ECU Query' data=b''
2021-01-10 13:05:51 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: [Errno 111] Connection refused
2021-01-10 13:06:51 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: [Errno 111] Connection refused
2021-01-10 13:07:51 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: [Errno 111] Connection refused

And then it continued to do connection refused for the next many hours, until I noticed and power cycled the ECU.

I’ve altered my code (and pushed to github) so it will only provide cached data 5 times. After that an exception and error will occur and HA will mark them as unavailable. I figured this was the “correct” thing to do - so if you wanted you could put the ECU on a smart switch and power cycle it automatically.

I’m testing now with this change. Not a particular easy thing to test with since I need it to go unavailable, and this is the first time in a week that it has.

If you guys think it should be longer than 5 times, or we should change this behavior, let me know. Maybe i’ll make it a variable in the yaml file you can adjust.

Not sure it makes me happy that i was not the only one, or sad that it happens with you too. Think it’s the latter, if device cant recover a stuck port and starts refusing, its not good.
I do suspect it comes with the Wifi connection having a transmission error and then the port screws up. This level of possible network trafficking issues, i cant investigate, that requires whole different kind of networking hardware.

Was there any specific error in data just before KEvin?
With me it happened after some literal errors, so it didn’t look to be on its own, but a ramp up from previous errors resulting in this refusal on connection.

this morning after some time, i also ran into ‘connection refused’ again. Didnt recover from it (did not update integration yet, now it’s updated).
Power cycle ecu did resolve it, but first few queries by hand, did not give any good response. There is some logic to the device that makes it hard to predict when it starts doing bad responses and when it runs just fine.
Lets see what newer version does today. normally once connection refused is resolved, the rest of the day it does its work fine.

Got some new logged error

2021-01-11 15:24:19 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: Error getting checksum int from 'Inverter data' data=b''
2021-01-11 15:24:19 ERROR (MainThread) [custom_components.apsystems_ecur] Unexpected error fetching apsystems_ecur data: local variable 'cache_count' referenced before assignment
Traceback (most recent call last):
  File "/home/pi/PI/.homeassistant/custom_components/apsystems_ecur/APSystemsECUR.py", line 130, in check_ecu_checksum
    checksum = int(data[5:9])
ValueError: invalid literal for int() with base 10: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pi/PI/.homeassistant/custom_components/apsystems_ecur/__init__.py", line 50, in async_update_data
    data = await hass.async_add_executor_job(ecu.query_ecu)
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/pi/PI/.homeassistant/custom_components/apsystems_ecur/APSystemsECUR.py", line 80, in query_ecu
    data = self.process_inverter_data()
  File "/home/pi/PI/.homeassistant/custom_components/apsystems_ecur/APSystemsECUR.py", line 195, in process_inverter_data
    self.check_ecu_checksum(data, "Inverter data")
  File "/home/pi/PI/.homeassistant/custom_components/apsystems_ecur/APSystemsECUR.py", line 133, in check_ecu_checksum
    raise APSystemsInvalidData(f"Error getting checksum int from '{cmd}' data={debugdata}")
custom_components.apsystems_ecur.APSystemsECUR.APSystemsInvalidData: Error getting checksum int from 'Inverter data' data=b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/hass/lib/python3.7/site-packages/homeassistant/helpers/update_coordinator.py", line 144, in async_refresh
    self.data = await self._async_update_data()
  File "/srv/hass/lib/python3.7/site-packages/homeassistant/helpers/update_coordinator.py", line 132, in _async_update_data
    return await self.update_method()
  File "/home/pi/PI/.homeassistant/custom_components/apsystems_ecur/__init__.py", line 55, in async_update_data
    cache_count += 1
UnboundLocalError: local variable 'cache_count' referenced before assignment

feels like complete empty response?

Data equaling to a 0 byte string would indicate an empty response from the ECU.

Clearly I introduced a new bug in my caching check though. I’m working through that now.

Since the device is pretty stable for a while, and then it’s not - I’m leaning on getting a switchable outlet and just creating an automation that power cycles the device a couple of hours before sunrise or something like that. This way it seems like it would work at least through the sunny parts of the day to gather data.

Preventive power cycle doesn’t remedy. Mine is off until sun rises. Then suddenly in 2 hours it’s refused connection. So I will make it depend on data age of ecu, if last date update is more then 40 minutes ago, power cycle, but only during sun above horizon.

I think sockets die on us. With sendall() it’s clear but with recv() we can’t say for sure when it is finished receiving if there’s no check. As a result we see incomplete messages. Maybe that’s why I see loops in most cases.

Please verify what I think is also happening: does the socket get shutdown and closed after an exeption?

Oh that sucks. Mine seems to have been pretty stable, only a 3-4 times has it gone unavailable, but good to know. I guess I’ll implement something like this too.

I’m gonna keep an eye on the group that’s trying to intercept the zigbee signals from the inverters, that in the end might be our best bet for stability.

I can change the code to do some looping and check for more data, I’ll have to put the socket into non-blocking mode - so it will take some time. I’ll try and get to this tonight.

I’m not sure we are gonna get much better since I really think this is a hardware/firmware issue at this point. The code as written should work fine for a device that accepts a short command and sends the data back - since it does that correctly 99% of the time for me. The device at some point screws up and gets into a weird state and goes so far as killing off the code that’s listening for incoming connections.

Agree. Some other question then, could the polling be stopped after sunset inside the code ?

Well, I’m not exactly sure how I’d determine sunset at this point. I might be able to make a service where you can tell it to stop collecting, and start collecting - then you could create your own automation to do that. I’ll look into the home assistant API a little and see what I can find.

HA has sun state, but indeed not sure if you have it available on your code level like that. Hence question. I know that HA has ‘turn off temporary devices’ support now, i’ll look at that first, dont waste your time on that yet. I’ll get back on that

New version pushed to github. Complete re-write of the socket code.

It will loop around the socket recv function until the data matches the recv_suffix (which is set to b’END\n’) or the self.timeout expires (which is set to 5 seconds). If it never got data that ended in b’END\n’ it will try the re-send the command to the ECU, and try the whole recv loop again. It tries the entire send command, recv loop for a total self.cmd_attempts (which is set to 3 times). You can edit these variables at the top of APSystemsECUR.py if you want to try different timeouts, or loop counts.

Relevant code changes if you want to review. At the bottom you should see the error that will show up in the log. You should get the command sent, and what ever we got from the recv buffer.

    async def async_read_from_socket(self):
        self.read_buffer = b''
        end_data = None

        while end_data != self.recv_suffix:
            self.read_buffer += await self.reader.read(self.recv_size)
            size = len(self.read_buffer)
            end_data = self.read_buffer[size-4:]

        return self.read_buffer

    async def async_send_read_from_socket(self, cmd):
        current_attempt = 0
        while current_attempt < self.cmd_attempts:
            current_attempt += 1

            self.writer.write(cmd.encode('utf-8'))
            await self.writer.drain()

            try:
                return await asyncio.wait_for(self.async_read_from_socket(), 
                    timeout=self.timeout)
            except Exception as err:
                pass

        self.writer.close()
        raise APSystemsInvalidData(f"Incomplete data from ECU after {current_attempt} attempts, cmd='{cmd.rstrip()}' data={self.read_buffer}")

I’m not going to get to the service to turn on/off the collection of data today, but I did find the HomeAssistant API I need to implement this.

While Sander you are right there is sun state, but that’s only if you have the sun integration turned on, and i’m not sure I should make any assumptions about that.

What i’m planning on doing is you being able to create an automation like these - I still need to code the start_query and stop_query services.

alias: Turn off query of ecu at sunset
trigger:
  - platform: sun
    event: sunset
action:
  - service: apsystem_ecur.stop_query

alias: Turn on query of ecu at sunrise
trigger:
  - platform: sun
    event: sunrise
action:
  - service: apsystem_ecur.start_query

cool stuff man. I’lll give it a go today straight away

Looks great, thank you. I added a late notice on my previous remark about exceptions. If an exception takes place, is the socket being correctly closed as well? I can’t interpret if your code does that, but it should of course.

Before installing the update I will follow results on my additions to your script: “sock.settimeout(2)” before “recv”. Don’t know if that really does something but until now I only have had exception errors: “Exception. Using cached data from last successful communication from ECU. Error: timed out” (added the text Exception to indicate which of the error handlers is being called).

Yes the code does close the socket when an exception happens see self.writer.close() before it raises an APSystemsInvalidData exception.

FYI this new code ran overnight for me, and I got pretty much exactly the same results (which is unfortunately what i thought would happen)

2021-01-11 20:40:35 WARNING (MainThread) [homeassistant.loader] You are using a custom integration for apsystems_ecur which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant.
2021-01-12 02:32:57 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: Incomplete data from ECU after 3 attempts, cmd='APS1100280002216000047832END' data=b'APS110015000202\n'
2021-01-12 02:34:12 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: Incomplete data from ECU after 3 attempts, cmd='APS1100280002216000047832END' data=b'APS110015000202\n'
2021-01-12 02:35:27 WARNING (MainThread) [custom_components.apsystems_ecur] Using cached data from last successful communication from ECU. Error: Incomplete data from ECU after 3 attempts, cmd='APS1100280002216000047832END' data=b'APS110015000202\n'

This is pretty consistent for me at 2am. I wonder if this response message is something to do with ECU busy or something along those lines. If you scroll through my other messages you can see it’s happened quite a few times, all around the same time.

Everything has recovered and I’m still getting new data, and that 2:35 message is the last in the log.

yes, i also had slowly a timing feeling that it start doing ‘stuff’.
my data has not been stale whole day… image
maybe i leave the device on tonight and see more details on timing

Unfortunate that the code adjustment didn’t do anything. Until now I captured no issues except voor the timeouts. So the issue might indeed be a busy ECU . These seem to be under control :crossed_fingers: I let it run for another evening/night. I would almost think the ECU is single threaded or something because of the data mix-up.

2021-01-12 09:58:06 WARNING (MainThread) [custom_components.apsystems_ecur] Exception. Using cached data from last successful communication from ECU. Error: timed out
2021-01-12 11:29:19 WARNING (MainThread) [custom_components.apsystems_ecur] Exception. Using cached data from last successful communication from ECU. Error: timed out
2021-01-12 14:06:23 WARNING (MainThread) [custom_components.apsystems_ecur] Exception. Using cached data from last successful communication from ECU. Error: timed out
2021-01-12 14:26:25 WARNING (MainThread) [custom_components.apsystems_ecur] Exception. Using cached data from last successful communication from ECU. Error: timed out
2021-01-12 14:36:27 WARNING (MainThread) [custom_components.apsystems_ecur] Exception. Using cached data from last successful communication from ECU. Error: timed out
2021-01-12 14:41:29 WARNING (MainThread) [custom_components.apsystems_ecur] Exception. Using cached data from last successful communication from ECU. Error: timed out
2021-01-12 14:51:31 WARNING (MainThread) [custom_components.apsystems_ecur] Exception. Using cached data from last successful communication from ECU. Error: timed out

I would think the ESP device that’s handling the socket connections would be. These micro-controllers are pretty low performance devices single core 80MHz.

I don’t know if it’s the main CPU for the entire ECU or if there is something else also handling the zigbee stuff.