The recent API outage showed some weaknesses in the handling of abnormal conditions in the integration. In particular, except for the implausible status of pets, there was no apparent indication of a problem unless one checked the rarely-used SurePetcare web application. If one relied entirely on the mobile app, everything seemed to be all right. This lead, among other effects, to people instantly suspecting the integration to be at fault.
The situation could be improved if the SurePetcare integration exposed an entity of type “problem” with the following behavior:
The entity returns “normal” as long as the following criteria are met:
- All devices have at least 15% battery remaining
- All devices show as connected
- All APIs are reachable and return codes in the 200-399 range (n.b. this would also catch failed authentications)
- All APIs return sane data matching the expected structure, having all relevant fields present and filled
In case one or more of the criteria are not met, the entity has the status “problem”.
As an added bonus, the integration might file a repair with details of the problem whenever it set the entity to “problem” status
Implementing this would offer a quick canary to determine whether or not the data from the various entities of the integration can be trusted.