VM Network Watchdog

TL;DR - My VM loses network connectivity sometimes, so I built a watchdog that resets it when this happens.

Hello, I’ve been using HA for about 9-10 months now, a long-time HomeSeer convert - and I love it! Running on an HP Elite Mini PC (Win 10 / VirtualBox), alongside Blue Iris 5 (but starting to look at Frigate, thats another topic). With the addition of a wall tablet running in the living room, has become a part of everyones daily routines.

Overall the system has been very reliable, until anything disrupts the controller PC’s network connection. That could be a momentary internet outage, router / switch reset or s/w update, unplug network cable to PC, etc. If any of these happen, the host PC will re-establish its network connection. The VirtualBox VM Network connection does not fully restore however.

When in this state, the following conditions are seen:

  • No HA access from any PC’s on the network or Nabu Casa, including the wall tablet. Note all local PC’s on the same subnet.
  • HA can be accessed from the Host PC however. From there, HA can be found to be running fine.
  • Pinging the VM from the Host PC works- but pinging the VM from any other PC on the network fails (this works fine when all is running ok).
  • Most devices/ integrations on the network and internet are Unavailable from HA. I say ‘most’, because strangely some like the Ecobee are not showing Unavailable. I suspect those connections are down too, but I havent tested for sure.

When in this state, shutting down the VM from the host PC, and restarting it restores the connection.

The best solution would be to fix the root of what is causing the issue. This appears to be a long-lived Virtualbox issue, after some digging - and no solutions that worked for me (at least that I could find). The next best - a watchdog that can reboot the VM when this condition occurs! I am new to Python, and thought this would be a good project to tackle.

One thing to note - the ‘Reboot System’ from HA does not resolve this issue for me. It does reboot the host VM, but the conditions stated previously still persist. Only when the VM is shut down from the Host side and restarted, does it restore full network functionality.

My solution uses a python script running on the host (windows) system, and establishes an MQTT connection with HA. It subscribes to a topic, and if HA detects that network connectivity is lost, it sends a reboot command to the host system. The host system then shuts down the VM, and starts it up again.

The automation that detects the network outage has two triggers - one triggers if the main router goes unavailable, and another trigger is run on HA startup. Other conditions check if a few other network devices are unavailable as well. If true, then the automation will send a persistent notification, delay a short time, then send the reboot signal.

Its been running great for a few weeks now. I havent seen too many natural network drops, but in all test cases its been 100% so far. I do need to get rid of the hacky time delays in the script and replace with wait logic.

Network Watchdog Python script (it uses the paho mqtt client) :

import paho.mqtt.client as mqtt
import time
import subprocess
from time import localtime, strftime

broker="10.0.0.11"
port=1883

def printmessage(newmessage1, newmessage2):
  print (strftime("%Y-%m-%d %H:%M:%S", localtime()),newmessage1, newmessage2)

def on_message(client, userdata, message):
    printmessage("message received " ,str(message.payload.decode("utf-8")))
    if str(message.payload.decode("utf-8")) == "ON":
            printmessage("Reboot Command Received... Rebooting VM","")
            TheExecutable = 'C:\\Progra~1\\Oracle\\VirtualBox\\VBoxManage.exe'
            pid = subprocess.Popen([TheExecutable, "controlvm", "HomeAssistantR02", "acpipowerbutton"]) # Call subprocess
            printmessage("Delay 30 seconds... ","")
            time.sleep(30) 
            printmessage("Launching VM (takes about 90 seconds)..","")
            TheExecutable = 'C:\\Progra~1\\Oracle\\VirtualBox\\VirtualBoxVM.exe'
            #print ("executable:", TheExecutable)
            pid = subprocess.Popen([TheExecutable, "--startvm", "HomeAssistantR02"]) # Call subprocess
            time.sleep(90) 
            
def on_publish(client,userdata, mid, reason_code, properties):             #create function for callback
    printmessage("data published \n","")
    pass

def on_connect(client, userdata, flags, reason_code,properties):
    client.inprogress_flag=False
    if reason_code == 0:
        # success connect
        client.connected_flag=True #set flag
        printmessage("Connected to ",broker)
        printmessage("Subscribing to topic hostPC/reboot","")
        client.subscribe("hostPC/reboot")

    if reason_code > 0:
        # error processing
        printmessage("Error Connecting: ",reason_code)
        
def on_disconnect(client, userdata, flags, reason_code,properties):
    client.inprogress_flag=False
    if reason_code == 0:
        # success disconnect
        printmessage("Client Disconnected","")
        client.connected_flag=False #reset flag
    if reason_code > 0:
        # error processing
        printmessage("Error Disconnecting: ",reason_code)
 
mqtt.Client.connected_flag=False#create flag in class
mqtt.Client.inprogress_flag=False#create flag in class

client = mqtt.Client(mqtt.CallbackAPIVersion.VERSION2)

client.username = 'removed for post'
client.password = 'removed for post'
client.on_publish = on_publish                          #assign function to callback
client.on_message=on_message #attach function to callback
client.on_connect=on_connect  #bind call back function
client.on_disconnect=on_disconnect  #bind call back function

try:
    if not client.inprogress_flag:
        printmessage("Connecting to broker","")
        client.inprogress_flag=True
        client.connect(broker,port,keepalive=60)              #establish connection
except:
    printmessage("connect failed","")
    time.sleep(5) 

client.loop_forever(retry_first_connection=True)

And the HA Automation :

alias: "Control: Network Watchdog"
description: ""
trigger:
  - platform: state
    entity_id:
      - sensor.nokia
    from: null
    to: unavailable
    for:
      hours: 0
      minutes: 1
      seconds: 0
    id: nokiacellunavailable
  - platform: homeassistant
    event: start
    id: hastarting
condition:
  - condition: or
    conditions:
      - condition: and
        conditions:
          - condition: trigger
            id:
              - hastarting
          - condition: state
            state: unavailable
            entity_id: sensor.nokia
            for:
              hours: 0
              minutes: 2
              seconds: 0
      - condition: and
        conditions:
          - condition: trigger
            id:
              - nokiacellunavailable
          - condition: state
            state: unavailable
            entity_id: climate.garagetemp_thermostat
            for:
              hours: 0
              minutes: 0
              seconds: 0
          - condition: state
            entity_id: sensor.obihai_sp1_service_status
            state: unavailable
action:
  - service: persistent_notification.create
    metadata: {}
    data:
      message: Home Assistant detected a Network disconnect and is rebooting...
  - delay:
      hours: 0
      minutes: 0
      seconds: 5
      milliseconds: 0
  - service: mqtt.publish
    data:
      qos: "0"
      topic: hostPC/reboot
      payload: "ON"
mode: single

Or … Stop using virtualbox.

yes, the option 3 I didn’t mention. I did consider that, and looked at other hosting options. My system is still hosting Blue Iris, so I think I would still need to use a VM for HA for now.

I do need to spend more time on different migration options though. Like I said, I had been learning Python and was a good ‘project’, abeit a bandaid.

What setup do you use, @nickrout ?

Allow me to jump in and suggest Proxmox. The instructions can be found here and firing up any VM (including Frigate LXC) is as simple as a 1-line install using tteck’s scripts.

If you’re still dependant on Blue Iris, it should be as simple as installing a minimal image of Windows inside Proxmox (though there’s no 1-line installer for this in tteck’s site - you’ll have to install the “hard” way).

I can also recommend VMware; workstation is free for personal use now that Broadcom acquired Vmware :wink:

I use a HA VM on proxmox, and LXCs for other services.

Wanted to update here. I went ahead and switched to ProxMox a few weeks ago and wow have my been eyes opened! Simplicity, configurable, and fast! I was a bit worried about the learning curve but there’s so much information available online, it was an easy transition. I used TTeck’s scripts which made it pretty painless.

I had HA up and running within a few hours, then MariaDB running in a separate LXC after a few days, and now working on getting Frigate up and running (to replace BI which I abandoned when switching over). I have a few cameras up, playing with detection settings now. Its also running in its own LXC, and I have not installed the HA Integration just yet.

Anyhow, thanks @nickrout and @ShadowFist for the ProxMox recommendation!

1 Like