Developing an ML model to integrate into HA

peveleigh · December 31, 2023, 11:00pm

This project is a work in progress. I’ll provide updates as I develop things further. The goal is to have a binary sensor in Home Assistant that represents a prediction of whether our furnace is burning oil or not. This sensor will (ideally) be used to alert me as to when I should be lighting wood fires. I try not to light a wood fire if the oil isn’t cutting in more than three times an hour.

Looking at the following temperature graph from our furnace room, it seems obvious as to when the furnace is burning oil, burning wood, or not burning anything. I want to use this data to predict if the furnace is burning oil or not using a machine learning model. The only issue is that I have no idea how to setup and train a machine learning model. That isn’t going to stop me from trying though.

Screenshot 2023-12-31 183147

I’ll be using Python since it’s the language I’m most proficient with. The first step is to get the temperature data from influxdb. I’m still running influxdb v1 and the python client library for influxdb only seems to work with v2. As a workaround, I’m extracting the data using the influxdb api instead. This works like so:

# Limit query to approx. a month’s worth of data
limit = 45000

# Build query string
query = f"""SELECT value FROM "°C" WHERE "entity_id"::tag = 'temperature_furnace_room' ORDER BY time DESC limit {limit}"""

# Prep request params
params = {
    'db':'home_assistant',
    'q':query
}
# Send get request to query endpoint
res = re.get("http://influxdb.local:8086/query",params=params)

# Extract data from response and convert to json
data = json.loads(res.content)['results'][0]['series'][0]['values']

To make the data easier to work with, I convert it to a Pandas dataframe:

# Convert to DataFrame
df = pd.DataFrame(data, columns=['Timestamp', 'Temperature'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

Next, I manually tag the data so the model can learn from it. I figured the easiest way to do that would be to click a bunch of “start/stop” points on the graph. This is possible with the matplotlib and mplcursor libraries. It’s important not to lose track of the “start/stop” points so I print out little reminders for myself. To ensure things don’t get too cluttered, I only plot a day at a time.

# Create empty list to save start/stop points in
time_intervals = []

# Iterate over each day
for day, group in df.groupby(df['Timestamp'].dt.date):

    # Create a scatter plot of the day's values
    fig=plt.figure(frameon=False,layout='tight')
    plt.scatter(group['Timestamp'], group['Temperature'], marker='.')
    plt.grid(True)

    # Use mplcursors to enable interactive selection
    cursor = mplcursors.cursor()

    # Create a callback function for mplcursors
    @cursor.connect("add")
    def on_add(sel):

        # Get index of selected value
        index = sel.index

        # Get timestamp of selected value
        timestamp = group.iloc[index]['Timestamp']

        # Print out helpful reminders
        if len(time_intervals) %2 == 0:
            print("Start")
        else:
            print("Stop")

        # Append start (or stop) point to start/stop list
        time_intervals.append(timestamp)

    plt.show()

Here’s what that looks like. I tried to minimize the amount of whitespace so it’s easier to see which points I need to click.

Now with all the “start/stop” points collected, I can produce a dataset and save it to a csv file. The following adds a column of Boolean values to the dataframe indicating if the furnace is on (true) or off (false) and then saves it to dataset.csv.

# Create column of true/false values
df['Furnace'] = df['Timestamp'].apply(lambda x: any(start <= x <= stop for start, stop in zip(time_intervals[::2], time_intervals[1::2])))

# Save data to csv file
df.to_csv('dataset.csv', index=False)

As a sanity check, I’ll graph the values to see if they match up or not.

df = pd.read_csv('dataset.csv')
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

for day, group in df.groupby(df['Timestamp'].dt.date):
    plt.scatter(group['Timestamp'], group['Temperature'], c=group['Furnace'].map({True: 'red', False: 'blue'}), marker='.')
    plt.title('Temperature Plot with Furnace Status')
    plt.xlabel('Timestamp')
    plt.ylabel('Temperature (°C)')
    plt.legend()
    plt.show()

Here’s a sample of a single day. It looks good to me. There are some areas that look like it could be the oil burning but I’m pretty sure it’s actually wood.

That was all straight-forward but now I’m lost. I assume I can train a model by passing it the last so many temperature values along with the current oil burning state (true/false). I’ll try it with the last ten values first and see how that goes.

window = 10
df = pd.read_csv("dataset.csv")

X = pd.concat([df['Temperature'].shift(i) for i in range(window)], axis=1).dropna().values
y = df['Furnace'][window:].values

Not knowing what to do next, I decided to ask ChatGPT. It suggested this:

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
# Standardize or normalize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
    
# Choose a machine learning model (Random Forest in this example)
model = RandomForestClassifier(random_state=42)
    
# Train the model
model.fit(X_train_scaled, y_train)
    
# Make predictions
y_pred = model.predict(X_test_scaled)
   
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print('Accuracy: ', accuracy)
print('Classification Report:\n', report)

After all that I ended up with these results:

Screenshot 2023-12-31 185759

My next steps are to compile more training data and fine-tune (i.e. completely redo) the ML model. Once I get something I’m happy with, I’ll proceed with integrating it into Home Assistant as a sensor. I’m not sure if I’ll go with AppDaemon, PyScript, or something else entirely yet.