Mdadm raid 5 monitor

pw44 · April 16, 2021, 7:26pm

Hya,

is there a way to monitor the mdadm (linux) raid health with HA?

TIA.

richieframe · April 17, 2021, 4:00am

Yes.

I use Netdata for one of my system monitoring solutions, it natively monitors mdstat. There are 4 ways to get those metrics.

Scrape the web page
Export to a time series database then read that directly from HA
Set up an alert in Netdata, then monitor that alert platform in HA (less cpu usage)
OR the easiest, the local JSON REST api

This is the url that pulls the amount of failed disks:

http://192.168.0.2:19999/api/v1/data?chart=mdstat.mdstat_health&after=-600&before=0&points=1&group=average&gtime=0&format=json&options=seconds&options=jsonwrap

and returns this data which you parse:

{
   "api": 1,
   "id": "mdstat.mdstat_health",
   "name": "mdstat.mdstat_health",
   "view_update_every": 600,
   "update_every": 1,
   "first_entry": 1618544491,
   "last_entry": 1618631481,
   "before": 1618631400,
   "after": 1618630801,
   "dimension_names": ["md126"],
   "dimension_ids": ["md126"],
   "latest_values": [0],
   "view_latest_values": [0],
   "dimensions": 1,
   "points": 1,
   "format": "json",
   "result": {
 "labels": ["time", "md126"],
    "data":
 [
      [ 1618631400, 0]
  ]
},
 "min": 0,
 "max": 0
}

The latest_values array gives you the current failed disk count to match the array of the mdraid arrays from dimension_names, if it is anything but 0s you have a problem. dimensions will match the amount of raid arrays you have in mdraid, so use that value as the size of the arrays.

pw44 · April 17, 2021, 7:42pm

thx for the answer.

i have netdata installed, but no mdstat on mine i’m trying to find out why and how.

richieframe · April 18, 2021, 3:20am

it is part of the proc plugin should just read the output of proc/mdstat if it exists

pw44 · April 18, 2021, 8:33pm

yes, i know, but it doesn’t show up.

pw44 · April 19, 2021, 8:20pm

solved. was using 1.9.0, update to 1.30.1 and it’s working. thx!

liuk4friends · January 22, 2022, 8:08pm

Hi @richieframe,
thanks to your help I found the values given by Netdata about the health of the RAID of my QNAP NAS.

My NAS is configured like this:

hard-disks 1 to 4 are in RAID5;

the fifth slot is empty;

the sixth contains an SSD with “cache” functions.

At the moment there are no problems neither at the hard-disk level nor at the RAID1 level

I’ve modified the url according to my configuration and this below is the output:

{
   "api": 1,
   "id": "mdstat.mdstat_health",
   "name": "mdstat.mdstat_health",
   "view_update_every": 600,
   "update_every": 1,
   "first_entry": 1642872373,
   "last_entry": 1642876369,
   "before": 1642876200,
   "after": 1642875601,
   "dimension_names": ["md1", "md3", "md322", "md256", "md321", "md13", "md9"],
   "dimension_ids": ["md1", "md3", "md322", "md256", "md321", "md13", "md9"],
   "latest_values": [0, 0, 0, 0, 1, 19, 19],
   "view_latest_values": [0, 0, 0, 0, 1, 19, 19],
   "dimensions": 7,
   "points": 1,
   "format": "json",
   "result": {
 "labels": ["time", "md1", "md3", "md322", "md256", "md321", "md13", "md9"],
    "data":
 [
      [ 1642876200, 0, 0, 0, 0, 1, 19, 19]
  ]
},
 "min": 0,
 "max": 19
}

I also found on the net the possibility of identifying further data with this url
all metrics from Netdata agent running and this is the output of mdstat.mdstat_health:

	"mdstat.mdstat_health": {
		"name":"mdstat.mdstat_health",
		"family":"health",
		"context":"md.health",
		"units":"failed disks",
		"last_updated": 1642929989,
		"dimensions": {
			"md1": {
				"name": "md1",
				"value": 0.0000000
			},
			"md3": {
				"name": "md3",
				"value": 0.0000000
			},
			"md322": {
				"name": "md322",
				"value": 0.0000000
			},
			"md256": {
				"name": "md256",
				"value": 0.0000000
			},
			"md321": {
				"name": "md321",
				"value": 1.0000000
			},
			"md13": {
				"name": "md13",
				"value": 19.0000000
			},
			"md9": {
				"name": "md9",
				"value": 19.0000000
			}
		}
	},

Starting from this point I would like to set up a template sensor that provides more readable values (like “status ok”, “degraded” and so on…)

At first I set up the sensors for Home Assistant like this:

      disk1_status:
        data_group: "mdstat.mdstat_health"
        element: "md1"
        icond: mdi:harddisk
      disk2_status:
        data_group: "mdstat.mdstat_health"
        element: "md3"
        icond: mdi:harddisk
      disk3_status:
        data_group: "mdstat.mdstat_health"
        element: "md322"
        icond: mdi:harddisk
      disk4_status:
        data_group: "mdstat.mdstat_health"
        element: "md321"
        icond: mdi:harddisk

All return ‘0.0 failed disks’ except the last one which reports ‘1.0 failed disks’.

However I don’t understand if they refer to the status of the individual hard-disks or to that of the raid.
Unfortunately I am not a linux user (I know only a few basic concepts) but getting help from mr. Google seems to me that perhaps the md1 sensor refers to the health of the RAID (and that the other “mdx” sensors represent other aspects).

If this were true then the following sensor would be enough for me:

      raid_status:
        data_group: "mdstat.mdstat_health"
        element: "md1"
        icond: mdi:harddisk

Taking into account my limitations and waiting to be able to better understand these concepts, I’ve tried to sketch this proof of ‘template sensor’ :

  - platform: template
    sensors:
      ts653a_raid_status_readable:
        friendly_name: "ts653a_raid_status_readable"
        value_template: >-
          {% if is_state('ts653a_raid_status', '0.0000000') %}
            OK
          {% elif is_state('ts653a_raid_status', '1.0000000') %}
            KO
          {% else %}
            -unknown-
          {% endif %}

Unfortunately, with poor results because it gives always to me -unknown-.
I’m far away from what I would like to do …

Thanks in advance

liuk4friends · January 24, 2022, 9:57pm

I’ve done it.
It was a trivial syntax error.
I forgot to add “sensor” so the template couldn’t work.

The right syntax is:

  - platform: template
    sensors:
      ts653a_raid_status_readable:
        friendly_name: "ts653a_raid_status_readable"
        value_template: >-
          {% if is_state('sensor.ts653a_raid_status', '0.0') %}
            OK
          {% elif is_state('sensor.ts653a_raid_status', '1.0') %}
            KO
          {% else %}
            -unknown-
          {% endif %}

The result is OK so it’s working.

Now the point is to understand if the md1 element of mdstat.mdstat_health represents the health of the raid or not…

RobinB · August 17, 2022, 6:00pm

Hi @liuk4friends,

did you ever manage to finish it?

liuk4friends · August 17, 2022, 7:58pm

Hi @RobinB,
Unfortunately no.
I’m stuck at the point of my last message.
The sensor works but I don’t know if it is set correctly.
I don’t know Linux well enough to understand the values of those ‘mdstat’ sensors provided by the Netdata Agent.

RobinB · August 18, 2022, 8:24pm

After asking you yesterday, I went ahead and used the HACS Addon RAID Monitor, that actually works out fine for me. I also set up Netdata cloud and that alerts on broken disks also.

liuk4friends · August 18, 2022, 9:35pm

Hello,
Could you be more specific?
What is the HACS addon you talk about? I can not find it

RobinB · August 18, 2022, 10:06pm

Sorry, it’s called “RAID Status”: https://github.com/LorenzoVasi/HA_mdadm

liuk4friends · August 18, 2022, 10:30pm

Thanks I’ll take a look