Mdadm raid 5 monitor

Hya,

is there a way to monitor the mdadm (linux) raid health with HA?

TIA.

Yes.

I use Netdata for one of my system monitoring solutions, it natively monitors mdstat. There are 4 ways to get those metrics.

Scrape the web page
Export to a time series database then read that directly from HA
Set up an alert in Netdata, then monitor that alert platform in HA (less cpu usage)
OR the easiest, the local JSON REST api

This is the url that pulls the amount of failed disks:

http://192.168.0.2:19999/api/v1/data?chart=mdstat.mdstat_health&after=-600&before=0&points=1&group=average&gtime=0&format=json&options=seconds&options=jsonwrap

and returns this data which you parse:

{
   "api": 1,
   "id": "mdstat.mdstat_health",
   "name": "mdstat.mdstat_health",
   "view_update_every": 600,
   "update_every": 1,
   "first_entry": 1618544491,
   "last_entry": 1618631481,
   "before": 1618631400,
   "after": 1618630801,
   "dimension_names": ["md126"],
   "dimension_ids": ["md126"],
   "latest_values": [0],
   "view_latest_values": [0],
   "dimensions": 1,
   "points": 1,
   "format": "json",
   "result": {
 "labels": ["time", "md126"],
    "data":
 [
      [ 1618631400, 0]
  ]
},
 "min": 0,
 "max": 0
}

The latest_values array gives you the current failed disk count to match the array of the mdraid arrays from dimension_names, if it is anything but 0s you have a problem. dimensions will match the amount of raid arrays you have in mdraid, so use that value as the size of the arrays.

1 Like

thx for the answer.

i have netdata installed, but no mdstat on mine :frowning: iā€™m trying to find out why and how.

it is part of the proc plugin should just read the output of proc/mdstat if it exists

yes, i know, but it doesnā€™t show up.

solved. was using 1.9.0, update to 1.30.1 and itā€™s working. thx!

Hi @richieframe,
thanks to your help I found the values given by Netdata about the health of the RAID of my QNAP NAS.

My NAS is configured like this:

  • hard-disks 1 to 4 are in RAID5;
  • the fifth slot is empty;
  • the sixth contains an SSD with ā€œcacheā€ functions.

At the moment there are no problems neither at the hard-disk level nor at the RAID1 level

Iā€™ve modified the url according to my configuration and this below is the output:

{
   "api": 1,
   "id": "mdstat.mdstat_health",
   "name": "mdstat.mdstat_health",
   "view_update_every": 600,
   "update_every": 1,
   "first_entry": 1642872373,
   "last_entry": 1642876369,
   "before": 1642876200,
   "after": 1642875601,
   "dimension_names": ["md1", "md3", "md322", "md256", "md321", "md13", "md9"],
   "dimension_ids": ["md1", "md3", "md322", "md256", "md321", "md13", "md9"],
   "latest_values": [0, 0, 0, 0, 1, 19, 19],
   "view_latest_values": [0, 0, 0, 0, 1, 19, 19],
   "dimensions": 7,
   "points": 1,
   "format": "json",
   "result": {
 "labels": ["time", "md1", "md3", "md322", "md256", "md321", "md13", "md9"],
    "data":
 [
      [ 1642876200, 0, 0, 0, 0, 1, 19, 19]
  ]
},
 "min": 0,
 "max": 19
}

I also found on the net the possibility of identifying further data with this url
all metrics from Netdata agent running and this is the output of mdstat.mdstat_health:

	"mdstat.mdstat_health": {
		"name":"mdstat.mdstat_health",
		"family":"health",
		"context":"md.health",
		"units":"failed disks",
		"last_updated": 1642929989,
		"dimensions": {
			"md1": {
				"name": "md1",
				"value": 0.0000000
			},
			"md3": {
				"name": "md3",
				"value": 0.0000000
			},
			"md322": {
				"name": "md322",
				"value": 0.0000000
			},
			"md256": {
				"name": "md256",
				"value": 0.0000000
			},
			"md321": {
				"name": "md321",
				"value": 1.0000000
			},
			"md13": {
				"name": "md13",
				"value": 19.0000000
			},
			"md9": {
				"name": "md9",
				"value": 19.0000000
			}
		}
	},

Starting from this point I would like to set up a template sensor that provides more readable values (like ā€œstatus okā€, ā€œdegradedā€ and so onā€¦)

At first I set up the sensors for Home Assistant like this:

      disk1_status:
        data_group: "mdstat.mdstat_health"
        element: "md1"
        icond: mdi:harddisk
      disk2_status:
        data_group: "mdstat.mdstat_health"
        element: "md3"
        icond: mdi:harddisk
      disk3_status:
        data_group: "mdstat.mdstat_health"
        element: "md322"
        icond: mdi:harddisk
      disk4_status:
        data_group: "mdstat.mdstat_health"
        element: "md321"
        icond: mdi:harddisk

All return ā€˜0.0 failed disksā€™ except the last one which reports ā€˜1.0 failed disksā€™.

However I donā€™t understand if they refer to the status of the individual hard-disks or to that of the raid.
Unfortunately I am not a linux user (I know only a few basic concepts) but getting help from mr. Google seems to me that perhaps the md1 sensor refers to the health of the RAID (and that the other ā€œmdxā€ sensors represent other aspects).

If this were true then the following sensor would be enough for me:

      raid_status:
        data_group: "mdstat.mdstat_health"
        element: "md1"
        icond: mdi:harddisk

Taking into account my limitations and waiting to be able to better understand these concepts, Iā€™ve tried to sketch this proof of ā€˜template sensorā€™ :

  - platform: template
    sensors:
      ts653a_raid_status_readable:
        friendly_name: "ts653a_raid_status_readable"
        value_template: >-
          {% if is_state('ts653a_raid_status', '0.0000000') %}
            OK
          {% elif is_state('ts653a_raid_status', '1.0000000') %}
            KO
          {% else %}
            -unknown-
          {% endif %}

Unfortunately, with poor results because it gives always to me -unknown-.
Iā€™m far away from what I would like to do ā€¦

Thanks in advance

Iā€™ve done it.
It was a trivial syntax error.
I forgot to add ā€œsensorā€ so the template couldnā€™t work.

The right syntax is:

  - platform: template
    sensors:
      ts653a_raid_status_readable:
        friendly_name: "ts653a_raid_status_readable"
        value_template: >-
          {% if is_state('sensor.ts653a_raid_status', '0.0') %}
            OK
          {% elif is_state('sensor.ts653a_raid_status', '1.0') %}
            KO
          {% else %}
            -unknown-
          {% endif %}

The result is OK so itā€™s working.

Now the point is to understand if the md1 element of mdstat.mdstat_health represents the health of the raid or notā€¦

Hi @liuk4friends,

did you ever manage to finish it?

1 Like

Hi @RobinB,
Unfortunately no.
Iā€™m stuck at the point of my last message.
The sensor works but I donā€™t know if it is set correctly.
I donā€™t know Linux well enough to understand the values of those ā€˜mdstatā€™ sensors provided by the Netdata Agent.

After asking you yesterday, I went ahead and used the HACS Addon RAID Monitor, that actually works out fine for me. I also set up Netdata cloud and that alerts on broken disks also.

1 Like

Hello,
Could you be more specific?
What is the HACS addon you talk about? I can not find it

Sorry, itā€™s called ā€œRAID Statusā€: https://github.com/LorenzoVasi/HA_mdadm

2 Likes

Thanks Iā€™ll take a look