Please note that up to this point, we haven’t touched /dev/sdb or /dev/sdd, the partitions that were part of the RAID array. Let’s try to recover the data from /dev/sdb1, for example, into a similar disk partition ( /dev/sde1 – note that this requires that you create a partition of type fd in /dev/sde before proceeding) using ddrescue: # ddrescue -r 2 /dev/sdb1 /dev/sde1 # mdadm -manage -set-faulty /dev/md0 /dev/sdd1Īttempts to re-create the array the same way it was created at this time (or using the -assume-clean option) may result in data loss, so it should be left as a last resort. # mdadm -manage -set-faulty /dev/md0 /dev/sdb1
But what happens if 2 disks in the array fail? Let’s simulate such scenario by marking /dev/sdb1 and /dev/sdd1 as faulty: # umount /mnt/raid1 Replace Failed Raid Device Recovering from a Redundancy LossĪs explained earlier, mdadm will automatically rebuild the data when one disk fails. To do that, let’s re-add /dev/sdb1 and /dev/sdc1: # mdadm -manage /dev/md0 -add /dev/sdb1 Though not strictly required, it’s a great idea to have a spare device in handy so that the process of replacing the faulty device with a good drive can be done in a snap. The image above clearly shows that after adding /dev/sdd1 to the array as a replacement for /dev/sdc1, the rebuilding of data was automatically performed by the system without intervention on our part. We can test this by marking /dev/sdb1 as faulty, removing it from the array, and making sure that the file tecmint.txt is still accessible at /mnt/raid1: # mdadm -detail /dev/md0 Luckily for us, the system will automatically start rebuilding the array with the part that we just added.
Then you can physically remove it from the machine and replace it with a spare part ( /dev/sdd, where a partition of type fd has been previously created): # mdadm -manage /dev/md0 -add /dev/sdd1
In this case, you will need to remove the device from the software RAID array: # mdadm /dev/md0 -remove /dev/sdc1 More importantly, let’s see if we received an email alert with the same warning: Email Alert on Failed RAID Device
This will result in /dev/sdc1 being marked as faulty, as we can see in /proc/mdstat: Stimulate Issue with RAID Storage To simulate an issue with one of the storage devices in the RAID array, we will use the -manage and -set-faulty options as follows: # mdadm -manage -set-faulty /dev/md0 /dev/sdc1 Simulating and Replacing a failed RAID Storage Device In a minute we will see what an alert sent by mdadm looks like. Otherwise, you will not receive any alerts. You can modify this behavior by adding the -delay option to the crontab entry above along with the amount of seconds (for example, -delay 1800 means 30 minutes).įinally, make sure you have a Mail User Agent (MUA) installed, such as mutt or mailx. To run mdadm in monitor + scan mode, add the following crontab entry as root: /sbin/mdadm -monitor -scan -oneshotīy default, mdadm will check the RAID arrays every 60 seconds and send an alert if it finds an issue. In my case: MAILADDR RAID Monitoring Email Alerts To set this up, add the following line in /etc/nf: MAILADDR Even when you can inspect /proc/mdstat in order to check the status of your RAIDs, there’s a better and time-saving method that consists of running mdadm in monitor + scan mode, which will send alerts via email to a predefined recipient. There is a variety of reasons why a storage device can fail (SSDs have greatly reduced the chances of this happening, though), but regardless of the cause you can be sure that issues can occur anytime and you need to be prepared to replace the failed part and to ensure the availability and integrity of your data.Ī word of advice first. You can fix this by running: # restorecon -R /mnt/raid1 Otherwise, you’ll run into this warning message while attempting to mount it: SELinux RAID Mount Error In addition, if SELinux is enabled in enforcing mode, you will need to add the corresponding labels to the directory where you’ll mount the RAID device.