Help! My FreeNAS device says my drives are DEGRADED.
The first issue that cropped up from my FreeNAS build was the boot drives went into a DEGRADED state a little over a month into operation. I had the foresight to mirror that boot pool since its all operated from USB keys, and USB nubbins are sometimes flakey to begin with. Large batch runs, you get a bad thumb drive in maybe 2% of that batch and its roulette time with your files.
Fixing this was as easy as identifying the disk by inspecting the zpool listing, and pulling
replacing it with another 32GB thumbdrive and resilvering the boot disks. FreeNAS made that process super
However, now I’m in somewhat of a different pickle…
My Mission Critical RaidZ2 pool is showing 2 drives as unavailable.
Well shit. This doesn’t bode well. The power flickered today on/off for a little over 30 minutes today and this subsequently hit my FreeNAS box pretty hard I guess. I really do need to get this thing behind a UPS. (More on that in a follow-up post)
Here’s some output from
zpool status -v
pool: volume1 state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://illumos.org/msg/ZFS-8000-2Q scan: scrub repaired 0 in 0 days 00:20:20 with 0 errors on Sun Sep 9 00:20:20 2018 config: NAME STATE READ WRITE CKSUM volume1 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/44c05816-9861-11e8-8b15-7085c2795350.eli ONLINE 0 0 0 gptid/4555ba97-9861-11e8-8b15-7085c2795350.eli ONLINE 0 0 0 8894497950539516152 UNAVAIL 0 0 0 was /dev/gptid/45ecef77-9861-11e8-8b15-7085c2795350.eli 4278200619945568431 UNAVAIL 0 0 0 was /dev/gptid/46735d83-9861-11e8-8b15-7085c2795350.eli errors: No known data errors
we see that I have 2 GPTID’s that are missing, we can tell by the “was”. Does this mean that the device identifiers changed?
what the hell even is /dev/gptid?
Geom disk list shows /dev/foo, what about those uuids?
Here’s a tool I’m not super familiar with, but happy exists and is well documented in its MAN.
$ glabel status Name Status Components gptid/c14bf7e0-b7d4-11e8-bfac-7085c2795350 N/A da0p1 label/efibsd N/A da1p1 gptid/21049873-95e5-11e8-8d4b-7085c2795350 N/A da1p1 gptid/75470614-a296-11e8-976f-7085c2795350 N/A ada0p2 gptid/611a6069-a296-11e8-976f-7085c2795350 N/A ada1p2 gptid/44c05816-9861-11e8-8b15-7085c2795350 N/A ada2p2 gptid/4555ba97-9861-11e8-8b15-7085c2795350 N/A ada3p2 gptid/45ecef77-9861-11e8-8b15-7085c2795350 N/A ada4p2 gptid/46735d83-9861-11e8-8b15-7085c2795350 N/A ada5p2
And for extended info
glabel list will show even more details.
Geom name: ada0p2 Providers: 1. Name: gptid/75470614-a296-11e8-976f-7085c2795350 Mediasize: 2998445408256 (2.7T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e1 secoffset: 0 offset: 0 seclength: 5856338688 length: 2998445408256 index: 0 Consumers: 1. Name: ada0p2 Mediasize: 2998445408256 (2.7T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e2
Cool so now we can map those UUID’s back to our disks with this output.
It looks to me like
ada5 are the problem children in this scenario. Lets keep digging in.
So, whats actually in the gpt partition devfs path?
Note I don’t actually know if FreeBSD uses DevFS, i’m just inferring with my linux-admin hat on.
So maybe we should look at whats actually on disk in
/dev/gpt at this point and figure out whats on disk?
$ ls /dev/gptid 21049873-95e5-11e8-8d4b-7085c2795350 46735d83-9861-11e8-8b15-7085c2795350 44c05816-9861-11e8-8b15-7085c2795350 611a6069-a296-11e8-976f-7085c2795350 44c05816-9861-11e8-8b15-7085c2795350.eli 611a6069-a296-11e8-976f-7085c2795350.eli 4555ba97-9861-11e8-8b15-7085c2795350 75470614-a296-11e8-976f-7085c2795350 4555ba97-9861-11e8-8b15-7085c2795350.eli 75470614-a296-11e8-976f-7085c2795350.eli 45ecef77-9861-11e8-8b15-7085c2795350 c14bf7e0-b7d4-11e8-bfac-7085c2795350
Confirmed, we have different uuids
Something has happened due to that power flicker that has changed these device UUID’s. I don’t have output to confirm this yet, but it appears that it’s moved a pair of disks on a different SATA channel into different positions in the disk layout. Which may be why this change showed up.
What I would have expected is the layout of the disk geometry to show the HGST disks first, which comprised the raidz2 pool.
$ sudo camcontrol devlist <WDC WD30EFRX-68EUZN0 82.00A82> at scbus0 target 0 lun 0 (pass0,ada0) <WDC WD40EFRX-68WT0N0 82.00A82> at scbus1 target 0 lun 0 (pass1,ada1) <HGST HDN726040ALE614 APGNW7JH> at scbus2 target 0 lun 0 (pass2,ada2) <HGST HDN726040ALE614 APGNW7JH> at scbus3 target 0 lun 0 (pass3,ada3) <HGST HDN726040ALE614 APGNW7JH> at scbus4 target 0 lun 0 (pass4,ada4) <HGST HDN726040ALE614 APGNW7JH> at scbus5 target 0 lun 0 (pass5,ada5) <SanDisk Cruzer Fit 1.00> at scbus8 target 0 lun 0 (pass6,da0) <SanDisk Ultra Fit 1.00> at scbus9 target 0 lun 0 (pass7,da1)
Notice that its showing WDC’s as the first disks in the geometry pass. This reeks of something real shitty happening with the disk layout unless I’m mis-remembering how the channels were setup.
This is a fine note to add post-facto. When you setup FreeNAS the first time, ensure that you have copied your data set layout somewhere you can reference later. AND KEEP THIS UPDATED. Having a shallow copy of data like what I have in this blog post of disk layout will be invaluable to identify chicanery of fallback settings in your BIOS kicking in if you’ve made any changes.
An aside on smartctl
At this point. I actually kicked off a long smartctl scan of the drives which takes around 530 minutes to complete. So best to set this and just not touch the rig during its scans while it probes the disks for status. When finished you can view the SMART output with:
smartctl -a /dev/ada3 for example. just substitute your drive letters.
As the drives passed, that’s no longer interesting to this story but a worthy thing to note when doing forensics to see whats up with your NAS.
BACKUP, BACKUP, BACKUP
Before we go any further, this is where I’d highly recommend backing up data. Ideally you’ve been doing this all along offsite, or to physical media you’re keeping in a secure location. I have a good majority of this data offsite, and only a small segment exists only in this cluster due to time constraints on getting my file server pipeline back up and running over secure tunnels.
That being said, plugging in a 1TB USB drive and copying files over should be no problem for this and we can start the process of exporting and reimporting the disks. Before you do anything, ensure you’ve got a copy of your Geli keys if you’ve setup full disk encryption on these drives. You should have done this when you initially set them up, but for the sake of this article, do so now.
Inspect the BIOS for those SATA disks
Now that we know our disks have changed location, lets see why. I could just export and reimport the disks, but there’s clearly more to this than just having some funky drive moving behavior.
As I restart the nas, I mash F2 to get into the bios and immediatly see that 2 disks in SATA3_A1 and SATA3_A2 are marked as disconnected. Ok, what the heck? Let me power-cycle and see if they come back… They do. Plot thickens.
I’m in a point where I have one of 3 things failing, and I need to dig deeper.
- Sata Cabling
- BIOS bits flipped due to the power flicker
- SATA controller problems.
As I search the internet for these SATA3_A# ports, I start finding the most likely culprit. ASROCK on this board has incorporated two different SATA controllers. Both running in UHCI mode, and ASMedia being the one controlling the SATA3_A ports. Great. Further searching yields that this ASMedia controller is notoriously finicky and has been reported as such as far back as 2013. Not a good look for using these ports in the disk configuration.
ASMedia controller appears to be garbage, whats next?
After doign some further reading, talking with others who are running their own storage arrays, xreffing with the FreeNAS forum posts, I’ve settled on picking up a new SAS controller to plug into one of these PCIe ports and giving it the business end of disk management.
The LSI SAS controller should be on the way, and arrive sometime on Wednesday evening. I’ll be doing a follow up post on that.
Additionally, I ordered two WD Red drives on the way from newegg to help mitigate batch problems). So we’ll be running some exercises with this once I have a confirmed third copy of the data somewhere sitting safely off site.
Here’s the new parts list to update on top of the original post:
|SAS9211-8I 8Port Int 6GB SATA+SAS PCIE 2.0||98.55||1||Amazon|
|Cable Matters Internal Mini SAS to SATA Cable (SFF-8087 to SATA Forward Breakout) 3.3 Feet||11.99||1||Amazon|
|WD Red Pro WD4003FFBX 4TB 7200 RPM 256MB Cache SATA 6.0Gb/s 3.5” Internal Hard Drive||184.99||2||NewEgg|
Hardware Failure, nobody ever expects it, but it happens
We all SAY we expect hardware to fail, but when it happens, its always a messy process. I know that I for one was only half expecting problems out of this rig within the first couple years. Not that the shipping Sata controller configuration was kind of bulldonk. I can’t be mad though, this is outlined elsewhere and I clearly missed some reading points. Everything else about this board has been stellar. I’m actually pretty hype to investigate this LSI SAS controller and see if it gives me better results.
As always, I’m over on mastodon as
@[email protected] if you want to discuss any of this with me. Catch you next week!