Spinrite RC5 / Zimaboard / NVMe

  • Release Candidate 6
    Guest:
    We are at a “proposed final” true release candidate with nothing known remaining to be changed or fixed. For the full story, please see this page in the "Pre-Release Announcements & Feedback" forum.
    /Steve.
  • Be sure to checkout “Tips & Tricks”
    Dear Guest Visitor → Once you register and log-in:

    This forum does not automatically send notices of new content. So if, for example, you would like to be notified by mail when Steve posts an update to his blog (or of any other specific activity anywhere else), you need to tell the system what to “Watch” for you. Please checkout the “Tips & Tricks” page for details about that... and other tips!

    /Steve.

slim724

Member
Dec 8, 2023
7
0
Minneapolis
Team,

I bought a PCie card for my 1TB NVMe drive to run on a ZimaBoard. Running Spinrite RC5 and FreeDOS. When it gets to 23.1037%, it hangs with "After an error occurred, this drive was reset....". I've tried level 1, 2 and 4....all the same.

I know there are issues with this drive. That's why it's been replaced. I the RC5 would be able to handle it. Is this to be expected? I tried searching these forums for a similar match.

Advice?

J
 
@Steve is working on a change to potentially address a specific issue related in this area for RC6. As I mentioned in another recent post, however, the drive firmware is in control here. If the drive firmware becomes unhappy, it can choose a number of reactions, including simply locking up, timing out, or asserting a "fatal error" condition wherein there is nothing the software (OS/SpinRite) can do to coax it back online (a power cycle will be required.) You can check for manufacturer utilities, or perhaps something S.M.A.R.T. utility wise to hopefully learn more... but it's probably not guaranteed to be very helpful... Drive manufacturers don't seem to try to be heroic when it comes to recovering from "fatal" drive issues.
 
@Steve is working on a change to potentially address a specific issue related in this area for RC6. As I mentioned in another recent post, however, the drive firmware is in control here. If the drive firmware becomes unhappy, it can choose a number of reactions, including simply locking up, timing out, or asserting a "fatal error" condition wherein there is nothing the software (OS/SpinRite) can do to coax it back online (a power cycle will be required.) You can check for manufacturer utilities, or perhaps something S.M.A.R.T. utility wise to hopefully learn more... but it's probably not guaranteed to be very helpful... Drive manufacturers don't seem to try to be heroic when it comes to recovering from "fatal" drive issues.
Thanks for the super prompt reply.

I'll wait for RC6. Not interested in manufacturer utilities. I want Spinrite to succeed in identifying and marking the bad areas. The drive isn't toast. Last time it was in use, I was able to boot up and use it. I didn't even know there were issues with it until I tried to clone it. CHKDSK wasn't much help except to validate Macrium. Hoping Spinrite is the solution - I want to keep that in my toolbox.

J
 
As I mentioned in another recent post, however, the drive firmware is in control here. If the drive firmware becomes unhappy, it can choose a number of reactions, including simply locking up, timing out, or asserting a "fatal error" condition wherein there is nothing the software (OS/SpinRite) can do to coax it back online (a power cycle will be required.)

I agree based on fact it's what I see all the time when working with NAND flash based drives. One of the first thing I'll always try is if I can create a sector-by-sector disk image. A common failure more during imaging process is the drive becoming totally unresponsive. With a spinning drive you'd try various reset commands, but almost without exception these will not work with NAND flash based drives, a power-cycle is the only way to get them out of this state.

Another interesting observation is that although it may seem certain sectors trigger the behavior, it does not mean these are bad sectors that can then be flagged as such. If I power-cycle and then skip the sector and set it aside for a next pass I will run into same issue again. If I however power-cycle and immediately re-read the problematic sector if often returns valid data without any delay. If a sector were truly bad the latter shouldn't happen. Now I'd argue there are issues with the sector (or some sectors in a block of sectors as I hardly ever try read sector by sector literally, but that's not an issue that can not be overcome obviously, but instead the firmware somehow gets caught up in some recovery procedure.

1702905389818.png


In image we see errors in imager = 1, but this is after we ran into 744 failed reads, followed by power-cycle + immediate re-read of that same block of sectors. The yellow blocks were all blocks of sectors that failed and were successfully read after power-cycling the device.

And if we consider that then trying to get the sector reallocated may not be the best approach: we'd read sector it would fail and hang the firmware. We could power-cycle and get a good read and so IOW no reason for firmware to initiate reallocation. Plus there's the risk the drive will stop responding at all at some point, I have had this happen after for example having imaged 80% of the drive using above method.
 
And if we consider that then trying to get the sector reallocated may not be the best approach: we'd read sector it would fail and hang the firmware. We could power-cycle and get a good read and so IOW no reason for firmware to initiate reallocation. Plus there's the risk the drive will stop responding at all at some point, I have had this happen after for example having imaged 80% of the drive using above method.
But, especially with NVMe drives, if the first read has failed, do we KNOW that the second read was actually trying the same physical page, or could the drive have silently worked out what was there and re-allocated the data to a different physical sector but using the same logical address?
 
I bought a PCie card for my 1TB NVMe drive to run on a ZimaBoard. Running Spinrite RC5 and FreeDOS. When it gets to 23.1037%, it hangs with "After an error occurred, this drive was reset....". I've tried level 1, 2 and 4....all the same.

I know there are issues with this drive. That's why it's been replaced. I the RC5 would be able to handle it. Is this to be expected? I tried searching these forums for a similar match.
Expanding a bit on what Paul wrote earlier...

After fully resetting a drive following an error, SpinRite has been waiting for up to 10 seconds for the drive's status to report that it is again ready to continue. It turns out that for some drives, 10 seconds is not sufficient. In experiments over this past weekend I've verified that if drives are giving more time they often will come back online. So, starting with RC6 (release candidate #6), SpinRite will give drives up to a full 60 seconds to "get their act together" and come back online. Since that is a LONG TIME for someone to wait for SpinRite while nothing appears to be going on, SpinRite RC6 and later posts an on-screen “Waiting for drive: xx” count down timer while it's waiting, so that the user knows what's happening and that the system hasn't died.

I'll be VERY INTERESTED to see if this actually does help with your NVMe drive. I can see a mechanical drive needing some time to do whatever it might need to do following a full reset. But I'd expect a solid-state drive to get back online sooner. (And it'll also be interesting to know, as that count down proceeds, at which point in time the counter disappears and work resumes.)

I'll wait for RC6. Not interested in manufacturer utilities. I want Spinrite to succeed in identifying and marking the bad areas. The drive isn't toast. Last time it was in use, I was able to boot up and use it. I didn't even know there were issues with it until I tried to clone it. CHKDSK wasn't much help except to validate Macrium. Hoping Spinrite is the solution - I want to keep that in my toolbox.
I'm gratified to see that you feel this way (since I do too!) If RC6 does not resolve this problem I'll want to work with you to figure out whether there's anything that SpinRite can do to resolve it!

I have this written and working now, but not as thoroughly tested as I'd like. So once this week's podcast is behind me I'll verify that it's all doing what I expect and we'll all move to RC6. (y)
 
But, especially with NVMe drives, if the first read has failed, do we KNOW that the second read was actually trying the same physical page, or could the drive have silently worked out what was there and re-allocated the data to a different physical sector but using the same logical address?
Yes. Of course I can not know, but it's sort of derived. So I hit bad spot, a block of sectors I can not read, do the power-cycle, retry the read and then it works. This is what I observe a lot, often. But this I had to discover. The usual method used by data recovery techs is, failed read > power cycle > and start reading after some preset skip. So rather than immediate re-read after power-cycle the skip is kept to be tried on a next pass. And like I said then I ran into the same issue at the same spot.

And so so test my theories I tested this with drives after I recovered the data. And I get a somewhat consistent picture there:

Method 1. failed read @ x > power cycle > immediate retry @ x > success! Due to remap or something else?
Method 2. failed read @ x > power cycle > skip > etc., finish pass 1. pass 2 return to skipped > failed read @ x again!! This would probably not happen if we assume @ x was reallocated!! If it were, we'd be able to read just like with method 1. Also using this method @ x will fail just like it did before even I already successfully read using method 1.

This is when I retest and retest on same drive. So success of reading @ x depends on immediate retry after power-cycle. So my working theory is, that if I retry after power-cycle the controller firmware hasn't gotten a chance yet to get pre-occupied yet with whatever it's pre-occupied with.

So .. this (Method 1. failed read @ x > power cycle > immediate retry @ x > success!) would mean we should be able to go back to the sector and read it again if it was reallocated, and this is how I "know" it's not reallocated because that same sector will be a problem again.

I agree, weird shit. I hope to discover more: I noticed I can predict what the controller is doing by observing power consumption. It's often subtle yet measurable differences. For example I can measure whether we read actual zeros from a zeroed drive or that zeros are simply returned because I am reading unmapped addresses (after TRIM for example). I hope and am trying to get better at measuring and determine what a drive is actually doing. Simple, yet measurable and repeatable example:

1702949052695.png

So for example, if I get one of these problem drives again I can measure power-consumption when drive is idle, and then before and after power-cycle and immediate re-read. I got the idea after reading and watching some stuff on power-fault injection, glitching .. These 'hackers' use power consumption patterns to determine what a controller is doing. They take a sample device, a sample of what they intend to attack and sort of create a map of power consumption patterns and link that to specific controller tasks. It's fascinating stuff.

Anyway, I think it should be measurable if controller gets pre-occupied and it may be possible to guess what it's doing.
 
Last edited:
After fully resetting a drive following an error, SpinRite has been waiting for up to 10 seconds for the drive's status to report that it is again ready to continue. It turns out that for some drives, 10 seconds is not sufficient. In experiments over this past weekend I've verified that if drives are giving more time they often will come back online. So, starting with RC6 (release candidate #6), SpinRite will give drives up to a full 60 seconds to "get their act together" and come back online. Since that is a LONG TIME for someone to wait for SpinRite while nothing appears to be going on, SpinRite RC6 and later posts an on-screen “Waiting for drive: xx” count down timer while it's waiting, so that the user knows what's happening and that the system hasn't died.
Yes, that's a cool idea.
 
@slim724:

The latest pre-release of SpinRite (5.02) incorporates this new 60-second wait with a countdown and it fixes the mistake I made in doing that for pre-release 5.01. So, when you can, let's see how release 5.02 functions with that ZimaBoard-mounted NVMe memory? You can find the details for grabbing the latest here: https://forums.grc.com/threads/pre-release-5-02.1417/

Thanks!!
Thanks. I'll get started.

J
 
Terrific! I'm pretty sure you could start at just before the trouble you've seen since the trouble is almost certainly about a specific "sore spot" of the media. (y)
 
No luck. See screen shot.
Okay. So the question is... Before that happened did a ~60 second timer countdown appear in the upper left of the screen? And did it count down to zero before that message appeared?

It IS still entirely possible that the drive really has just gone offline. The difference between the original RC5 and the later incremental 5.02 is that it will give the drive a full 60 seconds to get back online rather than just 10 seconds.

And, as I mentioned previously, I was always skeptical that a solid-state drive might require more then 10 seconds. The instance where this behavior was observed was on spinning ("spinners") mechanical drives.

Here's one question: If you hit this screen, then you exit SpinRite WITHOUT powering down, and then restart SpinRite, is the drive again ready to go? Or does it remain "offline" until the power is cycled?

Thanks!!!
 
WHOA!! I just noticed that this was a BIOS connected drive. When you said that you had obtained a PCIe card for an NVMe drive, I (wrongly) assumed that it was somehow emulating a SATA drive. But apparently that PCIe card has brought along its own BIOS.

THAT means that I have a bit more work to do. I've been working to get a 5.03 pre-release out. I'll get that posted and let you know

Thanks! (There's still reason to believe I can fix this! (y) )
 
Hi @Steve,

OMG, you're right. Since I didn't get the pesky 137GB message, I thought I was use SATA.

No, no delay before the red screen appears.

Here is what I see when the drive scan completes. It just confirms what you're thinking.

John
2023-12-27_18-23-57.jpg
 
John (@slim724):

I just checked the SRPR-503 release source code. It does have the updated BIOS reset recovery code. So if anything's going to be able to work on that drive, the currently published release should. There's a limit to what can be done through the BIOS... but SpinRite will do what it can. And SpinRite 7 will be able to access that NVMe drive directly at the hardware level. (y)
 
@Steve,

Testing against 5.03 was different but produced the same end result. Different in that I now see the countdown timer in the upper left corner. I tried both level 1 and 3. Same.

Perhaps I should purchase a proper SATA -> NVMe enclosure? It would be interesting to see if it succeeds when BIOS is no longer a factor.

Thoughts?

John
 
It's true that if you're able to put the NVMe drive into a SATA enclosure, so that you can then plug it into a SATA port, then SpinRite will almost certainly have a much more "intimate" interaction with the drive. For example, you should then see the drive's make/model and serial number (none of which are available through the "insulation" created by the BIOS.) Also, SpinRite will then be able to run at its maximum speed on that drive, using 1024-sector transfers for level 1&2 and 32768-sector transfers for levels 3-5. The BIOS imposes a strict limit of 127 sectors for everything.

I'm glad to know that BIOS access is now properly being very patient and giving the drive ample time to come back online. And this issue — of SpinRite giving up on highly troubled drives when there may be some way for it to get them back online — is the issue I'm currently working to resolve. So please stick around, either way! (y)