SRPR/Spinrite 6.1 scanning a 1TB SSD with GPT partition

  • SpinRite v6.1 Release #3
    Guest:
    The 3rd release of SpinRite v6.1 is published and may be obtained by all SpinRite v6.0 owners at the SpinRite v6.1 Pre-Release page. (SpinRite will shortly be officially updated to v6.1 so this page will be renamed.) The primary new feature, and the reason for this release, was the discovery of memory problems in some systems that were affecting SpinRite's operation. So SpinRite now incorporates a built-in test of the system's memory. For the full story, please see this page in the "Pre-Release Announcements & Feedback" forum.
    /Steve.
  • Be sure to checkout “Tips & Tricks”
    Dear Guest Visitor → Once you register and log-in please checkout the “Tips & Tricks” page for some very handy tips!

    /Steve.
  • BootAble – FreeDOS boot testing freeware

    To obtain direct, low-level access to a system's mass storage drives, SpinRite runs under a GRC-customized version of FreeDOS which has been modified to add compatibility with all file systems. In order to run SpinRite it must first be possible to boot FreeDOS.

    GRC's “BootAble” freeware allows anyone to easily create BIOS-bootable media in order to workout and confirm the details of getting a machine to boot FreeDOS through a BIOS. Once the means of doing that has been determined, the media created by SpinRite can be booted and run in the same way.

    The participants here, who have taken the time to share their knowledge and experience, their successes and some frustrations with booting their computers into FreeDOS, have created a valuable knowledgebase which will benefit everyone who follows.

    You may click on the image to the right to obtain your own copy of BootAble. Then use the knowledge and experience documented here to boot your computer(s) into FreeDOS. And please do not hesitate to ask questions – nowhere else can better answers be found.

    (You may permanently close this reminder with the 'X' in the upper right.)

wgk

Member
Dec 3, 2020
9
2
Calgary, Alberta, Canada
I have a 8-9 year old desktop running Linux on a system with 3 drives - 500GB SSD, 1TB SSD, 2TB HDD. I copied SRPR.EXE to a readspeed bootable USB and ran it. Both readspeed and srpr find all the drives. Readspeed runs fine on all drives and SRPR level 2 scan works on the 500GB and the 2TB drives but gets stuck trying to recover data very early in the 1TB drive which has a GPT partition table. I have left it for hours and it does not make any progress. I checked the drive with fsck and it is clean. I have seen conflicting info on GPT drives and UEFI bootable systems. It looks like my system has a BIOS but can boot UEFI. The OS on the 1TB drive does use UEFI, but readspeed/srpr USB boot and load fine.

Should I be worried about the 1TB drive, or is a limitation of Spinrite 6.1?

Some config info:
wgking@HAL:~$ sudo parted -l
Model: ATA Samsung SSD 870 (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
1 1049kB 500MB 499MB fat32 boot, esp
2 500MB 16.5GB 16.0GB linux-swap(v1) swap
3 16.5GB 1000GB 984GB ext4


Model: ATA ST2000DX002-2DV1 (scsi)
Disk /dev/sdb: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Disk Flags:

Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary btrfs


Model: ATA Samsung SSD 840 (scsi)
Disk /dev/sdc: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number Start End Size Type File system Flags
1 1049kB 500GB 500GB primary btrfs


wgking@HAL:~$ sudo file -s /dev/sda
[sudo] password for wgking:
/dev/sda: DOS/MBR boot sector; partition 1 : ID=0xee, start-CHS (0x0,0,2), end-CHS (0x3ff,255,63), startsector 1, 1953525167 sectors, extended partition table (last)
wgking@HAL:~$ sudo file -s /dev/sdb
/dev/sdb: DOS/MBR boot sector; partition 1 : ID=0x83, start-CHS (0x0,32,33), end-CHS (0x3ff,254,63), startsector 2048, 3907026944 sectors
wgking@HAL:~$ sudo file -s /dev/sdc
/dev/sdc: DOS/MBR boot sector


wgking@HAL:~$ efibootmgr
BootCurrent: 000A
Timeout: 1 seconds
BootOrder: 000A,000C,000D,0009,0008
Boot0008* CD/DVD Drive
Boot0009* Hard Drive
Boot000A* ubuntu
Boot000C* UEFI OS
Boot000D* ubuntu
 
but gets stuck trying to recover data very early in the 1TB drive which has a GPT partition table.
The SpinRite 6.1 pre-releases are unaffected by GPT. They work against the entire drive, not partitions.

I have left it for hours and it does not make any progress.
It's probably spending a bunch of time in DynaStat. I recommend uploading the log here, which can be found in the SRLOGS folder on your SpinRite boot drive.

What is the model number of your desktop?
 
Based on the logs it appears SpinRite is spending a lot of time trying to recover some of the sectors - so there is an issue very early on. If you don't want it to try so hard I think you can decrease the Dynastat recovery attempts via the command line. First thing is perhaps to tell SpinRite to start further into the drive so that it checks the rest of the drive. You should be able to do this via the GUI when you select your drive. Start somewhere after 1%.
 
Hey Greg (@wgk),

As others have noted, everything appears to be working correctly. The front of that SSD appears to be in trouble. That last log you posted shows that the SMART health for "uncorrectable" sectors has been pushed down from its starting value of 99 to 75. That would have shown as RED blocks while running SpinRite.

What you might consider doing is just running SpinRite first at Level 1. That would provide a non-data recovery read-only assessment of the drive.

If you have any important data on that drive I would remove it as your first priority.

But I do not believe from the evidence so far that the drive is necessarily dead or even dying. It might well be that rewriting the front of the drive would effectively repair it. This is a controversial position that I don't yet have sufficient evidence to defend... but I suspect that's what we're going to be learning over the next several years. I believe that SSDs need to be rewritten occasionally to keep their data from being upset by reading. (Google: “read disturb”) Your drive appears to be an extreme example of this.

If you were to first remove your data from this drive, then running SpinRite adding the (dangerous) command line "dynastat 0" — which will completely disable SpinRite data recovery. Then run SpinRite at Level 2 which will perform a full re-write of the drive but it will NOT first attempt to perform any data recovery.

The question then will be... does this "fix" your drive so that another regular pass of SpinRite finds NO problems and the drive is repaired?

If you do this, PLEASE continue to share your experiences.

(And everything that others said about BIOS, UEFI, GPT, etc. not having any bearing on this was correct. Once you have SpinRite running, it's running! :)
 
Page 12: "Retention errors are caused by charge leakage over time after a flash cell is programmed, and are the dominant source of flash memory errors, as demonstrated previously"
If you've been paying attention over in the newsgroups, Joep, you'll know that this is what I've been saying all along. For example, this is why storing unpowered offline SSDs in a hot data center results in higher levels of data loss than storing the memory in a cold environment. Higher temperatures increase thermal-induced electron migration, primarily across the insulating dielectric barrier. This results in "charge confusion" over time.

However, that said, we also have clear evidence from our own with with ReadSpeed that "read disturb" is an issue.

In another paper by the same people you have quoted above, they lead with: "NAND flash memory reliability continues to degrade as the memory is scaled down and more bits are programmed per cell. A key contributor to this reduced reliability is read disturb, where a read to one row of cells impacts the threshold voltages of unread flash cells in different rows of the same block. Such disturbances may shift the threshold voltages of these unread cells to different logical states than originally programmed, leading to read errors that hurt endurance" (The Referenced Paper)
 
While read disturb may pose more of a problem for future NAND generations, retention errors will also become an even bigger problem, we can already see this, more leveled cells > increasingly smaller margins that decide is a value is 0 or 1. A small charge drop in a multilevel cell will even have bigger consequences.
Joep: I'm sure you see that the storage of additional "bits" inside individual cells through multi-level storage reduces storage reliability regardless of whether the change in a cell's charge is brought about by static charge decay or dynamic read disturbance. EITHER of those undesirable factors will become a larger problem when the error margins of a cell's charge determination are reduced.

The original point I was making was that rewriting the SSD's storage would likely repair it in this instance. And that's true regardless of the cause of the memory's current inability to be read. (y)
 
Thanks for all the feedback. On the off chance it may be hardware related I swapped the SATA cable )no change), used 2 different SATA ports (no change), started scan at 5% (no change), so I have a new SSD ordered. I will play around with this one afterwards to see if I can rejuvenate it. This problem seems to be common with all SATA SSD brands. Do the M.2 format SSD fair any better? Might be time to upgrade the desktop...
 
Do the M.2 format SSD fair any better?
If you're referring to NVMe, then in my experience, no. They tend to run hot, even in desktops, unless you carefully cool them. My Samsung 970 EVO always ran hot, and eventually started racking up Media and Data Integrity Errors.

 
Finished a level 1 scan. Drive looks to be in pretty poor shape. I am a bit surprised the system seems to running fine. Hopefully it will continue until I pick up the replacement drive tomorrow.
 

Attachments

  • SR9-LOG.txt
    7.7 KB · Views: 130
Finished a level 1 scan. Drive looks to be in pretty poor shape. I am a bit surprised the system seems to running fine. Hopefully it will continue until I pick up the replacement drive tomorrow.
If it only needs to read data from certain areas in order to function then it's likely to continue operating but you may find you try and run some executable somewhere or go to read a file or image and it will just not work
 
  • Like
Reactions: Steve
I replaced the failing drive with a new Samsung 870 EVO, restored the new drive and waited a bit to experiment with the old drive in case my backups missed something. I think I got everything, so yesterday I connected the failing Samsung SSD to my system did some experiments. It still looks fine to Linux, all partitions run fsck clean. I ran badblocks against each partition and all but the initial small boot partition had lots of errors. I then used badblocks to completely overwrite those partitions with 0x'ff'. Now badblocks and spinrite l2 scans are clean. I also had the drive run its internal long and short self checks which are also clean.

wgking@HAL:~$ sudo smartctl -l selftest /dev/sdd
[sudo] password for wgking:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-89-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 6538 -
# 2 Extended offline Aborted by host 90% 6536 -
# 3 Short offline Completed without error 00% 6535 -
# 4 Short offline Completed without error 00% 20 -


This drive is not that old (purchased Aug 2022) and still covered by Samsung warranty. Should I return it for a replacement, or this typical of all SSDs? If SSDs reliably store data for only ~1 year, maybe it is time to switch to spinning drives with a big cache.
 

Attachments

  • srpr-14-LOG.txt
    6.9 KB · Views: 92
Greg (@wgk): If you're curious you could run GRC's ReadSpeed utility on that drive. It would be interesting to have you see what that drive reports for its read performance at five different locations on the drive. Then run a Level3 pass over that drive and re-run ReadSpeed. Also, after that Level3 pass do another Level2 and compare it with the results you first reported. My guess is that (a) After the Level3, ReadSpeed's results will be much improved and (b) the final Level2 will also be 100% clean.

You could still return that drive, but it might well be that it's only suffering from a prolonged lack of writing which a Level 3 pass will resolve. (y)
 
Should I return it for a replacement
Yes. 870 EVO's had a known firmware flaw that causes the issues you're seeing. As I noted in a previous comment, your drive has an affected firmware version.

You could try updating the firmware and running a full Level 3 scan, but I personally wouldn't trust that drive at this point.

Yep, and their drive has the oldest firmware, SVT01B6Q.
 
  • Like
Reactions: Steve
Although it may still be under warranty, if the diagnostics are now coming up clean, they will probably not replace it.
 
I did run readspeed after the full disk rewrite - it now looks the same as my new drive (attached). The badblocks rewite (Linux command: badblocks -v -w -s -b 4096 -t 0xff /dev/sdd3) should have the same effect as a spinrite L3 scan without all the bothersome data preservation since it rewrites the whole partition with a fixed bit pattern. I'll see if I can upgrade the firmware on it before reusing it. If I can't then I may try for a warranty replacement. Might be tricky since it never actually "failed" in operation, and if something had overwritten the bad sectors before I tried to read them I may never have known about the problem (but thanks to a spinrite scan I did find out).
 

Attachments

  • RS026.TXT
    1 KB · Views: 76