New RAM and errors caught by Memtest86+

  • Be sure to checkout “Tips & Tricks”
    Dear Guest Visitor → Once you register and log-in please checkout the “Tips & Tricks” page for some very handy tips!

    /Steve.
  • BootAble – FreeDOS boot testing freeware

    To obtain direct, low-level access to a system's mass storage drives, SpinRite runs under a GRC-customized version of FreeDOS which has been modified to add compatibility with all file systems. In order to run SpinRite it must first be possible to boot FreeDOS.

    GRC's “BootAble” freeware allows anyone to easily create BIOS-bootable media in order to workout and confirm the details of getting a machine to boot FreeDOS through a BIOS. Once the means of doing that has been determined, the media created by SpinRite can be booted and run in the same way.

    The participants here, who have taken the time to share their knowledge and experience, their successes and some frustrations with booting their computers into FreeDOS, have created a valuable knowledgebase which will benefit everyone who follows.

    You may click on the image to the right to obtain your own copy of BootAble. Then use the knowledge and experience documented here to boot your computer(s) into FreeDOS. And please do not hesitate to ask questions – nowhere else can better answers be found.

    (You may permanently close this reminder with the 'X' in the upper right.)

Dan Linder

Member
Jan 4, 2024
9
3
Much like I use SpinRite on all my new HDDs, I like to run Memtest86+ on all my new RAM sticks. I'm building a small home server and I am maxing it out with two Crucial 32GB SODIMM RAM sticks (CT2K32G4SFD832A).

The first set I ran Memtest86+ on for four+ days, and it noted two errors so I returned those SODIMMs for two more (identical). Those have been running for 100+ hours and they too are throwing errors.

IMHO, new RAM from a reputable vendor shouldn't have errors this early on. Am I just "lucky" and return these for a third time, or is this to be expected with the density of RAM today?

Sadly, no ECC options on this small system. Attached are screen shots of Memtest86+ running on the two sets of RAM. The errors are in different places in both, so I don't *think* it is the motherboard.

What do you and the rest of the SR community think? Is there a different manufacturer I should try? Or is this just the state of the "electronic art" today and we really need to get ECC in our RAM like we do in our HDDs?
 

Attachments

  • image-2.jpeg
    image-2.jpeg
    45.4 KB · Views: 266
  • image-3.jpeg
    image-3.jpeg
    38.4 KB · Views: 159
  • image-1.jpeg
    image-1.jpeg
    63 KB · Views: 170
There should never, ever, be RAM errors. A computer cannot run reliably with unreliable memory, and no one wants an unreliable computer. You should check your system BIOS/UEFI is not by default overclocking the memory... as many do this by default.
 
New RAM with errors could be motherboard, CPU, or the RAM itself. Throughout my 50+ year career I've seen a lot of weird things, including a faulty Sparc CPU causing FTP errors in Solaris, only FTP and no other apps. I would suggest grabbing another CPU to see if the errors are still present. Or, try the RAM on another MB. With your CPU and with a different CPU.

Remember, when you run your system hard there is a greater likelihood of discovering RAM errors. Just as my RAM was not an issue on my FreeBSD systems using the UFS filesystem, ZFS, its ARC exercises RAM quite hard, will cause RAM and CPU errors because RAM, CPU, and northbridge are exercised more intensely due to its heavy use of RAM for cache. I've always played with RAM clock rate and interleave when drilling down to discover load related problems on consumer hardware.

First try your RAM on a friend's computer. If the errors persist it's likely the RAM. If the RAM is ok on your friend's computer, try it on your motherboard with a different CPU. If that resolves the problem you likely have a marginal CPU. Failing that, it may be the MB.

The other thing to check is, try to reduce reduce the memory clock rate or interleave., though most MBs automatically detect this. Depending on the MB you should be able to reduce either.

Hope this gives you a diagnostic roadmap.
 
  • Like
Reactions: Dan Linder
Yes more likely a motherboard issue. As suggested run the RAM slower, with a much longer access time, and see if that helps. This looks a lot like Heartbleed, in that the cells in one region are affected by another being read or written, and likely a slower access or more time between writes and reads will help with this. As well, if you have a pair of smaller sticks, like 2 8G ones around, run them for a 100 hour at the fastest rate they can handle, which will likely show up motherboard issues as being the cause, slight timing differences between data lines causing the wrong data to be latched in or out with the right mix of data and clock timings.
 
  • Like
Reactions: Dan Linder
Yes more likely a motherboard issue. As suggested run the RAM slower, with a much longer access time, and see if that helps. This looks a lot like Heartbleed, in that the cells in one region are affected by another being read or written, and likely a slower access or more time between writes and reads will help with this. As well, if you have a pair of smaller sticks, like 2 8G ones around, run them for a 100 hour at the fastest rate they can handle, which will likely show up motherboard issues as being the cause, slight timing differences between data lines causing the wrong data to be latched in or out with the right mix of data and clock timings.
In technical terms this is called a rowhammer attack. I doubt this is the issue.

WRT rowhammer attacks. Colin Percival wrote this article in 2005, long before anyone else was discussing it. His solution at the time was to disable hyperthreading. A link to the article PDF is at the bottom of the page.
 
The screenshot shows that the RAM is quite hot, although probably within limits. Have you checked teh slots carefully and made sure it is seated correctly?
 
  • Like
Reactions: Dan Linder
I'm away from home and on mobile so can't write a lot, but yesterday I removed one stick to start comparing different combinations of slots, two vs one stick, and even a second chassis.

These are small Beelink (like Intel NUC) so cooling options aren't great but I'll see what I can do for that test, too.
 
Sorry for the late update. I put one DIMM in each of my Beelinks (both identical models with AMD Ryzen 7 5800H) and they both ran 100% error free for over 25 hours. My next test is to put both into the second Beelink system and see what the status is after 24 hours.
 
After running for 24 hours in that configuration (single DIMM in each system in slot one), I move the DIMMs to slot two and have had them running for another 24 hours. This confirms that the error isn't in one slot or the other, and that both systems are capable of running a single 32GB DIMM just fine. The temp on both have been fine, but the "A" system is running in the 73/83C range, and the "B" system is in the 65/68C range. System-A has an internal 3.5" SSD which is putting off some heat.

Next test is to move both DIMMs to System-B (no SSD) to see how it performs. The original test screen shots were in the System-A with the SSD, so heat may have easily been an issue as @AlanD suggested.
 
Now three hours with the 2x32GB SODIMMs in System-B. Memtest86+ has gone 3+ hours (two passes) without any errors, and the temperature is showing 65/78C.

I'll let this run overnight, but I suspect temperature and ventilation might be the big factor.
 
It's been 13 hours (six passes) and still zero errors. The temp is still showing 72/79C. Going to add in the SSD tonight and see how that changes things.
 
So I added the SSD last night, but when I was putting it in I noticed the ventilation fan (about 1.5" diameter) was directly below the SSD and with it installed the air-flow was mostly sealed off. See the attached photo - I placed a small USB dongle to give a rough idea of how little space there is. The bottom cover plate is removed, but when installed it is tight against the surface of the SSD.

To ensure some spacing for air flow, I placed a a small stick-on rubber pad (about 1/2" diameter, roughly 2-3mm thick) to the left side in the photo over the "HDD" label (spacer not pictured). I was then able to get the bottom cover on (barely) but I didn't screw it on yet.

After 24 hours of running the Memtest86+ utility, it is still sitting at zero RAM errors, even with the temperature showing 72/79.

I'm going to let it run this way a bit longer, and screw the bottom on and see if it continues to hold temperatures well.
 

Attachments

  • SSD spacing.jpeg
    SSD spacing.jpeg
    65.6 KB · Views: 128
Last edited:
  • Like
Reactions: SeanBZA
It's now been a solid 45 hours of Memtest86+ running, the last 20 hours have been with the unit closed (i.e. covers on, screwed down). and the RAM is still showing zero errors. Temps are currently at 72/79C.

I think @AlanD has the winning answer - keeping the heat down was the key.

Closing this as a "case solved". (Spoiler, still issues...) Thanks everyone.
 

Attachments

  • SSD 45 hours.jpeg
    SSD 45 hours.jpeg
    73.5 KB · Views: 124
Last edited:
  • Like
Reactions: DanR and SeanBZA
Well, that didn't take long. :(

With the first "Beelink SER5 MAX Mini PC, AMD Ryzen 7 5800H" system fixed (remember, I bought two), I put the second one together and started running Memtest86 on it with the RAM it came with (2 x 16GB SODIMM, Crucial). We'll call this "System A" (The "System B" is the one that is working well.), and it was this "System A" that originally had the RAM issues and when I finally got the successful runs I had moved the RAM and SSD to "B".

I started setting up "System A" with the original RAM and the 1TB M.2 drive (

The first 24 hours went well - no errors and multiple passes. (Sadly, I didn't take a picture of that 'good' result.) This morning I get up and at hour 32 I have a (single) RAM failure on pass #18 through the RAM. (see attached screen shot taken on June 23 at 0630, it was on pass #20)

The temp on this system is at 82/85C - as @AlanD mentioned a few days ago - but this system is in my basement office which is a constant 20.0-22.2C, and "System B" is still running along at the 72/79C report from Memtest86.

I now suspect that the heating problem was more due to "System A" than the cooling fan being blocked, so I'm going to have to go back to Amazon and Beelink to get a return on the "System A" unit. I'm writing this here more so I have a documentation trail to show Amazon and Beelink what I've been through as the return window for these closed on June 13 (they were both ordered May 12).

Thanks for all the help - I'll document any resolution and confirm what the culprit was for this system, too.
 

Attachments

  • SystemA-2x16-1TM.2-NoSSD-240623-0635.jpg
    SystemA-2x16-1TM.2-NoSSD-240623-0635.jpg
    70.2 KB · Views: 142