Blog

Data Center Down Time: DRAM Row Hammer Failures in the field

Written by FuturePlus Systems | Apr 28, 2017 5:09:51 PM

 

DDR3 memory is at the heart of almost all cloud computing servers today.  A recently publicized failure mechanism in DDR3 memory, coined Row Hammer,  has been shown to not only be a reliability issue but also a security risk for servers, laptops, desktops and embedded systems around the world.  In short, excessive accesses to a single Row in memory can cause bit flips in adjacent locations causing system crashes, corrupted data and even security exploits.  Several research papers have been published and more work is being done to help the industry understand this problem.  No industry standards group, government agency or trade association has signed up to address this issue.  Data Centers and end users are on their own.

Computer architecture relies on three basic building blocks, the CPU or central processing unit, the I/O, Input and Output and the Memory.  When it comes to the memory the dominate technology is DRAM or Dynamic Random Access Memory.  Today’s most prevalent version of memory is called DDR3 which stands for the 3rd generation of Double Data Rate Memory.  In the quest to get memories smaller and faster memory vendors have had to make very small physical geometries.  These small geometries put memory cells very close together and as such one memory cell’s charge can leak into an adjacent one causing a bit flip.   It has come to the attention of the industry that this is indeed happening under certain conditions.  Very simply the problem occurs when the memory controller under command of the software causes an ACTIVATE command to a single row address repetitively.  If the physically adjacent rows have not been ACTIVATED or Refreshed recently the charge from the over ACTIVATED row leaks into the dormant adjacent rows and causes a bit to flip.   This failure mechanism has been coined ‘Row Hammer’ as a row of memory cells are being ‘hammered’ with ACTIVATE commands.  Additionally double sided Row Hammering has also been proven.  This involves two ‘aggressor’ rows on either side of a ‘victim’ row.  This double sided hammering produces failures faster and causes more bits to flip .  Once this failure occurs a Refresh command from the Memory Controller solidifies the error into the memory cell.   Current understanding is that the charge leakage does not permanently damage the physical memory cell which makes repeated memory tests trying to find the failing device useless.

DDR3 memory is pervasive today and used in nearly all cloud server systems, many embedded applications and military applications.  Most critical applications do use error detection and correction, ECC. However ECC is a single bit detection and correction and double bit detection.  In the case of more than two bit errors, which has been demonstrated with Row Hammer failures,  ECC falls short.  Our dependence on DDR3 memory and this known failure mechanism should be a wake up call for the industry.  So far the most common workaround is to double the refresh rate to the memory.  This is an attempt to ‘charge up’ the dormant memory cells so that they do not fall victim to adjacent rows that might become ‘hammered’.  This reduces performance and increases power consumption and the problem is not going away.  This workaround just reduces the statistical probability.

Why does this happen? Simply put, the memory controller’s job is to read and write information to and from the memory under program control.  If the software running executes certain commands that cause repeated accesses to a single location the memory controller will generate excessive ACTIVATE commands.    Currently there is nothing in the DDR3 memory controller designs to prevent this from happening.

 

Figure 1: OCP Server being tested for Row Hammer Failures

 

DDR3 memory is a critical part of the world’s cloud computing strategy and today’s servers have an extensive amount of DDR3 memory.  The studies have shown a potential for millions of Row Hammer failures per system.  Given the vast amount of DDR3 memory in today’s systems failures should clearly be a concern. This known failure mechanism can lead to undetected data corruption, reliability issues and security breaches.  Current mitigation strategies, for deployed systems, are impractical, expensive or just reduce the statistical likelihood.  Should we upgrade to DDR4?  Not so fast…..studies have shown that DDR4 has the same problem.   A strategy to determine if applications even create the Row Hammer failure should be considered. Understanding if an application is at risk can reduce the pressure to implement unneeded, expensive and time consuming mitigation strategies saving organizations millions of dollars.  If applications are shown to be at risk then steps can be taken to upgrade hardware, rewrite the application and provide warnings to the field that such failures might occur.