Memory failure causes the server to work abnormally

  

Symptoms


One server in the office LAN is equipped with an Intel Pentium® processor, which works normally, but blue screens often appear recently. The crash occurred and the management work could not be performed normally. This is the case, I shut down the server in a crash, upgraded the SDRAM from the original 12 8 MB to 512 MB, and chose to load the optimal parameter settings in the system CMOS settings (ie select "LOAD BIO SDEFAULTS" "), detecting each unit of memory. After completing the setup and restarting the server, the memory check is normal, but the screen prompts that you need to restart SETUP, and click the specified "F2" button and then crash.


Diagnostic Process

This problem seems to be much easier. Since there is a problem after adding the operation, first check if the newly added device itself has a physical failure. Or there is an error in the settings made for the device. Therefore, I changed the existing memory module in the server back to the original way according to the idea of ​​narrowing down the fault source, and then checked the floppy disk drive, hard disk, optical drive and other devices in turn by removing only one hardware device at a time. As a result, it was found that there were no faults in the various hardware devices of the system. So the hardware failure is ruled out, it seems that it is necessary to change the angle and check again.

Since the system prompts you to restart SETUP after startup, I analyze according to this phenomenon, the fault may be related to the system SETUP setting, especially related to the setting of detecting each unit of memory. After determining the direction of the check, I turned off the server. Then pull out the battery and short-discharge the battery pins, but still can't solve the problem after checking. I tried to find the motherboard to clear the CMOS jumper settings, change the pin line from 1 to 2 to 2 to 3, and then return to the original position after a while. Start up and try again. As a result, the system returns to normal and no other prompts appear. The memory is still upgraded and the CMOS SETUP settings are adjusted. In particular, note that the memory detection is set to detect every MB, and the system is fully restored.

The root cause of the failure is that the server's verification of the installed memory settings does not match the way the memory itself supports it. It seems that the network management is not only concerned about the connection status of the network, but also pays more attention to all the devices used in the network.

Exclusions

1 Memory verification method.

The cause of this failure is that the default setting of the server memory is ECC (with checksum), and each unit of the memory is set to be detected in the system CMOS, but the normal memory used for upgrading is not This operation is supported, thus causing the above failure. The key to troubleshooting such problems is to clear the CMOS settings and adjust the relevant parameter settings.

In fact, there are three common verification methods for memory. The following is a brief summary.

For the parity of memory (Parity) from the bit concept, the bit is the smallest unit in memory, also known as "bit", it only has two states represented by 1 and 0 respectively . It is stipulated that 8 consecutive bits are called one byte. Each byte of non-parity memory has only 8 bits, if one of its bits stores the wrong value. It will cause the corresponding data stored in it to change and cause an application error. The parity memory adds an extra bit to each byte (8 bits) for error detection. For example, a byte is stored in a byte "10011110", and each bit of the value is added, that is, 1+0+0+1+1+1+1+0-5. If the result is an odd number, the parity bit is defined as 1, and vice versa. When the CPU reads the stored data, it will add the data stored in the first 8 bits again, and the result of the calculation is consistent with the check digit. When the CPU finds that the two are different, it reacts. Now the motherboard can use both memory blocks with or without parity, but note that the two cannot be mixed.

And ECC (Error Chechng and CorreCting) memory, it is also implemented by adding bits to the original data bits. For example, 8-bit data requires 1 bit for dry Parity test and 5 bits for ECC. The extra 5 bits are used to reconstruct the wrong data. When the number of bits in the data is doubled, Parity is also doubled, and the ECC only needs to be increased by one bit. When the data is 64 bits, the ECC and the number of bits used are the same (both are 8). In places where Parity can only detect errors, ECC can correct most errors. If the work is normal, the data will not be found to be faulty. Only after the error correction of the memory, the computer's operation instructions can continue to execute.

SPD (Serial Presence Detecl), which is an 8-pin SOIC package (3mm x 4mm) 256-byte EEPROM (Electrcally Erasable Programmable ROM EEPROM) )chip. Most of the models are 24LC01B. The position is generally on the right side of the front of the memory stick. It records the parameters such as the speed, capacity, voltage and row and column address bandwidth of the memory. When booting, the BIOS of the PC will automatically read the information recorded in the SPD. If there is no SPD, it will be prone to crash or fatal error. It is an important indicator of the PC100 memory. In order to reduce the production cost, individual hardware manufacturers must conform to the PC100 standard on the surface, so they solder an empty SPD on the PCB, which may cause the FSB above 100MHz to not work properly.

2 The way to load optimal parameters in CMOS.

In addition, the optimal parameter settings are loaded in this fault. There are usually two settings for CMOS SETUP loading optimal parameters: one is SETUP optimization parameter, which can optimize the whole system, but needs system support, so stability cannot be guaranteed; the other is BIOS optimization. Parameters, which have the best stability, are generally recommended. When the system fails, you may want to load the best stability parameters and solve the problem before recovering. It should be noted that in general, do not pull out the battery easily. The purpose of the discharge operation used in troubleshooting is to facilitate the clearing of the CMOS settings.

Copyright © Windows knowledge All Rights Reserved