Failure analysis of the three major components of the server_About the server

Failure analysis of the three major components of the server

  
                  But because of the many similarities between X86 servers and desktops, there are many similarities, from pre-deployment to mid-term maintenance to post-management. Therefore, although the X86 server has a mature and stable architecture, it will inevitably appear to "strike." In particular, the enterprise load application is much more, and the failures encountered are very common. Sharing the failures of the three major components and sharing them with everyone can effectively prevent everyone from appearing on future business platforms.

Server Core --CPU
degree of harm: ★

fault Playback: Testing conducted friends know, one based on Intel Xeon server, boot no display, system indicator madness The flicker, the most direct suspicion is that the CPU and the motherboard are in poor contact, but replacing it with another CPU socket on the multi-server motherboard still does not respond.

Solution: In this situation the measured, voltage actually abnormal CPU, the CPU is the original VRM (VoltageRegulatorModule, a voltage regulator module) fails, can not be performed on the motherboard DC converter circuit, not the CPU Provides a stable operating voltage, so you can only replace the CPU.

I believe that this failure is more fatal, damage to the CPU will lead directly to the entire server is unavailable, but the safety of the CPU itself is very high, very low failure rate. Therefore, in the daily maintenance tasks, the service interruption caused by the damage of the CPU is relatively rare, and its harm degree is not too high. If it is a multi-way server, it is not necessary to worry about the server crash caused by the CPU damage. In addition

two core server platform, than the memory and hard drive, specific to the selected memory, server memory and general desktop memory, there are some differences. Users who have carefully observed the server memory will find that the server memory usually has 9 chips on one side compared to the single-sided 8-particle design of the normal memory. This is what we often call ECC memory.

server read performance - memory
degree of harm: ★★ ☆

fault Playback: Before a server on two 2GB of RAM installed, because the service itself carries too More, the server's processing data is getting slower and slower, so the server is upgraded by adding two memory sticks of the same model. After inserting all the memory into the motherboard, the system detects only 6GB, and the other 2GB memory disappears mysteriously. The new memory is still unable to be detected normally.

Solution: official website for the server product, the case is because the server's memory slots are paired, 1-4,2-5,3-6,7-10,8-11 9-12, the new memory is inserted in 2, 3 slots, can not form a pairing, naturally only one memory can be detected, the memory is inserted into 5 slots, 8GB memory is successfully detected.

visible, server memory advantage is not only reflected in the performance, also put a lot of effort in fault tolerance, the purpose is to provide high and stable environment for the entire platform, previously mentioned memory uses ECC (error checking and correction Technology, Register, and Chipkill are all designed to improve the stability of memory, so that the memory modules and slots can be better integrated.

as a server storage terminal, stable working relationship hard to secure corporate data, the server hard drive is the core of the data warehouse, all software and data are stored here, so the server hard disk stability and reliability Has very high requirements.

addition, the server generally requires 24/7 non-stop operation, the hard disk have 24 hours non-stop operation. Therefore, server hard drives have high requirements for stability and reliability. There are three main types of hard disks used in the server market, including SATA hard disks, SCSI hard disks, and SAS hard disks. Among them, SATA hard disks are mainly used in low-end servers, while SCSI and SAS hard disks are targeted at medium and high-end servers. Server memory core

- hard
degree of harm: ★★ ☆

fault Playback: Each server will crash, no signs of restart, if frequent, it will by the data After testing by the IT operation and maintenance personnel of the center, it was found that the working hours of the hard disk were too long and physical bad sectors appeared. Therefore, it is the best solution to backup and replace the hard disk immediately. The data in the hard disk is exported. As a result, during the process of transferring data, I/O errors are constantly popped up, which directly causes the data transfer speed to be very slow and lost. A lot of important data.

Solution: Most of these cases is the head or disk error occurred. If the hard disk platter is scratched, but the area is not large, the professional company can restore the data by replacing the magnetic head and recover more than 95% of the data. This situation is relatively lucky.

but usually said to take preventive measures in a timely manner if the faults found, to be resolved before there is no more physical damage to the disk appear as soon moderate to severe disc damage, permanent loss of data, in order to avoid this kind of situation occurred recommended to do the following:

on the hard choice to professional server hard disk, for example: MTBF over 1.6 million hours, annual failure rate of less than 0.55%, seismic aspects have 300G /More than 2ms of impact resistance, etc., in addition to the application of the server RAID array technology, such as: RAID5, which consists of at least three hard drives, while writing data information to the hard disk, also write verification information, when there are When one hard disk fails, the data of the failed hard disk can be obtained from the other two hard disks according to the algorithm, and the security is greatly improved. Failure
above three components
just a brief start a discussion, in fact, is not limited to these points in terms of server failure, the power management module and the card also has a similar problem, I hope a lot of users in the application Accumulate experience, minimize the incidence of failures, and provide a stable and flexible IT application environment.