3 major server hardware failures Have you ever encountered

  
                  

Today, when it comes to the CPU of the X86 server platform, many people may have a lot of products from Intel and AMD's two chip giants, from the previous Xeon 5400 to the mainstream Xeon 5600, Xeon 7500, and AMD's strong 12 Core x86 processor "Magny-Cours" (Manicourt) and more. At the same time, the other two cores of the server can not be underestimated on the basis of the CPU. The components with ECC, ChipKill, hot-swappable technology, and RAID hard disks that prevent data loss are jointly created to create a rock-solid X86 server.

But because of the many similarities between X86 servers and desktops, there are many similarities, from pre-deployment to mid-term maintenance to post-management. Therefore, although the X86 server has a mature and stable architecture, it will inevitably appear to "strike." In particular, the enterprise load application is much more, and the failures encountered are very common. Sharing the failures of the three major components and sharing them with everyone can effectively prevent everyone from appearing on future business platforms.

Server Core - CPU

Hazard Level: ★

Fault Playback: A friend who has tested has knows that one is based on Intel Xeon server, no display at boot, system The indicator light flashed wildly. The most direct suspicion was that the CPU was in poor contact with the motherboard, but it was replaced by another CPU socket on the multi-server motherboard.

Solution: Under this condition, the CPU voltage is abnormal. The original VRM (Voltage Regulator Module) of the CPU has failed, and the DC circuit conversion on the motherboard cannot be performed. The CPU is supplied with a stable operating voltage, so that only the CPU can be replaced.

The author believes that this fault is relatively fatal, the damage of the CPU will directly lead to the unavailability of the entire server, but the security of the CPU itself is very high, and the failure rate is extremely low. Therefore, in the daily maintenance tasks, the service interruption caused by the damage of the CPU is relatively rare, and its harm degree is not too high. If it is a multi-way server, it is not necessary to worry about the server crash caused by the CPU damage.

The other two cores of the server platform are memory and hard disk. The specific choice of memory, server memory and ordinary desktop memory still have some differences. Users who have carefully observed the server memory will find that the server memory usually has 9 chips on one side compared to the single-sided 8-particle design of the normal memory. This is what we often call ECC memory.

Server Read Performance - Memory

Hazard Level: ★★☆

Fault Playback: Previously on a server with 2 2GB RAM installed, due to its own bearer Too many services, the server is processing data faster and slower, so upgrade the server by adding two memory sticks of the same model. After inserting all the memory into the motherboard, the system detects only 6GB, and the other 2GB memory disappears mysteriously. The new memory is still unable to be detected normally.

Solution: Through the official website of the server product, this situation is because the memory slot of the server is paired, 1-4, 2-5, 3-6, 7-10, 8-11 9-12, the new memory is inserted in 2, 3 slots, can not form a pairing, naturally only one memory can be detected, the memory is inserted into 5 slots, 8GB memory is successfully detected.

It can be seen that the advantage of server memory is not only reflected in performance, but also puts a lot of effort into fault tolerance. The purpose is to provide a highly stable environment for the entire platform. The previously mentioned memory uses ECC (error checking and correction) Technology, Register, and Chipkill are all designed to improve the stability of memory, so that the memory modules and slots can be better integrated.

As a server storage terminal, the stable operation of the hard disk is related to the security of enterprise data. The server hard disk is the core data warehouse. All software and data are stored here, so the reliability and stability of the server hard disk are Has very high requirements.

In addition, the server generally needs 24*7 hours to run continuously, and its hard disk also needs to run 24 hours a day. Therefore, server hard drives have high requirements for stability and reliability. There are three main types of hard disks used in the server market, including SATA hard disks, SCSI hard disks, and SAS hard disks. Among them, SATA hard disks are mainly used in low-end servers, while SCSI and SAS hard disks are targeted at medium and high-end servers.

Server Storage Core - Hard Disk

Hazard Level: ★★☆

Fault Playback: Each server will crash and restart without warning. If it occurs frequently, After being tested by the data center IT operation and maintenance personnel, it is found that the hard disk working time is too long, and physical bad sectors appear. Therefore, it is the best solution to backup and replace the hard disk immediately. The data in the hard disk is exported. As a result, during the process of transferring data, I/O errors are constantly popped up, which directly causes the data transfer speed to be very slow and lost. A lot of important data.

Solution: Most of the cases are errors in the head or the disc. If the hard disk platter is scratched, but the area is not large, the professional company can restore the data by replacing the magnetic head and recover more than 95% of the data. This situation is relatively lucky.

But it is usually said that prevention is not possible. If the fault is detected in time, it will be solved before the disc has no more physical damage. Once the disc is damaged seriously, the data will be permanently lost. In order to avoid this. It is recommended to do the following:

In the choice of hard disk, you need a professional server hard disk, such as: the average time between failures is more than 160,000 hours, the annual failure rate is less than 0.55%, and the earthquake resistance is 300G/More than 2ms of impact resistance, etc., in addition to the application of the server RAID array technology, such as: RAID5, which consists of at least three hard drives, while writing data information to the hard disk, also write verification information, when there are When one hard disk fails, the data of the failed hard disk can be obtained from the other two hard disks according to the algorithm, and the security is greatly improved.

The faults of the above three components are just a brief introduction. In fact, the server failure is not limited to these points. There are similar problems in power supply, management module and network card. I hope users have more applications. Accumulate experience, minimize the incidence of failures, and provide a stable and flexible IT application environment.

Copyright © Windows knowledge All Rights Reserved