Simple analysis of server availability issues

  

As an indispensable server in the informatization construction of the hardware architecture has always been a concern, at the same time, the server replacement is also witnessing the development of the world's leading technology, whether it is the original 16-bit processing, or a 32-bit processor that was later smash hit, even including 32-bit, 64-bit processors that are now supported at the same time, and the upcoming pure 64-bit processor era. Although the server has undergone ever-changing, it has one point. The eternal theme is the availability of the server. If a server cannot guarantee even the most basic availability, it will not be able to enter the big stage of the times.

What exactly is the availability of the server? What does it include? Why are you so concerned? Below we will answer one by one.

The usability of the server is to require the server to have high reliability, high stability, easy management and maintenance, not to crash and malfunction from time to time, and to minimize the phenomenon of downtime. Because in most cases, the server requires continuous and uninterrupted work, it is very important that its performance is stable and reliable. If an ordinary PC crashes and restarts, at most, some of the document information and a small amount of data on the computer will be lost. Will not cause huge economic losses. But if the server crashes, the consequences will be unimaginable. Because many important data, data, information, and records are stored on the server, especially many network services run on the server. Once the server fails, it will cause a lot of data loss, many important business pauses, such as proxy Internet access, security. Verification, e-mail services, etc. will be invalid. If it is a network that needs to be billed, it will not be able to provide accurate billing data. Not only will it be impossible to achieve safe operation, but the entire network will be paralyzed, and its losses are difficult to estimate. It's easy to manage and maintain, and it's very fun for non-professional users to use the simplest management to maintain all the devices in the network. So in summary, high reliability, high stability and easy management and maintenance are the concrete manifestations of server availability.

But how do you guarantee availability in the design of the server's hardware architecture? The key is to do hardware redundancy and hardware online diagnostics. Common hardware redundancy includes: disk redundancy, power redundancy and fan redundancy, as well as some RAM redundancy, PCI adapter redundancy and network card redundancy; hardware online diagnostics need to include: hot swap Technology, memory protection technology, memory check and error correction technology, memory mirroring technology, memory hot add/switch technology, active PCI technology, active diagnostic technology, etc.

Hardware redundancy is easier to understand. It is a redundant backup of the components of the hardware to ensure the flaws of the hardware system caused by the damage of some components. However, due to the cost of the equipment, it cannot be done. The redundancy of components is generally the redundancy of some key components. For example, disk redundancy technology is the RAID technology that people often say, that is, different independent hard disks (physical hard disks) are different. The methods combine to form a hard disk group (logical hard disk), providing higher storage performance and data redundancy than a single hard disk. In the current server products, basically adopt this technology, support RAID0, RAID1, so that the server can fully utilize the bandwidth of the bus to complete the data operation, significantly improve the overall access performance of the disk, and maximize the availability of user data. . At the same time, some of the current server products can provide dual-power and dual-fan redundant backup, and can also support hot-swap technology, which creates an easy-load working state for power supplies and fans, reducing power or fan damage. The internal problems of the system have fundamentally avoided the unstable operation and downtime of the server.

But it is not enough to provide hardware redundancy. It also requires some hardware online diagnostic technology to make the server's availability to the extreme. For example, hot-swap technology means that some components can be inserted and dialed when the system is powered. This is very important because when we find that some components are damaged, but because of the hardware redundancy provided, the system can continue to operate well. We need to replace the damaged device. If there is no hot plug technology, we must turn off the power of the server to do so, which will cause the artificial server to stop. Most of the server products in the Aerospace Alliance have adopted the functions of supporting hardware hot plugging, such as power supply, hard disk, fan, memory, network card, and so on.

Here we also need to mention the memory error correction technology----ChipKill memory technology, which is a new ECC memory protection standard. As the CPU performance of servers based on Intel processor architectures is increased by a multiple of the geometric level, the performance of the hard disk drive is only increased by a factor of five, so in order to obtain sufficient performance. The server needs a lot of memory to temporarily store the data read on the CPU, so that a large amount of data access results in 4 (32-bit) or 8 (64-bit) bits of data per access on a single memory chip. . By reading so much data at once, the possibility of multiple data errors is greatly increased, and ECC cannot correct double-bit errors, which is likely to cause loss of all bit data, and the system will quickly collapse.

The amount of memory installed on a server is increasing, and the possibility of memory-related errors in the system is increasing. Therefore, in terms of ensuring the reliability of server products, not only the Chipkill repair technology, but also some pure hardware methods such as memory protection, memory mirroring and hot swap performance, and some software methods such as memory hot add technology are used to ensure the reliability of the device. Sexuality makes the availability of the entire system the most reflected.

Memory mirroring is to make two copies of memory data, which are placed in main memory and mirror memory. When the system works, data is written to both memories at the same time, so there are two complete backups of the memory data. Because of the cross-image mirroring between channels, each channel has a complete set of memory data copies.

A "fault tolerance threshold" is set in the system chipset. If any memory reaches the "fault tolerance threshold", its channel is marked and the other channel works alone. But still maintain dual channel memory bandwidth.

Memory mirroring effectively prevents data loss due to memory failure. The mirrored memory and the main memory are diagonally distributed. If one of the channels fails, the other channel still has the memory data of the faulty channel, which effectively prevents data loss due to the memory channel failure, greatly improving the server. reliability. The mirrored memory has a capacity greater than or equal to the main memory capacity. When the system is working, the mirrored memory is not recognized by the system. Therefore, in terms of investment, the investment in memory image data protection is doubled without memory protection.

When the memory is hot standby (Sparing), the memory for hot backup is not used under normal conditions, which means that the system does not see this part of the memory capacity. One DIMM in each memory channel is not used and is reserved for hot spare memory. The threshold of the number of memory check errors is set in the chipset, that is, the number of errors per unit time. When the number of working memory failures reaches this "fault tolerance threshold", the system starts a double write operation, one writes to the main memory and one writes to the hot spare memory. When the system detects that the two memory data are consistent, the hot standby memory replaces the main memory. Memory work, fault memory is disabled, thus completing the hot standby memory replacement fault memory work task, effectively avoiding system loss due to memory failure or system downtime. The capacity of this hot spare should be greater than or equal to the maximum memory capacity of the channel in which it resides to meet the maximum capacity requirements for memory data migration.

As we all know, system overheating is the main reason that affects server stability. How to ensure that the temperature of the server is running at full load for a long time under harsh environment? For example, the approach of adopting the forward wind is different from the way of the side air intake. This air intake mode ensures that the server installed on the rack can have a completely unobstructed wind source during the actual application process. The redundant fan is only provided to ensure that when one cooling fan fails and does not function as a heat sink, the other fan will work immediately, ensuring a certain heat dissipation capability.

Nowadays, some server products have added a unique air guiding path to concentrate and control the direction of the wind. The solution to the heat dissipation problem improves the reliability of the system and effectively extends the life of the components.

Copyright © Windows knowledge All Rights Reserved