Case review: Hot Standby Keeping the server uninterrupted The

  
system failure can be caused by various reasons. It may take 10 minutes, hours or even days for the server to return to normal.

My original unit is the core of the network in a certain district of Beijing, and there are more than 20 application platforms for different business departments. There are more than 20 external websites and OAs in important departments. The government network automation office platform was built in early 2001. After five years of upgrading and transformation, it is now running the fourth edition. This platform not only serves as an information and communication platform for all government units, commission offices, subdistrict offices, etc., but also a circulation carrier for the entire official document. The status can be imagined.

one day to find a substitute server

, the core area of ​​serious server failure, resulting in data loss. As a director of information with ten years of work experience, I am shocked by the problem that this server has encountered in a few years. I think everyone knows the common sense of network management, that is, the higher the usage rate, the more the failure rate. high.

One month after the server was repaired, the unit sent me to participate in a network security class. In fact, I have long heard of the term "two-machine hot standby", but the real understanding of it is still due to the "guarantee business continuity" in the class.

many manufacturers as we explain the "zero" time to conversion, in fact, is impossible, from the effects of the practical application of them, or from some of the real case, we can easily see that this is a relative "Zero" time. A typical system capable of maintaining a host conversion for 1 minute is already a good design. //Computer software and hardware application network WWW.45IT.COM

Dual-system hot backup solves the problem of uninterrupted service when the main server fails, but in actual application, multiple units may appear. The case of the server, the server "cluster." (I need to explain here, according to the correct translation of the Cluster we should name this multi-server exactly - cluster, not cluster) If we explain the hot standby system more specifically, then it can be understood as Active ( Active) Two servers that exist in the Standby mode, which together use a shared storage device. Only one server runs at the same time. When one of the servers running fails to start, the other backup server activates the standby server through software diagnostics (usually called heartbeat diagnosis) to ensure that the application is in a short time. The inside is fully restored to normal use.

Preparing to deploy dual-system hot standby system

After returning from the class, our unit held a monthly work meeting according to the example. With the plan of establishing a safety information platform in our district, I proposed The need to establish hot standby. Our system department receives no fewer than 30 system failures every day. These types of faults are numerous, such as equipment failure, operating system failure, software system failure, and so on.

network operators and system administrators artificially restore normal server can take 10 minutes to several hours or even days. And if the technician is not on site, the time to restore the service will be longer. The failure of this OA is special. Some system engineers may not encounter such a situation for a lifetime: two hard disks in the RAID5 disk array are simultaneously dropped; the backup system has just migrated to the new computer room. The pressure felt at that time was never encountered. While thanking IBM engineers for their timely repair, I feel that it is more important to establish a more complete security system.

all know the truth: server failure rate than the switch, failure is much higher storage devices. The reason is easy to understand. A server is a much more complex device than a switch or storage device. It includes both hardware and an operating system and application software system. Deciding whether to use hot standby is the right way to analyze the importance of the existing system and the tolerance for service disruption to determine whether to use hot standby.

not only equipment failure may cause service interruptions, and software problems may also cause the server to not work properly. Decided to adopt the "final condition = user tolerance time - system recovery time" of the dual-system hot standby system. According to the previous questionnaire and the daily consultation call, the longest waiting time of the OA client is no more than one hour. The fastest time we need to recover from a backup is more than 6 hours. It can be seen that it is imperative to establish a dual-system hot standby system.

choose to deploy hot standby ways

report also hit, the funds are approved, begin to set up the stage I made difficult. I understand that there are two implementation modes for hot standby. The reason for hesitating is which way to choose? One is based on the shared storage device, and the other is the way there is no shared storage device, generally called pure software.

Storage Sharing

For this method, two servers are used, which are composed of shared storage devices (disk array cabinets or storage area network SAN). In the process of providing external services, the two servers will provide services with a virtual IP address. When one server fails, the other server makes a judgment based on the situation of heartbeat detection, and switches to take over the service. Because of the shared storage device, the two servers use virtually the same amount of data and are managed by dual-machine or cluster software.



Briefly pure software embodiment, an entirely software embodiment is through the mirroring software, real-time replication of data to another server, so that the same data is present in each of the two servers One, if one server fails, you can switch to another server in time. In another case, the cluster does not need to use shared storage, but you can use dual-machine or cluster software directly. But this situation is actually nothing to do with the mirroring software, it is just a change in the sharing mode above.

after a group discussion system, and ultimately chose the "shared memory." There are three reasons:

1. OA is built on Windows IIS + SQL Server platform, using the Windows Cluster Services compatibility problems do not arise.

2. Windows Cluster can be established by a simple training throughout the group management system, universal, but also to ensure that future upgrades will not cause trouble.

3. Taking into account the amount of data of OA in the future, the money to buy the software into more reasonable on the storage device.

subsequent multi-year period, during which hot Standby system appeared single point of failure. One of them was that after a system patch was installed, IIS could not be started. The reason was found after the OA vendor simulated the failure together. However, OA has not stopped in this year.

Copyright © Windows knowledge All Rights Reserved