Server common soft fault resolution ideas

  
                              

Server software failures are the most prone part of server failures, accounting for approximately 70%, and the process of resolution must be more thoughtful. There are many reasons for the software failure of the server. The most common is that the server BIOS version is too low, the server management software or the server driver has a bug, the application conflicts, and the software failure caused by human. The following is an example of how to repair various types of software failures.

There is an HP LH6000R server configured as a dual PIII XEON 700 CPU with 2M cache and 512M RAM. After power-on, the system log reports an error in the voltage regulation module (VRM). The error message is: “Voltage Regulator Module (VRM) over/under-voltage 2.88V/0V”. On the surface, it is very likely that the server's voltage regulation module or other hardware has failed, which is very likely to cause maintenance personnel to think that it is a hardware failure. Maintenance personnel immediately tested the hardware on other LH6000Rs and found that even with new accessories, the server still reported VRM errors. At the time of the development, the maintenance engineer brought the latest firmware of the CPU Management Control (FIRMWARE), so after upgrading the FIRMWARE of the CPU management section, the server resumed immediately.

FIRMWARE upgrade method is to extract the CPU management board (CMC) FIRMWARE refresh program in the server NAVIGATOR, the program is FLASH.EXE, and then download the LH6KC.BIN (CPU management) from the Internet. Copy the board's FIRMWARE) to a DOS boot disk and use this disk to boot the server. Then run "FLASH /CMC A:LH6KC.BIN" under DOS, and restart the server after the refresh is complete. This upgrade method is also suitable for refreshing the system BIOS, etc., but the parameters of the FLASH command are different and the FIRMWARE and BIOS file names are different. For the parameters, please refer to the description of the server.

Any server's FIRMWARE and BIOS will have different bugs, because BUG is inevitable, so we can't mistakenly think that the server's BIOS program is perfect, and should always update the server's FIRMWARE and BIOS. Just be careful before upgrading, the wrong way to upgrade can lead to serious consequences.

The popular mid-to-high-end servers all have powerful management programs, which provide customers with convenient management methods. The server also has drivers under various operating systems, which is convenient for customers in various operating systems. use. However, any program in the world will have some bugs, these bugs will affect the user. However, server vendors will always develop new programs in the first time, and customers only need to update these programs in time to avoid such failures.

When the software failure of the server is such, the performance is not the same. In general, the management program BUG will cause the system to slow down, the CPU usage will become high, and some functions cannot be used normally. The bug of the driver will cause a crash, conflict with some software, and unstable disk operation. The best way to see if the hypervisor is wrong is to first disable such management tools in the system and then observe if the server is still abnormal. Since the management tool is started as the system boots, it should be avoided first. Take WINDOWS NT4 as an example, first disable some server software services in the management tool service, and then modify the startup items in the registry. If there is a problem with the driver, enter the system in safe mode to see if it is normal. However, it should be noted that in safe mode, it is normal for the system to slow down (especially for disk I/O).

Server administrators should always download the latest management tools and drivers on the server website. This will reduce the occurrence of a large part of the software failure.

In contrast, fault diagnosis caused by software conflicts is difficult, requiring managers to have rich experience and keen observation.

There was a friend who told me that he had an Inspur server that could not install SQL SERVER 2000. He had reinstalled NT NT times, and the system was faulty. And this only server will be a very important database server, so it is very urgent. So I accompanied my friend to his company to check. The server room where this server is located is a very standard and complete computer room. I checked the situation of this server and found that there was no hardware failure, thus eliminating the possibility of poor optical disk drive reading. However, the friend's engraved SQL SERVER 2000 CD caused my suspicion, I asked him to take out the genuine SQL SERVER installation, the result is still not. In the process of installation, there is no slight error, but it will automatically exit when running, without any prompt. However, I found a message in the system log of the Event Viewer in the management tool: windata.exe caused an invalid data overflow. Windata is a program written by a friend himself, and is a program that is launched when the operating system is started. I immediately ended this process, and then running SQL everything works fine.

For such software failures, the operator should first check the relevant logs to see if there are any suspicious processes in the system. The current server is high-end or low-end, and the support for standard programs such as SQL is quite reliable, so the focus of the elimination is to end the suspicious process.

There is also a software failure caused by human factors. It is usually caused by human error (including operation without operating process), unexpected shutdown (including sudden power failure) or abnormal shutdown of the application. .

Human error factors can be avoided by strengthening management. Here is a detailed description of the method of accidental shutdown or abnormal shutdown procedures caused by failure.

It is very important to shut down the system program normally, especially the WEB server. One of my friends experienced an experience of data corruption or even loss because they did not shut down the system program properly. My friend is using the HP web hosting server appliance, so I provided him with some usage rules.

These methods are very effective for server maintenance, including the correct shutdown of the system program, how to avoid data loss and the recovery method after abnormal shutdown of the system. Let's take my friend's HP web hosting server appliance as an example (using UNIX, but the idea is valid for other operating systems).
The process of properly shutting down includes powering down the system by pressing the Power button. You should keep pressing the power switch for a few seconds to get the system into normal shutdown.

In addition, in order to avoid data loss, you should follow the steps below:

· Backing up the data of the Web Hosting Server Appliance frequently can be done through the network management interface.

· Install the second hard disk and set it as a mirror with the original hard disk.

Once the Server Apliance fails to close properly and cannot be restarted, please restore it as follows:

1. When the appliance has been powered off, connect a non-modem serial cable (found in the box) to the control port on the back.

2. Connect the other end of the serial cable to the serial port of a PC running Windows.

3. Run the HyperTerminal and set the port parameters to 19200, n-8-1, Flow control - None. You can see the control prompt for the appliance and ask you to enter the administrator. Password.

4. Restart the appliance and wait until the prompt "LILO boot:", hold down the Tab key for 5 seconds until the prompt changes to "boot:".

5. Type "emergency" and press Enter. At this point you need to wait patiently for a few minutes. Then, the login prompt will appear again. At this point, the LCD screen will work again.

6. Select a random password on the LCD screen (this password is only used for emergency recovery)

Go to Defaults... and press the right arrow key to select it.
Go to Root Password... and press the right arrow key to select it.
Turn to Random and press the right arrow key to select, which will prompt for a randomly generated password.
Make a note of this password.
Turn to Yes and press the right arrow key to select, the system password will be changed immediately.
7. Go back to the hyperlink's control screen, log in to the appliance, use the "root" username and the password just now, and the “#” prompt will appear.

8. To repair the partition, proceed as follows:

For sa1100, enter in order:
[...]#: fsck /dev/hda5

[...]#: fsck /dev/hda6

[...]#: fsck /dev/hda7

For sa1120, enter in order:
[...]#: fsck /dev /sda5

[...]#: fsck /dev/sda6

[...]#: fsck /dev/sda7

When all partitions have been fixed, Go back to the "#" prompt.

9. Enter "reboot" to restart the system.

If the system still does not start, please record the contents of the control panel and ask for technical support.

For the software failure of the server, as long as the administrator pays attention to maintenance, it should be avoided.

Copyright © Windows knowledge All Rights Reserved