Eight steps to troubleshoot AIX servers

  

Problem 1: The server is bigger and the computing power is reduced

At the time, I needed to migrate an AIX5.3LPAR from a POWER4-based old IBM pSeries?p670 server. Go to the new pSeriesp570 server based on POWER6?. The old server resources are insufficient (using WorkloadManager to manage the resources of the main application on the server), so the new dynamic processor resources on the new hardware should provide the computing power I need. I executed mksysb on this LPAR, then used NetworkInstallationManager to restore it on new hardware and map it through the SAN disk.

I started this LPAR and it looks fine until I start the application. Suddenly, the user started calling. They simply can't access their products. When I logged in, I found that the server was completely idle. There are no processes on the server that consume a lot of resources. Why do users have problems?

Problem 2: A failed hard disk cannot be unmirked

One of my servers has a mirrored root disk. One day, the error report indicated that bad blocks on one of the disks could not be relocated. I know this is a precursor to a hardware failure, so I started to unmirror. However, the server says that the mirror cannot be completely unmounted because one of the logical volumes has only one good copy and it is on the failed disk. How should I solve this problem and replace the hardware?

Troubleshooting Procedures

Remember these two sample questions, and now look at the process of solving them.

Step 1: Don't mess around

Once you find that you are in trouble, the most sensible move is not to mess. Just like Indiana · Jones is in the "Raiders of the Raiders", if you find that a dart hits you when you step on the floor, then stop at the place and don't move on. More changes will only complicate the issue and may make the situation worse. When a problem affects the normal operation of the system, it does not make sense to have to solve multiple problems.

For the first example question, I let the user quit the system right away and then I terminate the application. I know that when the performance is poor, the user's queries and input will be interrupted, which may destroy their data. I don't want their environment to change further before I check the system. Although users don't want to hear that they can't use the new server right now, they know that I am looking for the cause of the problem and they will be happy. In addition, this gives me time to perform other troubleshooting steps in my own way.

Step 2: Start with the basic command, then add complexity

When I was studying Kung Fu, I heard a story of a second-level black belt at the bus stop to punish the thief. The students wanted to know which trick she used to put down the attacker. Is it a golden tiger? Or is it the circle of palms in the palm of your hand? We even imagined that she was very powerful and put the other person down with drunken celestial beings. The result was not: she used one of the first techniques of vaginal discharge in the class - elbow hitting the chest and then boxing the nose.

AIX provides commands for checking various aspects of the server, including hardware and software. Even the most basic commands provide a good basis for analyzing problems. When the information is not enough or something is still not working properly, you can start experimenting with more complex and powerful tools. However, you should start with the simplest commands and ideas, and then use more powerful tools.

For the second example problem, I first look for hardware problems by looking at the errpt output, then use the unmirrorvg command - a simple but powerful tool to try to unmirror - instead of running rmlvcopy on each logical volume on the disk When I found out that a logical volume could not be deleted, I used other basic commands such as lspv, lsvg, and migratepv to collect information. I am trying to create another copy of the volume group on another disk with extendvg and mirrorvg. This still leaves some old partitions, so I went one step further, using syncvg and synclvdom to coordinate the ObjectDataManager with the server. Finally, I used migratelp to try to move each logical partition out of this disk. Unfortunately, these tools don't work, but they provide a lot of information.

Step 3: Reproduce the Problem

According to the scientific method, the key point of any hypothesis and experiment is to be able to reconstruct the process and produce the same result. If you can't, the conclusion is at least uncertain. In the worst case, this would subvert the scientist's theory and undermine their reputation, just like the physicists who claimed to have achieved room temperature cold fusion in the 1990s.

Or, as I said: If you don't succeed at first, try it elsewhere to see if it can cause the same problem.

When managing an AIX server, if something goes wrong and you have the resources needed to reproduce the problem, do the same on another similar type of LPAR to see if it will produce the same the result of. If modifying the same attribute on another server will produce the same result, it can be inferred that this operation is the source of the problem. However, if the opposite result is produced, then study the nuances between the servers and try to figure out the cause of the problem.

For the LPAR involved in the first example problem, I found that the problem did not arise when swapping a SAN disk back to the old p670 server and starting it. Users can access their applications, the CPU is under normal load, and CPU utilization is more than 80% (10% kernel + 70% users). Therefore, I can conclude that something specific to the p570 server is causing the problem, not something introduced during the migration process.

Step 4: Research Questions

In the information age, you can get a lot of information with just a few keystrokes and a few mouse clicks. Even better, system administrators are often members of large communities, and the community has documented many years of experience.

First of all, you should check the information of the manufacturer and the seller. Companies like IBM publish all their manuals, Redbooks, technical files and even man pages online for research. Simply type in a simple keyword into the search bar on the main site to find a lot of helpful suggestions and information.

Other sources of information I recommend include newsgroups, forums, and sites that other system administrators frequently visit. People who deal with servers all the time often visit technical sites and comment on what they see during work. For public help, most system administrators are happy to provide pointers or help by email. In addition, old information related to the operating system and other versions of the software can often be found, and more information can be found through them.

For these sources, the main trick is to use the appropriate keyword set. If I use a general website like Google to study AIX issues, I will make sure that the search string starts with AIX in order to exclude information related to other styles of UNIX. Then, it may contain the output of the command or the label generated by errpt. I'll also make sure to put double quotes around the specific phrase (“”) to limit the search to these specific issues and avoid irrelevant information, especially for commonly used words (like LogicalVolumeManager).

For the problem of disk bad block relocation failure, searching on Google using the phrase AIX<badblockrelocation”failure has produced hundreds of results, but it does not seem to match my situation.

Step 5: Cancel all changes

Sometimes, the most sensible way to solve a problem is to cancel all changes you have made and return to the original state. This step is not always feasible. Sometimes, an over-enthusiastic C-level executive forces you to roll back their servers. Or, because of the time constraints, it is necessary to do so. In any case, rollback is one of the best tactics to choose from.

I put this step in the middle of the troubleshooting step list, because sometimes I have to do this earlier, sometimes later. But based on my experience, I think it's best to complete the first four steps before considering canceling all changes. If you cancel the changes immediately at the beginning of the troubleshooting process, the problem is likely to be unresolved and you will experience the same trouble the next time you try the same job. If you fall back too late in the process, it will affect the uptime, or complicate the problem, to the extent that it is impossible to retreat.

For the first example, I actually had to roll back the server migration operation due to time. If the production server is out of service for a longer period of time, users and companies will lose money. It took a week to reschedule the work, which allowed me to do more research, but when I tried the migration again, the problem reappeared. For the second example, you cannot perform a fallback on hardware issues. Unable to tell the server, "back to the state before the bad block relocation error occurred! ” I have to continue to work hard to overcome the disk failure.

Step 6: Change only one rule at a time

If all the steps above don't work and you decide to start changing the main components or doing more aggressive operations on the server, remember One of the most important rules: change only one place at a time.

Multiple changes can result in one of two situations. First, if these changes resolve the issue, then you don't know which change is a valid action. If you don't care what exactly solves the problem, this may not be a big deal, but good system administrators want to learn more because they know that problems tend to occur multiple times in the same place. Second, if the problem is not resolved, this may introduce more complexity. Keep doing this and you won't know which change to cancel. If you go far enough, the system will become a mess of porridge and you will be confused. (There is a joke about this situation on xkcd.)

If the problem is not resolved after making a change, you usually want to cancel it and try other things. In the first example, this is the case: when I compare the two servers' HardwareManagementConsole profiles, I see them different. I noticed that the old POWER4 hardware uses a dedicated CPU, while the new POWER6 hardware uses an uncapped shared CPU pool. I want to know how this difference affects CPU performance, so I modified the profile on the POWER6 hardware to use a dedicated CPU. Oddly, according to user feedback, the server "normal", I saw it on the processor. load. Therefore, I know that the problem is definitely related to CPU resources, but it is necessary to find out why this is the case.

Step 7: Recourse to IBMSupport

If you have tried all the reasonable steps and need new ideas, you should usually contact IBMSupport. They have advanced troubleshooting tools, are proficient in the operating system and Experts in every aspect of related products (such as VIO and PowerHA) can bring up relevant cases to confirm and assist in solving similar problems. However, if you have not called 800-IBM-SERV before, there are a few points to understand.

First, you should have an IBM contract number. There are multiple levels of support, from the most advanced 24x7x365 support by dedicated personnel until 8am to 5pm for non-critical servers. These support packages can be purchased directly from IBM or contracted with value-added resellers.

You also need to provide some information so that IBMSupport can call up your account - usually the phone number, serial number, contract number, or physical location of the server. This information largely depends on whether you are building a hardware case or a software case.

Support personnel must also be made aware of the severity or priority of the issue. Priority is divided into several levels from 1 to 4. Level 1 usually involves system downtime or production impact, and the call is immediately forwarded to the technician for this level. Level 4 means that processing time can be longer and is usually used for general management problems.

After you describe the problem and create a support case, you will be given a tracking number - often referred to as PMR. This number identifies this case to other support personnel who work with you. The hardware and software PMR is unique, and if your problem crosses the border, you will need to get a new number.

For both example questions, I had to contact IBM. For the first question, IBM mobilized many people from VIO support to the kernel team to solve the problem. For the second question, only hardware technicians participated, and I provided information from the snap command for analysis.

Step 8: Go Extreme

Sometimes, there are no other ways to solve the problem, only some unorthodox measures that most people think are crazy. This is usually done when you are desperate and even work or life is at stake. In this case, IBM support staff often say, “If you do this, you will be in an unsupported state and must start again before we can support it. & rdquo; However, if your solution is effective, it may be able to save you.

Copyright © Windows knowledge All Rights Reserved