Eight steps to troubleshoot AIX server

2024-09-05 04:01:50

Problem 1: The server is larger, but the computing power is reduced

At that time, it was necessary to migrate an AIX5.3 LPAR from the old IBM pSeries p670 server based on POWER4 to the new pSeries p570 server based on POWER6. The old server has insufficient resources (use WorkloadManager to manage the resources of the main application on the server), so the new dynamic processor resources on the new hardware should provide the computing power I need. Executed mksysb for this LPAR, and then used NetworkInstallationManager to restore it on the new hardware and map it via SAN disk.

This LPAR was started, and everything looked fine until the application was started. Suddenly, users started calling. They simply cannot access their products. When I log in, I find that the server is completely idle. There are no processes on the server that consume a lot of resources. Why do users encounter problems?

Problem 2: The failed hard drive cannot be unmirrored

One of my servers has a mirrored root disk. One day, the error report stated that bad blocks on one of the disks could not be relocated. I know this is a harbinger of hardware failure, so I started to unmirror. However, the server said that the mirror cannot be completely lifted because one of the logical volumes has only one good copy and it is on the failed disk. How should I solve this problem and replace the hardware?

Troubleshooting process

Remember these two example problems, and now look at the process of solving them.

Step 1: don't mess up

Once you are in trouble, the wisest move is not to move. Just like Indiana Jones in "The Treasure Hunter", if you find that darts will hit you when you step on the floor, then stop at the same place and don't move on. More changes will only complicate the problem and may make the situation worse. When a problem affects the normal operation of the system, it does not make sense to have to solve multiple problems.

For the first example problem, I let the user log out of the system immediately, and then I terminate the application. I know that users' query and input will be interrupted when the performance is very poor, which may destroy their data. I don't want their environment to change further before I check the system. Although users are reluctant to hear that they cannot use the new server now, they will be happy to know that I am looking for the cause of the problem. In addition, this gives me time to perform other troubleshooting steps in my own way.

Step 2: Start with basic commands first, then add complexity

When I was learning Kung Fu, I heard the story of a second-level black belt at the bus stop to suppress the thief. The students wanted to know which trick she used to knock down the attacker. Is it the Golden Tiger style? Or the palm of the palm of the Eight Diagrams? We even imagined that she was very powerful and used the drunk Eight Immortals to put the other party down. None of the results: she used one of the techniques that leucorrhea first learned in the class-elbowing her chest and boxing her nose.

AIX provides commands for checking various aspects of the server, including hardware and software. Even the most basic commands provide a good basis for analyzing problems. When there is not enough information or something still does not behave properly, you can start experimenting with more complex and powerful tools. However, you should start with the simplest commands and ideas before using more powerful tools.

For the second example problem, I first looked for hardware problems by looking at the errpt output, and then used the unmirrorvg command-a simple but powerful tool that attempts to unmirror-instead of running rmlvcopy on each logical volume on the disk. When I found that there was a When the logical volume cannot be deleted, other basic commands such as lspv, lsvg, and migratepv are used to collect information. I tried to use extendvg and mirrorvg to create another copy of the volume group on another disk. This still leaves some old partitions, so I went one step further and used syncvg and synclvdom to coordinate the ObjectDataManager with the server. Finally, I use migratelp to try to transfer each logical partition out of this disk. Unfortunately, none of these tools work, but they provide a lot of information.

Step 3: reproduce the problem

According to the scientific method, the key point of any hypothesis and experiment is to be able to reconstruct the process and produce the same result. If this is not possible, the conclusion is at least uncertain. In the worst case, this would subvert the scientist's theory and damage their reputation, just like a physicist who claimed to achieve cold fusion at room temperature in the 1990s.

Or, as I say: if it is unsuccessful in the beginning, then trying it elsewhere can cause the same problem.

When managing an AIX server, if something goes wrong, and you have the resources needed to reproduce the problem, then perform the same operation on another LPAR of similar type to see if it will produce the same result. If modifying the same attribute on another server will cause the same result, it can be concluded that this operation is the source of the problem. However, if the exact opposite result is produced, then study the subtle differences between the servers and try to guess the cause of the problem.

For the LPAR involved in the first example problem, I found that when swapping the SAN disk back to the old p670 server and starting it, the problem did not appear. Users can access their applications, the CPU is under normal load, and the CPU utilization is more than 80% (10% core + 70% user). Therefore, I was able to conclude that something specific to the p570 server was causing the problem, not something introduced during the migration.

Step 4: research questions

In the information age, you can get a lot of information with just a few keystrokes and a few mouse clicks. Even better, system administrators are often members of large communities, and the community records many years of experience.

First of all, you should consult the manufacturer and seller's own information. Companies like IBM open all their manuals, Redbooks, technical documents and even man pages online for research. Just enter simple keywords in the search bar of the main site, you can find a lot of suggestions and information that may be helpful.

Other sources of information I recommend include various newsgroups, forums, and sites frequently visited by other system administrators. People who deal with servers all day long often visit technology sites and comment on what they see during their work. For open requests for assistance, most system administrators are happy to provide pointers or help by email. In addition, old information related to other versions of the operating system and software can often be found, and more information can be found through them.

For these sources of information, the main trick is to use an appropriate keyword set. If I use a general website like Google to study AIX issues, then I will make sure that the search string starts with AIX in order to exclude information related to other styles of UNIX. Then, it may contain the output of the command or the label generated by errpt. I will also make sure to put double quotes ("") around specific phrases to limit the search to these specific questions and avoid irrelevant information, especially for commonly used words (eg LogicalVolumeManager)

For the problem of disk bad block relocation failure, searching on Google using the phrase AIX "badblockrelocation" failure produced hundreds of results, but it did not seem to match my situation.

Step 5: cancel all changes

Sometimes, the most sensible way to solve a problem is to cancel all the changes that have been made and return to the original state. This step is not always feasible. Sometimes, overzealous C-level executives force you to roll back their servers. Or, due to time constraints, it is necessary to do so. In any case, retreating is one of the best tactics to choose from.

I put this step in the middle of the list of troubleshooting steps because sometimes it is necessary to do this earlier, sometimes it is later. But based on my experience, I think it is best to complete the first four steps before considering canceling all changes. If you cancel the changes immediately at the beginning of the troubleshooting process, the problem is probably not resolved, and you will encounter the same trouble next time you try the same work. If you roll back too late in the process, it will affect uptime or complicate the problem to the point where rollback is impossible.

For the first example, due to time, I actually had to roll back the server migration operation. If this production server is shut down for a longer period of time, users and companies will lose money. It took a week to reschedule this work, which allowed me to do more research, but when I tried the migration again, the problem appeared again. For the second example, you cannot perform a rollback for hardware problems. Unable to tell the server, "Return to the state before the bad block relocation error occurred!" I had to continue to work hard to overcome the failure of the disk.

Step 6: Only change one rule at a time

If none of the above steps work, you decide to start changing the main components or do more aggressive operations on the server, then remember the most important rule: only change one place at a time.

Multiple changes can lead to one of two situations. First, if these changes solve the problem, then you do nâ€™t know which change is a valid action. If you do nâ€™t care what exactly solves the problem, this may not be a big deal, but excellent system administrators want to learn more, because they know that the problem often appears in the same place multiple times. Second, if the problem is not resolved, this may introduce more complexity. By continuing to do this, you will not know which change to cancel. If you go far enough, the system will mess up and you will be confused. (There is a joke about this situation on xkcd.)

If the problem is not resolved after making a change, you usually want to cancel it and try other measures. This is the case in the first example: when I compare the HardwareManagementConsole profiles of the two servers, I see that they are different. I noticed that the old POWER4 hardware uses a dedicated CPU, while the new POWER6 hardware uses an uncapped shared CPU pool. I wanted to know how this difference affected CPU performance, so I modified the profile on POWER6 hardware to use a dedicated CPU. Strangely, based on user feedback, the server was "normal" and I saw load on the processor. Therefore, I know that the problem must be related to CPU resources, but I need to find out why this is the case.

Step 7: turn to IBMSupport

If you have tried all reasonable steps and need new ideas, you should usually contact IBM Support. They have advanced troubleshooting tools and experts who are proficient in every aspect of the operating system and related products (such as VIO and PowerHA). Related cases to confirm and assist in solving similar problems. However, if you have not dialed 800-IBM-SERV before, there are a few things to know.

First, you should have an IBM contract number. There are multiple levels of support, from the highest level of 24x7x365 support by a dedicated person until 8am to 5pm for non-critical servers. These support service packs can be purchased directly from IBM, or contracts can be signed with value-added vendors.

You also need to provide some information so that IBMSupport can call up your account-usually the phone number, serial number, contract number, or physical location where the server is located. This information depends largely on whether you have a hardware case or a software case.

Support personnel must also be informed of the severity or priority of the problem. Priority is divided into several levels from 1 to 4. Level 1 usually involves system outages or production impacts. For this level, the call is immediately transferred to the technician. Level 4 means that the processing time can be longer and is usually used for general management problems.

After you describe the problem and create a support case, you will be given a tracking number-usually called PMR. This number identifies the case to other support staff working with you. The hardware and software PMR is unique. If your problem crosses the border, you need to get a new number.

For the two example problems, I had to contact IBM. For the first problem, IBM mobilized many people from VIO support to the core team to solve the problem. For the second question, only hardware technicians are involved. I provided information from the snap command for analysis.

Step 8: go to extremes

Sometimes, there is no other way to solve the problem, and you can only try some unorthodox measures that most people think are crazy. This is usually done when you are already desperate and even work or life is at stake. In this case, IBM support staff often say, "If you do this, you will be in an unsupported state and you have to start over again before we can support it." If your solution is effective, maybe Can turn danger into danger.

Biodegradable Poop Bags

Biodegradable Poop Bags,Biodegradable Daily Waste Bag,Plastic Biodegradable Dog Waste Poop Bag,Biodegradable Dog Waste Bag

Taizhou Jinchi Sanitary Product Co., Ltd , http://www.jstzjcpetwastebag.com