I. Fault description
On a certain day, the customer reported that the IBM P770 (9117-MMB)small machine was down. We went to the scene to check, the small machine consists of 4 CEC cabinets and an IO expansion cabinet, built 4 Lpar. check the HMC and ASMI alarm information, the part of the error involves the FSP card, the CPU board, IO board, the middle board, memory and so on.
II. Failure analysis
After analyzing the alarm information, on-site inspection of the equipment, power-up startup test, we found that the expansion cabinet two DBJM790 FSP card power-up indicator does not light up, the device can not be started, judging that the FSP card has been bad, corresponding to the location of the U78C0.001.DBJM790-P1-C1, for the other error message, analyzing the exclusion of the CPU and memory, the initial judgment of the device down with the U78C0.001.DBJM782-P2 position IO board failure related to the error code 1100262D.
troubleshooting
We decided to start the troubleshooting process by replacing the faulty FSP card, and the FSP card is normal before we can continue to check and deal with other faults. The process is as follows: 1. Backup partition data, select the host ------>Configuration----->Manage partition Data ------>Backup.
2. View the partition backup data on the HMC.
3. Open the Restricted shell terminal on the HMC console.
4. View the partition data using the command:
>ls -l /var/hsc/profiles/.
5. Log in to ASMI and record the device hostname, network settings, microcode information, time, and boot options. 6. Power down the device and replace the FSP card in CEC Cabinet 2. 7. Plug in the cables, do not connect the HMC, and power up the device for testing. 8. find that the main cabinet FSP does not light up, replace the main cabinet FSP card again. 9. Do not connect the HMC cable, power on the startup. 10. laptop directly connected to the HMC management port, FSP card to restore factory configuration ASMI---->system service asid---->factory configuration---->reset service processor setting---->continue Wait for the restoration of factory settings to complete, the FSP card will automatically reboot, about 10 to 20 minutes.
11. Modify the time, hostname, and HMC management port IP address.
12. Connect the HMC manager and wait for the connection to refresh.
13. Enter the HMC and ASMI passwords as prompted.
14. After successful connection, the host state is recovery, select the host and choose the first item of Recovery partition in the taskbar, select Restore profile data from HMC backup data to recover partition data from this HMC. Wait for the recovery to complete, the device will automatically power on, and self-test. 15. The self-test process is still a red fork, can not start, check the error message, or the main cabinet IO board error. 16. Shut down and power down again and replace the main cabinet U78C0.001.DBJM782-P2 position IO board.
17. Reboot the device, the FSP can power up normally.
18. After the HMC recognizes the device normally, re-do the partition recovery operation, which is successful and the device boots up to standby.
19. Find the corresponding partition summary file to start the partition.
Lessons learned
This time, the IBM Power770 small machine failure belongs to the problem of the key business system, the customer is more anxious, and the site is under greater pressure. And encountered a composite failure of the FSP card and IO board, resulting in the log of the CPU board, the middle board, the IO board and other components of the error report, increasing the difficulty of fault location. IBM Power770, 780 and other models of small machines in the power-up, there is a certain probability of damage to the FSP card, this time, but also encountered; fortunately, the spare parts in a timely manner, the fault judgement is accurate, and the fault problem was repaired on time.
Here summarizes the following points:
1, IBM P770, 780 small machine FSP card failure often leads to multiple other components and alarm, it is best to go to the scene to confirm that in the power state, all the FSP card power indicator is normal (green light is always on), such as extinguished, it is the FSP card is bad, it is recommended that the first replacement of the FSP card, and then troubleshooting other components.
2, P770, 780 small machines under normal circumstances when the power supply, the FSP card will immediately light up, but this type of machine, the power supply FSP card is prone to failure, there will be a bad situation while repairing, point is not lit can only replace the FSP card. So you need to prepare a few more FSP cards according to the conditions, and backup the partition information before powering down. (In HMC, select Host--> Configuration-->Managepartition Data-->Backup)
3, after replacing the FSP card, do not connect to the HMC first, you need to restore the factory settings first, to prevent connecting to the HMC after the device's partition information flushed out of the HMC, resulting in the partition can not be recovered.
4, after replacing the FSP card, make sure it can light up, then power on self-test, check whether there is any problem with other parts, and then deal with the problem when it is found.
For more information, please visit Antute's official website:btjf.31baglady.com
|