On Monday morning, at about 0200, we took the mail server (hal) down for a system patch that would update the timezone definitions in use. This was motivated by the fact that emails passing through the system would have incorrect timestamps during DST dates that were different from the old dates. This needed to be correct due to the official use and nature of Department email that requires timestamping to be accurate. Part of this patch required a patch update to the kernel. The patch reported a successful update. The system was then rebooted, as part of the patch installation requirements. It was at this point that the machine reported kernel modules that could not be loaded, and the machine would not boot.
Over the course of the next 48 hours, we learned the following things:
1)The root filesystem on hal had become silently corrupted. The
filesystem's superblocks were okay, and the filesystem mounted and
unmounted cleanly on hal. However, files within the filesystem, namely
the kernel files, were no longer written correctly. When the update
manipulated those files, the corruption caused damage to the new files
that rendered them unusable for booting. The silent corruption was
undetectable by routine checking methods. Only a full fsck of the
filesystem uncovered the damage, and this was not an option on a
full-time available server system, as it requires an unmount of the
filesystem to be checked. You can't unmount a root filesystem on a live
OS.
2)After diagnosing the problem, it was then found, through more
arduous and time-consuming testing, that the meta-disk that the root
filesystem lived on was also corrupt, due to the filesystem damage. This
meant that all attempts to boot the system from the network and from
CD/DVD media were unsuccessful, as Solaris has an undocumented 'feature'
that attempts to mount the root device of the previous Solaris
installation if it can find it. That meant that every time we tried
booting the system with media or network, the environment would discover
the corrupted root metadisk, attempt to mount it, and promptly crash or
freeze when critical device pointers like /dev/null would come back
corrupt. This behavior, as mentioned before, was not documented.
3)After running fsck on the corrupted root metadevice, we were able to
repair its status. After that, we were able to run a Solaris 10 CD-based
installation on the system. Before that, we tried running a
network-based jumpstart, as it's much faster. It was at this point that
we discovered that the rarp-based jumpstart for Solaris 10 is not
functioning properly, despite the fact that the DHCP-based jumpstart for
the same installation is functioning just fine. It turns out that the
BootPROM for Hal is out of date, and a new version is available. Due to
time constraints, we have opted not to update the BootPROM at this time.
The updates address the problems we were having getting a net-based
installation to work.
4)The imap and pop3 daemons that we installed did not work properly at
first. This manifested as an inability of mail users to authenticate
against the mail server. We first investigaged the SASLauthdaemon. It
was only after we thoroughly inspected this installation that we moved
on to the pop3 and imap daemons. We then had to re-build them from
source with different compile options, as Solaris' PAM implementation
is broken in a known way. That has been accomdated in the build options.
After re-building the daemons, they began to authenticate users
correctly.
5)Postfix has a new option to turn on TLS without absolutely requiring
it. This is a new option we had to accomodate in our configuration
files. This offers TLS, but falls back to plaintext if TLS is not used.
Once all this was addressed, the mail server begain processing the backlog of mail held at the OIT mail exchanger. We are still assessing if all the mail has been processed, as we do not have direct access to that system.
As of Tuesday afternoon, everything began working normally.
On Wednesday morning at approximately 10:20, network connectivity for the LDAP, Kerberos, Web, Database, and fileservers went down. This left the mail server and login servers running without the necessary services to continue functioning normally. What resulted was a mail server that could not perform user lookups, and could not access user mail spools. The symptom is that any mail recieved between 10:20 am on Wednesday and approximately 18:30 Wednesday evening will be rejected as mail for invalid users. As far as the mail server could establish, all our users on the system did not exist, as it could neither perform user lookups, nor deliver mail to user mail spools. The network connectivity drop was due to a Cisco Catalyst switch failure in the Telco Closet upstairs that serves our server room. Mr. Cotton replaced the switch in question. When the repalacement was finished, CS Sysadmins began to assess server functionality. It was quickly established that the servers on the Gigabit Ethernet ports were not getting proper network access. Mr. Cotton then did network troubleshooting, and it was established that three of the port blades in the Catalyst switch had been swapped into the incorrect slots. After the blades were restored to their correct locations, network connectivity was restored. The decision was made at this time to move network connections for all servers on the gigabit ports to our catalyst switches in the server room, which interface directly with fiber connections in the core room downstairs, bypassing the telco closet entirely. Proper network function was then permanently restored, and all systems brought back online without further incident.