Information last updated May 26, 2008

05-26-08
The CSS01 access server is operational again! the highjacker has been disconnected.
05-23-08
The CSS01 access server is having its IP address hijacked by a unauthorized DHCP server somewhere at Parkview. Doug Johns (OIT) has been working on locating the MAC address to track down the port number. as of 16:00 May 23rd the port has been located in rm B-123 (Pulp Lab) port E-5. I was unable to gain access so we'll try next week.
05-09-08
The Litany of Lunacy OR Why email was broken for three days

On Monday morning, at about 0200, we took the mail server (hal) down for a system patch that would update the timezone definitions in use. This was motivated by the fact that emails passing through the system would have incorrect timestamps during DST dates that were different from the old dates. This needed to be correct due to the official use and nature of Department email that requires timestamping to be accurate. Part of this patch required a patch update to the kernel. The patch reported a successful update. The system was then rebooted, as part of the patch installation requirements. It was at this point that the machine reported kernel modules that could not be loaded, and the machine would not boot.

Over the course of the next 48 hours, we learned the following things:
1)The root filesystem on hal had become silently corrupted. The filesystem's superblocks were okay, and the filesystem mounted and unmounted cleanly on hal. However, files within the filesystem, namely the kernel files, were no longer written correctly. When the update manipulated those files, the corruption caused damage to the new files that rendered them unusable for booting. The silent corruption was undetectable by routine checking methods. Only a full fsck of the filesystem uncovered the damage, and this was not an option on a full-time available server system, as it requires an unmount of the filesystem to be checked. You can't unmount a root filesystem on a live OS.
2)After diagnosing the problem, it was then found, through more arduous and time-consuming testing, that the meta-disk that the root filesystem lived on was also corrupt, due to the filesystem damage. This meant that all attempts to boot the system from the network and from CD/DVD media were unsuccessful, as Solaris has an undocumented 'feature' that attempts to mount the root device of the previous Solaris installation if it can find it. That meant that every time we tried booting the system with media or network, the environment would discover the corrupted root metadisk, attempt to mount it, and promptly crash or freeze when critical device pointers like /dev/null would come back corrupt. This behavior, as mentioned before, was not documented.
3)After running fsck on the corrupted root metadevice, we were able to repair its status. After that, we were able to run a Solaris 10 CD-based installation on the system. Before that, we tried running a network-based jumpstart, as it's much faster. It was at this point that we discovered that the rarp-based jumpstart for Solaris 10 is not functioning properly, despite the fact that the DHCP-based jumpstart for the same installation is functioning just fine. It turns out that the BootPROM for Hal is out of date, and a new version is available. Due to time constraints, we have opted not to update the BootPROM at this time. The updates address the problems we were having getting a net-based installation to work.
4)The imap and pop3 daemons that we installed did not work properly at first. This manifested as an inability of mail users to authenticate against the mail server. We first investigaged the SASLauthdaemon. It was only after we thoroughly inspected this installation that we moved on to the pop3 and imap daemons. We then had to re-build them from source with different compile options, as Solaris' PAM implementation is broken in a known way. That has been accomdated in the build options. After re-building the daemons, they began to authenticate users correctly.
5)Postfix has a new option to turn on TLS without absolutely requiring it. This is a new option we had to accomodate in our configuration files. This offers TLS, but falls back to plaintext if TLS is not used.

Once all this was addressed, the mail server begain processing the backlog of mail held at the OIT mail exchanger. We are still assessing if all the mail has been processed, as we do not have direct access to that system.

As of Tuesday afternoon, everything began working normally.



On Wednesday morning at approximately 10:20, network connectivity for the LDAP, Kerberos, Web, Database, and fileservers went down. This left the mail server and login servers running without the necessary services to continue functioning normally. What resulted was a mail server that could not perform user lookups, and could not access user mail spools. The symptom is that any mail recieved between 10:20 am on Wednesday and approximately 18:30 Wednesday evening will be rejected as mail for invalid users. As far as the mail server could establish, all our users on the system did not exist, as it could neither perform user lookups, nor deliver mail to user mail spools. The network connectivity drop was due to a Cisco Catalyst switch failure in the Telco Closet upstairs that serves our server room. Mr. Cotton replaced the switch in question. When the repalacement was finished, CS Sysadmins began to assess server functionality. It was quickly established that the servers on the Gigabit Ethernet ports were not getting proper network access. Mr. Cotton then did network troubleshooting, and it was established that three of the port blades in the Catalyst switch had been swapped into the incorrect slots. After the blades were restored to their correct locations, network connectivity was restored. The decision was made at this time to move network connections for all servers on the gigabit ports to our catalyst switches in the server room, which interface directly with fiber connections in the core room downstairs, bypassing the telco closet entirely. Proper network function was then permanently restored, and all systems brought back online without further incident.


03-09-07


Servers have been moved into the new racks and they are now all balanced on the UPS's. The remote access servers are now CSS01 - CSS10.

12-4-06


Spam Assassin is now up and working. Contact Sysadmin if you need assistance in setting up your account.

10-18-06

The new webmail host webmail.cs.wmich.edu is now working.
We now have dedicated remote servers for student and faculty access on or off campus. These machines are isolated from inadvertent disconnect from service (they reside in the data center) and they are available 24/7. They are "css01-css20" (the csx group of machines has been rerouted (CNAME) to the css group).
NOTE: These are x86 machines (Intel based chip set) running Solaris 10.

We are aware of the ridiculous amount of spam(we receive it all!). The new (isolated) mail server is in process, we plan to reduce current spam levels and flag suspect mail.

08-10-06


The intranet server can only be accessed from the cs.wmich.edu domain.

OIT has turned off port 25 aboard the campus, pop is no longer supported unless it's secure.

We have been experiencing a flood of spoofed mail from an outside source, indicating that passwords and accounts have been altered.
This is false info. If you recieve this mail just ignore it. The only department accounts that you need to be concerned with are:
"sysadmin" and "csadmin", all others are forged (or are not supposed to send mail).

We will be upgrading the UPS for all servers. We may experience short outages of system services once the Physical plant has completed the wiring.

Hal experienced some temporary hardware issues due to overheating. These Problems should now be corrected.

The follow software is installed on all csx/csy machines:
Sun C/C++, Fortran Compilers and utilities
GNU C/C++, Fortran, ObjectiveC and utilities
Clisp, DrScheme, GNU Prolog, SWIProlog
XFig, Maple 6, Mysql, spim, teTeX, Acrobat5, Jabberwocky
Java SDK 1.4.2_03, Java SDK 1.5.0


01-23-06

CS had a hardware failure causing intemittent NFS failures. The system is back up and working. (Cause unknown at this time)