We have a cluster of HP workstations and, once upon a time, were using 1/4-tape as the backup medium. This was very slow and cumbersome, as we were forever increasing the amount of disk space on our system, and we decided to purchase HP's optical jukebox to use both as large removable media and as the primary backup device.
We had been experiencing occasional problems with the 1/4-inch tape backups, but HP's hardware service engineer convinced us that the problems were resolved. A complete backup was performed prior to installation (by the HP engineer) of the jukebox. Two unfortunate things happened. First, the problems on our backup tapes were due to intermittent hardware problems on the tape drive which were not discovered by the extensive diagnostics performed on the tape drive. Second, the engineer installed the jukebox with the same hardware SCSI address as our root file system.
As you may have anticipated, the attempt to mediainit the first optical cartridge resulted in a rather ungraceful failure of the root file system. This was compounded by the fact that much of the data on the backup tapes was not recoverable.
We had an operator lay a book on the console keyboard, throwing the console into system monitor mode. This stops the system clock, which locks every session dead in it's tracks. At that time we had over 100 user sessions running. Most of our inbound lines are essentially modem lines on a very large "rotor". After their session hung for a minute or so, many users disconnected and called back. They got connected, but received no login prompt (the system was in a sort of suspended animation). Little did they know that they were now on a different port than the one they just abandoned.
A call to the computer room soon identified the problem, and the operator was given the commands to resume normal system operation. As near as we can figure, somewhere around half of the users had disconnected but the system didn't notice because it never saw carrier drop on those ports (being dead). New, different users had now connected to those ports. We received several semi-confused user calls, realized what had happened and invoked the magic "/etc/shutdown NOW" command. The procedure (should this ever happen again) will be to manually panic the system and reboot. I also surgically removed the keycap from that particular key on our terminal - you have to work to press it now!
Many years ago a tiny little college in the middle of nowhere purchased an NCR tower, then a newfangled contraption. A half-dozen of us were using it for an assembly class. The prof should have made his warnings about TRAP a little more clear. One student runs his program and it suddenly begans spawning processes, rapidly filling the machine. The prof came in, amused, logged on as superuser, and killed a process. Another process was immediately spawned. The prof tried again. He was ignored. He was also no longer amused. After several minutes he gave up and turned off the box. The tower didn't even flinch. He pulled the plug. Nothing. He ripped the back off the box and dug around. Finally he found the fuse and pulled it, killing the machine. Some of us later claimed we heard laughter as it went down.
Many times since then I have wished other computers came with a backup battery as standard issue.
Many things happened in those many years I've been with computers. The most horrorful story I've seen is not UNIX related, but it is certainly worth a tale. Here it goes.
This big (:-) CDC 6600 system was bootable from tape drive 0, using these 12 inches wheels containing 1/2" tape. The *whole* system was reloaded anew from the tape each time we restarted the machine, because there was no permanent file system yet, the disks were not meant to retain files through computer restarts (unbelievable today, I know :-). The deadstart tapes (as they were called) were quite valuable, and we were keeping at least a dozen backups of those, going back maybe one or two years in development.
The problem was that the two vacuum capstans which were driving the tape 0, near the magnetic heads, were not perfectly synchronized, due to an hardware misadjustment. So they were stretching the tape while they were reading it, wearing it in a way invisible to the eye, but nevertheless making the tape irrecoverable. Besides that, everything was looking normal in the tape physical and electrical operations. Of course, nobody knew about this problem when it suddenly appeared.
All this happened while all the system administration team went into vacation at the same time. Not being a traveler, I just stayed available `on call'. The knowledgeable operators were able solve many situations, and being kind guys for me (I was for them :-), they would not disturb me just for a non-working deadstart tape. Further, they had a full list of all deadstart backup tapes. So, they first tried (and destroyed) half a dozen backups before turning the machine to the hardware guys, whom destroyed themselves a few more.
The technicians had their own systems for diagnostics, all bootable from tape drive 0, of course. They had far less backups to we did. They destroyed almost them all before calling me in. Once told what happened, my only suggestion was to alter the deadstart sequence so to become able to boot from another tape drive. Strangely enough, nobody thought about it yet. In these old times, software guys were always suspecting hardware, and vice versa :-).
Happily enough, the few tapes left started, both for production and for the technicians. Tape drive 0 being quite suspectable, the technicians finally discovered the problem and repaired it. My only job left was to upgrade the system from almost one year back, before turning it to operations. This was at the time, now seemingly lost, when system teams were heavily modifying their operating system sources. This was also the time when everything not on big tapes was all on punched Hollerith cards, the only interactive device being the system console. It took me many days, alone, having the machine in standalone mode. The crowd of users stopped regularily in the windows of the computer room, taking bets, as they were used to do, on how fast I will get the machine back up (I got some of my supporters loosing their money, this time :-).
This was quite hard work for me, done under high pressure. When the remainder of the staff returned from trip, and when I told them the whole tale, we decided to never synchronize our holidays again.
At one time, there were three of us working on a unique SVR3.2 motorola based machine, on a R&D project. I took care of all the SysAdmin tasks, I had a back up administrator, and the third person had been stuck into my group (company politics). The group project files were in /user and the individial ones in /user2. We had managed to get backup from the operations department for /user only (not even /; security paranoia?). Anyway, I had another scsi hard disk that I used for making a disk copy of the primary scsi hard disk every Friday. This disk was connected, but not mounted, so that I could do the disk backup from my desk when I wanted to. This machine used to sometimes get a scsi error such that you could not log in, but the processes already running on the machine were not affected. If were logged in the console, you just powered off the machine for a few minutes and rebooted it. Around holidays time the other Admin was off in a long vacation. I had taken Monday off, and headed off for a four day weekend. The machine does the same blurp. The third person decides the power off the machine & turn it back on immediately. It does not come up properly. She decides to reinstall the machine using the installation tape that I had unfortunately left in the open. Reformats the hard disk, installs the base system, and is stuck at that point when I come back in on Tuesday. I almost blow a blood vessel but try to keep calm 'cause I had made a disk copy about 10 days before (too anxious to get on my holiday the previous week). Try to mount the disk... hit vaccuum. Try using dd to look at the disk... Seemed to be a large /dev/null :-? When the lady decided to reinstall the system, it asked her what scsi disks she wanted to reformat, and she said "y" for both 0 & 1!! All my sample/trial&error work for a year had bitten the dust. My only (small) consolation was that I was not the only one affected.
Story One is about The Sun 3/260 That Froze Solid. One day a user reported that the Sun 3/260 he was using was "dead". On inspection, I found the Sun at the console prompt and the keyboard totally unresponsive. The L1-A sequence did nothing. So I power cycled it. Nothing. A blank screen, no activity. I was ready to call service, then decided to try rebooting with the normal/diag switch set to diag. On looking at the back of the pedestal, I saw that the ethernet cable had been pressed up against the reset switch! ARGGGHHHH! The user had pushed the machine back just enough to press the switch and keep it pressed. (I don't recall if there was a "watchdog reset" message on the console when I found it, but I was new enough to Suns that that would not have been a dead givaway.)
Story Two involved connecting an HP laserjet to a Sun 3/280. This sucker just would NOT do flow control correctly. I put a dumb terminal in place of the HP and manually typed ^S/^Q sequences to prove that the serial port really was honoring X-ON/X-OFF. But for some reason the ^Ss from the HP didn't "taste right" to the Sun, which ignored them. Switching the HP serial port between RS422/RS232 had no effect. It evenually turned out to be some sort of flakeyness with the Sun ALM-II board. Everything worked fine after I moved the printer to one of the built-in Zilog ports. Death to flakey hardware...
In article <1992Oct16.152629.29804@nsisrv.gsfc.nasa.gov: williams@nssdcs.gsfc.na
[story about connecting HP LJ to a Sun 3/280 with an ALM-II board deleted]
ARRRGGGHHH!!!! DEATH TO ALM-II BOARDS! Funny though, I do have an HPLJ-2 hooked up to a SUN 690MP through the ALM-2 boards without problems. However I also had Sun going up the wall with myself with an Okidata 320 printer that would hang the port until we reboot the machine (not a nice thing to do with a dozen stock brokers). Funny thing is, we had ANOTHER Okidata 320 printer attached to the same Sun on another ALM-2 port, no problem with that one. Hm, switch the printers, no change. Switch the cables, no change. Switch the ports, no change. Wierd. Finally discovered it was the DATA that was being sent. The printer with problems was a label printer, which was sending a control-s every 10-20 characters or so to pause the Sun. Apparently the Sun ALM-2 drivers can not handle control-s'es too frequently. No problem, Sun said, just switch to hardware flow control. Puzzled me, because my docs said the ALM boards had no hardware flow control. But his docs said they were there. Took the printer off line, started the lpd, data scope showed the data going out. Talked to Sun again, tried RTS-CTS, DTR, 'crtscts' in printcap, '-crtscts' in printcap. Trying all kinds combinations. Finally he asked me which ALM-2 port I was using, 13 I responded. Oh, ALM-2 ports only have the hardware flow control in the first four ports. Whoops :-). Both docs were, true, my docs said there was no hardware flow control, which was right, on the last 12 ports. His docs said that there was hw flow control, but he missed the 'on the first four ports' part. Now it works, and I hope Sun now has this better documented.
My company markets turnkey solutions for resume-processing, so most of our customers are non-technical HR recruiters. We contract third-party field service to a fairly recognizable name in the industry.
I received a call from an irate user who noticed intolerable delays after some upgrades were done to the customer's branch offices. His ELC would use dial-up to establish a link before running software off the server in a different site.
He attributed the delay to slow dial-up links and software changes, but then the customer mentioned that quitting WordPerfect and switching to our applic- ation took over an hour. I asked what the system was doing during that hour. He replied the disk was constantly spinning. Puzzled, I checked his swap, which was more than sufficient. Then finally I noticed his ELC booted with only 4 meg of memory.
Think the field technician swapped their CPU board a month ago and forgot to move the SIMMs over. The worst part of it was the customer went on with this situation for a month before bringing it to our attention!
Moral of the story: Check that the service guy puts everything back in.
I once had problems with files that mysteriously refused to stayed changed for very long. It was a PDP-11 Unix system that had crashed, and I brought it up single-user. I would change some file and it would stay changed for a minute or so but then revert to its earlier state (contents, protection mode, etc). What happened was that the write-protect switch on the disk drive had gotten bumped into the "on" position but the device driver failed to report any write errors. As long as the data stayed in kernel buffers the changes "took", but they would disappear once the buffers were reused and the system had to reread the disk.