Joined: 16 Jun 04
The main server and related systems went offline Jul 19 05:33 UTC during a power spike. The emergency backup power system, although in theory scaled to match the requirements of the server systems, failed with an overload message reported in the logs. Unfortunately this happened when I was paying a visit to Germany and I wasn't able to reset the power system before I arrived back 5 days later on the 24th 16:31 UTC.
The system only draws around 100-300W under normal circumstances, which would seem to suggest that the UPS of 1500VA would be sufficient - and indeed it has been handling many of the recent power spikes just fine. However, a more thorough analysis of the issue has revealed that the switch-over from mains power to battery power would leave the systems without power for about 6ms, causing an in-rush of current just as the inverter recovered from battery power. This in-rush would, in rare occasions under high load, put the total power drain over max and would cause the UPS to switch off to protect itself.
Only stand-by and line-interactive UPS systems have this issue because they keep the battery and power generation parts inactive until they are needed.
The solution has been to switch the UPS with a so called "online UPS" of similar capacity. Online UPSes (or double conversion UPSes) always keep the power generation system running - in fact it completely isolates the mains power from the power delivered to the servers which gives the added benefit of extremely stable voltage and frequency on the outputs.
Another nice side-effect is that the new UPS has some more advanced monitoring and management features and seems to be running cooler than the old one.
Sorry about the downtime!