On Tuesday, 21st of July 2009 around 12 o’clock an error in the so called „Fabric“ at the hosting-site of our hosting-partner GWDG happend. The „Fabric“ is the part of the SAN (Storage Area Network) that connect the SAN with the actual servers. When we decided for the hardware to be aquired for the production environment of PubMan we deliberately chose the SAN for storage because a SAN is one of the most secure ways to store data, because it’s eliminating most single points of failure.
Due to this error one of the two connections to the SAN was lost, but as there was another one still working the everything was still fine – that’s what this redundant connection is meant for anyway. Approx. 40 min later write-errors occurred on the remaining path and the OS switched the whole filesystem to read-only. The particular reason for the failure of both pathes is currently investigated at GWDG together with the vendor of the SAN – FalconStor.
Nevertheless these errors not necessarily meant that any data had to be lost but unfortunately during this time, between 12:00 and 12:40 o’clock a filesystem corruption happened, in detail this meant that blocks of data were written to places on the disks in the SAN where they don’t belong to. This incident is also investigated by our hosting partner.
After the connection to the SAN was restored we had to check the filesystem, but not all errors could thus be resolved. So we had to use the last working backup – as of 21st July, 4:00 am. So all changes after this date were lost.
We set the PubMan online again on Thursday, 23rd of July 2009. But due to maintenance work at the hosting-site of GWDG the connection to the SAN was lost again around Friday, 24th July 2009, 11:00 am. But this time there was no error on file system level. Since the recovery of the system takes some time and was also delayed by the weekend the system was ready only on Monday, 28th of July 2009. As for the involved items you will get or already have further information.