Post Mortem Report
The purpose of this post-mortem is to detail the investigations and subsequent findings of the impact of Cloud Mail incident of May 26 2022.
Background
- On Zimbra, we perform regular proactive maintenance on our LDAP servers.
- This involves reloading of the LDAP Databases on all masters and replica servers using a backup of one of the master databases.
- This is done quarterly to maintain optimal performance of the LDAP service.
- This process has been performed over the last 3 years in conjunction with our vendor without issue.
On Wednesday 25th May 2022:
- An LDAP maintenance change was performed on the evening of the 25th at 20:00.
- A SYNAQ Engineer was working with our vendor engineer for this change and a miscommunication occurred between them, which resulted in the incorrect backup being applied (it was a legacy backup from January 18 2022), instead of current backup taken from a master that evening.
On Thursday 26th May 2022: Authentication Failures Experienced
- [07:00am]: We received reports of a subset of clients that were experiencing authentication issues (i.e. could not login to their mailbox).
- Our engineers investigated the issue and we uncovered that we were working on incorrect version of the LDAP database
- [8:03am – 9:31am]: SYNAQ engineers took the LDAP system down for the purposes of restoring the most current version of the backups to all LDAP servers and this reloading to approximately 90 minutes.
- Authentication services were restored at 09:31 AM.
Remedial Actions
Immediate – 0-3 months:
- Improve and verify standard operating procedure for LDAP optimisation process with additional test cases.
- Build an additional monitoring alert that allows us to detect anomalous changes in expected data found in LDAP after a reload process is performed.
Long term: 6 – 12 months:
- Zimbra 9 – LDAP fixes and improvements.