Unscheduled maintenance

October 9, 2016

Fellow bunqers,

Earlier today we had to do some unscheduled maintenance work. We’re very sorry for any inconvenience this may have caused you.

Here’s what happened:

  • According to our monitoring system, the bunq app started experiencing issues at 12:15. 
  • At 12:24 we received reports that the bunq app was down. We responded immediately.
  • It turned out our database cluster was experiencing issues.
  • We saw that 2 out of 3 nodes were no longer connected.
  • Because we think it’s better to be safe than sorry, the app was temporarily switched to maintenance mode.

Due to its size, it took quite some time for the entire database cluster to recover. The app went out of maintenance mode at 13:32 and has been working normally since then. 

Further investigation taught us that we probably hit a bug deep within our database software and its operation system. Of course we are looking into this further. Please contact us if you want additional information or technical details. 

Sorry again to have disturbed your Sunday,

Jasper

UPDATE: November 14th

We’d like to share a more detailed recap of what happened:

  • TCP Port reuse combined with our stateful firewall;
  • We hit a corner case bug in the FreeBSD operating system that didn’t allow the passage of data on certain ports for a short period of time;
  • In a very short time period this happened twice in a row, causing two databases to not be able to communicate with the third.

We hit a bug in the clustering software and had to manually restart. We then had to make sure that all data was consistent before we could go back online.

Measures taken:

  • We temporarily switched from auto failover to manual failover, since an auto failover is a lot more complicated. Meanwhile we’re writing our own auto failover;
  • We’ve adjusted our firewall rules, they don’t have state anymore. We’re also looking into a completely different firewall;
  • We already have the FreeBSD 11 upgrade of our operating system, but implementing it takes time. We have to do this step by step.

While fixing this we also found a bug in the driver of the LSI controller of our SDDs. We’re working on that as well but until now it’s unclear whether that’s a firmware or driver bug. We’ll be fixing that too.