The planned Database maintenance consisted of running an ALTER
operation on a fairly large MySQL Table (20M+ records). We're attempting to switch the encoding to utf8mb4 to accommodate a richer set of characters for certain columns as required by some of our new customers.
We use RDS for our database needs and a rough calculation showed that we were going to trigger around 5-6M Write IOs. Various benchmarks we run in the past and previous migrations show that our RDS database (m3.xlarge) can handle 2000-2500 Write IOPS for a Burst of around 50 mins. So we think we're safe at this point (I.e we can get the migration done during the planned window).
We launched the migration and we monitored the Write IOPS which rise quickly to 2500K Write IOPS. Our RDS instance is humming along and because the Table is locked all insertions in the table are rejected. This triggers our alerts in newrelic. After 1h, we realise that the total numbers of IOPS required to finish the migration is not going to fit within the burst quota. The Write IOPS crash from 2500K down to 500 IOPS and we're faced with a migration that's going to take an extra 1h30 mins - which means an extra 1h30 of failing all API calls resulting in an insert. This is clearly unacceptable given our SLAs. Decision to roll back the migration and tackle it differently.