Resolving Invalid Key File Errors With Amazon Aurora

Fri Jul 23, 2021
~600 Words

When adding a column to a large table on an Aurora Mysql 5.6 db.r5.large instance we encountered a disturbing error:

Mysql2::Error: Incorrect key file for table 'applicants'; try to repair it`

The migration succeeded in our staging environment (with a smaller instance size) and in another production deployment that has very little data, but had failed on the largest instance, both in terms of the database instance type and data volume.

There are commands for checking the health of a database table in Mysql but they can seemingly result in a server exit if severe issues are found with InnoDB tables, so these were not run initially.

We had used online DDL to avoid locking the table, so we attempted to add the column synchronously with:

ALTER TABLE `applicants` ADD `expunge_after` datetime;`

This alter table statement failed with the same error.

Running CHECK TABLE applicants; resulted in no errors being reported:

+-----------------+-------+----------+----------+
| Table           | Op    | Msg_type | Msg_text |
+-----------------+-------+----------+----------+
| ebdb.applicants | check | status   | OK       |
+-----------------+-------+----------+----------+
1 row in set (10 min 15.57 sec)

At this point I assumed that it could be related to a lack of local storage, even though the error didn’t necessarily suggest this.

Running this command from the Aurora local storage troubleshooting page returned what seemed to be a large number of temporary tables on the affected instance:

SHOW GLOBAL STATUS LIKE 'created_tmp%tables';
+-------------------------+----------+
| Variable_name           | Value    |
+-------------------------+----------+
| Created_tmp_disk_tables | 199544   |
| Created_tmp_tables      | 14833718 |
+-------------------------+----------+
2 rows in set (0.00 sec)

This article (in Japanese) suggested looking at the local storage metrics for given instance sizes, and also to look at the CloudWatch metrics of the instance in question. Doing so it was clear that while this instance type (db.r5.large) had more than enough local storage to handle the migration, for some reason its available local storage had been trending steadily down over the past two weeks, leaving only ~5GB local storage, which was insufficient.

This AWS page on Aurora local storage suggests that we would see a ‘table full’ error in this case, but we were seeing this error related to a key file.

This Reddit discussion includes the rule of thumb that the local storage will be ~2x the memory of the instance. It also contains a hint that the used space may be freed in the event of a failover to the reader.

We first tried failing over to (possibly) free up local storage on the writer instance. After failing over the temporary table count dropped almost immediately:

mysql> SHOW GLOBAL STATUS LIKE 'created_tmp%tables';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 2     |
| Created_tmp_tables      | 459   |
+-------------------------+-------+
2 rows in set (0.01 sec)

However, even after waiting ~20 minutes CloudWatch showed no improvement in the amount of available local storage, so the next step was to size up the instance.

Failing over to the backup instance and sizing up to db.r5.xlarge allowed the migration to complete using online DDL:

mysql> ALTER TABLE `applicants` ADD `expunge_after` datetime, ALGORITHM=INPLACE, LOCK=NONE;

Query OK, 0 rows affected (11 min 18.63 sec)
Records: 0  Duplicates: 0  Warnings: 0

If you are experiencing issues with migrations on large tables I would recommend that you confirm exactly how much local storage the affected instance has. If it appears low then failing over may suffice to resolve it, or if you are running a heterogeneously-sized setup as we are then sizing up the writer may be a better option.