Debugging "Relay log read failure" With MySQL Master-Slave Replication For Drupal

Our Drupal 7 installation has served us well using a single, optimised MySQL database server. However, a desire to deliver advanced reporting and dashboards driven by Power BI required us to implement a replicated slave so that Power BI could draw data from a datasource that would not impact the performance of the database serving the website.

Mostly the configuration was straightforward:

Replicate a MySQL 5.5 master into a single MySQL 5.7 slave
Stitchdata tunnels into MySQL slave to access binary logs - Stitchdata requires at least MySQL 5.6, which explains the version difference you can see in step 1.
Stitchdata takes care of some data transformations to prep the data for Power BI.

The slave was successfully seeded using mysqldump, the correct master log file and position was specified so that replication would start from the correct place and the slave replication started successfully.

But after a few seconds SHOW SLAVE STATUS \G would tell us:
Relay log read failure: Could not parse relay log event entry.

We could reset the slave and start it again, data would synchronise as expected but a few seconds later we would hit the same error again.

Was it the MySQL version mismatch? No.
Was it because we had configured MySQL to require replication over SSL? No.
Was it a network problem? No.
Was it the log files becoming corrupt? No.

The documentation had no more ideas for us. Time to break the problem down and see where things were going wrong.

MySQL binary log files typically receive a numbered suffix, for example:
mysql-bin.000001
mysql-bin.000002

The binary log is transferred to the slave, the slave reads these binary logs and creates relay logs which it uses to keep track of what it has already written into its local DB and what it will write next. These relays logs typically have a similar numeric suffix.

What was unusual in our environment was that we had one big relay log without the numeric suffix. This was unusual and seemed like a good place to start investigating.

The MySQL docs say this:

By default, relay log file names have the form host_name-relay-bin.nnnnnn in the data directory, where host_name is the name of the slave server host and nnnnnn is a sequence number. Successive relay log files are created using successive sequence numbers, beginning with 000001. The slave uses an index file to track the relay log files currently in use. The default relay log index file name is host_name-relay-bin.index in the data directory.

The default relay log file and relay log index file names can be overridden with, respectively, the --relay-log and --relay-log-index server options (see Section 16.1.6, “Replication and Binary Logging Options and Variables”).

If a slave uses the default host-based relay log file names, changing a slave's host name after replication has been set up can cause replication to fail with the errors Failed to open the relay log and Could not find target log during relay log initialization. This is a known issue (see Bug #2122). If you anticipate that a slave's host name might change in the future (for example, if networking is set up on the slave such that its host name can be modified using DHCP), you can avoid this issue entirely by using the --relay-log and --relay-log-index options to specify relay log file names explicitly when you initially set up the slave. This will make the names independent of server host name changes.

We had setup an explicit relay-log config setting based on our slave server hostname, which seemed to be in line with the recommendations in the documentation, but this is where the problems began. Our database server host names use the following convention:

slave.001.xxx.yyy.com

What we learnt is that using that hostname pattern for the relay-log setting is the cause of the problem:
relay_log = slave.001.xxx.yyy.com

While we haven't reviewed the MySQL source code it seems likely that the binary log naming algorithm uses some form of pattern matching to determine what is the current numeric suffix so that it can determine what the next suffix should be. But apparently the pattern matching it uses looks for a dot followed by digit anywhere in the filename, not just in the suffix. Therefore the presence of ".001" within the relay log name was making this fall over and MySQL could not successfully create the next relay log file and continue pumping everything into a single relay log file.

Once we had uncovered this behaviour the fix was simple, just avoid conflicting with that pattern by removing the dot:
relay_log = slave-001-xxx-yyy-com-relay-bin

Now that our replication setup runs without issue we can:

Create complex dashboards in Power BI and embed them into pages without compromising Drupal performance.
We have real time backups of our data
We can initiate drush sql-sync commands from the slave to avoid extra load on the production DB
We could configure Drupal to read from a slave and write to the master, but we'll be doing some further testing before we take that step.