This talk discusses the development RDMA Migration and Fault Tolerance as part of the migration system within QEMU and the challenges involved in developing both pieces of inter-related work.
RDMA helps make your migration more deterministic under heavy load because of the lower latency and higher throughput over TCP/IP. In particular, a TCP-based migration, under certain types of memory-bound workloads, may take a more unpredicatable amount of time to complete the migration if the amount of memory tracked during each live migration iteration round cannot keep pace with the rate of dirty memory produced by the workload. Fault Tolerance in the virtualization community has gone through lots of growing pains, including implementations from Xen, VMWare, Marathon, and academia. This talk also summaries our attempt to perform Micro-Checkpointing inside KVM, making use of RDMA as well.