Rendimiento de escrituras síncronas: caché, RAM, fallos de corriente, BBU.
En una entrada anterior he hablado sobre ZFS para virtualización. Uno de los temas recurrentes que he encontrado durante el análisis de esta solución es el rendimiento de las escrituras sync (synchronized, síncronas). Comentemos este excelente post: Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?:A properly executed sync write provides assurance to the storage consumer (in this case the filesystem layer) that the requested operation has been committed to stable storage; upon completion, an attempt to read the block is guaranteed to return the data just written, regardless of any crashes or power failures or other bad events. This usually means waiting for the drive to be available, then seeking to the sector, then writing the sector… a process that can take a little time. File data blocks are typically much less important, and are usually written asynchronously by most filesystems. This means that the disk controller might have many write requests simultaneously queued up, and the controller and drive are writing them out in a convenient manner. Because for the most part users are copying around data files, usually we’re used to seeing throughput in MBytes/sec. Sync writes, even on a local system, are usually substantially slower. So here’s the problem. ESXi represents your VM’s disk as data in a vmdk file. However, it really has no clue as to what that VM is trying to write to disk, and as a result, it treats everything as precious. In other words, VMware writes nearly everything to the datastore as a sync write. This can result in significant slowdowns. On a local host, the typical solution to the problem is to get a RAID controller with write back cache and a BBU. […] So VMware’s solution is to make sure writes are flagged as sync. Basically they hand the problem over to the storage subsystem, which is, in their defense, a fair thing to do. A lot of NAS devices cope with this by turning on async writes. You can do this, but then you become susceptible to damage from a crash or power loss… the very thing VMware was trying to protect you from.Como explica, vSphere no sabe si el SO guest está escribiendo de manera síncrona o asíncrona, por lo que siempre lo hace de manera síncrona, al ser la opción más segura frente a fallos de corriente. El problema es su rendimiento pobre en discos duros tradicionales:
The option sync means that all changes to the according filesystem are immediately flushed to disk; the respective write operations are being waited for. For mechanical drives that means a huge slow down since the system has to move the disk heads to the right position; with sync the userland process has to wait for the operation to complete. In contrast, with async the system buffers the write operation and optimizes the actual writes; meanwhile, instead of being blocked the process in userland continues to run. (If something goes wrong, then close() returns -1 with errno = EIO.)Las escrituras asíncronas emplean cachés y buffers que aceleran enormemente el proceso. Para mejorar las síncronas, se suelen emplear una caché de memoria RAM, protegida con batería frente a cortes de luz. Aquí se puede ver un ejemplo en una controladora RAID:

Dear all, I have a VMware ESXi 5 server with nine VPS in my office. I don’t have any problem with this server since this week. All virtual machines which is installed on it become slow in performance and in speed specially SharePoint 2010 virtual machine server. VMware vSphere Client environment is very slow and some time «Not Responding». Shutting down the virtual machines and VMware server and turning them on again will fix the problem, but the next day I have the same problem. I don’t know what is the reason. Anyone can explain about this issue and its reason and how to fix it? Thanks. _____________________________________
Please provide some details about the host system (vendor, disks, RAID controller, …). In case you are using a RAID controller with battery-buffered write-back cache, double check whether the battery is still ok. André