Every night, with our customers' permission, Tintri VMstores upload a summary of their health status and activity. This not only lets Tintri support proactively reach out to customers, but also gives Tintri engineering a better sense of how our product is actually used, and what areas might need attention. Read on for a data dive with this information—it helps us so we can help you.
I’ve previously written about the impact of virtualization on the NFS operation mix, and on data from Tintri VMstore autosupports about VM sizes. I thought I’d follow up by taking a look at the size of read and write operations.
The data I present here was collected from 2000 real-world systems, with several days of coverage from each system. It covers only NFS (not SMB, which the VMstore also supports) but does include both VMware and non-VMware hypervisors.
Tintri does not currently track operation sizes per-VM, only at the NFS layer. This tracking uses a histogram that uses power-of-two bucket sizes, and the operation count is reported to the nearest power of two in nightly autosupports.
While this lets us cover a broad range of values at low cost, it does limit our ability to look at interesting things like the number of odd-sized reads or writes, and may reduce the accuracy of results. When a read size exceeds a power-of-two boundary, it is rounded down for reporting purposes.
Looking at the data as a whole, 4KB operations clearly still dominate the virtualized data center. About 47% of writes, and 25% of reads, were 4KB in size. However, the most common read size was 8KB, at 30%.
There is a surprising spike at 64 bytes. This is caused by the presence of VMware lock files, which are frequently updated using a small write. Over the Tintri customers in this study, more than 8% of the write operations were this size, almost all of them likely coming from accesses to the lock files. The only other small write size with any significant number of operations was 8 bytes, at 0.12% of the sample.
While there is a corresponding spike in the number of reads of 64 bytes, compared to other small read sizes, this constitutes only 0.004% of read operations. Based on the assumption that all of these operations are lock files, we can estimate that a VMware lock file is written 31 times as often as it is read.
The minimum SCSI operation we would expect to see from a VM is 512 bytes (one disk sector). These small operations constitute only 2.6% of reads, and 5.1% of writes.
Large reads or writes (64KB or more) are much more common. 21% of reads and 8% of writes are large, and in fact most bytes are transferred in large operations. Here is the same data, with each operation weighted by the number of bytes transferred instead of the number of I/Os:
Nearly half of all bytes read or written are in 256KB blocks. 64KB is overrepresented compared to 128KB; one possible explanation is that VMware Storage vMotion uses a 64KB size.
The data represented here average over many different arrays and many different times of day. One technique for describing the data is to use “K-means clustering”. This algorithm divides up the eight million measurements in my data set into clusters based on how closely matched the distribution of accesses are, considered as a multi-dimensional space. I found that K=5 gave clusters that seemed most distinct. Each cluster is the “center” of about 20% of the measurements, so we can think of them as representing five approximately equally common access patterns. (However, there may not be an equal number of accesses in each cluster, only an equal number of 10-minute samples.)
For reads, the five clusters are:
That is, nearly 80% of the time we would expect the VM workload to have a predominant single read size! Here is a doughnut graph showing the distribution of sizes in the centroid of each cluster:
For writes, the story is similar but a little more complicated. The five clusters can be described as:
It seems likely that backup workloads drive the large-block read periods in my data set; there is no corresponding heavy large-block write workload. I have not yet attempted to correlate the mix of operation sizes with the time of day, nor with the number of IOPS, but these might make interesting follow-up studies.
There are two practical lessons I take away from this data. The first is about benchmarking, the second is about quality of service on mixed workloads.
The behavior of virtualized systems is complex, and measuring performance on just one or two block sizes is likely to represent only 20-40% of the observed workload at best. If a storage system is benchmarked at 8KB reads only, that test ignores 70% of the real world.
Additionally, most measurement results are for a fixed block size, when real-world results tell us that the most common read and write sizes are different! A better model might be to combine writes of 4KB with reads of 8KB. A benchmark needs to more closely resemble this sort of real-world data in order to provide guidance. The best is to use your own workloads, but a whole-stack benchmark such as IOMark or VDBench provides a more realistic mix.
The second lesson is that small- and large-block I/O coexist, and the storage system needs to gracefully handle a mixture, and a mixture that changes over time. At 10Gbps, a 64KB NFS read takes about 50 microseconds to send. If multiple such requests are in flight at once, an unrelated VM’s traffic could see 200 microseconds or more of extra latency. Recall that while most read requests are small in our data set, most bytes are sent in response to large reads. So, latency-sensitive tasks may encounter this situation relatively frequently.
The Tintri VMstore includes an automatic mechanism to adapt to this mix of read sizes. We perform fair sharing per VM on the outgoing network link. So, if one VM has made several large reads, other VMs’ responses will be sent to the network first. This ensures that small-block workloads do not see excess latency by queueing up behind a sequence of large-block reads. This is an important optimization to isolating latency-sensitive VMs from backup traffic or other noisy neighbors.
More broadly, I think that the data here shows some familiar features but also some surprises. 4KB is still “king” of the workload. This is expected given that common Linux and Windows file systems still use 4KB blocks. But, 8KB reads are more frequently observed, and higher sizes are not uncommon. This suggests that guest file systems are being successful at allocating data sequentially, even though the file systems themselves have not evolved their block size in decades.
Our observation that most bytes are transferred in large block sizes is also not surprising, but is a good confirmation that large-block performance is still important in the virtualized workload. If much of this traffic is backup, recovery, or live migration, perhaps it could be productively offloaded onto the VMstore’s built-in data protection and replication features.
The presence of lots of 64-byte writes is a definite signature of a VMware workload. The frequency of such lock file writes is probably increased by the fact that locking continues even if the application itself is idle. It might be worth looking at whether special-case handling of these lock files would provide overall performance benefit.
Our ability to classify different loads by a dominant read or write size is, I feel, the most interesting part of this analysis. Although all of the clusters show a mixture of sizes, we see that 80% of the time, read traffic is dominated by a single block size. (Recall that each cluster represents 20% of the collected samples across both time intervals and VMstores.)
The VMstore uses an 8KB block size internally, which our measurements show works well across a range of different request sizes. But, a “phase change” to a different dominant size could be an interesting signal to show to the user, or drive internal adaptations to the changing workload.
Unique control with VM-level actions for infrastructure functions including snapshots, replication and QoS make protection and performance certain in production, and accelerate test and development cycles.