0 0

Using machine learning to adapt to complex I/O patterns for superior QoS

Aerial view of cars driving on freeways to represent complex I/O patterns

Tintri Auto-QoS uses proven machine learning algorithms to optimize I/O scheduling when the I/O mix becomes complex.

  • Most storage QoS implementations work only at the LUN or volume level and allow you to set a “ceiling” on I/O but not guarantee performance.
  • Tintri Auto-QoS allows you to guarantee both minimum and maximum performance levels for every VM and container.
  • To increase QoS efficiency, Auto-QoS uses machine learning to schedule I/O from individual VM queues, providing optimum utilization of storage resources.

The first post in this series looked at the role of machine learning in optimizing VM placement. This time we’ll take a closer look at Auto-QoS.

One of the key advantages of Tintri all-flash storage that differentiates it from every other all-flash array is our unique approach to Quality of Service (QoS). Operating at the VM and container level allows Tintri Auto-QoS to be smarter and work better than any other QoS implementation, making it a cornerstone of Tintri autonomous operation. The Tintri enterprise cloud platform is designed for autonomous operation, freeing your infrastructure, cloud, and DevOps teams from routine management and allowing you to focus on higher-value tasks.

Overcoming storage QoS challenges

Most IT teams are working to modernize infrastructure and consolidate workloads to streamline operations, but if you can’t deliver predictable performance for workloads sharing the same storage, your ability to consolidate is severely limited. In the absence of effective QoS mechanisms, the only option is to over-provision storage—effectively putting you right back where you started.

Most storage QoS implementations have two big limitations:

  • They only allow you to set a performance ceiling. This means you can limit the impact of performance “hogs,” but you can’t guarantee performance for important workloads.
  • They work at LUN or volume granularity. It’s a virtualized world, and typical storage LUNs may contain tens or hundreds of VMs. It’s difficult to determine what QoS settings are appropriate for each container and hard to predict the effect of QoS on the individual VMs inside. Plus, the I/O profiles of the applications running inside each VM can be wildly different.

The few QoS implementations that overcome the first limitation (by allowing you to establish both a floor and a ceiling), don’t overcome the second. You are forced to set limits on a LUN with little or no visibility as to what’s happening with the VMs inside it. Getting good performance results may require you to work out the right limits on every LUN or volume, which ends up becoming little better than a guessing game.

A previous blog described how Tintri Auto-QoS delivers guaranteed performance with fine-grained control. Tintri accomplishes this with:

  • Per-VM performance isolation. The Tintri QoS scheduler maintains an IO queue for every VM, using each VM's IO request-size and per-request overhead to determine the cost of every IO in the system. I/O from each IO queue is scheduled proportionally into the pipeline for execution, ensuring resources are allocated fairly. It’s important to note that storage that lacks visibility at the VM level cannot offer this capability.
  • Per-VM performance protection. Minimum and maximum performance settings can be enabled on individual VMs or sets of VMs should that become necessary, giving you fine-grained control when you need it.

What this means in practice is that noisy neighbor and I/O blender problems are a thing of the past, and, for most workloads, you get good performance without having to do anything. You can set specific limits on particular VMs as needed. Because Tintri has much better visibility of I/O performance from the system level down to the level of the individual VM or container, the risk that a storage system will run out of performance unexpectedly is minimized. This allows Tintri storage to allocate performance resources efficiently based on actual I/O patterns.

How can storage accommodate complex I/O workloads?

The various applications running inside your VMs and containers can have very different I/O patterns. Some applications may be streaming data in big chunks, while others are doing small, random transactional I/Os. Some applications are write heavy, while others do mostly reads. As you would expect, these varying I/O behaviors don’t all have the same impact on performance.

Naturally, every storage system has an upper limit on the performance it can deliver. Tintri all-flash storage can mete out performance resources more accurately and efficiently and achieve higher utilization without getting in trouble. Our systems characterize varying workloads with precision, allowing the software to know when performance resources have been overprovisioned—and act accordingly. The software monitors request sizes and types as well as the load (queue depth) for each VM and then admit requests from each VM queue accordingly.

Using machine learning to accommodate mixed I/O

So how does Tintri software determine how much work to schedule at one time? If the incoming requests are uniform across the set of VMs, it simply measures the throughput achieved for various queue depths to find the one that maximizes throughput.

If requests are a mix of various types: read vs. write and small vs. large, the problem is more complex. This is where machine learning comes in. Tintri software uses linear regression algorithms to estimate how many requests of different request size and request type to admit to achieve optimum throughput.

In general, requests of various types share some resources (e.g. all requests use CPU) and may also use distinct resources (e.g. only writes use NVRAM). Throughput will be proportional when requests use shared resources and additive when the mix of requests use distinct resources. By using smarter scheduling, Tintri is able to accommodate dynamically changing I/O patterns to deliver more performance.

Direct visibility of performance data

Another Tintri performance advantage is that Tintri Analytics allows you to visualize all this performance information directly at the VM level should it become necessary for troubleshooting. Varying I/O sizes and types are characterized in terms of “normalized IOPS,” making it possible to easily compare activity across VMs as well as see the overall impact on the storage system. No other all-flash array provides this level of visibility.

Using machine learning to adapt to complex I/O patterns for superior QoS

A smarter approach to storage performance

At Tintri, our goal is to deliver the simplest management experience possible. Auto-QoS ensures that every VM running on Tintri storage delivers good performance. And QoS works hand in glove with Tintri VM Scale-out to optimize your entire environment.

As you learned in the previous blog, Tintri VM Scale-out accurately predicts future capacity and performance needs to optimize VM placement. As a result, Tintri all-flash storage operates autonomously, delivers better performance with little or no performance tuning, and makes smart recommendations to make sure your environment remains optimal. Both Auto-QoS and VM Scale-out give you fine-grained control to address unique requirements.

Next time we’ll look at machine learning in Tintri Analytics.

Sumedh Sakdeo / Sep 26, 2017

Sumedh is Sr. Staff Engineer in Tintri and has worked on groundbreaking file systems for storage infrastructures. His areas of interest are Virtualization, File System, QoS, Lockless Algorithms, So...more