0 0

Help! My HCI has fallen and it can't get up

  • KEY TAKEAWAYS
  • Spectre—Intel’s feature / bug—has created a security vulnerability that can affect storage.
  • Architectures with shared systems—namely hyperconverged infrastructures—suffer from the most performance degradation and patching pain.
  • Tintri is unaffected by Spectre and offers per-VM QoS to permanently resolve any pains around predictable performance.

 

Chances are this is not the first you’ve read about Intel’s security vulnerability, referred to as Meltdown or Spectre (we’ll use the latter). But we want to take the opportunity to be clear about Tintri’s position—that some architectures are clearly more compromised than others. 

Before we get into it, for those who are newer to the topic … what is Spectre?

In brief, Spectre is a security ‘feature’ that creates vulnerabilities in affected processors. Since we’re talking about Intel’s processors, their reach means the issue is quite widespread. The vulnerability allows rogue processes to access unprivileged information and kernel memory. The ‘feature’ can be patched, but these patches can have a negative effect on processor performance.

How is Tintri impacted?

Tintri arrays are not affected by Spectre. Tintri storage is not a shared system—there is no provision for any user code to run and exploit vulnerabilities. So, we do not need to issue any patches, and since our management system (Tintri Global Center) is outside the IO path, there’s no concern about degraded performance.

Tintri Analytics—our SaaS-based analytics toolset—uses AWS. Amazon has already taken care of the underlying infrastructure. Since Tintri Analytics is a SaaS application (managed by Tintri), customers don’t have to worry about the performance degradation thanks to cloud elasticity.

How are other vendors impacted?

Storage vendors with closed appliances are unlikely to require a patch. Many legacy vendors have published statements to this end. But, for those issuing patches, the performance impact cannot be brushed aside. Every application that shares a LUN or volume with a VM that is being patched could be affected because of the IO storm resulting from patch & reboot of guest VMs.

However, the story is even rougher for hyperconverged vendors, specifically the ones that run storage as a VM or with a mid-level file system, are at greater risk. Consider a hyperconverged vendor running a guest VM on the file system of a virtual storage controller, that is running as a VM, on another file system; there are numerous syscalls per IO and that could negatively impact performance—some estimates indicate a 10% to 30% impact.

HCI is already burdened by a CPU and licensing tax as a result of its shared storage and compute architecture. This performance tax only exacerbates the problem. Customers may need to add more systems and licenses just to get performance back to previous levels.

That’s why some HCI vendors have already released patches; but every VM and host needs to be patched and so the maintenance window is huge! It’s also risky. If you don’t have enough headroom (remember that 10% to 30% hit) you can’t effectively evacuate and rebalance nodes. That could trigger an application failure or worse a node failure that creates a domino effect across the infrastructure.

This is where Tintri stands apart with its per-VM automatic QoS. We’ve made the design choice to isolate storage from compute, and we’ve built into our file system foundation the ability to isolate every individual application for guaranteed performance.

Bottom-line

The Intel bug highlights two things. First is the pain of unpredictable performance. Of course, lack of predictability is never a concern for Tintri customers because we isolate storage from compute, and isolate every individual application. We’re the only system that can provide per-VM quality of service to guarantee performance resources for every VM. Per-VM QoS can be set manually or handled autonomously.

Second is the risk of hyperconverged infrastructures. Sure, they promised web-scale, but the reality is that with mixed workloads, beyond a few nodes all sorts of problems surface. That includes the ‘best practice’ of balancing nodes, the challenge of moving data across nodes and now the danger of shared systems.

These risks are exposed by the Intel bug, but they’re certainly not new. The patching process outlined above will have to be repeated for any widespread virus or vulnerability.

So, what’s the best way to avoid a meltdown? Stay cool with Tintri.  

 

Satinder Sharma / Jan 10, 2018

Satinder Sharma is a subject matter expert for storage & virtualization.  He works for Tintri, based out of Toronto, Canada. He is responsible for evangelizing and helping customers design th...more

Temporary_css