Here at Tintri, the technical marketing lab is our production environment, which we rely on for important demos, new feature validation, education and other critical activities. So we do our best to make sure it's in tip-top shape, 24/7/365.
Along the same lines, we address outages or service degradation ASAP. Of course, when a strange issue surfaced last week, we jumped right in.
On the case
While running some maintenance activities, we noticed the following:
The Tintri VMstore UI showed that one of the busier VMs was experiencing latency, around half of which was due to the network delay. This looked odd for several reasons:
The total network throughput at the VMstore level was very low
There is a separate storage network, so non-storage noise can’t have an impact on this network’s load
The environment was supposed to be quiet at that time of night
It was pretty obvious that this was storage traffic, but it wasn't coming from one of the VMs in this VMstore.
Almost as an instinct, we went to check our vCenter. It showed the following:
Here we learned that many packets were sent from one host (which we already assumed), but we still didn’t know which VMs were behind it. Wasn’t that helpful.
But we were so busy looking deep into the details, we almost forgot about the one place where we could see the complete picture: Tintri Global Center (TGC).
TGC consolidates data from all our VMstores, so it holds a compete view of the entire infrastructure.
And sure enough, we logged in and found the answer immediately.
The top graph shows the total throughput across all 549 VMs in our infrastructure, which are serviced by each of the multiple Tintri VMstores. When we clicked a point in the graph (the red line), we saw a list of the VMs that contributed to the throughput at that point in time.
We could immediately see that VM “SHARE2012” was waaaay up there at the top.
As it turns out, someone used IO-Meter running on the “SHARE2012” VM and it had some extreme settings for testing purposes. This VM created load on the entire network, which even slowed down VMs that were not on the same VMstore or ESXi host.
This turned out to be a classic situation where a single pane of glass view, an oft-misused term, was invaluable in diagnosing the root of our cross-system/cross-platform problems.