CTO Kieran Harty discusses DP / DR strategies with two Tintri customers.
The recent WannaCry ransomware crisis reminds us all how important it is to have a good data protection and disaster recovery strategy (DP / DR) in place. Tintri CTO Kieran Harty (KH) recently had the chance to speak with two Tintri customers about their DP / DR strategies.
Matt Crape (MC) is the IT Manager at The C3 Group of Companies, a multi-discipline engineering/contracting organization specializing in building science, restoration, industrial maintenance, foundation systems, environmental technologies, and advanced materials.
Geoff Grice (GC) is Head of IT responsible for applications and systems at CMC Coal Marketing Company (CMC) in Ireland. CMC coordinates the sale and delivery of high-quality thermal coal to numerous power stations around the world.
KH: Where are you with DR? Are you rethinking your strategy?
MC: DR is something I always think about and treat as a continuous process. A lot of organizations are hesitant to throw a bulk sum of money into DR because they may never see value from the investment. I tend to think of it more like insurance: you hope you don't need it, but if you do need it, it can be a lifesaver.
GG: Our DR strategy been quite successful for many years, protecting company information using VMware and Tintri technology. But like many global companies, we are now subject to the European General Data Protection Regulation (GDPR), which may have a direct effect on where we are allowed to store personal data that belongs to EU citizens. So we might need to move our DR site and replication out of the USA and back to Europe in order to meet the newly established legislation. I think a lot of companies will be looking at this same thing over the next few months.
KH: Is your approach more complement or rip-and-replace?
MC: I am always looking to make improvements. Sometimes you need to dip your toes in the water to get a DR plan off the ground. This might be something as simple as moving a copy of your data offsite. Over time you can grow the DR strategy and make changes as funds become available or as business needs change. I like to target key components and make changes when possible.
I can see a complete rip-and-replace making sense if there is a drastic change in infrastructure and your current DR equipment can't handle it. But you need to figure out what your business needs are beforehand in order to operate during a DR situation. Senior management might state that they want “five 9s,” which is fine, but that can often change once they see the price tag.
GG: I like to have a focused solution or combination of solutions in place that very clearly meets the business requirements from Day One. In the past I have seen complementary or evolutionary “bolt on”-type approaches fall short in meeting expectations. For something as crucial as DR or Data Protection, you really want to know that your system is going to deliver from the get-go. For that reason I prefer to review the whole requirement holistically and design from the ground up.
KH: In terms of DP / DR, what do you think about standardizing on one vendor?
MC: I am generally a big fan of standardizing. There are many reasons, including the fact that you have one point of contact. If my environment is made up of three different primary storage vendors, four different secondary storage vendors, and two DP/DR solution vendors—well, that can be a lot of calls to make if I need support. A much more streamlined support structure can not only save money, but also save the business in case of a disaster.
However, blind standardization can pose potential rip-and-replace risks. Focusing on the benefits you are trying to achieve will help you to work the separate pieces into your consistent DR plan.
GG: I like one storage vendor, but prefer mixed vendors for the whole environment—for example, servers from Dell and storage from Tintri. I would not rule out using just a single vendor, but my experience to date is that no single vendor has exactly what we need.
With regard to DP / DR, I don’t really lean either way. I like to align the technology to our functional requirements, then pick best-of-breed options to put the whole picture together. Sometimes you find overlap between two or more different products that might be involved in making the system work (i.e., Tintri replication, VMware replication, Veeam replication). In my view it’s better to have multiple options, although the tradeoff is you’re probably paying for the same functionality more than once.
KH: What about heterogeneous environments and cloud?
MC: Heterogeneous definitely has its place. When you consider cloud providers such as AWS or Microsoft Azure, it becomes a whole new ball game. Leveraging technologies like Amazon S3-based snapshot replication can definitely be worked into the mix. Cloud can be tricky:
In some cases, it can remove complexities, i.e., no equipment to maintain on-premises.
In other cases it can add complexities, i.e., failing over a handful of services to cloud from on-premises might not be a clean cut.
GG: We've pursued a private cloud environment and that has worked well to date. I guess we can do that because we don't have an incredibly dynamic data storage requirement compared to some companies. So I remain open to public cloud and I understand the benefits of being able to grow or shrink storage repositories on demand, but it's not something we need right now. ironically, if we hadn’t invested in Tintri, we probably would need to do something like that.
KH: Have you had success using primary storage snapshots as your first line of defense?
MC: Snapshots are fantastic if you need to roll back a lot of data in a hurry. We had an instance a few years ago where due to a seeding error (on our part) with Windows DFS, a whole division's data was wiped out. A snapshot would have recovered that data quickly, but we had multiple VMs on the same LUN. We couldn't roll back the snapshot without sacrificing data from those other VMs.
One of the technologies that initially attracted me to Tintri was the fact that it is VM-Aware Storage. We could have been back up and running in seconds by just restoring the single snapshot for that one host. Instead, we had to migrate those other VMs off of the LUN, restore the snapshot, and then migrate them back over. That was overhead that we did not want to be dealing with in a stressful situation.
GG: I know this is not a new “gee whiz” feature; the power of Tintri snapshots for DR has been there for a number of years, but when you put them in the context of current issues affecting all businesses then you realize, “Wow this feature is invaluable and continues to deliver year in year out.”
Late last year one of our staff members mistakenly ran a ransomware email attachment that encrypted some shared folders and the whole PC (virtual desktop). Unfortunately the user ran this late at night, did not realize what was causing the slow PC performance, and left the system to happily encrypt itself overnight.
The next day staff arrived at my desk asking why they couldn’t access important logistics files in the shared folders. I reviewed the files and found a Locky ransom note left behind demanding payment to unlock our files.
We powered off the offending system that ran the ransomware, deleted all of the infected files on the network share, and simply spun up Tintri clones of the file server and the user’s virtual desktop from before the encryption took hold. Then it was just a matter of copying out the clean files and switching the user onto the new desktop.
Total time to recover was about 20 or 30 minutes, whereas before we had Tintri VMstores, we could have been running restore jobs for days or maybe even a week. Thanks to Locky (an early relative of WannaCry), we found out that Tintri’s cloning technology is so solid, it really does protect the efficiency of our workforce.
KH: How about any “gotchas” using snapshots?
MC: Snapshots should not be relied on as the sole backup strategy. Things happen—what if there is a huge power spike or a pipe bursts? If your snapshots are sitting on the primary storage array and the array is damaged, then you just lost your snapshots. You can look at replication, which definitely helps alleviate that danger, but if the underlying data is corrupt, then you are just replicating corrupted data. Even though there are lots of safeguards in place, snapshots should be used as a part of your data protection strategy, not as the sole tool.
GG: Having snapshots does reduce the amount of usable disk space that is available on a storage array, so there is a penalty or “cost” to having them. But as we have seen, the benefit can be great, so how do you quantify that cost? I see snapshots as a disaster recovery tool, one that I like to combine with traditional backup to give me a full-house data protection ability. When you have the combination of backup software and snapshots to rely on, it becomes an exercise in common sense as to where you allocate your storage resources.
KH: How is Tintri working with your existing solutions?
MC: We are a Veeam shop, and we've had great success using VMstore for our production workloads. With our previous storage array, we were seeing large spikes in latency when a backup would kick off. This would lead to large backup windows and a performance impact not only on the VMs being backed up, but on any other VMs in that disk group.
Since we've made the move to Tintri, we no longer see those spikes. Veeam has never let us down and adding Tintri into the mix has created a perfect marriage of the two products. In fact, the way I used to know the backups were running is I would get email alerts at night letting me know about the latency problem. I sleep better now that I don't hear my phone chiming in the middle of the night with alert notifications.
GG: We’re using Veeam, SRM, and Tintri integration through the VAAI plug-in. It’s just amazing performance all around—could not live without it. I am looking at Commvault for possible GDPR compliance, so who knows, maybe in the future I’ll have a better feel for that also.
Unique control with VM-level actions for infrastructure functions including snapshots, replication and QoS make protection and performance certain in production, and accelerate test and development cycles.