Tintri’s Auto-QoS Improves Application Performance and Reduces Time Spent on Storage Operations to Almost Zero
SEGA Games Co., Ltd. is a Japanese multinational video game developer and publisher headquartered in Tokyo, with multiple offices around the world. SEGA Games has released a number of popular game titles in recent years, including hits such as "Phantasy Star Online 2", "Yakuza Series", "Puyopuyo!! Quest", and "Hortensia Saga" designed for PCs, as well as home game machines and smart devices. SEGA Games is also successful in many other business areas, including a cross-promotional network for smart devices called "Noah Pass" that currently has a monthly active user count (MAU) exceeding 10 million, and which continues to grow steadily as a B2B service.
Soichiro Fujise, manager of Infrastructure & DB Section at SEGA Games, is responsible for establishing and operating the IT infrastructure for all of SEGA Games’ organizations that produce games for smart devices and B2B services. He and his team of ten engineers support the storage, network, and virtualization infrastructure for the company’s production service and development environments. The stability and availability of the IT infrastructure is of utmost importance to the company. If the IT infrastructure fails—and even if service is only interrupted for minutes or hours, or if performance deteriorates—it impacts SEGA Games’ business in a major way.
Currently, more than half of SEGA Games’ physical servers are used to host virtualization environments, where 2,000-3,000 VMs are running. According to Fujise, "total virtualization" is the best way to further improve the fault tolerance of SEGA Games’ IT infrastructure. "At present, all of our applications that require high performance are managed by physical servers,” noted Fujise. “However, if a failure occurs with a physical server, recovery takes a lot of time, due to all of the parts needing to be replaced. If we had total virtualization—even if there was a failure—we could move the virtual machine to another secure host and resume service within in a short period."
Daily Struggles with the Sudden Appearance of “Monster VMs”
Kyohei Aso, who also works in SEGA Games’ Infrastructure & DB section, explains the reason for introducing new storage solutions. "There haven’t been any major issues with normal performance and capacity in the storage systems that we were using for virtualized environments,” he explained. “However, conventional storage does not have a function to automatically adjust QoS. A ‘Monster VM’ is a virtual machine that consumes a lot of storage I/O resources. When a Monster VM emerges, the performance of other VMs using the same storage will be greatly affected.”
According to Fujise, the appearance of Monster VMs is totally unpredictable. “If you have a VM with a lot of storage I/O at all times, you can deal with it in advance, but VMs that do not have much I/O can suddenly become ‘monsterized’. The most common reasons for this are data backup processing or log salvaging processing to check for problems with applications. When such processing begins, the storage I/O of that VM will increase suddenly. Since these operations are carried out by the service department, it is not possible for us to be aware of the situation in advance."
Since Monster VMs threatened the stable operation of all games, SEGA Games’ Infrastructure & DB section needed to deal with them as soon as they appeared. Before moving to Tintri, SEGA Games’ IT team had to identify which VM were causing the high I /O, manually adjust the QoS setting of the storage, and move the Monster VM to another host to reduce the impact on other VMs.
In reality, however, it was very difficult to determine which Monster VMs and processes were responsible for the sudden and dramatic increase in storage I/O. "Initially, we logged into all VMs and retrieved event logs to investigate what kind of processing was being carried out at the time the storage I/O suddenly increased, but in most cases it was unknown,” Fujise said. “After that, we introduced a monitoring tool that charted the I/O of each VM, but it was not useful when there was no regularity and the I/O increased rapidly for a short time. Even if we discovered the cause after a day or two, it still took longer to work on the issue after that. We would carefully examine the appropriate QoS values that did not affect the performance of the VM, examine the configuration change procedure, and in some cases, call the storage vendor and consider future responses. Sometimes all of this would take a week. After doing these procedures over and over, I asked myself, ‘Is it right that we spend so much time on this? There are other important tasks and we need to use our time efficiently!’ The Monster VMs needed to be defeated in order to stabilize our IT infrastructure to achieve total virtualization.”
SEGA Games tested infrastructure solutions from several storage companies. “During product selection, we borrowed test machines from four storage vendors and conducted PoC tests focusing on two points; how much storage I/O would be stopped in the event of a hardware failure, and how much improvement would be seen from an operational aspect,” explained Aso. "For the hardware failure test, we created a ‘pseudo failure’ by disconnecting the LAN cable of the test machine or breaking a hard disk. A long stoppage of I/O with the other vendors’ infrastructure was 2-3 minutes. Tintri, however, recovered within a few seconds to tens of seconds during our tests."
Impressed with Tintri Auto-QoS
“In terms of operational improvement, Tintri’s Auto-QoS function was overwhelmingly effective,” noted Aso. “This function monitors storage I/O for each VM and dynamically controls it when I/O is suddenly increased so as not to affect the I/O of other VMs. Some of the other companies’ storage also had auto-QoS functionality, but when we actually tested them, I/O was only mildly suppressed and the performance of other VMs was impacted. On the other hand, Tintri QoS greatly reduced the impact on other VMs."
After three months of testing, all of SEGA Games’ DB infrastructure section members agreed on the decision to purchase two Tintri systems for the company’s primary site, and two more for its DR site. It took only 30 minutes for Tintri to be configured and about one week for the integrator to install the whole system into operation. At present, the back-end systems of games including "Puyopuyo!! Quest" and "Hortensia Saga" are now hosted on Tintri.
The effectiveness of the Tintri Auto-QoS function was evident. "As QoS takes place automatically, I don’t need to do anything after starting operations. Since I don’t have to do much, I worry if it is really okay,” laughed Fujise.
Aso said he no longer needs to monitor storage I/O. “I occasionally open the Tintri management screen and check VMs, sorting in descending order of IOPS. However, since Auto-QoS is working well, I have noticed that these checks are not really necessary. We first introduced a 500 VM per unit model as a guideline, and there was some concern that the capacity wouldn’t be sufficient. However, those worries seem to be unfounded.”
2.2x Data Reduction
"The data reduction rate in the real environment is 2.2x, in accordance with the Tintri catalogue specifications,” Aso said. “Although Tintri currently hosts 420 VMs, there is still more space on the system, and it is likely to host about 800 VMs in that one unit. It is also important that there is no impact on performance at all, even if we turn on the deduplication and compression functions."
More Efficient Backups
“It was not possible to use the deduplication/compression function with traditional storage environments due to performance weakening, and it took six hours to backup to the DR sites each time,” Aso reported. “With Tintri, the amount of data to be transferred is greatly reduced thanks to deduplication/compression, so a one-time backup (snapshot) can be achieved in less than 20 minutes using Tintri ReplicateVM."
Using Tintri Analytics
"The new cloud service called Tintri Analytics is also interesting,” noted Aso. “It analyzes the amount of storage used, so it allows you to predict when your capacity will run out and you can add a unit to avoid this. Our company still has plenty of space on our Tintri systems, but it is reassuring to know we can predict such things down the line."
SEGA Games’ Infrastructure & DB section is currently working towards its goal of “total virtualization,” by developing a virtualization environment that is capable of achieving performance comparable to the current physical servers. There is no doubt that Tintri has brought that goal closer to reality. With the time spent on regular storage operations reaching almost zero, the IT organization has more time to focus on further improving IT infrastructure and working towards future goals. “By introducing Tintri, the Monster VMs causing so many issues have disappeared. In gaming terms, this would be the moment we beat the monster and peace is restored,” Fujise concluded.
Software Publisher: Computer Games
High I/O (“Monster VMs”) were impacting the performance of other virtual machines
Wanted to virtualize remaining physical servers to reduce recovery times
Tintri Storage Systems
Improved application performance
Reduced I/O recovery times from hardware failure from 2-3 minutes to tens of seconds
Obtained a data reduction rate of 2.2x
Reduced backup times from 6 hours/day to 20 minutes/day using ReplicateVM