I’ve spent the better part of three decades leading one of the most demanding high-performance computing infrastructures in the world. One of the greatest challenges of HPC infrastructure is keeping data available and meeting the needs of the business with supporting engineers located in dozens of locations around the world. Here are some key takeaways for anyone struggling with this problem.
“What we have is a data glut.” – Veron Visage
We are all dealing with a lot of data (some say too much data!). Due to changes in chip geometry, we are seeing a massive increase in the amount of data being generated. In addition, logging applications, compliance requirements, and the ever-expanding need to leverage the data that you have are all driving more and more data usage.
1. Many organizations are not cleaning up data that has been created.
As a result, they are dealing with data that no one actually uses, but it still consumes resources! CPU’s have seen dramatic increases in performance for the past decade, and most storage platforms haven’t kept up… especially NFS. This has led to organizations purchasing more arrays to provide enough bandwidth for the computing grid. This then leads to under-utilized arrays and overspending when it comes to the most expensive tier of storage. As almost every large company is working at a global scale and with the need for the data to be local to the engineer, the need for copying the data to multiple engineering design locations becomes a choke point and costly for the network bandwidth. It gets worse when you add on the technical debt that comes with storage infrastructure upgrades every 3-5 years. For example, let’s say you start out with 5 arrays with 200TB each on them. When it’s time to upgrade, what do you do? The opportunity here is to purchase 1PB of storage and clean up 500TB, but this is easier said than done!
2. Understanding the security requirements of the data you are storing is also another key problem today.
It can be difficult enough to understand which intellectual property (IP) should be under a need-to-know basis vs which IP can be shared more broadly. However, identifying the data is only half the problem; you need to then lock it down without causing a performance or access problem for the engineers that need to use it. Data protection is a similar issue to security; you obviously want to make sure you are protecting the critical IP and not just treating all the data the same. One failed backup may mean weeks of lost work for engineers if they need a restore.
“In God we trust. All others must bring data.” -W. Edwards Deming
3. It’s all about the data.
So where do we start to try and untangle some of this? Most data in the high-performance computing world is considered unstructured, usually at the scale of billions of files and many thousands of directories or paths. However, unstructured data does have underlying patterns.You’ll just have to work with your business partners to figure out what they are. Start with the big datasets first, then work from there. Understanding the patterns will help IT determine the best places to store this data, who really uses it, how long you should keep it, and so much more! Engagement with the data owners is crucial, and you may need to leverage some 3rd party software as well; believe me, it is well worth your time.
4. Self-service is key.
Engineers or scientists should be FREE to provision and quickly utilize the storage services that they need. They must also be able to return storage that is no longer needed. This can be scary for some IT departments, but with the right seat belts, it can be made possible. IT has made some good progress in recent years in gaining insight into how the data is being accessed and used from a pure metadata view, but in order to be useful, this must be made visible to the people actually using the data. In addition, enabling the data owner to take actions (archive, delete, etc.) must be fully automated and easy for the end user to be able to control. Ideally, wrap all of this in a data lifecycle policy that can be automated with your storage platform.
5. Data workflow patterns vary. How does yours look?
Understanding your data at rest is definitely important, but understanding the way that your grid or applications use the storage platform is just as important. Some performance requirements ramp up quickly in the beginning but then drop off and are rarely used, while some data performance needs ramp up slowly, then are critical in the middle of their life, then drop down again.
Ask yourself: Does this workflow have the possibility of bursting back up? Do data performance needs change, moving from critical to mildly needed throughout the life-cycle of a project? These are questions that need to be asked and answered. The good news is that the storage platforms are doing a much better job of helping us with these questions.
6. Every organization has more data than they need, which causes many problems.
These problems include spending more on infrastructure costs such as storage, networking, and datacenter costs. It can be difficult to sort out the gems in your datasets from the junk. Doing this, though, can save millions of dollars and ensure you are protecting the right data instead of treating it all the same. Replicating junk globally not only is costly, but also gets in the way of replicating the critical data that engineers need to work globally. Consolidating arrays? Over the years, IT has been able to successfully consolidate, but then still end up with more under-utilized arrays. It happens for a variety of reasons, including the business wanting dedicated infrastructure, or performance requirements not being well understood, or the failure domain getting too big. This pendulum swings back and forth, and you’ll have to determine what is right for you.
7. Data needs to be moved around constantly and generally “close” to the engineers.
Your data may need to be moved around outside of the tools that use it, but it must be done in concert with the tools. Some of this data is owned in one location and then pushed out globally; some of this data is owned in multiple locations and pushed around globally; some of this data is moving around in a orbit-the-sun model. Essentially, there are many various ways that data needs to move; therefore, you need a platform and tooling to enable this for your engineering/scientific community.
8. Find the right people for the job.
Storage custodian – This is the storage admin. They care about the health of the array and they make sure the data is safe and protected. Typically, they would be responsible for the infrastructure and all the data management tools.
Data analysts – These folks are very focused on the meta (data) and are data owner (customer) centric, making sure everyone knows who owns the data, works on data lifecycle policies, and are the primary users of the data management tools.
Data Owners – These are the engineers, scientists, or people in the business. They need to be informed about their data and usage so that they can make informed decisions.
Spend less time managing storage and instead focus on visibility and self-service.
Understand your data workflow and data usage and create a data lifecycle policy.
Make sure you have the right roles identified and put data on the platform that best suits these patterns and performance requirements.
Paul Ferraro is Technology Advisor to Evotek. Previously, Paul was Senior Director, IT Global Infrastructure Services at Qualcomm, where he oversaw all aspects of global data management, storage, data protection, server management, and virtual infrastructure services.