As Allison mentioned in this blog post, scalability is an integral part of any infrastructure strategy that is going to deal with machine learning and artificial intelligence (ML/AI) workflows. Today, I’m going to delve a little bit further into why that is and how you can plan for it.
Allison mentioned that I&O leaders often feel like they have to choose between scalability and accessibility, but in 2018, companies can’t be making that choice. Data must be accessible to the compute layer, and infrastructure must be scalable to accommodate this data. There’s no way around it.
Why does AI matter?
Machine learning applications, when deployed strategically and thoughtfully, can make or break a company’s bottom line. Common use cases include product recommendations, sentiment analysis, image recognition and classification, churn prediction, and fraud detection.
In many of these situations, using anything other than a machine learning workflow would make it impossible for companies to stay competitive. In other situations, they allow companies to develop artificial intelligence applications, replacing or complementing manual labor.
We’re talking right now about workflows that rely on unstructured data: how can you tell from a pile of scanned microscope slides which one of them has cancer? Can you identify a couch from a TV show and advertise it to the show’s viewers? How can you tell from a video feed whether a self-driving car is too close to the car in front of it?
A big part of the reason that modern image classification systems have come to the forefront in recent years is that advances in infrastructure have made it possible. Although convolutional neural networks have existed in theory for decades, they were largely an academic pipe dream prior to the widespread adoption of the GPU and the ready availability of dense storage systems.
What does scalability look like in 2018?
Although Igneous provides all of the necessary components of an AI-compatible storage tier as described by Neil Stobart at Cloudian, scalability is arguably the key to success in a competitive landscape. We can quibble over the exact minimum dataset size needed to create a functional model for any given task, but the fact remains: more good data means more accurate results. Updating and growing your data sets over time leads to more accurate results. And this can’t be done without scale-out infrastructure.
Our partners Pure Storage and NVIDIA have together architected a fully integrated storage and compute infrastructure. Pure Storage AIRI provides the tools for modern companies--whether they are enterprises just beginning to take advantage of their rich data, or startups eager to seize the opportunity to create value out of data that slower-moving enterprises can’t--to leverage all of the unstructured data available to them. AIRI lets IT do its job, so that data scientists can focus on iterating on their models rather than worrying about compute time or overloading their infrastructure with very large datasets.
Scalability of Secondary Storage
The need for well-integrated compute and primary storage is really intuitive, because the intersection of datasets and algorithms is where all the “glamour” is. That’s where you’ll diagnose a pancreatic tumor, sort people by face, or tell your new car it’s about to run over a dog.
But the need for scalability doesn’t stop at primary storage. In order to make the very best use out of top-of-the-line storage products, such as Pure Storage’s FlashBlade, you have to replicate your data to a secondary tier to guard against data loss, and actively archive any data that is not being used. The beauty of a scale-out system is that you only pay for what you use; the beauty of policy-based archiving is that you only use what you need.
With teams of under ten people regularly wrangling petabytes of data, scale-out secondary storage needs to be more than just scale-out. It needs to scale out without hassle. The storage itself needs to know how much storage is needed, and adapt accordingly. It needs to be just as hands-off as using a public cloud provider. In short, it needs to be delivered as-a-Service.
Since cold storage is even less “sexy” than secondary storage, data scientists and their IT support teams don’t exactly want to spend a lot of time managing that, either. That’s why Igneous lets users set up three-tier data movement policies, which, once set, will automatically move cold data off of the high-throughput, on-premises Igneous Hybrid Storage Cloud and into AWS Glacier, Azure Cool Blob, or Google Cloud Platform’s Coldline Storage.
Learn more about how we support ML/AI workflows in our solution brief!