My last blog covered a few questions IT leaders are working on to best enable machine learning and AI projects.
- Should we architect for Scale or Data Accessibility out of the gate?
- How much of our legacy infrastructure and data management investments can be used for new machine learning initiatives?
- What about our data protection strategy—does this need to change to address machine-generated data?
Since then I’ve been digging deeper into the topic, gathering more questions and challenges, and discussing strategies with customers. Here are a few sharable "aha moments" that some of our customers have had recently.
Aha moment #1: The kind of data (structured vs unstructured), where it lives today, and where those data sets need to be stored tomorrow are critical factors to your AI infrastructure strategy.
Arthur Cole from IT Business Edge writes that “support for multi-format storage infrastructure is [...] crucial, given that machine learning, cognitive computing and other forms of AI must pull both structured and unstructured data from multiple sources that rely on iSCSI, NFS, SMB and other solutions.”
Balancing proximity of the data sets with appropriate levels—that can scale—of storage tiers is part of this decision. While public cloud storage may be the best solution for more static backup data sets having long retention periods, unstructured data sets that may be needed for trending analysis should stay closer to users. But doing that in a cost-effective, predictable, and accountable way can be tough.
Aha moment #2: An organizing system is needed—and it needs to be across ALL unstructured data sets.
Many of our customers have been struggling with this problem for years, and have devised some elegant solutions for ways to systematically organize their digital assets using their metadata.
But these methods tend to be brittle, hard to manage, and don’t apply to the entire enterprise unstructured dataset. Teams are looking for a more universal way to scan, index, and classify the millions of files being created and analyzed by machines.
Aha moment #3: Once you know what data you have, who it belongs to, who is using it, how often it changes, and what access requirements it has, data portability and movement can clog up the pipeline if not planned for in advance.
Arik Hesseldahl unpacks some of the problem in a recent CIO.com article, stating, “Most companies have a serious data flow problem with bottlenecks aplenty across their data spectrum.” Data security policies have to be re-thought and data processing and storage locations strategically aligned.
Monica Rogati wrote a great article on the “AI Hierarchy of Needs” that underscores the foundational importance of thinking through the data flow and storage of structured and unstructured data—proactively.
Arik ends his article with this well-articulated statement: “You can make all kinds of big plans for an AI initiative and spend months working on it. But nine months in if you find out that your data is stuck, you can expect to spend another nine months or more figuring out what to do,” he says. “And while you’re doing that, your competition may pass you by.”
Bottom line, IT can—and should—have a huge impact on enabling data science, deep learning, and AI pipelines.
But, putting the data first—its organizing principles, its processing needs, its characteristics, its access requirements, its visibility and search requirements—before deciding which infrastructure changes are required is key.
Learn how Igneous supports ML workflows in our ML Solution Brief.