Subscribe Here!

Finding My Data

by Christian Smith – October 21, 2015

As anyone who knows me knows, I hate traffic. Which is why the Waze app caught my eye. The moving map is amazingly accurate regarding traffic congestion, accidents, and police locations. It’s as close to real time that I have ever seen.

Curious, I later researched how this was happening and realized the solution the Waze guys had stumbled upon has a direct correlation to a storage problem I had been noodling on.

Let me start with how the traffic problem was solved. Previous traffic reporting solutions depended on out of band methods to collect the data (rider reports, traffic cameras, government systems). These were slow to respond to changes, and available only in select urban areas.

Recently solutions, such as Waze, have switched to using aggregated mobile phone location data. Your carrier knows precisely where you are at every moment. Mapping that data to highways provides the speed and congestion of traffic in any area in real time, which provides an extremely accurate, real-time view of traffic.

The key difference is instead of sitting on the outside and ‘watching’ for traffic, the system now sits on the inside and ‘reports’ traffic back to the main systems. The system is inherently scalable as there is a one-to-one relationship between cars and monitoring devices. More cars mean more devices, so the monitoring never falls behind.

This whole scenario applies directly to a problem petabyte-scale storage systems have: Search. Let’s do a quick review of the history of search.

In simple storage systems searching for data is done in a brute force manner. You start by examining the first item and continue until you reach the last item. The brute force method breaks down with even moderately large storage systems because it takes too long. You have no doubt witnessed this with primitive email searches, where it can take 20 minutes or longer to complete a simple search.

So, system folks improved this process by pre-indexing. The system creates an initial index of search terms by crawling the data item-by-item. This takes an enormous amount of time, of course, but when a user searches for something the file system refers to the index and returns the search results virtually instantaneously.

Great, but when applied to petabyte-scale storage systems this pre-indexing is of little value. It takes so long that by the time it is finished the underlying storage system has changed and the pre-index is no longer valid.

So, users are faced with two bad options for finding information on really big storage systems. The brute force method, which is way too slow to be usable, and the pre-index method, which is out of date before it is even ready.

The way the traffic guys fixed their problem is precisely how storage guys should fix the search problem. The key is to architect a ‘watch’ service that is integrated into the storage tier and can maintain an index in essentially real-time.

Kiran talked about AWS Lambda in a previous post and postulated that while Lambda was awesome for AWS S3 storage, a similar capability is needed for on-premise storage. This search problem is a perfect use case for why this is true. Here is how search indexing would work if you had an on-premise equivalent to Lambda:

Instead of sitting on the outside and watching, the indexing service would be triggered automatically every time the storage changed. If something was added, changed, deleted or moved, the indexing service would immediately (within seconds) kick-in and revise the master index.

The first advantage is that there would no longer be a way that the indexing system would get out of sync with the underlying information. The second advantage is that the indexing system would scale automatically to meet demand. As with Lambda, the service would automatically fire-up as many independent instances of the indexing service as needed. In other words, it would scale out as needed.

This is a perfect example of why on-premise storage needs an AWS Lambda like service architecture. It vastly simplifies tasks like indexing while providing the scale-out capacity that large-scale storage systems require.

At least, that’s my opinion. What’s your take? Oh, and by the way, I still hate traffic!

Related Content

Top 10 IT Trends for 2019

February 19, 2019

In 2019 and beyond, 451 Research sees a key shift in the world of IT—the breaking apart and coalescing of old silos of technology. Today, technological advances feed off each other to drive innovation. With this new paradigm of technological innovation, 451 Research shares 10 IT trends they predict for 2019.

read more

“Interesting Times” for Unstructured Data Management

January 10, 2019

The expression “may you live in interesting times...” is subject to much debate. To some it is a celebration of the the opportunities to be found in times of transition. To others, it is a cautionary phrase that should be heeded to avoid misfortune. No matter which side of these interpretations you find yourself aligned with, there is no question that 2019 will be a year of significant opportunities and challenges for those responsible for the proper care, management, and stewardship of unstructured data.

read more

Coming Soon: A New Approach to Protecting Datasets

December 17, 2018

Unstructured data has grown at an annual compounded rate of 25% for the past ten years, and shows no sign of slowing. For most organizations, “data management” for unstructured data has really just meant capacity management, i.e. increase capacity to keep up with data growth. This model worked at moderate scales, but as datasets have increased in size, complexity, and quantity, it has pushed the scales into petabytes of data with billions of files, and overwhelmed budgets. Enterprises are now asking for data management strategies that do more than just provide continuously increasing capacity.

read more

Comments