Blog

Subscribe to Email Updates

A Fail-in-Place Platform—RatioPerfect™ Every Time

by Jeff Hughes – October 25, 2016

We have previously described our vision around Zero-Touch Infrastructure™, the first of two key architectural components that enable us to deliver on the promise of a True Cloud for Local Data.  In this article, I will expand on the second key architectural component which is on-premises appliances are truly able to be a fail-in-place platform.

Multiple conversations with customers that had an “at-scale” storage and server infrastructure had confirmed what had been our experience:

  1. At scale, components routinely fail.
  2. Servicing failures manually is laborious, cumbersome, and causes friction. The cost of labor to service failed parts exceeds the actual cost of the parts themselves.

Our goal was that no matter what component (disk, memory, power supply, fans, etc.) of our on-premises appliances failed, it should (a) not impact the customer workflows, and (b) not require human intervention to mitigate.

As a platform, traditional storage servers — with dual/quad Intel Xeon processors, 32–128GB of DRAM, and 60 or more disks — were a complete anathema to us. To begin with, lots of compute power is accessing hundreds of terabytes of data across a thin straw (6Gbps SAS bus!). 

traditional-storage-diagram.png

Even more worrisome, though, was the fact that a storage server failure meant that hundreds of terabytes of data were instantly unavailable and the failure necessitated human intervention to resolve by replacing the server. Sure, software erasure coding techniques could be used to “recover” the unavailable data, but the time to rebuild hundreds of terabytes is measured in multiple weeks, and during which the customer’s workflow would be significantly impacted. 

Enter our patented RatioPerfect architecture. We leveraged off-the-shelf commodity ARM processors (low cost, low power) and built a dedicated “controller” for every drive.

fail-in-place-diagram.png

Each drive has its own ARM-based controller that runs Linux and is dedicated to managing just that drive.  This converts a dumb disk drive into an intelligent nano-server that is (dual) Ethernet-connected and runs part of our software stack.  The problem of lots of compute power accessing hundreds of terabytes of data through a thin straw is completely resolved, as we now have unrestricted gigabit bandwidth to each and every drive.  The “blast radius” of a component failure is reduced from hundreds of terabytes to the capacity of a single drive.  Our software automatically detects any component failures, recovers data from the failed nano-server, and routes around the failed components — all without requiring any human intervention.

The result is an appliance platform that is both a commodity and truly fail-in-place, so we can bring our customers the True Cloud experience for their Local Data.  In a future article, I will dig deeper into the technical reasons behind this architecture.

Related Content

Coming Soon: A New Approach to Protecting Datasets

December 17, 2018

Unstructured data has grown at an annual compounded rate of 25% for the past ten years, and shows no sign of slowing. For most organizations, “data management” for unstructured data has really just meant capacity management, i.e. increase capacity to keep up with data growth. This model worked at moderate scales, but as datasets have increased in size, complexity, and quantity, it has pushed the scales into petabytes of data with billions of files, and overwhelmed budgets. Enterprises are now asking for data management strategies that do more than just provide continuously increasing capacity.

read more

Data Protection at Scale: How Igneous Integrates with NetApp

November 19, 2018

One of Igneous’ key benefits is how we integrate easily with any primary NAS system, streamlining data protection and freeing customers from legacy solutions, and vendor-specific data silos.

read more

How Igneous Optimizes Data Movement

August 22, 2018

Our co-founder and Architect, Byron Rakitzis, recently wrote an article over at DZone called "Parallelizing MD5 Checksum Computation to Speed Up S3-Compatible Data Movement."

read more

Comments