Blog

Subscribe to Email Updates

A Fail-in-Place Platform—RatioPerfect™ Every Time

by Jeff Hughes – October 25, 2016

We have previously described our vision around Zero-Touch Infrastructure™, the first of two key architectural components that enable us to deliver on the promise of a True Cloud for Local Data.  In this article, I will expand on the second key architectural component which is on-premises appliances are truly able to be a fail-in-place platform.

Multiple conversations with customers that had an “at-scale” storage and server infrastructure had confirmed what had been our experience:

  1. At scale, components routinely fail.
  2. Servicing failures manually is laborious, cumbersome, and causes friction. The cost of labor to service failed parts exceeds the actual cost of the parts themselves.

Our goal was that no matter what component (disk, memory, power supply, fans, etc.) of our on-premises appliances failed, it should (a) not impact the customer workflows, and (b) not require human intervention to mitigate.

As a platform, traditional storage servers — with dual/quad Intel Xeon processors, 32–128GB of DRAM, and 60 or more disks — were a complete anathema to us. To begin with, lots of compute power is accessing hundreds of terabytes of data across a thin straw (6Gbps SAS bus!). 

traditional-storage-diagram.png

Even more worrisome, though, was the fact that a storage server failure meant that hundreds of terabytes of data were instantly unavailable and the failure necessitated human intervention to resolve by replacing the server. Sure, software erasure coding techniques could be used to “recover” the unavailable data, but the time to rebuild hundreds of terabytes is measured in multiple weeks, and during which the customer’s workflow would be significantly impacted. 

Enter our patented RatioPerfect architecture. We leveraged off-the-shelf commodity ARM processors (low cost, low power) and built a dedicated “controller” for every drive.

fail-in-place-diagram.png

Each drive has its own ARM-based controller that runs Linux and is dedicated to managing just that drive.  This converts a dumb disk drive into an intelligent nano-server that is (dual) Ethernet-connected and runs part of our software stack.  The problem of lots of compute power accessing hundreds of terabytes of data through a thin straw is completely resolved, as we now have unrestricted gigabit bandwidth to each and every drive.  The “blast radius” of a component failure is reduced from hundreds of terabytes to the capacity of a single drive.  Our software automatically detects any component failures, recovers data from the failed nano-server, and routes around the failed components — all without requiring any human intervention.

The result is an appliance platform that is both a commodity and truly fail-in-place, so we can bring our customers the True Cloud experience for their Local Data.  In a future article, I will dig deeper into the technical reasons behind this architecture.

Related Content

How Igneous Optimizes Data Movement

August 22, 2018

Our co-founder and Architect, Byron Rakitzis, recently wrote an article over at DZone called "Parallelizing MD5 Checksum Computation to Speed Up S3-Compatible Data Movement."

read more

How Igneous Selects Weekly Release Candidates for Production

August 14, 2018

Streaming out a weekly software update brings joy to customers and engineers alike. Customers receive cutting-edge features and timely bug fixes, while engineers transform bright ideas into production realities with minimal turnaround.

read more

Igneous Announces New Integration with Google Cloud Storage

July 23, 2018

Igneous Systems is excited to announce a new integration with Google Cloud Platform. This integration was designed with both replication and long-term archive in mind. Through the Igneous interface, you can now move files and file systems directly into Google Cloud Storage via policy. You will retain your ability to search across storage tiers, restoring to Igneous or to your primary NAS when you need to recover your data.

read more

Comments