Congratulations to StorSimple for building an innovative product that Microsoft was recently inspired to acquire. For those of you who have not had a chance to look into StorSimple yet, it offers an interesting hybrid storage capability: on-premises storage, combined with Windows Azure-based storage. Simply drop their storage appliance in your network and start using it as a storage device. You can expect capabilities similar to any enterprise-class storage device, including high availability through dual-controllers, battery-backed memory and RAID.

Under the covers, however, the StorSimple appliance will seamlessly spread your data between its three types of storage: high performance flash SSDs, high-capacity SAS disk drives and Windows Azure-based cloud storage — essentially giving you access to virtually unlimited amounts of storage. However, the technique to automatically move the data between high-cost and media is not new. For years, the industry has referred to this technique as HSM – Hierarchical Storage Management, or tiered storage. However, HSM products such as IBM Tivoli Storage Manager and Oracle’s SAM-QFS are considered high-end products and are typically outside the reach of most small- to medium-sized businesses. This is why some believe that StorSimple may have an opportunity to bring HSM to the masses.

So why is this interesting?

  1. Cloud storage is cheap, and is in fact getting cheaper all the time, as cloud storage providers are forced to pass along the “economies of scale” savings to consumers in order to stay competitive in an increasingly tight market.
  2. Cloud storage is highly available given the local and geo-redundant options.
  3. Based on the activity, your data may be locally available on the appliance or downloaded to the appliance when requested. There are a number of knobs you can turn in order to customize this behavior (i.e. dictate the activity threshold before the data is moved out to the cloud). In any case, you are shielded from the complexity of shuttling the data between on-premises and Windows Azure storage.
  4. By identifying data duplicates and removing excess copies, data is reduced (deduplicated or deduped), thereby reducing the amount of data stored.
  5. Last but certainly not least, the StorSimple appliance is no slouch when it comes to its hardware specifications — with a max capacity ranging from 100 TB to 500 TB, dual redundant power, full RAID protection and SSD acceleration.

Oh, yes… if you are concerned about security/compliance regarding the data stored in the cloud, you will find it comforting to know that StorSimple applies AES-256 encryption to all data transmitted and stored in the cloud. In fact, your files are never moved to the cloud in their entirety. Instead, only the fragments of deduplicated files (referred to fingerprints) are moved to the cloud. In order to recreate a file based on its fingerprints, you need a metadata map that captures the relationship and sequence of fingerprints that make up a file. This metadata map needs to be downloaded to the appliance before users can access their data natively. The diagram pasted below illustrates this point. Step 1 is to download the metadata before users can access their data natively (Step 2):

Source: StorSimple_CiS_White_Paper_Rev5.pdf

The key point to note here is that code hosted in cloud (i.e. Windows Azure-hosted code) cannot natively access the file in the cloud directly.

This brings me to my feature request for StorSimple: Give us a way to access the data natively in the cloud. One approach to enabling this could be to allow customers to provision a virtual StorSimple appliance in Windows Azure. The virtual StorSimple appliance will complement the hardware-based StorSimple appliance located on-premises. Once the virtual StorSimple appliance is in place, Windows Azure-hosted code can directly access the data.

Interestingly, this is not very different from how Windows Azure site-to-site VPN works. A hardware based VPN device (Juniper, Cisco etc.) is placed on-premises. To complement the VPN device, a virtual VPN gateway is provisioned in Windows Azure. Together, the VPN device on-premises and the virtual gateway device in Windows Azure help establish seamless connectivity between the two locations.

How will it help? One of the patterns that is becoming increasingly common is the big data style analysis of the ever-growing volumes of business event data being generated these days. Rather than set up an on-premises Hadoop cluster, in many cases it is just easier (and less expensive) to transport the business events to the cloud and use a cloud based Hadoop implementation (HDInsight is Windows Azure based Hadoop implementation).

[pullquote]Organizations also want to retain the ability to react to business events almost immediately.[/pullquote]But it is not a simple matter of transporting the data. There are a few additional things to consider as well. While organizations are willing to perform long-term data analysis in the cloud, they also want to retain the ability to react to business events almost immediately. This means that in the short term (typically up to seven days from the time when the data is collected), the business data needs to reside on-premises. Another requirement is guaranteed delivery of business events. This means that data is never lost in transfer in the face of network and other outages (both on-premises, as well as, the cloud outages)

I think that StorSimple, with their hybrid storage model, is uniquely suited to address this requirement. Business data in the short term will reside in the tier-1 SAN-based storage on-premises. This makes it possible to react to events in an immediate low latent manner. Beyond that, the data will securely be moved to Windows Azure Blob storage where it can interface with HDFS (Hadoop Distributed File System) via ASV (Azure Storage Vault).

I strongly believe that such a solution will go a long way in democratizing “big data” benefits. As I stated earlier, organizations can standup HDInsight clusters easily (without requiring the technical know-how of setting the cluster) and cheaply (only provision the cluster for the time needed to conduct the analysis). Mind you, no ingress data charges are levied by Windows Azure. On the other hand, while the egress charge will apply, the data coming out of Windows Azure datacenter will be small since one will typically pull down the analysis only.