Hybrid drives can accelerate performance by storing frequently accessed data in flash memory, where it can be accessed very quickly. How much performance can be gained by replacing a hard disk drive with a hybrid drive? The answer depends on the workload. In this post, we’ll talk about the workload access distribution and how it is important for estimating hybrid drive performance.

How much of the data is frequently accessed? How frequently is it accessed in relation to the rest of the data? These are essential questions for characterizing the workload access distribution. Imagine that we have 10 data records and that we rank these records from the most frequently accessed to the least frequently accessed. Now imagine that I am about to choose a record at random. The access distribution gives the probability that I will choose record 1, 2, 3, and so on. In practice, this distribution is often skewed so that a small fraction of the records (the hot data) is chosen with high probability.

What do words in the English language, corporation sizes, and information accesses have in common? They have all been modeled by the Zipf distribution. Using the example of 10 records of data, the Zipf distribution states that the probability of accessing record n is proportional to 1/n^s, where the skew parameter s characterizes how skewed the distribution is toward the most frequently accessed records. If s is zero, all 10 records are equally likely to be accessed. If s is large, the first piece will be accessed nearly all the time. The graph below shows an access distribution for the example of 10 data records with the skew parameter s=0.7.

In this example, all 10 records have a reasonably large probability of being accessed, but the probability of an access involving the first record is over 25%. Storing this record on faster storage will significantly improve performance because it is so frequently accessed. Hybrid drives use caching algorithms to keep as much of the frequently accessed data as possible in fast storage.

In this example, we looked at 10 data records. In practice, there are millions, billions, or even more pieces of data. These pieces could be images, web pages, database records, or any one of the many types of information that we store digitally.

The Zipf distribution can be used to model real workloads with an appropriate choice of the skew parameter. If you choose to use a Zipf distribution to model your workload, be sure to compare the model to the actual data to ensure that the model is reasonable in your particular application. The resulting Zipf model gives a high-level characterization of your workload that can be useful when deciding between spinning media and hybrid drives.

Author: Dan Lingenfelter

Twitter Facebook Google Plus Linked in