Products Why ExaGrid? News/Events Partners Support Company Info Contact Us

ExaGrid's Eye on Deduplication

Current Articles | RSS Feed RSS Feed

Assessing the Deduplication Tax

Posted by Marc Crespi on Thu, Sep 25, 2008 @ 05:17 PM
Share on Twitter Twitter | Share on Facebook Facebook | Buzz This  Google Buzz | Submit to Digg digg it |  Add to delicious  delicious |  Share on LinkedIn LinkedIn 

As we wind through this intense political season, discussions of taxes are everywhere.   Taxes are a fact of life.  But before this post is misunderstood as a political position, let me explain the type of tax I am discussing today:  the dedupe tax.

The fact is that deduplication is an extremely important storage technology, especially in disk-based backup products.  It is deduplication that allows organizations to store large amounts of data in a small amount of disk and to transfer backups over wide area networks to disaster recovery sites by moving a very small amount of data.

However, as with most compelling technologies, there are trade-offs to be made.  Each disk-based backup vendor is deciding which trade-offs are right for your data center and your backups. Given the importance of your backup window and restore times, the most critical trade-offs to be considered are the performance trade-offs -- or what I call the "dedupe tax".  The dedupe tax is a performance hit that could show up as a longer backup window or a dramatically slowed restore.

The assessment of the tax varies with the deduplication method employed by the vendor:

  • Post-process de-duplication (implemented by ExaGrid Systems) - backups are written directly to disk in their entirety and maintained for rapid restore of your most recent backup. Since 90% or more of all restores are done from your most recent backup, this method avoids the de-dupe tax for 90 to 95% of restores and 100% of backups.
  • In-line deduplication - is performed on backup data on its way to disk. This method charges you the de-duplication tax for every backup into the system and every restore out of the system. The promise is the use of less disk (but not lower cost) and simplicity. With in-line, you are in the 100% tax bracket!

So, what is the cost of the dedupe tax? It can be substantial.  It can slow your backups down by as much as 2x to 5x versus raw disk speeds.  Similarly, it can dramatically slow down restores and force your organization to wait much longer to recover data when you can least afford it - during a critical recovery scenario.

Look for implementations that allow you to determine when and how much dedupe tax should be paid.  With ExaGrid's post-processing GRID architecture, all backups and most restores are tax free.  Only restores from older, deduplicated data incur the de-dupe tax and these are generally smaller restores with much less urgency associated with them.

Taxes must be paid. But, do not pay a gigabyte per second more in dedupe tax than needed!

Marc Crespi is the Vice President of Product Management for ExaGrid Systems, Inc.

0 Comments Click here to read/write comments

Clarity about client side, inline and post-process data deduplication

Posted by Bill Andrews on Tue, Sep 16, 2008 @ 04:15 PM
Share on Twitter Twitter | Share on Facebook Facebook | Buzz This  Google Buzz | Submit to Digg digg it |  Add to delicious  delicious |  Share on LinkedIn LinkedIn 

There is great debate around disk-based backup systems with data de-duplication as to whether client level, inline de-duplication or post process de-duplication is better.

The idea of data de-duplication is to avoid storing redundant data. Some only store unique roughly 8KB blocks of data and some only store the actual bytes, at the byte level, that change. Both of these methods deliver similar de-duplication rates. But the question remains... where is the best place to de-duplicate the data?

Client level de-duplicates the data where the "backup agent" lives on each application server. The advantage of this approach is that less traffic is sent over the network and therefore the backup window is the shortest with this approach. The disadvantage of this approach is that you have to replace your existing backup application with the new client-based de-duplication application.

Inline de-duplication is when the disk-based backup appliance, connected to your existing backup server, de-duplicates the data on the way to the disk. The advantage to this approach is that it uses less disk than post process and theoretically should cost less. The disadvantages are that this approach provides for the slowest backup windows as the de-duplication slows the backups down from writing to disk, expanding the backup window. These systems require more memory and processor so they are not necessarily less costly.

Post process de-duplication is when the disk-based backup appliance, connected to your existing backup server, allows the data to write directly to the disk from the backup server, at disk speed. The de-duplication work begins after the backup is complete. The advantage is that backups occur much faster than the inline approach resulting in a shorter backup window. The disadvantage is that more disk is required to land the backup and then compare. However, the cost of the additional disk is no more than the additional processor and memory required for inline process and therefore post process systems do not cost more than inline. In fact, in most cases they cost less. If you choose a post process system, make sure that the system is sized properly to de-dup all your data well in advance of the next backup coming in.

There is a further advantage / disadvantage debate as to which approach allows for replication of changed data to be received at the offsite system the fastest. I plan to expand upon this in a separate post.

Bill Andrews is President and CEO of ExaGrid Systems a company that provides fast, low cost and scalable disk-based backup with data de-duplication solutions.

2 Comments Click here to read/write comments

Confusion about VTL and Disk-based Backup

Posted by Bill Andrews on Mon, Sep 08, 2008 @ 10:05 AM
Share on Twitter Twitter | Share on Facebook Facebook | Buzz This  Google Buzz | Submit to Digg digg it |  Add to delicious  delicious |  Share on LinkedIn LinkedIn 

There seems to be a lot of confusion around disk-based backup and VTL. Many use these words interchangeably as to imply that disk-based back up is VTL and VTL is disk-based backup. The truth is that VTL is an interface between the backup server and the disk.

All backup applications can  write to three targets:

  1. Tape library
  2. NAS shares (network attached storage device share)
  3. Disk volume - any disk

If you want to backup to disk you have 3 choices:

  1. If you want to write in tape mode then you need to put VTL between the backup server and the disk. The VTL emulates a tape library on the front end and writes to disk on the back end. Up until a few years ago this was the only way you could write to disk. But then the backup applications all added the ability to write natively to disk by adding the ability to write to a NAS share or disk volume. Therefore, VTL has gone away in the mass market as it was a stop gap. However, it still has value in Fibre SAN environments.
  2. You can point back up jobs at NAS shares. Simply plug a NAS server behind your backup server and point your backups at NAS shares.
  3. You can point backup jobs at disk volumes. This is the least common method in the industry...as all the products with data de-duplication use either NAS or VTL.

The industry is dividing into two camps as the disk-based backup systems with data de-duplication in the market offer either NAS or VTL.

  1. If your backup server is on a Fibre SAN and you want the disk-based backup product on the Fibre SAN, VTL can handle SAN block level traffic. This tends to be the case mostly in the large enterprises.
  2. For the mass market of mid market to small enterprise customers where they don't have a Fibre SAN--or if they do have a Fibre SAN, their backup application is on the Ethernet network and not Fibre, then the solution of choice is to connect a NAS based disk-based backup system. NAS is connected via Ethernet to the disk-based backup system. This can be over the Ethernet network or to keep the traffic off the Ethernet network this can be a private Ethernet connection between the backup server and the disk-based backup system.

Bill Andrews is President and CEO of ExaGrid Systems a company that provides fast, low cost and scalable disk-based backup with data de-duplication solutions.

0 Comments Click here to read/write comments

All Posts

Subscribe by Email

Your email:

Connect with ExaGrid