Computer Systems and Engineering Seminar Series

The Logic of Physical Garbage Collection in Deduplicating Storage Fred Douglis, Dell EMC

Speaker:Fred Douglis
Date: Friday, January 27, 2017
Time: 11:45am - 12:45pm
Location: North Bldg. 311, Duke

Abstract

Most storage systems that write in a log-structured manner need a mechanism for garbage collection (GC), reclaiming and consolidating space by identifying unused areas on disk. In a deduplicating storage system, GC is complicated by the possibility of numerous references to the same underlying data. We describe two variants of garbage collection in a commercial deduplicating storage system, a logical one that operates on the files containing deduplicated data and a physical one that performs sequential I/O on the underlying data. The need for the second approach arises from a shift in the underlying workloads, in which exceptionally high duplication ratios or the existence of millions of individual small files result in unacceptably slow GC using the file-level approach. Under such workloads, determining the liveness of chunks becomes a slow phase of logical GC. We find that physical GC decreases the execution time of this phase by up to two orders of magnitude in the case of extreme workloads and improves it by approximately 10-60% in the common case, but only after additional optimizations to compensate for its higher initialization overheads. Joint work with Abhinav Duggal, Philip Shilane, Tony Wong, Shiqin Yan, and Fabiano Botelho.

Biography

Fred Douglis is a computer scientist with interests in distributed systems, storage, web technologies, and many other areas. He is currently a research scientist in the CTO office of the EMC Core Technologies Division, working on deduplicating backup systems and related technologies. He also serves as an advocate and internal resource for academic publishing within the division, identifying and nuturing promising technical work for external dissemination. From 2002-2009 he was a researcher at IBM Research in Hawthorne, NY, working on stream computing, deduplication, and other areas in distributed systems. Before joining IBM, he was with AT&T Labs--Research, where he was most recently a Division Manager. Before that, he was a scientist with the Mobios Project at the Matsushita Information Technology Laboratory, later called the Panasonic Information and Networking Technology Laboratory and eventually closed. Prior to that, he was a visiting professor at the Vrije Universiteit in Amsterdam, the Netherlands, working with Andy Tanenbaum.