Skip to main content

Study of Chunking Algorithm in Data Deduplication

  • Conference paper
  • First Online:
Proceedings of the International Conference on Soft Computing Systems

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 398))

Abstract

Data deduplication is an emerging technology that introduces reduction of storage utilization and an efficient way of handling data replication in the backup environment. In cloud data storage, the deduplication technology plays a major role in the virtual machine framework, data sharing network, and structured and unstructured data handling by social media and, also, disaster recovery. In the deduplication technology, data are broken down into multiple pieces called “chunks” and every chunk is identified with a unique hash identifier. These identifiers are used to compare the chunks with previously stored chunks and verified for duplication. Since the chunking algorithm is the first step involved in getting efficient data deduplication ratio and throughput, it is very important in the deduplication scenario. In this paper, we discuss different chunking models and algorithms with a comparison of their performances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cisco Global Cloud Index: Forecast and methodology (2015) white paper. http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html. Visited last on 02 Apr 2015

  2. Quinlan S, Venti SD (2002) A new approach to archival storage. In: Proceedings of the first USENIX conference on file and storage technologies, Monterey, CA

    Google Scholar 

  3. Denehy TE, Hsu WW (2003) Reliable and efficient storage of reference data. Technical Report RJ10305, IBM Research, Oct 2003

    Google Scholar 

  4. Andrew Tridgell (1999) Efficient algorithms for sorting and synchronization. PhD thesis, Australian National University

    Google Scholar 

  5. Won Y, Kim R, Ban J, Hur J, Oh S, Lee J (2008) Prun: eliminating information redundancy for large scale data backup system. In: Proceedings IEEE international conference computational sciences and its applications (ICCSA’08)

    Google Scholar 

  6. Won Y, Ban J, Min J, Hur J, Oh S, Lee J (2008) Efficient index lookup for de-duplication backup system. In: Proceedings of IEEE international symposium modeling, analysis and simulation of computers and telecommunication systems (MASCOTS’08), pp 1–3, Sept 2008

    Google Scholar 

  7. Kulkarni P, Douglis F, LaVoie J, Tracey J (2004) Redundancy elimination within large collections of files. In: Proceedings of the USENIX annual technical conference, pp 59–72

    Google Scholar 

  8. Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams. In: Proceedings of the 8th USENIX conference on file and storage technologies. USENIX Association

    Google Scholar 

  9. Policroniades C, Pratt I (2004) Alternatives for detecting redundancy in storage systems data. In: Proceedings of the annual conference on USENIX annual technical conference. USENIX Association

    Google Scholar 

  10. Eshghi K, Tang HK (2005) A framework for analyzing and improving content-based chunking algorithms

    Google Scholar 

  11. Kubiatowicz J et al (2000) Oceanstore: an architecture for global store persistent storage. In: Proceedings of the 9th international conference on architectural support for programming languages and operating systems

    Google Scholar 

  12. Quinlan S, Dorwards S (2002) Venti: a new approach to archival storage. In: Proceedings of USENIX conference on file and storage technologies

    Google Scholar 

  13. Rabin M (1981) Fingerprinting by random polynomials. Center for Research in Computing Technology, Aiken Computation Laboratory, University

    Google Scholar 

  14. Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P (2009) Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of the 7th USENIX conference on file and storage technologies (FAST’09), San Francisco, CA, USA, Feb 2009, pp 111–124

    Google Scholar 

  15. Muthitacharoen A, Chen B, Mazi`eres D (2001) A low-bandwidth network file system. SIGOPS Oper Syst Rev 35(5):174–187

    Google Scholar 

  16. Zhu B, Li K, Patterson H (2008) Avoiding the disk bottleneck in the data domain deduplication file system. In: FAST’08: Proceedings of the 6th USENIX conference on file and storage technologies, Berkeley, CA, USA, pp 1–14

    Google Scholar 

  17. Liu C, Lu Y, Shi C, Lu G, Du D, Wang D (2008) ADMAD: application-driven metadata aware de-duplication archival storage system. In: Proceedings o fifth IEEE international workshop storage network architecture and parallel I/Os (SNAPI’08), pp 29–35

    Google Scholar 

  18. Mogul J, Douglis F, Feldmann A, Krishnamurthy B (1997) Potential benefits of delta encoding and data compression for HTTP. In: Proceedings of ACM SIGCOMM’97 conference, pp 181–194, Sept 1997

    Google Scholar 

  19. Bolosky WJ, Corbin S, Goebel D, Douceur JR (2000) Single instance storage in windows 2000. In: Proceedings of fourth USENIX windows systems Symposium, pp 13–24

    Google Scholar 

  20. You LL, Pollack KT, Long DDE (2005) Deep store: an archival storage system architecture. In: Proceedings of international conference on data engineering (ICDE’05), pp 804–8015

    Google Scholar 

  21. Muthitacharoen A, Chen B, Mazieres D (2001) A low-bandwidth network file system. ACM SIGOPS Oper Syst Rev 35(5):174–187

    Article  Google Scholar 

  22. Thein NL, Thwel TT (2012) An efficient Indexing Mechanism for data de-duplication. In: Proceedings of the 2009 international conference on the current trends in information technology (CTIT), pp 1–5

    Google Scholar 

  23. Bloom BH (1970) Space/time tradeoffs in hash coding with allowable errors. Commun ACM 13(7):422–426

    Article  MATH  Google Scholar 

  24. Meister D, Brinkmann A (2009) Multi-level comparison of data deduplication in a backup scenario. In: Proceedings of SYSTOR’09: The Israeli experimental systems conference, May 2009, pp 1–12

    Google Scholar 

  25. Cannon D (2009) Data deduplication and tivoli storage manager, Mar 2009

    Google Scholar 

  26. Data Domain LLC. Deduplication FAQ. url:http://www.datadomain.com/resources/faq.html

  27. Meyer DT, Bolosky WJ (2011) A study of practical deduplication. In: Proceedings of 9th USENIX conference on file and storage technologies

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Venish .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer India

About this paper

Cite this paper

Venish, A., Siva Sankar, K. (2016). Study of Chunking Algorithm in Data Deduplication. In: Suresh, L., Panigrahi, B. (eds) Proceedings of the International Conference on Soft Computing Systems. Advances in Intelligent Systems and Computing, vol 398. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2674-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-2674-1_2

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-2672-7

  • Online ISBN: 978-81-322-2674-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics