


For example, you might recognize all that hashes that divide evenly by a constant value. Recognizing a hash simply means that the hash value is in a certain subset of values that you choose arbitrarily.that you've decided represent sentinel hashes. If your hashing function is reasonably distributive, each new byte will result in a hash value that's pretty randomly distant from the previous byte's contribution. What this amounts to is calculating hashes on the data that you recognize.įor example, imagine calculating a rolling hash, byte by byte on the file as you go through it. The trick, if it can be considered a trick, is to recognize ranges of data that are static. Files that have internal database-ish structure for example. The IO for changed smaller files is something you will probably just absorb.īut, with larger files, it would be nice to not have to rewrite all the data if only a small portion changes.and with many large writable files, small changes in a large otherwise static data set is exactly what happens. You recognized the problem with simple "chunking" of the data.that all subsequent chunks might have to be rewritten in the case of an early-in-file insert.įirst, you will probably recognize that below a certain size threshold, this matters very little. We chose SHA-1, as have a lot of other products that do similar work. The probability of a collision to too high for use in these types of applications. We did something similar to what you're asking for and built a product around it.Ī couple of observations first, and you've probably heard enough of this, given the age of the question, MD5 isn't your best friend. Having to know the internal structure of a file is probably the least attractive option - at least to me. There are a number of approaches to de-duplication.
