Update README

2021-12-30 20:07:06 +01:00 · 2021-12-30 20:07:06 +01:00 · 2ce24281d4
parent daf555e4c5
commit 2ce24281d4
3 changed files with 67 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -1,11 +1,76 @@

 # Cyclic Poly

-From section 2.3 
+This crate is an implementation of the hashing algorithm given in Section 2.3 of the paper:
+> Jonathan D. Cohen, Recursive Hashing Functions for n-Grams,
+> ACM Trans. Inf. Syst. 15 (3), 1997.
+
+Other opensource implementations from this paper usually concentrate on other algorithms that
+appear in later sections.
+
+This hashing algorithm has two features that stand out in comparison to other hashing algorithms:
+* Recurrent/Rolling hash calculation is supported.
+* Hash values are decomposable.
+
+## Recurrent/Rolling
+
+
+A rolling hash (sometimes called a recurrent hash), can calculate a new hash value from a previous hash value and a
+small change in the data. Rolling hashes are useful to hash a window of data that slides along a collection of data:
+
+![Rolling Hash](docs/rolling.png)
+
+This is much more efficient than recalculating a hash value for every byte in the data.
+
+Rolling hashes are an efficient way for finding a block of data that has a given hash value.
+
+## Decomposable
+
+A decomposable hash, can calculate a hash value for a block of data given a hash value for a bigger block of data and
+the hash value for the remaining data:
+
+![Decomposable Hash](docs/decomposable.png)
+
+In the example above, the hash $h_3$ can be calculated from the hashes $h_1$ and $h_2$.
+
+# Alternative Algorithms
+
+The Adler32 hash/checksum algorithm can be used as a rolling hash and was made popular by the `zlib` library and `rsync` tool.
+
+# Performance
+
+Performance comparisons were done on a laptop and are relative to the [adler32 crate](https://docs.rs/adler32/latest/adler32/). To run this comparison execute `cargo bench`.
+
+## Single block
+
+The Cyclic Polynomial hashing is slightly slower than Adler32 when hashing a single block.
+
+| Algorithm         | MB/sec |
+| ----------------- | ------ |
+| Cyclic Poly 32bit | 2127   |
+| Cyclic Poly 64bit | 2126   |
+| Adler32           | 2562   |

 ## Rolling

-## Decomposable
+The calculation of rolling hashes is faster than Adler32.
+
+| Algorithm         | MB/sec |
+| ----------------- | ------ |
+| Cyclic Poly 32bit | 1254   |
+| Cyclic Poly 64bit | 1048   |
+| Adler32           | 170    |
+
+## Collisions
+
+Hash value collisions can be measured by executing `cargo test -- --ignored` which will run the test called `collisions()`.
+The [expected number of collisions](https://math.stackexchange.com/questions/35791/birthday-problem-expected-number-of-collisions) is calculated as $n (1 - (1 - \frac{1}{N})^{n-1})$ where $n$ is the number of blocks hashed and $N$ is $2^{32}$.
+
+| Algorithm         | Collisions |
+| ----------------- | ---------- |
+| Expected Value    | ???        |
+| Cyclic Poly 32bit | 114        |
+| Adler32           | 2835       |

 ## TODO

--- a/docs/decomposable.png
+++ b/docs/decomposable.png
--- a/docs/rolling.png
+++ b/docs/rolling.png