Update README

This commit is contained in:
Alienscience 2021-12-30 20:07:06 +01:00
parent daf555e4c5
commit 2ce24281d4
3 changed files with 67 additions and 2 deletions

View file

@ -1,11 +1,76 @@
# Cyclic Poly
From section 2.3
This crate is an implementation of the hashing algorithm given in Section 2.3 of the paper:
> Jonathan D. Cohen, Recursive Hashing Functions for n-Grams,
> ACM Trans. Inf. Syst. 15 (3), 1997.
Other opensource implementations from this paper usually concentrate on other algorithms that
appear in later sections.
This hashing algorithm has two features that stand out in comparison to other hashing algorithms:
* Recurrent/Rolling hash calculation is supported.
* Hash values are decomposable.
## Recurrent/Rolling
A rolling hash (sometimes called a recurrent hash), can calculate a new hash value from a previous hash value and a
small change in the data. Rolling hashes are useful to hash a window of data that slides along a collection of data:
![Rolling Hash](docs/rolling.png)
This is much more efficient than recalculating a hash value for every byte in the data.
Rolling hashes are an efficient way for finding a block of data that has a given hash value.
## Decomposable
A decomposable hash, can calculate a hash value for a block of data given a hash value for a bigger block of data and
the hash value for the remaining data:
![Decomposable Hash](docs/decomposable.png)
In the example above, the hash $h_3$ can be calculated from the hashes $h_1$ and $h_2$.
# Alternative Algorithms
The Adler32 hash/checksum algorithm can be used as a rolling hash and was made popular by the `zlib` library and `rsync` tool.
# Performance
Performance comparisons were done on a laptop and are relative to the [adler32 crate](https://docs.rs/adler32/latest/adler32/). To run this comparison execute `cargo bench`.
## Single block
The Cyclic Polynomial hashing is slightly slower than Adler32 when hashing a single block.
| Algorithm | MB/sec |
| ----------------- | ------ |
| Cyclic Poly 32bit | 2127 |
| Cyclic Poly 64bit | 2126 |
| Adler32 | 2562 |
## Rolling
## Decomposable
The calculation of rolling hashes is faster than Adler32.
| Algorithm | MB/sec |
| ----------------- | ------ |
| Cyclic Poly 32bit | 1254 |
| Cyclic Poly 64bit | 1048 |
| Adler32 | 170 |
## Collisions
Hash value collisions can be measured by executing `cargo test -- --ignored` which will run the test called `collisions()`.
The [expected number of collisions](https://math.stackexchange.com/questions/35791/birthday-problem-expected-number-of-collisions) is calculated as $n (1 - (1 - \frac{1}{N})^{n-1})$ where $n$ is the number of blocks hashed and $N$ is $2^{32}$.
| Algorithm | Collisions |
| ----------------- | ---------- |
| Expected Value | ??? |
| Cyclic Poly 32bit | 114 |
| Adler32 | 2835 |
## TODO

BIN
docs/decomposable.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

BIN
docs/rolling.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB