Update README
This commit is contained in:
parent
daf555e4c5
commit
2ce24281d4
69
README.md
69
README.md
|
@ -1,11 +1,76 @@
|
|||
|
||||
# Cyclic Poly
|
||||
|
||||
From section 2.3
|
||||
This crate is an implementation of the hashing algorithm given in Section 2.3 of the paper:
|
||||
> Jonathan D. Cohen, Recursive Hashing Functions for n-Grams,
|
||||
> ACM Trans. Inf. Syst. 15 (3), 1997.
|
||||
|
||||
Other opensource implementations from this paper usually concentrate on other algorithms that
|
||||
appear in later sections.
|
||||
|
||||
This hashing algorithm has two features that stand out in comparison to other hashing algorithms:
|
||||
* Recurrent/Rolling hash calculation is supported.
|
||||
* Hash values are decomposable.
|
||||
|
||||
## Recurrent/Rolling
|
||||
|
||||
|
||||
A rolling hash (sometimes called a recurrent hash), can calculate a new hash value from a previous hash value and a
|
||||
small change in the data. Rolling hashes are useful to hash a window of data that slides along a collection of data:
|
||||
|
||||
![Rolling Hash](docs/rolling.png)
|
||||
|
||||
This is much more efficient than recalculating a hash value for every byte in the data.
|
||||
|
||||
Rolling hashes are an efficient way for finding a block of data that has a given hash value.
|
||||
|
||||
## Decomposable
|
||||
|
||||
A decomposable hash, can calculate a hash value for a block of data given a hash value for a bigger block of data and
|
||||
the hash value for the remaining data:
|
||||
|
||||
![Decomposable Hash](docs/decomposable.png)
|
||||
|
||||
In the example above, the hash $h_3$ can be calculated from the hashes $h_1$ and $h_2$.
|
||||
|
||||
# Alternative Algorithms
|
||||
|
||||
The Adler32 hash/checksum algorithm can be used as a rolling hash and was made popular by the `zlib` library and `rsync` tool.
|
||||
|
||||
# Performance
|
||||
|
||||
Performance comparisons were done on a laptop and are relative to the [adler32 crate](https://docs.rs/adler32/latest/adler32/). To run this comparison execute `cargo bench`.
|
||||
|
||||
## Single block
|
||||
|
||||
The Cyclic Polynomial hashing is slightly slower than Adler32 when hashing a single block.
|
||||
|
||||
| Algorithm | MB/sec |
|
||||
| ----------------- | ------ |
|
||||
| Cyclic Poly 32bit | 2127 |
|
||||
| Cyclic Poly 64bit | 2126 |
|
||||
| Adler32 | 2562 |
|
||||
|
||||
## Rolling
|
||||
|
||||
## Decomposable
|
||||
The calculation of rolling hashes is faster than Adler32.
|
||||
|
||||
| Algorithm | MB/sec |
|
||||
| ----------------- | ------ |
|
||||
| Cyclic Poly 32bit | 1254 |
|
||||
| Cyclic Poly 64bit | 1048 |
|
||||
| Adler32 | 170 |
|
||||
|
||||
## Collisions
|
||||
|
||||
Hash value collisions can be measured by executing `cargo test -- --ignored` which will run the test called `collisions()`.
|
||||
The [expected number of collisions](https://math.stackexchange.com/questions/35791/birthday-problem-expected-number-of-collisions) is calculated as $n (1 - (1 - \frac{1}{N})^{n-1})$ where $n$ is the number of blocks hashed and $N$ is $2^{32}$.
|
||||
|
||||
| Algorithm | Collisions |
|
||||
| ----------------- | ---------- |
|
||||
| Expected Value | ??? |
|
||||
| Cyclic Poly 32bit | 114 |
|
||||
| Adler32 | 2835 |
|
||||
|
||||
## TODO
|
||||
|
||||
|
|
BIN
docs/decomposable.png
Normal file
BIN
docs/decomposable.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 57 KiB |
BIN
docs/rolling.png
Normal file
BIN
docs/rolling.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 38 KiB |
Loading…
Reference in a new issue