Cyclic Polynomial Recursive/Rolling Hash
Go to file
Alienscience c94512007b
ci/woodpecker/push/woodpecker Pipeline was successful Details
chore: Release cyclic-poly-23 version 0.3.1
2024-02-10 12:48:10 +01:00
benches Made api clearer. 2022-01-30 10:52:14 +01:00
docs Update README 2021-12-30 20:07:06 +01:00
examples Made api clearer. 2022-01-30 10:52:14 +01:00
scripts Fix release script 2024-02-10 12:47:48 +01:00
src Added reset_hash 2022-03-16 18:12:51 +01:00
.gitignore Initial put 2021-12-28 18:01:46 +01:00
.gitlab-ci.yml Speed up CI 2022-10-13 16:30:52 +02:00
.woodpecker.yml Fix release script 2024-02-10 12:47:48 +01:00
Cargo.toml chore: Release cyclic-poly-23 version 0.3.1 2024-02-10 12:48:10 +01:00
LICENSE-APACHE2 Move collisions test to an example program. 2021-12-31 16:59:23 +01:00
LICENSE-MIT Move collisions test to an example program. 2021-12-31 16:59:23 +01:00
README.md Added support for no_std. 2022-01-30 11:42:33 +01:00

README.md

Docs

Cyclic Poly

This crate is an implementation of the hashing algorithm given in Section 2.3 of the paper:

Jonathan D. Cohen, Recursive Hashing Functions for n-Grams, ACM Trans. Inf. Syst. 15 (3), 1997.

Other opensource implementations from this paper usually concentrate on other algorithms that appear in later sections.

This hashing algorithm has features that stand out in comparison to other hashing algorithms:

  • The algorithm supports recurrent/rolling hash calculations.
    • The rolling hash calculation is around 5 times faster than Adler32.
  • Hash values are decomposable.

Recurrent/Rolling

A rolling hash (sometimes called a recurrent hash), can calculate a new hash value from a previous hash value and a small change in the data. Rolling hashes are useful to hash a window of data that slides along a collection of data:

Rolling Hash

This is much faster than recalculating a hash value for every block.

Rolling hashes are an efficient way for finding a block of data that has a given hash value.

Decomposable

A decomposable hash, can calculate a hash value for a block of data given a hash value for a bigger block of data and the hash value for the remaining data:

Decomposable Hash

In the example above, the hash `h_3` can be calculated from the hashes `h_1` and `h_2`.

Alternative Algorithms

The Adler32 hash/checksum algorithm can be used as a rolling hash and was made popular by the zlib library and rsync tool.

Performance

Performance comparisons were done on a laptop and are relative to the adler32 crate. To run this comparison execute cargo bench.

Single block

The Cyclic Polynomial hashing is slower than Adler32 when hashing a single block.

Algorithm MB/sec
Cyclic Poly 32bit 2127
Cyclic Poly 64bit 2126
Adler32 2562

Rolling

The calculation of rolling hashes is faster than Adler32.

Algorithm MB/sec
Cyclic Poly 32bit 1254
Cyclic Poly 64bit 1048
Adler32 170

Collisions

To calculate hash collisions on random data execute:

cargo run --example collisions

This will output the number of collisions.

Algorithm Collisions
Random Number Generator 118
Cyclic Poly 32bit 140
Adler32 2872

Cargo Features

std

By default, this crate uses the Rust standard library enabled with the feature std . The crate supports no_std if it is included as a dependency without default features e.g.:

cyclic-poly-23 = { version = "xxx", default-features = false }