Hashing don't care bits - data-structures

I have a set of rules for Deep Packet Inspection (DPI), means, when a packet arrives, I should go over the set of rules to find a matching rule in order to decide what action to take (Accept/Deny).
Our algorithm is hashing that set of rules twice so that we get a small chunks of sub-rules set. So when a packet comes, we "send" it to the same hash-tables so that we end up searching if there is a matching rule only in a small chunk of rules.
The only problem is the rules with don't care bits, we thought about duplicating each rule with don't care bits, but this will take a vast memory, since we are hashing a 17bits out of the 104 in the first place and 16bits out of the 104 bits in the second place, thus if the whole 17 and 16 bits for a specific rule are don't care, there should be an extra 2^33 entries for each rule(!). We are okay with lot of pre-processing time.

Related

if a Bitcoin mining nounce is just 32 bits long how come is it increasingly difficult to find the winning hash?

I'm learning about mining and the first thing that surprised me is that the nounce part of the algorithm which is supposed to be randomly looped until you get a number smaller than the target hash .. is just 32 bits long.
Can you explain why then is it so difficult to loop an unsigned int and how come is it increasingly difficult over time? Thank you.
The task is: try different nonce values in your potential block until you reach a block having a hash value below some given threshold.
I can't find the source right now, but I'm quite sure that since the introduction of special mining ASICs the 32-bit nonce is no longer enough to keep the miners busy for the planned 10 minutes interval between blocks. They are able to compute 4 billion block hashes in less than 10 minutes.
Increasing the difficulty didn't help anymore, as that reached the point where none of the 4 billion possible nonce values gave a hash below the threshold.
So they found some additional fields in the block that are now used as nonce-extension. The principle is still the same: try different values until you reach a block with a hash below the threshold, only now it's more than 32 bits that can be varied, allowing for the threshold to be lowered beyond the former 32-bit-implied barrier.
Because it's not just the 32bit nonce that is involved in the calculation. The 1MB of transaction data is also part of the mining input. There is then a non-trivial amount of arithmetic to arrive at the output, which then can be compared with the target.
Bitcoin mining is looping over all 4billion uints until you find a "right" one.
The way that difficulty is increased, is that only some of the bits of the output matter. E.g. early on the lowest 11 bits had to be some specific pattern, the remaining 21bits could be anything. In theory there would be 2million "right" values for each transaction block, uniformly distributed across the range of a uint. Then the "difficulty" is increased so that 13 bits have to be some pattern, so now there are 4x fewer "right" answers, so it takes (on average) 4x longer to find one.

Using hashing for efficient Deep Packet Inspection

In order to enhance the performance of Deep Packet Inspection we are preprocessing the set of rules by performing a hashing algorithm on them which in turn divides the rules into smaller chunks of sub-rules, making the inspection much faster.
The hashing is done on the first 17 bits of the originally 104. After the preprocessing is done, whenever a packet arrives, we hash its first 17 bits and check it against the much smaller set of rules based on the result.
(The algorithm is used twice, after hashing the first 17 it hashes the next 16 bits and stores the results as well, but for this specific problem we can assume that we're only performing a simple hash on a fixed number of bits)
The algorithm is indeed efficient, however, we can't seem to find a way to apply it on entries with don't care bits - which we get a lot.
We searched for a solution in numerous places, and tried for instance a suggestion of duplicating rules with don't care bits. It didn't work however, for the vast amount of memory it would take (for each don't care bit of the 17 of the numerous rules there is an option of it being either 1 or zero - this would demand an exponential amount of space).
We would very much appreciate any suggestion or insight, even a partial solution would be great.
Note: There is no limit on preprocessing time or additional space as long as it is not exponential or anything impractical.
If you use the hash table as a cache and revert to something slower if an entry for the current value isn't found then you don't need to populate it completely. You could either build it ahead of time based on an analysis of previous traffic, creating as many entries as you can afford, or you could populate it dynamically, creating new entries after you process a packet when an entry was not found, and removing old entries that had not been used for some time to reclaim store.
217 is 131,072: large, but not inordinately so. If you use a bit of indirection (for example, storing your rules in an array without duplication, and then building a size-217 table of indices into that array), then you should be able to do this in well under 1 MB.

Such a thing as a constant quality (variable bit) digest hashing algorithm?

Problem space: We have a ton of data to digest that can range 6 orders of magnitude in size. Looking for a way to be more efficient, and thus use less disk space to store all of these digests.
So I was thinking about lossy audio encoding, such as MP3. There are two basic approaches - constant bitrate and constant quality (aka variable bitrate). Since my primary interest is quality, I usually go for VBR. Thus, to achieve the same level of quality, a pure sin tone would require significantly lower bitrate than a something like a complex classical piece.
Using the same idea, two very small data chunks should require significantly less total digest bits than two very large data chunks to ensure roughly the same statistical improbability (what I am calling quality in this context) of their digests colliding. This is an assumption that seems intuitively correct to me, but then again, I am not a crypto mathematician. Also note that this is all about identification, not security. It's okay if a small data chunk has a small digest, and thus computationally feasible to reproduce.
I tried searching around the inter-tubes for anything like this. The closest thing I found was a posting somewhere that talked about using a fixed size digest hash, like SHA256, as a initialization vector for AES/CTR acting as a psuedo-random generator. Then taking the first x number of bit off that.
That seems like a totally do-able thing. The only problem with this approach is that I have no idea how to calculate the appropriate value of x as a function of the data chunk size. I think my target quality would be statistical improbability of SHA256 collision between two 1GB data chunks. Does anyone have thoughts on this calculation?
Are there any existing digest hashing algorithms that already do this? Or are there any other approaches that will yield this same result?
Update: Looks like there is the SHA3 Keccak "sponge" that can output an arbitrary number of bits. But I still need to know how many bits I need as a function of input size for a constant quality. It sounded like this algorithm produces an infinite stream of bits, and you just truncate at however many you want. However testing in Ruby, I would have expected the first half of a SHA3-512 to be exactly equal to a SHA3-256, but it was not...
Your logic from the comment is fairly sound. Quality hash functions will not generate a duplicate/previously generated output until the input length is nearly (or has exceeded) the hash digest length.
But, the key factor in collision risk is the size of the input set to the size of the hash digest. When using a quality hash function, the chance of a collision for two 1 TB files not significantly different than the chance of collision for two 1KB files, or even one 1TB and one 1KB file. This is because hash function strive for uniformity; good functions achieve it to a high degree.
Due to the birthday problem, the collision risk for a hash function is is less than the bitwidth of its output. That wiki article for the pigeonhole principle, which is the basis for the birthday problem, says:
The [pigeonhole] principle can be used to prove that any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger. Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression is lossless), which possibility the pigeonhole principle excludes.
So going to a 'VBR' hash digest is not guaranteed to save you space. The birthday problem provides the math for calculating the chance that two random things will share the same property (a hash code is a property, in a broad sense), but this article gives a better summary, including the following table.
Source: preshing.com
The top row of the table says that in order to have a 50% chance of a collision with a 32-bit hash function, you only need to hash 77k items. For a 64-bit hash function, that number rises to 5.04 billion for the same 50% collision risk. For a 160-bit hash function, you need 1.42 * 1024 inputs before there is a 50% chance that a new input will have the same hash as a previous input.
Note that 1.42 * 1024 160 bit numbers would themselves take up an unreasonably large amount of space; millions of Terabytes, if I'm doing the math right. And that's without counting for the 1024 item values they represent.
The bottom end of that table should convince you that a 160-bit hash function has a sufficiently low risk of collisions. In particular, you would have to have 1021 hash inputs before there is even a 1 in a million chance of a hash collision. That's why your searching turned up so little: it's not worth dealing with the complexity.
No matter what hash strategy you decide upon however, there is a non-zero risk of collision. Any type of ID system that relies on a hash needs to have a fallback comparison. An easy additional check for files is to compare their sizes (works well for any variable length data where the length is known, such as strings). Wikipedia covers several different collision mitigation and detection strategies for hash tables, most of which can be extended to a filesystem with a little imagination. If you require perfect fidelity, then after you've run out of fast checks, you need to fallback to the most basic comparator: the expensive bit-for-bit check of the two inputs.
If I understand the question correctly, you have a number of data items of different lengths, and for each item you are computing a hash (i.e. a digest) so the items can be identified.
Suppose you have already hashed N items (without collisions), and you are using a 64bit hash code.
The next item you hash will take one of 2^64 values and so you will have a N / 2^64 probability of a hash collision when you add the next item.
Note that this probability does NOT depend on the original size of the data item. It does depend on the total number of items you have to hash, so you should choose the number of bits according to the probability you are willing to tolerate of a hash collision.
However, if you have partitioned your data set in some way such that there are different numbers of items in each partition, then you may be able to save a small amount of space by using variable sized hashes.
For example, suppose you use 1TB disk drives to store items, and all items >1GB are on one drive, while items <1KB are on another, and a third is used for intermediate sizes. There will be at most 1000 items on the first drive so you could use a smaller hash, while there could be a billion items on the drive with small files so a larger hash would be appropriate for the same collision probability.
In this case the hash size does depend on file size, but only in an indirect way based on the size of the partitions.

Does length of number affects sorting time?

I have a simple question does the length of numbers that need to be sorted affects the sorting time ??
Example: Suppose we need to sort 10 million 6 digit numbers (like: 204134) and 10 million 2/3 digit numbers(like: 24, 143) and to sort both the sets individually. Does the set with 6 digit numbers is gonna take more time than the the one with 2/3 digit numbers ?
I know the hardware use each logic gate for a single digit so 6 logic gates for 6 digits compared to 2/3 gates for other set but i don't know whether this affects the sorting time or not. Can someone explain me this.
Helps will be appreciated.
Thanks
The hardware works with bits, not with decimal digits. Furthermore, the hardware always works with the same fixed amount of bits (for a given operation); smaller values are padded. For example, a 32 bit CPU usually has 32 bit comparator units with exactly as much circuitry needed for 32 bit comparisons, and uses those regardless of whether the values currently being compared would fit into fewer bits.
Another issue with your thinking is that the exact amount of logic gates doesn't matter much for performance. The propagation time of individual gates is much smaller than a clock cycle, only rather complicated circuits with long dependency chains actually take longer than a single cycle (and even then it might be pipelined to still get a throughput of 1 op per cycle). A surprisingly large number of logic gates in sequence (and an virtually unlimited number of logic gates in parallel) can easily finish their work within one clock cycle. Hence, a smart 64 bit comparison doesn't take more clock cycles than a 8 bit one.
The short answer: It depends, but probably not
The longer answer:
It's hard to know because you haven't said much about the hardware or the sorting algorithm. You mentioned later that you're using some MPI variant of Quicksort. So you're asking if there could be a performance difference between 6-bit numbers and 3-bit numbers due to the hardware. Well, if you pack those digits together then you're going to have better bandwidth when transferring the dataset from memory to the processor. Since you haven't mentioned anything about compacted arrays, I'll assume you're not doing this. Once the value is in the register it will have the same latency and throughput regardless of being 6 bits or 3 bits.
There are algorithms like radix sort that perform differently depending on the number of bits needed for your range of numbers. Since you're not using this, it doesn't apply.

Scalability of aho corasick

I want to search a text document for occurrences of keyphrases from a database of keyphrases (extracted from wikipedia article titles). (ie. given a document i want to find whether any of the phrases have a corresponding wikipedia article) I found out about the Aho-Corasick algorithm. I want to know if building an Aho-Corasick automaton for dictionary of millions of entries is efficient and scalable.
Let's just make a simple calculations :
Assume that you have 1 million patterns (strings, phrases) with average length 10 characters and a value (label, token, pointer etc) of 1 word (4 bytes) length , assigned to each pattern
Then you will need an array of 10+4=14 million bytes (14Mb) just to hold the list of patterns.
From 1 million patterns 10 bytes (letters, chars) each you could build an AC trie with no more than 10 million nodes. How big is this trie in practice depends on the size of each node.
It should at least keep 1 byte for a label (letter) and word (4bytes) for a pointer to a next node in trie (or a pattern for a terminal node) plus 1 bit (boolean) to mark terminal node,
Total about 5 bytes
So, the minimum size of a trie for 1 million patterns 10 chars you will need min 50 million bytes or about 50 Mb of memory.
In practice it might be 3-10 times more , but yet is very-very manageable, as even 500Mb memory is very moderate today. (Compare it with Windows applications like Word or Outlook)
Given that in terms of speed Aho-Corasick (AC) algorithm is almost unbeatable, it still remains the best algorithm for multiple pattern match ever. That's my strong personal educated opinion apart from academic garbage .
All reports of "new" latest and greatest algorithms that might outperform AC are highly exaggerated (except maybe for some special cases with short patterns like DNA)
The only improvement of AC could in practice go along the line of more and faster hardware (multiple cores, faster CPUs, clusters etc)
Don't take my word for it, test it for yourself. But remember that real speed of AC strongly depends on implementation (language and quality of coding)
In theory, it should maintain linear speed subject only to the effects of the memory hierarchy - it will slow down as it gets too big to fit in cache, and when it gets really big, you'll have problems if it starts getting paged out.
OTOH the big win with Aho-Corasick is when searching for decent sized substrings that may occur at any possible location within the string being fed in. If your text document is already cut up into words, and your search phrases are no more than e.g. 6 words long, then you could build a hash table of K-word phrases, and then look up every K-word contiguous section of words from the input text in it, for K = 1..6.
(Answer to comment)
Aho-Corasick needs to live in memory, because you will be following pointers all over the place. If you have to work outside memory, it's probably easiest to go back to old-fashioned sort/merge. Create a file of K-word records from the input data, where K is the maximum number of words in any phrase you are interested in. Sort it, and then merge it against a file of sorted Wikipedia phrases. You can probably do this almost by hand on Unix/Linux, using utilities such as sort and join, and a bit of shell/awk/perl/whatever. See also http://en.wikipedia.org/wiki/Key_Word_in_Context (I'm old enough to have actually used one of these indexes, provided as bound pages of computer printout).
Well there is a workaround. By writing the built AC trie of dictionary into a text file in a xml-like format, making an index file for the first 6 levels of that trie, etc... In my tests I search for all partial matches of a sentence in the dictionary (500'000 entries), and I get ~150ms for ~100 results for a sentence of 150-200 symbols.
For more details, check out this paper : http://212.34.233.26/aram/IJITA17v2A.Avetisyan.doc
There are other ways to get performance:
- condense state transitions: you can get them down to 32 bits.
- ditch the pointers; write the state transitions to a flat vector.
- pack nodes near the tree root together: they will be in cache.
The implementation takes about 3 bytes per char of the original pattern set,
and for 32-bit nodes, can take a pattern space of about 10M chars.
For 64-bit nodes, have yet to hit (or figure) the limit.
Doc: https://docs.google.com/document/d/1e9Qbn22__togYgQ7PNyCz3YzIIVPKvrf8PCrFa74IFM/view
Src: https://github.com/mischasan/aho-corasick

Resources