IP substring search when IP is stored as an integer - algorithm

Let's say I have a lot of IPv4s stored as integers (specifically, in a relational database) and I want to do a substring search on them given a string representation of an IP.
For example, a user types in 12.3 and expects that they get back results such as 12.30.45.67, 192.168.12.3, 1.12.34.5, 9.212.34.5.
If the IP were a string, I could just do a plain substring search. It might not be efficient but it is at least simple to implement and understand. But because I can't readily change it into a string at the moment, I don't see any efficient (in terms of CPU cycles, memory, and also my development/implementation time) way of doing this, but maybe I am just missing something.

You aren't missing anything.
For example try to turn 12.3 into a series of ranges. In whichever octet the 12 is in, there will be 3 options (12, 112, 212). In whichever octet the 3 is in there will be 2 options (3 and 30-39). That's 6 range per combination of preceding octets.
But the preceding octets? We have 1 + 256 + 256*256 depending whether 0, 1 or 2 octets precede your start.
That's a grand total of 3 * 2 * (1 + 256 + 256*256) = 394758 ranges of numbers you have to search in. It is unlikely that doing that many index searches will be faster than scanning everything.
Incidentally the worst case would be 1.2. In that case you'd have had 17 * 3 * (1 + 256 + 256*256) = 3355443 range lookups to do!
If they want this badly enough, you need to do a full text search on strings.

Anything other than some pre-processing, indexing, caching in that case sounds too inefficient (and very hard to implement) to me.
Here are a few ideas:
Look into creating a custom index, if possible, that enables you do string search.
Add an automatic field to the table that represents the ip as a string and enable you to string search. Add a corresponding index of course.
If you can't or don't want to change the schema of that table, create another one with string representations of the rows in your ip table and corresponding foreign keys that map to the primary keys of the ip table.
If you don't want or can't edit that database at all, create an external key/value store/database where the keys are string representations of the ips and values hold the corresponding records of the ip table in the (now) external database or point to it.
In any case, I don't think searching in that table with its current form (integer) is feasible (both performance-wise and implementation complexity-wise) considering your requirements.

Related

how to mapping 100 million string into 100 thousand int?

I have about 100 million simple key-value pairs(it's legacy data, never need to update, and keys are random string), and i want to store them in redis for query.
my thought is that i use the first four character as a hash key, and store them into a hash type, so there're about a million hash key in redis, with each hash key has about 1000 sub-keys.
but things just don't go as planed. for some reason, i found some hash keys only have one sub-key, but some have more than 500,000 sub-keys, which may not encoded in memory very efficiently.
so i'd like to know that is there are some simple understandable algorithm which can divide my 100 million string averagely into 100 thousand buckets(int). when I pick up a string, I can know where it goes by using the same algorithm.
thanks!!
Using only a small portion of the string to compute the hash function can be a problem because your strings could, for example, all share the same prefix.
There is a description of string hash functions which take the entire string at http://www.javamex.com/tutorials/collections/hash_function_technical_2.shtml and Good Hash Function for Strings (actually they give two different descriptions of the same function).
One way to look at this is that it regards the characters of a string as the coefficients A,B,C of a polynomial of the form A + Bx + Cx^2 + Dx^3... where in this case x is 31 and arithmetic is modulo 2^32. If x is well chosen then this is a scheme with which there is a lot of experience and some maths may apply which gives it good properties. Even better is to do the arithmetic modulo the size of the hash table, and to chose the size of the hash table to be a prime. If your data is static, it might be worth trying a few different primes of around your preferred table size and a few different values of x, and pick the combination which gives you the most evenly populated table.

How to search in polynomial hash table in ascending order in C?

I there guys,
i'm developing a small program in C, that reads strings from a .txt file with 2 letters and 3 numbers format. Like this
AB123
I developed a polynomial hash function, that calculates an hash key like this
hash key(k) = k1 + k2*A² + k3*A^3... +Kn*A^n
where k1 is the 1º letter of the word, k2 the 2º (...) and A is a prime number to improve the number of collisions, in my case its 11.
Ok, so i got the table generated, i can search in the table no problem, but only if i got the full word... That i could figure it out.
But what if i only want to use the first letter? Is it possible to search in the hash table, and get the elements started by for example 'A' without going through every element?
In order to have more functionality you have to introduce more data structures. It all depends on how deep you want to go, which depends on what exactly you need to code to do.
I suspect that you want some kind of filtering for the user. When user enters "A" it should be given all strings that have "A" at the start, and when afterwards it enters "B" the list should be filtered down to all strings starting with "AB".
If this is the case then you don't need over-complicated structures. Just iterate through the list and give the user the appropriate sublist. Humans are slow, and they won't notice the difference between 3 ms response and 300 ms response.
If your hash function is well designed, every place in the table is capable of storing a string beginning with any prefix, so this approach is doomed from the start.
It sounds like what you really want might be a trie.

Performance impact of index datatype in MongoDB?

I need a new Mongo collection that associates data with an IP address, the address being the collection key. I'm wondering if there's any performance advantage using the decimal notation of the IP adress (e.g. 3299551096 as an integer) instead of the dotted notation (e.g. "198.252.206.16" as a string).
I haven't found any evidence for or against, nor any performance comparison between integer and string indexes. Is there any reason to prefer one over the other?
An integer value storage requirement is smaller, but of course, not very significant. The sorting/indexing algorithm for a number would be slightly faster than a string normally, but the difference would be extremely small as the string is also very short.
I wouldn't expect a compelling performance difference between the two. If you're planning on storing IPV6 addresses, the issue will be that BSON (http://bsonspec.org/#/specification) doesn't have a simple data type for storing a 16-byte number, so it's not necessarily a natural fit to store as a number only.
In the end, I'd likely just use strings if you want to avoid doing translation from storage to screen, or if you want to make queries more natural to write for most of us :) :
db.ips.find({addr: "192.168.1.1"})
If using strings, I'd also suggest you consider storing as a fixed format string such as 192.168.001.001 if you want to do more complex searches, such as a range search. Since a string stored with a consistent fixed format will sort naturally, you can use it in more ways than you'd otherwise be able to. If ranges aren't important, it's not necessary to store this way.
With a fixed format, you could do a query like:
db.ips.find({ addr: {
$gte: "192.168.000.000",
$lte: "192.168.000.255" } })
That would find all IP addresses between (inclusive) 192.168.0.0 and 192.168.0.255.
Ideally, you'll have an index on the field either way:
db.ips.ensureIndex({ addr: 1 })

Generating/Compression unique key

In my work I have many users and each users have set of files in there home directories. Due to some pre defined rules I have given each file a UID (unique identification), based on the user file content and its creation time. But now I came to know that the number of files in user account cannot exceed say 1 million. The current UID is about 32 characters long. Is there any way through which I can bring down my UID to about 6 (ideal condition) character to about 10-12 character long as the current uidl is using lots of space in my NoSQL database.
Current uidl looks like
timestamp.prrocess_whichcreated_it.size
EDIT
Let me rephrase the problem. What I actually need is a compressing algo:
For e.g.
I have list of 1,000,000 strings( each unique )and each 32 character long. I need a compress function f, such that F(string) = s2 , where S2 is of length 10 characters and all the S2 strings are uniquely mapped
Sort your UID's and replace the old UID's with a new UID indicating the index in the sorted array of the old UID's
a simplified pseudo code should look like that:
sorted <- sort(UID's)
for each file:
file.UID <- sorted.indexOf(file.UID)
It very difficult to take a UNIQUE id compress it and keep it UNIQUE. You tend to run into collisions.
#amit's suggestion really is the best one. Perhaps his implementation was a bit glib though.
How about you create a table with an AUTO INCREMENTING INTEGER "ID" column and a string/varchar "OldGUID". INSERT all your old/current GUIDs into the table and now you have a 1-to-1 match between the GUID and a shorter/compressed "ID". As you create new GUIDs just INSERT them into the table and you'll continue having the 1-to-1 match so you can switch back and forth between long and short version.
If you only need a Unique Identifier, then my first thought goes to UUID.
However, generic UUID will consume 16 bytes, and is binary format. It does not meat your requirement of 6 characters. Compared to your current method using 32 characters, it "only" saves 50% space.
Therefore, a milder scheme would be to use 64-bit UID (8 bytes) with a general Hash Function. With a good hash, the probability of collision remains fairly reasonable as long as the total number of UID generated is below < 100 millions. If that seems acceptable, then 8-bytes seems pretty close to your space requirement.

String comparison algorithm, relevancy, how much "alike" 2 strings are

I have 2 sources of information for the same data (companies), which I can join together via a unique ID (contract number). The presence of the second, different source, is due to the fact that the 2 sources are updated manually, independently. So what I have is an ID and a company Name in 2 tables.
I need to come up with an algorithm that would compare the Name in the 2 tables for the same ID, and order all the companies by a variable which indicates how different the strings are (to highlight the most different ones, to be placed at the top of the list).
I looked at the simple Levenshtein distance calculation algorithm, but it's at the letter level, so I am still looking for something better.
The reason why Levenshtein doesn't really do the job is this: companies have a name, prefixed or postfixed by the organizational form (LTD, JSC, co. etc). So we may have a lot of JSC "Foo" which will differ a lot from Foo JSC., but what I am really looking for in the database is pairs of different strings like SomeLongCompanyName JSC and JSC OtherName.
Are there any Good ways to do this? (I don't really like the idea of using regex to separate words in each string, then find matches for every word in the other string by using the Levenshtein distance, so I am searching for other ideas)
How about:
1. Replace all punctuation by whitespace.
2. Break the string up into whitespace-delimited words.
3. Move all words of <= 4 characters to the end, sorted alphabetically.
4. Levenshtein.
Could you filter out (remove) those "common words" (similar to removing stop words for fulltext indexing) and then search on that? If not, could you sort the words alphabetically before comparing?
As an alternative or in addition to the Levenshtein distance, you could use Soundex. It's not terribly good, but it can be used to index the data (which is not possible when using Levenshtein).
Thank you both for ideas.
I used 4 indices which are levenshtein distances divided by the sum of the length of both words (relative distances) of the following:
Just the 2 strings
The string composed of the result after separating the word sequences, eliminating the non-word chars, ordering ascending and joining with space as separator.
The string which is contained between quotes (if no such string is present, the original string is taken)
The string composed of alphabetically ordered first characters of each word.
each of these in return is an integer value between 1 and 1000. The resulting value is the product of:
X1^E1 * X2^E2 * X3^E3 * X4^E4
Where X1..X4 are the indices, and E1..E4 are user-provided preferences of valuable (significant) is each index. To keep the result inside reasonable values of 1..1000, the vector (E1..E4) is normalized.
The results are impressive. The whole thing works much faster than I've expected (built it as a CLR assembly in C# for Microsoft SQL Server 2008). After picking E1..E4 correctly, the largest index (biggest difference) on non-null values in the whole database is 765. Right untill about 300 there is virtually no matching company name. Around 200 there are companies that have kind of similar names, and some are the same names but written in very different ways, with abbreviations, additional words, etc. When it comes down to 100 and less - practically all the records contain names that are the same but written with slight differences, and by 30, only the order or the punctuation may differ.
Totally works, result is better than I've expected.
I wrote a post on my blog, to share this library in case someone else needs it.

Resources