Without additional Data Structure identify string has unique WORDS - algorithm

I can identify uniqueness of words in string using additional Data Structures like Hashmap but I'm not able to figure out without additional Data Structures .
Unique characters can be determined by using additional integer or bit set but how words can be fit there ?
Please suggest some solution.
Thanks

Related

How much space and processing will be optimized in Lucene index by storing a field as Byte instead of String for billions of documents

I understand the concept of inverted-index and how Dictionary storage optimization could help to load entire dictionary in main memory for the faster query.
I am trying to understand how Lucene index work.
Suppose I have a String type field which has only four distinct values for the 200 billion documents indexed in Lucene. This field is a Stored field.
If I change the field to Byte or Int type to represent all 4 distinct values and re-index and store all the 200 billion documents.
What would be storage and query optimization for this data type change? If there would be any.
Please suggest if I can do some test on my laptop to get a sense.
As far as I know, a document in Lucene consists of a simple list of field-value pairs. A field must have at least one value, but any field can contain multiple values. Similarly, a single string value may be converted into multiple values by the analysis process.
Lucene doesn’t care if the values are strings or numbers or dates. All
values are just treated as opaque bytes.
For more information, please see this document.

Algorithm to search for a list of words in a text

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.
I can see two obvious ways to do this --
A brute force way of searching for each word from the list in the text.
Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.
Is there a better solution?
I am using python though I am not sure if that changes the algorithm anyway.
Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.
I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.
Here are the previously asked questions on SO -
Is there an efficient algorithm to perform inverted full text search?
Searching a large list of words in another large list
EDIT: I just found another question on SO which is about the same problem.
Algorithm for multiple word matching in text
I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?
There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.
The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.
You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:
"dog|cat|horse|skunk"
And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.
There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.
If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.
Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:
"\b(cat|dog|horse|skunk)\b"
The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:
hashTable = Build hash table from target words
for each word in input text
if word in hashTable then
output word
Or, if you want a list of matching words that are in the input text:
hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
if word in hashTable then
add word to foundWords

What is the fastest way to store huge amount of unique strings?

I wonder what is the best way for storing huge amount of strings and checking for duplication.
We have to think about our priority:
duplicate check speed
inserting new string time
storage space on hard disk
random access time
What is the best solution, when our target is fast duplicate checking and inserting new strings time (no random access or storage space matter) ?
I think about SQL database, but which of DB's is best for this solution ?
If we use SQL DB, like MySQL, which storage engine will be the best ? (of course, we have to exclude memory because of data amount)
Use a hash function on the input string. the output hash would be the primary key/id of the record.
Then you can check if the DB has this hash/id/primary key:
If it doesnt: this is a new string; you add a new record including the string and hash as id.
If it does: check that the string from the loaded record is the same as the input string.
if the string is the same: it is a duplicate
if the string is different: this is a collision. Use a collision resolution scheme to resolve. (A couple of examples below)
You will have to consider which hash function/scheme/strength to use based on speed and expected number of strings and hash collision requirements/guarantees.
A couple of ways to resolve collisions:
Use a 2nd hash function to come up with a new hash in the same table.
Mark the record (e.g. with NULL) and repeat with a stronger 2nd hash function (with wider domain) on a secondary "collision" table. On query, if the string is marked as collided (e.g. NULL) then do the lookup again in the collision table. You might also want to use dynamic perfect hashing to ensure that this second table does not have further collisions.
Of course, depending on how persistent this needs to be and how much memory you are expecting to take up/number of strings, you could actually do this without a database, directly in memory which would be a lot faster.
You may want to consider a NoSQL solution:
Redis. Some of the use cases solved using Redis:
http://highscalability.com/blog/2011/7/6/11-common-web-use-cases-solved-in-redis.html
http://dr-josiah.blogspot.com/2011/02/some-redis-use-cases.html
(Josiah L. Carlson is the author of Redis in Action)
http://www.paperplanes.de/2010/2/16/a_collection_of_redis_use_cases.html
memcached. Some comparisons between memcached and Redis:
http://www.quora.com/What-are-the-differences-between-memcached-and-redis
Is memcached a dinosaur in comparison to Redis?
http://coder.cl/2011/06/concurrency-in-redis-and-memcache/
Membase/Couchbase who counts OMGPOP's Draw Something as one of their success stories. Comparison between Redis and Membase:
What is the major difference between Redis and Membase?
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Some questions:
how large is the set of strings?
will the application be read heavy or write heavy? or both?
how often would you like data to be persisted to disk?
is there a N most recent strings requirement?
Hope this helps.
Generate Suffix trees to store strings . Ukkonen's algorithm as in http://www.daimi.au.dk/~mailund/slides/Ukkonen-2005.pdf will give some insight how to create Suffix tree .There are number of ways to store this suffix tree. But once generated , the lookup time is very low.

calculating a hash of a data structure?

Let's say I want to calculate a hash of a data structure, using a hash algorithm like MD5 which accepts a serial stream, for the purposes of equivalence checking. (I want to record the hash, then recalculate the hash on the same or an equivalent data structure later, and check the hashes to gauge equivalence with high probability.)
Are there standard methods of doing this?
Issues I can see that are problematic are
if the data structure contains an array of binary strings, I can't just concatenate them since ["abc","defg"] and ["ab","cdefg"] are not equivalent arrays
if the data structure contains a collection that isn't guaranteed to enumerate in the same order, e.g. a key-value dictionary {a: "bc", d: "efg", h: "ijkl"} which should be considered equivalent to a key-value pair {d: "efg", h: "ijkl", a: "bc"}.
For the first issue, also hash the lengths of the strings. This will differentiate their hashes.
For the second, sort the keys.
A "standard" way of doing this is to define a serialized form of the data structure, and digest the resulting byte stream.
For example, a TBSCertificate is a data structure comprising a subject name, extensions, and other information. This is converted to a string of octets in a deterministic way and hashed as part of a digital signature operation to produce a certificate.
There is also another problem with structs and it is the alignment of data members on different platforms.
If you want a stable and portable solution, you can solve this by implementing "serialize" method for your data structure in such a way that serialize will produce byte stream (or more commonly, output to the byte stream).
Then, you can use hash algorithm with the serialized stream. In such a way, you will be able to solve the problems you mentioned by explicit traversion of your data. As other additional features you will get ability to save your data onto hdd or to send it over the network.
For the strings, you can implement Pascal type storage where length comes first.
If the strings can't have any nul characters, you can use C strings to guarantee uniqueness, eg. "abc\0defg\0" is distinct from "cdefg\0".
For dictionaries, maybe you can sort before hashing.
This also reminds me of an issue I heard of once... I don't know what language you are using, but if you are also hashing C structs without filtering them in any way, be careful about the space between fields that the compiler might have introduced for alignment reasons. Sometimes those will not be zeroed out.

What Data Structure could I use to find the Phone number of a person given the person's name?

What Data Structure could I use to find the Phone number of a person given the person's name?
Assuming you will only ever query using the person's name, the best option is to use an associative data structure. This is basically a data structure, usually implemented as a hashtable or a balanced binary search tree, that stores data as key=>value (or, stated in another way, as (key,value) pairs). You query the data structure by using the key and it returns the corresponding value. In your case, the key would be the name of the person and the value would be the phone number.
Rather than implementing a hashtable or a binary search tree for this yourself, check to see if your language has something like this already in its library, most languages these days do. Python has dict, perl has hashes, Java and C# has Map, and C++ has the STL map.
Things can get a little trickier if you have several values for the same key (e.g. the same person having multiple phone numbers), but there are workarounds like using a list/vector as the value, or using a slightly different structure that supports multiple values for the same key (e.g. STL multimap). But you probably don't need to worry about that anyway.
An associative array, such as a hashtable.
Really, anything that maps keys to values. The specific data structure will depend on the language you are using (unless you want to implement your own, in which case you have free reign).

Resources