I want to build a shakespeare unique id generator that can generate up to 32 character long unique ids based on a shakespeare text corpus. I did some research but I couldn't find any algorithm that can generate unique ids based on a text corpus.
So the generator can create unique hashes like "Hellemptyalldevilshere".
Does anybody know an algorithm to generate this ids?
Related
I can identify uniqueness of words in string using additional Data Structures like Hashmap but I'm not able to figure out without additional Data Structures .
Unique characters can be determined by using additional integer or bit set but how words can be fit there ?
Please suggest some solution.
Thanks
I am classifying text using the bag of words model. I read in 800 text files, each containing a sentence.
The sentences are then represented like this:
[{"OneWord":True,"AnotherWord":True,"AndSoOn":True},{"FirstWordNewSentence":True,"AnSoOn":True},...]
How many dimensions does my data have?
Is it the number of entries in the largest vector? Or is it the number of unique words? Or something else?
For each doc, the bag of words model has a set of sparse features. For example (use your first sentence in your example):
OneWord
AnotherWord
AndSoOn
The above three are the three active features for the document. It is sparse because we never list those inactive features explicitly AND we have a very large vocabulary (all possible unique words that you consider as features). In another words, we did not say:
OneWord
AnotherWord
AndSoOn
FirstWordNewSentence: false
We only include those words that are "true".
How many dimensions does my data have?
Is it the number of entries in the largest vector? Or is it the number of unique words? Or something else?
If you stick with the sparse feature representation, you might want to estimate the average number of active features per document instead. That number is 2.5 in your example ((3+2)/2 = 2.5).
If you use a dense representation (e.g., one-hot encoding, it is not a good idea though if the vocabulary is large), the input dimension is equal to your vocabulary size.
If you use a word embedding that has 100-dimension and combine all words' embedding to form a new input vector to represent a document, your input dimension is 100 then. In this case, you convert your sparse features into dense features via the embedding.
I would like to know when hashing a very large dynamic query with many parameters what would be the best hash algorithm to generate unique key.
Also would it be hashing the query is the best method in creating an unique key using the large query or is there any other method to create an unique key for a dynamic large query to be used in Redis its unique key.
The answer on the first part of question is "Which hashing algorithm is best for uniqueness and speed?" quiestion. In few words - SHA1/Murmur2 look's like a potential choise.
The second is more complex and depends on hashing function. For example, Doctrine uses sha1 as hash key function as sha1(query + params). Here is nice explanation about probability of SHA1 collisions.
I have a very long array (~2 million) of strings (ISBN for books).
I am parsing an XML file that adds to my existing library. Instead of hitting the database on every book in the XML file, I've loaded the existing library of ISBNs into an array.
If a new ISBN is found, I create a new book model. If the already ISBN exists, I update a column for that title.
Currently using index to find each ISBN from the XML file
array.index(ISBN)
I also experimented with converting the array to a hash of ISBN keys and that's a little faster.
hash[ISBN]
Any ideas to do the lookup faster? Both the array and hash method are giving me roughly 15 and 20 checks/sec respectively.
Hash lookup is the fastest you can get, I think. Its computational complexity is a constant with respect to the size. If that is not fast enough, there should be something wrong with other parts of your code.
I am designing a web system like reddit's pagination, example
http://example.com/list.html?next=1234
Asume we display 50 items per page, the above URL will retrieve 50 items after the Primary Key 1234.
The problem with this approach is that the total number of items is guessable because PKs are AUTO_INCREMENT, to hide business sensitive data like this, are there any hash/encryption algorithm can
comparable, you know which hash is larger/smaller than another
can not guess growth or total number because it's sparse and randomized.
not very long, can be translated into very short base36.