Large text file dictionary of random words for benchmarking purposes? - data-structures

I was wondering if anyone could point me to a very very large dictionary of random words that could be used to test some high performance string data structures? I'm finding some that are in the ~2MB range... however I'd like some larger if possible. I'm guessing there has to be some large standard string dataset somewhere that could be used. Thanks!

http://norvig.com/big.txt
The above link was mentioned in Norvig's spell checker article - http://norvig.com/spell-correct.html

I'd recommend taking a look through the material available at the TREC (Text REtrieval Conference). Some good datasets which might meet your requirements.

Related

Keyword search algorithm

I'm developing an android application where the user needs to key in a string/sentence of strings as a keyword(s) and based on that input, some string should be retrieved from a database. I am looking for a suitable algorithm for this purpose. I have gone through many answers and many algorithms such as tfidf and Boyer-Moore but I am still confused of choosing the most efficient algorithm. Anyone has a suggestion?
(the purpose is to retrieve some string based on entered keywords)
Thanks
I wrote autocomplete search (by phrases/subphrases), you can see
performance and dictionaries size on my demo:
http://olegh.ftp.sh/autocomplete.html
This is Celeron-300 machine, FreeBSD OS. And loadin less than 1% CPU during active search.
But, written on C++, and uses mmap/pread system calls. So, I not sure, will it works on Android. I can share sources upon request.
Regarding algorithm: Used preindexed hash-index file,
based on all possible prefixes from phrases from dictionary.
By mmapped hash-table found bucket, which fetch into memory by pread.
Indexing relatively slow operation:
indexing of 15,000,000 dictionary entries can consume ~1hour on PERL script.
But search/retrieve extremely quick, and performance is not dependent on dictionary size.
I wouldn't recommend writing your own algorithm; rather, you should use an existing library like Apache Lucene.

Algorithm to compare similarity of ideas (as strings)

Consider an arbitrary text box that records the answer to the question, what do you want to do before you die?
Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question).
Is there another or better way to do something like this?
Is this any different than string similarity?
Is this the right question to be asking?
The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing
It is much more difficult than string similarity. This is what you need to do at a minimum:
Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"
Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.
Calculate a weight for every term.
Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)
Run a clustering algorithm on document vectors.
Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.
The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
[...]
What you want is very much an open problem in NLP. #Ali's answer describes the idea at a high level, but the part "Construct a document vector for every answer" is the really hard one. There are a few obvious ways of building a document vector from a the vectors of the words it contains. Addition, multiplication and averaging are fast, but they affectively ignore the syntax. Man bites dog and Dog bites man will have the same representation, but clearly not the same meaning. Google compositional distributional semantics- as far as I know, there are people at Universities of Texas, Trento, Oxford, Sussex and at Google working in the area.

OpenCV: Fingerprint Image and Compare Against Database

I have a database of images. When I take a new picture, I want to compare it against the images in this database and receive a similarity score (using OpenCV). This way I want to detect, if I have an image, which is very similar to the fresh picture.
Is it possible to create a fingerprint/hash of my database images and match new ones against it?
I'm searching for a alogrithm code snippet or technical demo and not for a commercial solution.
Best,
Stefan
As Pual R has commented, this "fingerprint/hash" is usually a set of feature vectors or a set of feature descriptors. But most of feature vectors used in computer vision are usually too computationally expensive for searching against a database. So this task need a special kind of feature descriptors because such descriptors as SURF and SIFT will take too much time for searching even with various optimizations.
The only thing that OpenCV has for your task (object categorization) is implementation of Bag of visual Words (BOW).
It can compute special kind of image features and train visual words vocabulary. Next you can use this vocabulary to find similar images in your database and compute similarity score.
Here is OpenCV documentation for bag of words. Also OpenCV has a sample named bagofwords_classification.cpp. It is really big but might be helpful.
Content-based image retrieval systems are still a field of active research: http://citeseerx.ist.psu.edu/search?q=content-based+image+retrieval
First you have to be clear, what constitutes similar in your context:
Similar color distribution: Use something like color descriptors for subdivisions of the image, you should get some fairly satisfying results.
Similar objects: Since the computer does not know, what an object is, you will not get very far, unless you have some extensive domain knowledge about the object (or few object classes). A good overview about the current state of research can be seen here (results) and soon here.
There is no "serve all needs"-algorithm for the problem you described. The more you can share about the specifics of your problem, the better answers you might get. Posting some representative images (if possible) and describing the desired outcome is also very helpful.
This would be a good question for computer-vision.stackexchange.com, if it already existed.
You can use pHash Algorithm and store phash value in Database, then use this code:
double const mismatch = algo->compare(image1Hash, image2Hash);
Here 'mismatch' value can easly tell you the similarity ratio between two images.
pHash function:
AverageHash
PHASH
MarrHildrethHash
RadialVarianceHash
BlockMeanHash
BlockMeanHash
ColorMomentHash
These function are well Enough to evaluate Image Similarities in Every Aspects.

Extract only English sentences

I need to extract posts and tweets from Facebok and Twitter into our database for analysis. My problem is the system can process on the English sentences (phrases) only. So how can I remove non-English posts, tweets from my database.
If you do know any algorithm in NLP can do this, please tell me.
Thanks and regards
Avoiding automatic language identification where possible is usually preferable - for instance, https://dev.twitter.com/docs/api/1/get/search shows that returned tweets contain a field iso_language_code which might be helpful.
If that's not good enough, you'll have to either
look for existing language identification libraries in whatever language you're using; or
get your hands on a sufficient amount of English text (dumps of English Wikipedia, say, or any of the Google n-gram models) and implement something like http://www.cavar.me/damir/LID/.
Get an English dictionary and see if the majority of the words in your text are in it. Since you are looking at online text, be sure to include common slang and abbreviations.
This can run very quickly if you store the dictionary in a trie data structure.
I think fancy NLP is a bit overkill for this task. You don't need to identify the language if it's not English so all you have to do is test your text with some simple characteristics of the English language.
I have tried using standard libraries for language detection on tweets. You will get a lot of false negatives because there are a lot of non-standard characters in names, smilies etc. This problem is more severe in smaller posts where the signal-to-noise ratio is lower.
The main problem is not the algorithm but the outdated data-sources. I would suggest crawling/streaming a new one from Twitter. The language flag in Twitter is based on geographical information, so that will not work in all cases. (A chinese person can still make chinese posts in USA). I would suggest using a white-list of a lot of English speaking persons and collect their posts.
I wrote a little tweet language classifier (either english or not) that was 95+% accurate if I'm remembering right. I think it was just naive bayes + 1000 training instances. Combine that with location information and you can do even better.
I found this project, the source code is very clear. I have tested and it runs pretty well.
http://code.google.com/p/guess-language/
Have you tried SVD (Single Value Decomposition) for LSI (Latent Semantic Indexing) & LSA (Latent Semantic Analysis) ? see: http://alias-i.com/lingpipe/demos/tutorial/svd/read-me.html

English texts lexicon comparison

Let's imagine, we can build a statistics table, how much each word is used in some English text or book. We can gather statistics for each text/book in library.
What is the simplest way to compare these statistics with each other? How can we find group/cluster of texts with very statistically similar lexicon?
First, you'd need to normalize the lexicon (i.e ensure that both lexicons have the same vocabulary).
Then you could use a similarity metric like the Hellenger distance or the cosine similarity to compare the two lexicons.
It may also be a good idea to look into machine learning packages such as Weka.
This book is an excellent source for machine learning and you may find it useful.
I would start by seeing what Lucene (http://lucene.apache.org/java/docs/index.html ) had to offer. After that you will need to use a machine learning method and look at http://en.wikipedia.org/wiki/Information_retrieval.
You might consider Kullback Leibler distance. For reference, see page 18 of Cover and Thomas:
Chapter 2, Cover and Thomas

Resources