How to efficiently match hundred thousands of substring in one string using elasticSearch [closed] - elasticsearch

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
My problem is simple: I have a database containing 400,000 substrings (movies and tv shows titles).
I'd like to match these titles in a message such as:
I really love Game Of Thrones and Suits, also Spotlight is an awesome
movie.
What I need is to match Game Of Thrones, Suits and Spotlight in this string.
I tried to send all titles to wit.ai but it seems that it can't handle 100,000 substrings.
I'm wondering if elasticsearch could do the job?
If that's a common problem, sorry, could you help me to search in the right direction.
Thanks!

One of the best algorithms to find strings from dictionary in a text is Aho-Corasick one
dictionary-matching algorithm that locates elements of a finite set of
strings (the "dictionary") within an input text. It matches all
strings simultaneously. The complexity of the algorithm is linear in
the length of the strings plus the length of the searched text plus
the number of output matches.
But I wonder that your database engine does not provide possibilities for such searching... Probably it really can, but you don't know?

Related

How to classify string into a person's name or company's name or none of these? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Let's say, I have string say 'John Doe' and I want to determine whether this string is a name of a person, name of a company or none of these.
Every minute more and more strings are coming into my system and the system needs to classify into one of these 3 categories.
You would need a dictionary of strings in different categories to compare them against.
Without a dictionary you would need some kind of AI/machine learning that could do this automatically, but that is far beyond the scope of the kind of answer you'll get here.
NLTK provides the corpora of the most common English words (nltk.corpus.words.words('en')) and most common English names nltk.corpus.names.words())
Use gensim word2vec, it's a library provided by google where it have vectors and relationship for all the words.
Now when you enter the text to the system, first you'll get vector for your word.
On the top of this you can apply any classification algo to categorize your task.
Hope this help!

Find duplicate images algorithm [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to create a program that find the duplicate images into a directory, something like this app does and I wonder what would be the algorithm to determine if two images are the same.
Any suggestion is welcome.
This task can be solved by perceptual-hashing, depending on your use-case, combined with some data-structure responsible for nearest-neighbor search in high-dimensions (kd-tree, ball-tree, ...) which can replace the brute-force search (somewhat).
There are tons of approaches for images: DCT-based, Wavelet-based, Statistics-based, Feature-based, CNNs (and more).
Their designs are usually based on different assumptions about the task, e.g. rotation allowed or not?
A google scholar search on perceptual image hashing will list a lot of papers. You can also look for the term image fingerprinting.
Here is some older ugly python/cython code doing the statistics-based approach.
Remark: Digikam can do that for you too. It's using some older Haar-wavelet based approach i think.

Split text files into two groups - unsupervised learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Imagine, you are a librarian and during time you
have classified a bunch of text files (approx 100)
with a general ambiguous keyword.
Every text file is actually a topic of keyword_meaning1
or a topic of keyword_meaning2.
Which unsupervised learning approach would you use,
to split the text files into two groups?
What precision (in percentage) of correct classification
can be achieved according to a number of text files?
Or can be somehow indicated in one group, that there is
a need of a librarian to check certain files, because
they may be classifed incorrectly?
The easiest starting point would be to use a naive Bayes classifier. It's hard to speculate about the expected precision. You have to test it yourself. Just get a program for e-mail spam detection and try it out. For example, SpamBayes (http://spambayes.sourceforge.net/) is a quite good starting point and easily hackable. SpamBayes has a nice feature that it will label messages as "unsure" when there is no clear separation between two classes.
Edit: When you really want unsupervised clustering method, then perhaps something like Carrot2 (http://project.carrot2.org/) is more appropriate.

Algorithm for finding similar words [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
In order to support users learning English, I want to make a multiple-choice quiz using the vocabulary that the user is studying.
For example, if the user is learning "angel" then I need an algorithm to produce some similar words such as "angle" and "angled"
Another example, if the user is learning "accountant" then I need an algorithm to produce some similar words such as "accounttant" and "acountant", "acounttant"
You could compute the Levenshtein Distance from the starting word to each word in your vocabulary and pick the 2 or 3 shortest ones.
Depending on how many words are in your dictionary this might take a long time though, so I would recommend bailing out after a certain (small) number of steps - i.e. if you have made 3 mutations and still haven't arrived at your target word then stop and move on to the next one.

What are the best string matching algorithms to search several patterns at once? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
What are the best string matching algorithms which can be used to search multiple patterns within a string?
For looking for exact match to a number of different strings I favour the Aho-Corasick string matching algorithm, but there are a number of possible contenders, depending on what your patterns are. One starting point to see what is around in practical use would be look at the different variants of grep mentioned on Wikipedia or pointed to from there.

Resources