Finding documents containing a phrase in a big corpus - elasticsearch

I have a big corpus of text documents, which is dynamically updated. Every second nearly 100 new documents are added to this corpus.
I want to find the documents that contain an input phrase query (or one of the input phrases) in real-time. The queries also come sequentially with a high rate.
What is the right tool to implement this? Is Elasticsearch the appropriate one or there are lighter tools?

Related

Elasticsearch The best way to compare large texts

I have several thousand texts in elasticsearch which I have to compare with text segments and detect plagiarism (complete coincidence). I plan to take several large segments from different parts of the checked texts and find them in documents loaded into elasticsearch. I am trying to find a better way in elasticsearch.

Are there any approaches/suggestiosn for classifying a keyword so the search space will be reduced in elasticsearch?

I was wondering is there any way to classify a single word before applying a search on elasticsearch. Let's say I have 4 indexes each one holds few millions documents about a specific category.
I'd like to avoid searching the whole search space each time.
This problem becomes more challenging since it's not a sentence,
The search query usually consists only a single or two words, so some nlp magic (Named-entity recognition, POS etc) can't be applied.
I have read few questions on stackoverflow like:
Semantic search with NLP and elasticsearch
Identifying a person's name vs. a dictionary word
and few more, but couldn't find an approach. are there any suggestions I should try?

Elasticsearch - many small documents vs fewer large documents?

I'm creating a search by image system(similar to Google's reverse image search) for a cataloging system used internally at my co.. We've already been using Elasticsearch with success for our regular search functionality, so I'm planning on hashing all our images, creating a separate index for them, and using it for searching. There are many items in the system and each item may have multiple images associated with it, and the item should be able to be find-able by reverse image searching any of its related images.
There are 2 possible schema we've thought of:
Making a document for each image, containing only the hash of the image and the item id it is related to. This would result in about ~7m documents, but they would be small since they only contain a single hash and an ID.
Making a document for each item, and storing the hashes of all the images associated with it in an array on the document. This would result in around ~100k documents, but each document would be fairly large, some items have hundreds of images associated with them.
Which of these schema would be more performant?
Having attended a recent Under the Hood talk by Alexander Reelsen, he would probably say "it depends" and "benchmark it".
As #Science_Fiction already hinted:
are the images frequently updated? That could come at a negative cost factor
OTOH, the overhead for 7m documents maybe shouldn't be neglected whereas in your second scenario they would just be not_analyzed terms in a field.
If 1. is a low factor I would probably start with your second approach first.

How to extract keywords from lots of documents?

I have many documents, over ten thousands (maybe more). I'd like to extract some keywords from each document, let's say 5 keywords from each document, using hadoop. Each document may talk about a unique topic. My current approach is to use Latent Dirichlet Allocation (LDA) implemented in Mahout. However as each document talks about a different topic, the number of extracted topics should be equal to the number of documents, which is very large. As the LDA become very inefficient when the number of topics become large, my approach is to randomly group documents into small groups each having only 100 documents and then use Mahout LDA to extract 100 topics from each group. This approach works, but may not be very efficient because each time I run hadoop on small set of documents. Does anyone has a better (more efficient) idea for this?

Score equivalence Oracle Text / Lucene

Is there an equivalence between the scores an Oracle Text Score would calculate and a Lucene one ?
Would you be able to mix the sources to get one unified resultset through the score ?
Scores are not comparable between queries or data changes in Lucene, much less being comparable to another technology. Lucene scores of the same document can be changed dramatically by having other documents added or removed from the index. Scoring as a percentage of maximum becomes the obvious solution, but the same problems remain, as well as that other algorithms in another technology will ikely render different distribution. You can read about why you should not compare scores like this here and here
A way I managed to lash something similar together was to fetch matches from the other data source, and create a temporary index in a RAMDirectory, and then search again incorporating it with a MultiSearcher. That way everything is getting scored on a single, cohesive data set, within a single search. Scoring should be reasonable enough, though this isn't exactly the most efficient way to search.

Resources