I have several thousand texts in elasticsearch which I have to compare with text segments and detect plagiarism (complete coincidence). I plan to take several large segments from different parts of the checked texts and find them in documents loaded into elasticsearch. I am trying to find a better way in elasticsearch.
Related
I am using Elasticsearch to compute the cosine similarity between paragraphs and search queries. However, the tutorials I find online seem to indicate that you can only have one vector per indexed document.. That is unfortunate for my use case since each document contains multiple paragraphs, and thus have multiple vectors associated with them.
So, when using similarity metrics like cosine similarity or k-nearest neighbors, what is the best way to deal with this? Do I add multiple JSONs per document to the index? That is, one for each vector?
Or, is there a smarter way to do this?
I have a big corpus of text documents, which is dynamically updated. Every second nearly 100 new documents are added to this corpus.
I want to find the documents that contain an input phrase query (or one of the input phrases) in real-time. The queries also come sequentially with a high rate.
What is the right tool to implement this? Is Elasticsearch the appropriate one or there are lighter tools?
I have various text documents (many from OCR). Many of these documents have medium/large sections which are entirely the same and which I would like to exclude before doing analysis on them.
I've done a good bit of searching without much luck. There are some things out there that use regularity of HTML documents' DOM structures, but I'm not working with HTML. Is there an algorithm out there I can use or a paper that I can read to figure out how to find blocks of the same text across different documents?
I'm creating a search by image system(similar to Google's reverse image search) for a cataloging system used internally at my co.. We've already been using Elasticsearch with success for our regular search functionality, so I'm planning on hashing all our images, creating a separate index for them, and using it for searching. There are many items in the system and each item may have multiple images associated with it, and the item should be able to be find-able by reverse image searching any of its related images.
There are 2 possible schema we've thought of:
Making a document for each image, containing only the hash of the image and the item id it is related to. This would result in about ~7m documents, but they would be small since they only contain a single hash and an ID.
Making a document for each item, and storing the hashes of all the images associated with it in an array on the document. This would result in around ~100k documents, but each document would be fairly large, some items have hundreds of images associated with them.
Which of these schema would be more performant?
Having attended a recent Under the Hood talk by Alexander Reelsen, he would probably say "it depends" and "benchmark it".
As #Science_Fiction already hinted:
are the images frequently updated? That could come at a negative cost factor
OTOH, the overhead for 7m documents maybe shouldn't be neglected whereas in your second scenario they would just be not_analyzed terms in a field.
If 1. is a low factor I would probably start with your second approach first.
I have many documents, over ten thousands (maybe more). I'd like to extract some keywords from each document, let's say 5 keywords from each document, using hadoop. Each document may talk about a unique topic. My current approach is to use Latent Dirichlet Allocation (LDA) implemented in Mahout. However as each document talks about a different topic, the number of extracted topics should be equal to the number of documents, which is very large. As the LDA become very inefficient when the number of topics become large, my approach is to randomly group documents into small groups each having only 100 documents and then use Mahout LDA to extract 100 topics from each group. This approach works, but may not be very efficient because each time I run hadoop on small set of documents. Does anyone has a better (more efficient) idea for this?