I have large database of annotations of images stored in an elasticsearch database. I want to use this database for keyword extraction. Input is text (typically a newspaper article). My basic idea for an algorithm is to go through each term from the article and use elasticsearch to discover how frequent the term is in the image annotations. Then output terms from articles which are not frequent (in order to prefer names of people or places over common English words).
I don't need something very sophisticated, these keywords are used just as suggestion for user input, but I want something faster then asking N search queries (where N is number of terms in text) to elasticsearch which can be slow on large texts. Is there some robust and fast technique for keyword extraction in elasticsearch?
You can use elastic search term aggregations for this. They can return bucketed keywords with document counts which indicate their relative frequency. Here is an example query in YML.
query:
match:
annotation:
query: text of your article
aggregations:
term_frequencies:
terms:
field: annotation
Related
How do I dump out the term dictionary from Elasticsearch?
I just want to look for pathological cases in indexing, so I want to see what is actually getting put into the index.
These are text fields, so I can't just do a terms aggregation.
_termvectors only works for single or multiple documents. I want top terms in the index, like the terms component in Solr.
I am using Elasticsearch 6.2, and I have some queries that analyze a massive amount of documents. I am sorting to one field inside the index. Elasticsearch examines 10.000 documents (default configuration value) and then returns them paginated.
I tried to read the documentation, but I cannot find any information if the database applies the sorting before or after the analysis process of the documents from the index.
In other words, the sort is applied directly during the index analysis or the documents are sorted once analyzed? If the last option is correct, which kind of sort applies Elasticsearch during the scan?
Thanks a lot.
Sorting, aggregations, and access to field values in scripts requires
a different data access pattern. Instead of looking up the term and
finding documents, we need to be able to look up the document and find
the terms that it has in a field.
This quote from the Elasticsearch reference documentation implies to me, that sorting is happening on the non-analyzed level, but I've also decided to double check and do some tests on it.
In the Elasticsearch we have capabilities to do sorting on non-analyzed fields - e.g. keyword. Those fields are using doc-values to do sorting and after the test I could say that it's using pre-analyzed values to do sorting according to the codes representing characters (numbers, uppercase letters, lowercase letters)
It's also possible to do a sorting on text fields with some caveat and tuning (e.g. need to enable fielddata, since text fields do not support doc_values)
In this case the documents are sorted according to analyzed values. Of course a lot depends on analyzing pipeline, since it could do various stuff to the text. Also, just as a reminder:
Fielddata can consume a lot of heap space, especially when loading
high cardinality text fields. Once fielddata has been loaded into the
heap, it remains there for the lifetime of the segment. Also, loading
fielddata is an expensive process which can cause users to experience
latency hits. This is why fielddata is disabled by default.
My problem is to search data of thousands of users, e.g. mailboxes. Almost all the time search is filtered by user id. How this locality of searches could be taken into consideration? I'm trying to achieve performance comparable to a case where each user has dedicated index.
Sharding is not an option because it will be used (total number of users ~ 1M), and I'm looking for a solution to use inside a shard of ~4k users.
Well it can be done in Sphinx with Attributes. Most of the time can make the search more efficient by adding the user-id as a fake keyword too*. Then the documents can be filtered during the full-text stage. (still keep the attribute too, so as avoid possibility of manipulating results by constructing a careful query to return results from other users)
eg, add _user1234 as a full-text field, then add to query WHERE MATCH('example _user1234') AND user = 1234 then finds documents just from that user.
One possible solution is to group documents of the same user in inverted index block. Given that inverted index block is sorted by document id, such grouping can be done only by assigning ids to documents appropriately. Same user's documents should have monotonic ids. There could be minor violations of this rule - it would not harm performance significantly.
Implementations.
index sorting having just become a first-class citizen in Lucene 6.21
Could be achieved in elasticsearch 2.3 (see here). And I think it's achievable in Solr in the same way.
As for sphinx, I suppose the same technique of assigning monotonic document ids should work.
For more technical reasoning see previous link.
I'm using elasticsearch to find similar documents to a given document using the "more like this" query.
Is there an easy way to get the elasticsearch scoring between 0 and 1 (using cosine similarity) ?
Thanks!
You may want to look into the Function Score functionality of Elasticsearch, more specifically the script_score and field_value_factor functions. This will allow you to take the score from default scoring (_score) and enhance or replace it in other ways. It really depends on what sort of boosting or transformation you'd like. The default scoring model takes into account the Vector model but other things as well .
I don't think that's possible to retrieve directly.
But perhaps this workaround would make sense?
Elasticsearch always bring back max_score in hits document.
You can potentially divide your document _score by max_score. Report with highest value will score as 1, documents, that are not so like given one, will score less.
The Elasticsearch uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.
I want to query elasticsearch documents within a date range. I have two options now, both work fine for me. Have tested both of them.
1. Range Query
2. Range Filter
Since I have a small data set for now, I am unable to test the performance for both of them. What is the difference between these two? and which one would result in faster retrieval of documents and faster response?
The main difference between queries and filters has to do with scoring. Queries return documents with a relative ranked score for each document. Filters do not. This difference allows a filter to be faster for two reasons. First, it does not incur the cost of calculating the score for each document. Second, it can cache the results as it does not have to deal with possible changes in the score from moment to moment - it's just a boolean really, does the document match or not?
From the documentation:
Filters are usually faster than queries because:
they don’t have to calculate the relevance _score for each document —
the answer is just a boolean “Yes, the document matches the filter” or
“No, the document does not match the filter”. the results from most
filters can be cached in memory, making subsequent executions faster.
As a practical matter, the question is do you use the relevance score in any way? If not, filters are the way to go. If you do, filters still may be of use but should be used where they make sense. For instance, if you had a language field (let's say language: "EN" as an example) in your documents and wanted to query by language along with a relevance score, you would combine a query for the text search along with a filter for language. The filter would cache the document ids for all documents in english and then the query could be applied to that subset.
I'm over simplifying a bit, but that's the basics. Good places to read up on this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-filtered-query.html
http://exploringelasticsearch.com/searching_data.html
http://elasticsearch-users.115913.n3.nabble.com/Filters-vs-Queries-td3219558.html
Filters are cached so they are faster!
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html