ElasticSearch results aren't relevant - elasticsearch

In ElasticSearch, I've created two documents with one field, "CategoryMajor"
In doc1, I set CategoryMajor to "Restaurants"
In doc2, I set CategoryMajor to "Restaurants Restaurants Restaurants Restaurants Restaurants"
If I perform a search for CategoryMajor:Restaurants, doc1 shows up as MORE RELEVANT than doc2. Which is not typical Lucene behavior, which gives more relevance the more times a term shows up. doc2 should be MORE RELEVANT than doc1.
How in do I fix this?

You can add &explain=true to your GET query to see that score of doc2 is lowered by "fieldNorm" factor. This is caused by default lucene similarity calculation formula, which lowers score for longer documents. Please read this document about default lucene similarity formula:
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/search/Similarity.html
To disable this behaviour add "omit_norms=true" for CategoryMajor field to your index mapping by sending PUT request to:
http://localhost:9200/index/type/_mapping
with request body:
{
"type": {
properties": {
"CategoryMajor": {
"type": "string",
"omit_norms": "true"
}
}
}
}
I'm not certain, but it may be necessary to delete your index, create it again, put above mapping and then reindex your documents. Reindexing after changing mapping is necessary for sure :).

Related

Really huge query or optimizing an elasticsearch update

I'm working in documents-visualization for binary classification of a big amount of documents (around 150 000). The challenge is how to present general visual information to end-users, so they can have an idea on the main "concepts" on each category (positive/negative). As each document has an associated set of topics, I thought about asking Elasticsearch through aggregations for the top-20 topics on positive classified documents, and then the same for the negatives.
I created a python script that downloads the data from Elastic and classify the docs, BUT the problem is that the predictions on the dataset are not registered on Elasticsearch, so I can not ask for the top-20 topics on a certain category. First I thought about creating a query in elastic to ask for the aggregations and passing a match
As I have the ids of the positive/negative documents, I can write a query to retrieve the aggregation of topics BUT in the query I should provide a really big amount of documents IDS to indicate, for instance, just the positive documents. That is impossible, since there is a limit on the endpoint and I can not pass 50 000 ids like:
"query": {
"bool": {
"should": [
{"match": {"id_str": "939490553510748161"}},
{"match": {"id_str": "939496983510742348"}}
...
],
"minimum_should_match" : 1
}
},
"aggs" : { ... }
So I tried to register the predicted categories of the classification in the Elastic index, but as the amount of documents is really huge, it takes like half an hour (compared to less than a minute for running the classification)... which is a LOT of time just for storing the predictions.... Then I also need to query the index to et the right data for the visualization. To update the documents, I am using:
for id in docs_ids:
es.update(
index=kwargs["index"],
doc_type=kwargs["doc_type"],
id=id,
body={"doc": {
"prediction": kwargs["category"]
}}
)
Do you know an alternative to update the predictions faster?
You could use bulk query that permits you to serialize your requests and query only one time against elasticsearch executing a lot of searches.
Try:
from elasticsearch import helpers
query_list = []
list_ids = ["1","2","3"]
es = ElasticSearch("myurl")
for id in list_ids:
query_dict ={
'_op_type': 'update',
'_index': kwargs["index"],
'_type': kwargs["doc_type"],
'_id': id,
'doc': {"prediction": kwargs["category"]}
}
query_list.append(query_dict)
helpers.bulk(client=es,actions=query_list)
Please have a read here
Regarding to query the list ids, to get faster response you should't bring with you the match_string value, as you have done in the question, but the _id field. That permits you to use multiget query, a bulk query for the get operation. Here in the python library. Try:
my_ids_list = [<some_ids_here>]
es.mget(index = kwargs["index"],
doc_type = kwargs["index"],
body = {'ids': my_ids_list})

Less restrictive search doesn't return any hits in ElasticSearch

The query below returns hits, for example where name is "Balances by bank":
GET /_search
{ "query": {
"multi_match": { "query": "Balances",
"fields": ["name","descrip","notes"]
}
}
}
So why this doesn't return anything? Note that the query is less restrictive, the word is "Balance" and not "Balances" with an s.
GET /_search
{ "query": {
"multi_match": { "query": "Balance",
"fields": ["name","descrip","notes"]
}
}
}
What search would return both?
You need to change your mapping to be able to do that.
If you didn't specified a mapping with specific analyzers when creating your index, elasticsearch will use the default mapping and analyzer.
The default mapping will map each text field as both text and keyword, so you will be able to performe full text search (match part of the string) and keyword search (match the whole string), but it will use the standard analyzer.
With the standard analyzer your example Balances by bank becomes the following list of tokens: [Balances, by, bank], those items are added to the inverted index and elasticsearch can find the documents when you search for any of them.
When you search for just Balance, this term does not exist in the inverted index and elasticsearch returns nothing.
To be able to return both Balance and Balances you need to change your mapping and use the analyzer for the english language, this analyzer will reduce your terms to their stem and match Balance, Balances as also Balancing, Balanced, Balancer etc.
Look at this part of the documentation to see how the analysis process work.
And of course, you can also search for Balance* and it will return both Balance and Balances, but it is a different query.

Elasticsearch - Term Aggregation In Theory

imagine a random term aggregation on a specific field:
"aggs":{
"top_terms": {
"terms": {
"field": "any.specific.field"
}
}
}
My question here is: how does ES aggregate terms?
If the inverted & fielddate index looks like this: ES: inverted index & fielddata index and the fact that unique terms are stored per document, not per field, how does ES aggregate terms per field? What is happening behind the scenes to aggregate them?
Can somebody shed some light to me/us? Thanks in advance

Short queries return not enough results

Hey I have a field in elasticsearch that is analyzed with the alphanumeric_analyzer. Then I index data into that field that looks like this:
Test-00001
Test-00002
to
Test-01000
If I execute the following query, I get 250 results consistently. But they aren't necessarily Test-00001 to Test -00250.
`{
"query": {
"match": {
"filename_Analyzed": {
"type": "phrase_prefix",
"query": "0"
}
}
}
}`
I was expecting to get 1000 results, but I only get 250. Are my expectations correct, or is the search incorrect?
EDIT 1:
Gist for the mapping:
https://gist.github.com/goalie7960/8ffd1536269a901f18bc
EDIT 2:
If I double the number of shards, the number of results also doubles. So 5 shards = 250 results, 10 shards = 500 results, etc.
EDIT 3:
Here's a gist for the analyzer I am using. But I can also reproduce with the standard analyzer.
https://gist.github.com/goalie7960/b0bbbddf1cee29b4b5ed
Turns out the prefix query or phrase prefix was exceeding the max expansion limit in elastic search. A non simple solution was to switch to ngram analysis and it has fixed the problem. Yay.

Django-Haystack elasticsearch queries

Haystack generates elasticsearch queries to get results from elasticsearch. The queries get prepended with a filter containing the following query:
"query": {
"query_string": {
"query": "django_ct:(customers.customer)"
}
}
What is the meaning of the django_ct(..) query? Is this a function that haystack installs in elasticsearch? Is it some caching magic? Can I get rid of this part altogether?
The reason why I'm asking is that I have to build a custom query to use an elasticsearch multi_field. In order to change the queries I want to understand first how haystack generates its own queries.
Haystack uses Django's content types to determine which model attributes to search against in Elasticsearch. This is not really best practice, but it's how it's done in HS.
Basically, the code in HS looks something like this:
app_name, model_name = django_ct.split('.')
ct = ContentType.objects.get_by_natural_key(app_name, model_name)
model = ct.model_class()
# do stuff with model
So, you really don't want to ignore it when using haystack, if you are indexing more than one model in your index.
I have a couple other answers based on elasticsearch here: index analyzer vs query analyzer in haystack - elasticsearch? and here: Django Haystack Distinct Value for Field
EDIT regarding multi-fields:
I've used Haystack and multifields in the past, so I'm not sure you need to write you own backend. The key is understanding how haystack creates searches. As I said in one of the other posts, everything goes into query_string and from there it creates a lucene based search string. Again, not really best practice.
So let's say you have a multi-field that looks like this:
"some_field": {
"type": "multi_field",
"fields": {
"some_field_edgengram": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"some_field": {
"type": "string",
"index": "not_analyzed"
}
}
},
In haystack, you can just search against some_field and some_field_edgengram directly.
For example SearchQuerySet().filter(some_field="cat") and SearchQuerySet().filter(some_field_edgengram="cat") will both work, but the first will only match tokens that have cat exactly and the second will match cat, cats, catlin, catch, etc, at least using my edgengram analyzers.
However, just because you use haystack for indexing and search doesn't mean you have to use it for 100% of your search solutions. In the past, I've used PYES in some areas of the app and haystack in others, because haystack lacked the support for more advanced features and the query_string parsing was losing some of the finer grained accuracy we were looking for.
In your case, you could get results from the search engine via elasticutils or python-elasticseach directly for some more advanced searches and use haystack for the other more routine searches.

Resources