Elasticsearch DSL: Bucket not working - elasticsearch

Running the code,
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q, A
client = Elasticsearch(timeout=100)
s = Search(using=client, index="cms*")
s.aggs.bucket('ExitCode', 'terms', field='ExitCode').metric('avgCpuEff', 'avg', field='CpuEff')
for hit in s[0:20].execute():
print hit['ExitCode']
yields several ExitCode = 0. I thought a terms bucket is supposed to group all the results that have the same exit code, in this case. What is actually going on?

You're iterating over the hits, you need to iterate over the aggregated buckets instead:
response = s.execute()
for code in response.aggregations.ExitCode.buckets:
print(code.key, code.avgCpuEff.value)

Related

ElasticSearch get only document ids, _id field, using search query on index

For a given query I want to get only the list of _id values without getting any other information (without _source, _index, _type, ...).
I noticed that by using _source and requesting non-existing fields it will return only minimal data but can I get even less data in return ?
Some answers suggest to use the hits part of the response, but I do not want the other info.
Better to use scroll and scan to get the result list so elasticsearch doesn't have to rank and sort the results.
With the elasticsearch-dsl python lib this can be accomplished by:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch()
s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)
s = s.fields([]) # only get ids, otherwise `fields` takes a list of field names
ids = [h.meta.id for h in s.scan()]
I suggest to use elasticsearch_dsl for python. They have a nice api.
from elasticsearch_dsl import Document
# don't return any fields, just the metadata
s = s.source(False)
results = list(s)
Afterwards you can get the the id with:
first_result: Document = results[0]
id: Union[str,int] = first_result.meta.id
Here is the official documentation to get some extra information: https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html#extra-properties-and-parameters

How can I find the true score from Elasticsearch query string with a wildcard?

My ElasticSearch 2.x NEST query string search contains a wildcard:
Using NEST in C#:
var results = _client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq.Query("Micro*")))
.From(pageNumber)
.Size(pageSize));
Comes up with something like this:
$ curl -XGET 'http://localhost:9200/_all/_search?q=Micro*'
This code was derived from the ElasticSearch page on using Co-variants. The results are co-variant; they are of mixed type coming from multiple indices. The problem I am having is that all of the hits come back with a score of 1.
This is regardless of type or boosting. Can I boost by type or, alternatively, is there a way to reveal or "explain" the search result so I can order by score?
Multi term queries like wildcard query are given a constant score equal to the boosting by default. You can change this behaviour using .Rewrite().
var results = client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq
.Query("Micro*")
.Rewrite(RewriteMultiTerm.ScoringBoolean)
)
)
.From(pageNumber)
.Size(pageSize)
);
With RewriteMultiTerm.ScoringBoolean, the rewrite method first translates each term into a should clause in a bool query and keeps the scores as computed by the query.
Note that this can be CPU intensive and there is a default limit of 1024 bool query clauses that can be easily hit for a large document corpus; running your query on the complete StackOverflow data set (questions, answers and users) for example, hits the clause limit for questions. You may want to analyze some text with an analyzer that uses an edgengram token filter.
Wildcard searches will always return a score of 1.
You can boost by a particular type. See this:
How to boost index type in elasticsearch?

Fetch all the rows using elasticsearch_dsl

Currently i am using the following program to extract the id and its severity information from elastic search .
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
client = Elasticsearch(
[
#'http://user:secret#10.x.x.11:9200/',
'http://10.x.x.11:9200/',
],
verify_certs=True
)
s = Search(using=client, index="test")
response = s.execute()
for hit in response:
print hit.message_id, hit.severity, "\n\n"
i believe by default the query returns 10 rows. I am having more than 10000 rows in elastic search. I need to fetch all the information.
Can some one guide me how to run the same query to fetch all records ?
You can use the scan() helper function in order to retrieve all docs from your test index:
from elasticsearch import Elasticsearch, helpers
client = Elasticsearch(
[
#'http://user:secret#10.x.x.11:9200/',
'http://10.x.x.11:9200/',
],
verify_certs=True
)
docs = list(helpers.scan(client, index="test", query={"query": {"match_all": {}}}))
for hit in docs:
print hit.message_id, hit.severity, "\n\n"

Why not use min_score with Elasticsearch?

New to Elasticsearch. I am interested in only returning the most relevant docs and came across min_score. They say "Note, most times, this does not make much sense" but doesn't provide a reason. So, why does it not make sense to use min_score?
EDIT: What I really want to do is only return documents that have a higher than x "score". I have this:
data = {
'min_score': 0.9,
'query': {
'match': {'field': 'michael brown'},
}
}
Is there a better alternative to the above so that it only returns the most relevant docs?
thx!
EDIT #2:
I'm using minimum_should_match and it returns a 400 error:
"error": "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed;"
data = {
'query': {
'match': {'keywords': 'michael brown'},
'minimum_should_match': '90%',
}
}
I've used min_score quite a lot for trying to find documents that are a definitive match to a given set of input data - which is used to generate the query.
The score you get for a document depends on the query, of course. So I'd say try your query in many permutations (different keywords, for example) and decide which document is the first you would rather it didn't return for each, and and make a note of each of their scores. If the scores are similar, this would give you a good guess at the value to use for your min score.
However, you need to bear in mind that score isn't just dependant on the query and the returned document, it considers all the other documents that have data for the fields you are querying. This means that if you test your min_score value with an index of 20 documents, this score will probably change greatly when you try it on a production index with, for example, a few thousands of documents or more. This change could go either way, and is not easily predictable.
I've found for my matching uses of min_score, you need to create quite a complicated query, and set of analysers to tune the scores for various components of your query. But what is and isn't included is vital to my application, so you may well be happy with what it gives you when keeping things simple.
I don't know if it's the best solution, but it works for me (java):
// "tiny" search to discover maxScore
// it is fast, because it returns only 1 item
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setSize(1)
.execute()
.actionGet();
// get the maxScore and
// and set minScore = 70%
float maxScore = response.getHits().maxScore();
float minScore = maxScore * 0.7;
// second round with minimum score
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setMinScore(minScore)
.execute()
.actionGet();
I search twice, but the first time it's fast because it returns only 1 item, then we can get the max_score
NOTE: minimum_should_match work different. If you have 4 queries, and you say minimum_should_match = 70%, it doesn't mean that item.score should be > 70%. It means that the item should match 70% of the queries, that is minimum 3/4 queries

Setting Elastic search limit to "unlimited"

How can i get all the results from elastic search as the results only display limit to 10 only. ihave got a query like:
#data = Athlete.search :load => true do
size 15
query do
boolean do
must { string q, {:fields => ["name", "other_names", "nickname", "short_name"], :phrase_slop => 5} }
unless conditions.blank?
conditions.each do |condition|
must { eval(condition) }
end
end
unless excludes.blank?
excludes.each do |exclude|
must_not { eval(exclude) }
end
end
end
end
sort do
by '_score', "desc"
end
end
i have set the limit to 15 but i wan't to make it unlimited so that i can get all the data
I can't set the limit as my data keeps on changing and i want to get all the data.
You can use the from and size parameters to page through all your data. This could be very slow depending on your data and how much is in the index.
http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
Another approach is to first do a searchType: 'count', then and then do a normal search with size set to results.count.
The advantage here is it avoids depending on a magic number for UPPER_BOUND as suggested in this similar SO question, and avoids the extra overhead of building too large of a priority queue that Shay Banon describes here. It also lets you keep your results sorted, unlike scan.
The biggest disadvantage is that it requires two requests. Depending on your circumstance, this may be acceptable.
From the docs, "Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000". So my admittedly very ad-hoc solution is to just pass size: 10000 or 10,000 minus from if I use the from argument.
Note that following Matt's comment below, the proper way to do this if you have a larger amount of documents is to use the scroll api. I have used this successfully, but only with the python interface.
use the scan method e.g.
curl -XGET 'localhost:9200/_search?search_type=scan&scroll=10m&size=50' -d '
{
"query" : {
"match_all" : {}
}
}
see here
You can use search_after to paginate, and the Point in Time API to avoid having your data change while you paginate. Example with elasticsearch-dsl for Python:
from elasticsearch_dsl.connections import connections
# Set up paginated query with search_after and a fixed point_in_time
elasticsearch = connections.create_connection(hosts=[elastic_host])
pit = elasticsearch.open_point_in_time(index=MY_INDEX, keep_alive="3m")
pit_id = pit["id"]
query_size = 500
search_after = [0]
hits: List[AttrDict[str, Any]] = []
while query_size:
if hits:
search_after = hits[-1].meta.sort
search = (
Search()
.extra(size=query_size)
.extra(pit={"id": pit_id, "keep_alive": "5m"})
.extra(search_after=search_after)
.filter(filter_)
.sort("url.keyword") # Note you need a unique field to sort on or it may never advance
)
response = search.execute()
hits = [hit for hit in response]
pit_id = response.pit_id
query_size = len(hits)
for hit in hits:
# Do work with hits

Resources