Unique count of terms aggregations - elasticsearch

I want to count distinct values of a field from my dataset. For example:
The terms aggregation gives me the number of occurences by username. I want to only count unique usernames, not all.
Here's my request:
POST appzz/messages/_search
{
"aggs": {
"words": {
"terms": {
"field": "username"
}
}
},
"size": 0,
"from": 0
}
Is there a unique option or something like that?

You're looking for the cardinality aggregation which was added in Elasticsearch 1.1. It allows you to request something like this:
{
"aggs" : {
"unique_users" : {
"cardinality" : {
"field" : "username"
}
}
}
}

We had a long discussion about it with one of the ES guys in a recent Elasticsearch meetup we had here. The short answer is no, there isn't. And according to him it's not something to be expected soon.
One option to kind of do it is to get all the terms (give a really big size limit) and count how many terms are returned, but it's expensive and not really valid if you have a lot of unique terms.

#DerMiggel: I tried using cardinality for my project. Surprising on my local system out of a total dump of some 2,00,000 documents, I tried the cardinality with precision_threshold of 100, 0 and 40,000(as the max value). The first two times, result was different(count of 175 and 184 respectively) and for 40,000 got out of memory exception. Also the computation time was huge as compared to other aggs. Hence I feel cardinality is not actually that correct and might crash your system when required high accuracy and precision.

I'm still fairly new to ES, but if I get you correctly, it seems that you should be able to get your answer by simply counting the number of buckets returned in the response? (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html)
NOTE, though: contrary to what that doc says right now with size 0 ("It is possible to not limit the number of terms that are returned by setting size to 0."), my testing with the latest version (1.0.1 now) shows that this does not work! On the contrary, setting size to 0 will give you 0 buckets!!! You should set (sigh) size to some arbitrary high figure instead for now if you want to get all the terms.
EDIT: whoops, my bad! I just reread the doc again, and only just noticed that version note there, and realized that this is only coming out in 1.1.0? That note is in the past tense ("Added in 1.1.0."), which is confusing, but I guess 1.1.0 hasn't been released yet....
Oh, and btw, there seems to be something wrong with your url? I hope you know that.

Related

Elasticsearch: how to find out if a value matches any value in a list?

I just start learning Elasticsearch. My data has the company name and its website, and I have a list which contains all the domain aliases of a company. I am trying to write a query which can boost the record with the same website in the list.
My data looks like:
{"company_name": "Kaiser Permanente",
"website": "http://www.kaiserpermanente.org"},
{"company_name": "Kaiser Permanente - Urgent Care",
"website": "http://kp.org"}.
The list of domain aliases is:
["kaiserpermanente.org","kp.org","kpcomedicare.org", "kp.com"]
The actual list is longer than the above example. I've tried this query:
{
"bool": {
"should": {
"terms": {
"website": [
"kaiserpermanente.org",
"kp.org",
"kpcomedicare.org",
"kp.com"
],
"boost": 20
}
}
}
}
The query doesn't return anything because "terms" query is the exact match. The domain in the list and the url is similar but not the same.
What I except is the query should return the two records in my example. I think "match" can work, but I couldn't figure out how to match a value with any similar value in the list.
I found a similar question How to do multiple "match" or "match_phrase" values in ElasticSearch. The solution works but my alias list contains more than 50 elements. It would be very verbose if I wrote multiple "match_phrase" for each element. Is there a more efficient way like "terms" so that I could just pass in a list?
I'd appreciate if anyone can help me out with this, thanks!
What you are observing has been covered in many stackoverflow posts / ES docs - the difference between terms and match. When you store that info, I assume you are using the standard analyzer. This means when you push "http://kp.org", Elasticsearch indexes [ "http", "kp", "org" ] tokens broken out. However, when you use terms, it looks for "kp.org" but there was no such "kp.org" token to find matches for since that was broken down by the analyzer when indexing. match, however, will break down what you query for so that "kp.org" => [ "kp", "org" ] and it is able to find one or both. Phrase matching just requires the tokens to be next to each other which is probably necessary for what you need.
Unfortunately, there does not appear to be such an option that works like match but allows many values to match against like terms. I believe you have three options:
programmatically generate the query as described in the stackoverflow post that you referenced, which you noted would be verbose, but I think this might be just ok unless you have 1k aliases.
analyze the website field so that analysis transforms "http://www.kaiserpermanente.org" => "kaiserpermanente.org" and "http://kp.org" => "kp.org" for indexing. With this index time analysis approach, when querying, you can successfully use the terms filter. This might be fine given urls are structured and the use cases you outline only appear to be concerned with domains. If you do this, use multi fields to analyze one website value in multiple ways. It's nice to have Elasticsearch do this kind of work for you and not worry about it in your own code.
do this processing beforehand (before pushing data to ES) so that when you store data in elasticsearch, you store not only the website field, but also a domain, paths, and whatever else you need that you calculated beforehand. You get control at the cost of effort you have to put in.

ElasticSearch random score combined with boost?

I am building a iOS app with Firebase, and using ElasticSearch as a search engine to get more advanced queries.
I am trying to achieve a system where I can get a random record from the index, based on a query. I have already got this working using the "random_score" function with a seed.
So all documents should right now have equal chance of being selected. Is it possible to add a boost or something(sorry, I am new to ES)?
Let's say the the document has the field "boost_enabled" and it set to true, the document will be 3 times more likely to be selected, so "increasing" the chance of being selected as a random?
So in theory it should look like this:
Documents that matches the query:
"document1"
"document2"
"document3"
They all have an equal chance of being selected (33%)
What I wish to achieve is if "document1" has the field "boost_enabled" = true
It should look like this:
"document1"
"document1"
"document1"
"document2"
"document3"
So now "document1" is 3 times more likely to be selected as the random record.
Would really appreciate some help.
EDIT:
I've come up with something like this, is this correct or not? I am pretty sure it's not though...
"query" : {
"function_score": {
"query": {
"bool" : {
"must": {
"match_all": {}
},
"should": [
{ "exists" : {
"field" : "boost_enabled",
"boost" : 3
}
}
]
"filter" : filterArray
}
},
"functions": [
{
"random_score": {"seed": seed}
}
]
}
}
/ Mads
Yes, Elasticsearch has something like that - refer to Elasticsearch: Query-Time Boosting.
In your case, you would have a portion of your query that notes the presence of the flag you described and this "subquery" would have a boost. bool with its should clause will probably be useful.
NB: This is not EXACTLY like being able to say matching document is n times as likely to be a result
EDITS:
--
EDIT 1:
Elasticsearch will tell you how it comes up with the score via the
Explain API which might be helpful in tweaking parameters.
--
EDIT 2:
I apologize for what I had posted above. Upon further thought and exploration, I think the boost parameter is not quite what is required here. function_score already has the notion of weight but even that falls short. I have found other users with requirements similar to yours but it looks like there haven't been any good solutions proposed for this.
References:
Elasticsearch Github Issue on Weighted Random Sampling
Stackoverflow Post with a Request Identical to Github Issue
I do not think the solutions proposed in those posts are quite right. I put together a quick shell script hitting the Elasticsearch REST API and relying on jq (a popular CLI for processing JSON) to demonstrate: Github Gist: Flawed Attempt At Weighed Random Sampling with Elasticsearch
In the script, featured_flag is equivalent to your boost_enabled, and undesired_flag is there to demonstrate how to only consider a subset of documents in the index. You can copy the script tweak global variables at the top of the script like Elasticsearch server, index, etc to try it out.
Some notes on the script:
script creates one document with featured_flag enabled and one document with undesired_flag enabled that should not be ever chosen
TOTAL_DOCUMENTS can be used to adjust how many total documents are created (including the first two created)
FEATURED_FLAG_WEIGHT is the weight applied at query time via function_score
script reruns the same query 1000 times and outputs stats on how many times each of the created documents was returned as the first result
I would imagine your index has many "featured" or "boosted" samples among many that are not. With the described requirements, the probability of choosing a sample depends on weight of the document (let's say 3 for boosted documents, 1 for the rest) and the sum of weights across all valid documents that you want taken into consideration. Therefore, it seems like simple weights, boosts, and randoms are just insufficient
A lot of people have considered and posted solutions for the task of weighted random sampling without Elasticsearch. This appears to be a good stab at explaining a few approaches: electric monk: Weighted Random Distribution. A lot of algorithmic details may not be quite relevant here but I thought they were interesting.
I think the ideal solution would require work to be done outside of Elasticsearch (without delving into creating Elasticsearch plugins, scorers, etc). Here is the best that I can come up with at the moment:
A numeric weight field stored in the documents (can continue with boolean fields but this seems more flexible)
Hit Elasticsearch with an initial query leveraging aggregations for some stats we need
Possibly a sum aggregation for the sum of weights required for document probabilities
A terms aggregation to get counts of documents by weights (ex: m documents with weight 1, n documents with weight 3)
Outside of Elasticsearch (in the app), choose the sample
generate a random number within the range of 0 to sum_of_weights-1
use the aggregation results and the generated random to select an index (see the algorithmic solutions for weighted random sampling outside of Elasticsearch) that is in the range of 0 to total_valid_documents-1 (call this selected_index)
Hit Elasticsearch a second time with appropriate filters for considering only valid documents, a sort parameter that guarantees the document set is ordered the same way each time we run this process (perhaps sorted by weight and by document id), and a from parameter set to the selected_index
Slightly related to all this, I posted a slightly different write up.
In ES 7.15 i use {script_score} key, below you can see my example.
This code "source": "_score + Math.random()" added random 0.0 -> 1.0 number to my native boosted score. For more information you can see this
{
"size": { YOUR_SIZE_LIMIT },
"query": {
"script_score": {
"query": {
"query_string": {
"fields": [
"{ YOUR_FIELD_1 }^6",
"{ YOUR_FIELD_2 }^3",
],
"query": "{ YOUR_SEARCH_QUERY }"
}
},
"script": {
"source": "_score + Math.random()"
}
}
}
}

How to deal with too many irrelevant results for a elasticsearch query?

I'm trying to implement a product search for a German e-commerce site and have great trouble to find the right resources about specific problems.
I had the problem that searches for a partial word wouldn't return viable results, e.g. match for etikett wouldn't result in documents containing Rolletiketten.
Ngrams introduced too many problems, so I got rid of them again after some tests. I found out about word decomposition for the German language and tried some plugins. Now I'm getting far too many completely irrelevant results, e.g. searching for rolletikett returns documents containing möbelrollen, which is something completely different.
While I understand most of the mechanics and why I get those results, I have no clue how to fix my problems and as it seems I'm unable to find the right resources online to clear up some clouds.
A few hints would be awesome. Thank you.
With elasticsearch you should get the things you write out of the box (e.g. with a wildcard-search).
Maybe you are doing a boolean query that searches for whole words only.
I suggest the following links that go through the Query Language:
Introductional: http://logz.io/blog/elasticsearch-queries/
In detail: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
Hope that helps, Christian
Tip: The document-mapping and exact query you are submitting will be helpful for others to assist in solving your problem.
When you said that introducing ngrams is causing issues, I think you may have ended up placing too much pressure on indexing. Changing the min and max grams value can help in that.
For example, below is an analysis filter that I am using and performs well:
"autocomplete": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "10"
}
Here is a another problem on stack overflow, the problem statement is different but the solution is relevant for this problem too: https://stackoverflow.com/a/42592722/3133937
"Now I'm getting far too many completely irrelevant results"
Try using min_score: docs
Some of your ES queries may be broad enough that poor quality hits make it into your results. Just setting a threshold for score helps keep them at bay. For me, I had strong score 10 hits, then along with them were a ton of score 0 hits; unwanted. If you see this, I'd guess that your query could be more efficient, but at least min_score keeps the fluff down.
GET /myIndex/_search
{
"from" : 0,
"size" : 10,
"min_score": 1,
"query" : {
"match": {
"Title": {
"query": "Bake a Cake",
"fuzziness": 2
}
}
}
}
}

Elasticsearch get matching documents after specific document id

When I search for documents I took the first 10 and give them to the view, if the user scrolls to the end of the list the next 10 elements should be displayed.
I know the last document id of the displayed documents, now I have to get the next 10. Basically I would perform the exact same search with an offset of 10 but it would be much better to be able to search with the same query, putting the document id of the last retrieved document to it and retrieve the matching documents after the document with that id.
Is that possible with elasticsearch?
=== UPDATE
I want to point out my issue a bit more, because it seems it is not clear enough as it is described right now. Sorry for that.
The case:
You have a kind of feed, the feed will grow every second. If a user goes to the feed he gets the most recent 10 entries, if he scrolls down he wants to get the next 10 entries.
Because the feed is growing every second, a usual offset / limit (from / size in elasticsearch) can't solve this problem, you would display already displayed entries or completely newer entries, depending on the time between first request (first 10 entries) and the request for the next entries.
The request to get the next 10 elements AFTER the already displayed entries gives the backend the id of the last entry which got displayed. The backend knows to ignore all entries before this specific one.
At the moment I'm handling this in code, I request the list with all matching entries from Elasticsearch and iterate it, this way I can do everything I want (no surprise) and extract the needed chunk of entires.
My question is: Is there is a build in solution for this issue in elasticsearch. Because solving the problem on my way is not the fastest.
It's an old topic, but it feels that Search After API, which is available since elasticsearch 5.0, does exactly what is needed. Provide an id of your last doc and it's timestamp, for example:
GET twitter/tweet/_search
{
"size": 10,
"query": {
"match": {
"title": "elasticsearch"
}
},
"search_after": [
1463538857,
"tweet#654323"
],
"sort": [
{
"date": "asc"
},
{
"_uid": "desc"
}
]
}
Source: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
You just have to create your query DSL and a pagination system with
{ "size": 10, "from" : YOUR_OFFSET }
If I understood your question correctly then you can use ES scrolls for such thing.
This is an example on how would one do that in java, note that it uses SearchType.SCAN
SearchRequestBuilder requestBuilder = ....getClient().prepareSearch(indices)
/**
* Set up scroll and go from there.....
* To do that need to change search type to <code>SearchType.SCAN</code>
* and set up scroll it self
* Once search type and scroll are set and search is executed, whoever
* handles the result will need to check and poll the scroll
*/
requestBuilder.setSearchType(SearchType.SCAN);
requestBuilder.setScroll(new TimeValue(SCROLL_WINDOW_IN_MILLISECONDS)); // this is in MILLISECONDS
requestBuilder.setSize(10); // this is how many hits per shrad per scroll will be returned
SearchResponse response = requestBuilder.execute().actionGet();
while (true) {
results = client.prepareSearchScroll(results.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
if (results.getHits().getHits().length == 0) {
break;
}
// do what you need to do w/ scroll result here
}
So every time inside of while loop you would grab 10 consequtive results until you would get all your results
I know this is old, but I encountered the same dilemma and I'd rather think out loud.
In that feed, you seem to care about less and less relevant documents with every request. I'm not saying a timestamp/comment count etc on purpose, in terms of ES you talk about a score that can be calculated by many factors, and what you want is to continue to search down that scoring road.
The solution that came to my mind was: If you also care about more relevant documents (like Facebook shows you on the top "X new stories available") you can first search from the beginning until you reach the first document you encountered (the one that was previously most relevant), and by adding the count of documents before to the count of documents you already displayed in the feed you can determine an estimated offset (you might get a few duplicates in race conditions, just drop them).
So what you actually need to do is search the top until you reach the first document, and then search the estimated bottom and drop everything more relevant than the last document.
This is all assuming the bulks of feeds never change, if document Y was between X and Z, it will stay there forever.
If the score is constant (unlikely as this means it will always rise for the feed to keep changing), you can also filter by everything below the score of the last document.

Trouble with facet counts

I'm attempting to use ElasticSearch for analytics -- specifically to track "top content" for hand-rolled Rails CMS. The requirement is quite a bit more complicated than keeping a counter for each piece of content. I won't get into the depth of problem right now, as I can't seem to get even the basics working.
My problem is this: I'm using facets and the counts aren't what I expect them to be. For example:
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":1,"all_terms":false,"order":"count"}}}}
Result:
{"el_ids":{"_type":"terms","missing":0,"total":16672,"other":16657,"terms":[{"term":"quis","count":15}]}}
Ok, great, the piece of content with id "quis" had 15 hits and since the order is count, it should be my top piece of content. Now lets get the top 5 pieces of content.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":5,"all_terms":false,"order":"count"}}}}
Result (just the facet):
[
{"term":"qgz9","count":26},
{"term":"quis","count":15},
{"term":"hnqn","count":15},
{"term":"higp","count":15},
{"term":"csns","count":15}
]
Huh? So the piece of content w/ id "qgz9" had more hits with 26? Why wasn't it the top result in the first query?
Ok, lets get the top 100 now.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":100,"all_terms":false,"order":"count"}}}}
Results (just the facet):
[
{"term":"qgz9","count":43},
{"term":"difc","count":37},
{"term":"zryp","count":31},
{"term":"u65r","count":31},
{"term":"sxsi","count":31},
...
]
So now "qgz9" has 43 hits instead of 26? How can that be? I can assure you there's nothing happening in the background modifying the index. If I repeat these queries, I get the same results.
As I repeat this process of increasing the result size, counts continue to change and new content ids emerge at the top. Can someone explain to me what I'm doing wrong or where my understanding of how this works is flawed?
It turns out that this is a known issue:
...the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results.
By default, my index was being created with 5 shards. By changing this so the index only has a single shard, the counts behave inline with my expectations. Another workaround would be to always set size to a value greater than the number of expected facets and peel off the top N results.

Resources