how to get total tokens count in documents in elasticsearch - elasticsearch

I am trying to get the total number of tokens in documents that match a query. I haven't defined any custom mapping and the field for which I want to get the token count is of type 'string'.
I tried the following query, but it gives a very large number in the order of 10^20, which is not the correct answer for my dataset.
curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
"query": {
"match_all": {}
},
"aggs": {
"tk_count": {
"sum": {
"script": "_index[\"body\"].sumttf()"
}
}
},
"size": 0
}
Any idea how to get the correct count of all tokens? ( I do not need counts for each term, but the total count).

This worked for me, is it what you need?
Rather than getting token count on query (using tk_count aggregation, as suggested in the other answer), my solution stores the token count on indexing using the token_count datatype., so that I could get "name.stored_length" values returned in query results.
token_count is a "multi-field" it works on one-field-at-a-time (i.e. the "name" field or the "body" field). I modified the example slightly to store the "name.stored_length"
Notice in my example it does not count cardinality of tokens (i.e. distinct values), it counts total tokens; "John John Doe" has 3 tokens in it; "name.stored_length"===3; (even though its count distinct tokens is only 2). Notice I ask for specific "stored_fields" : ["name.stored_length"]
Finally, you may need to re-update your documents (i.e. send a PUT), or any technique to get the values you want! In this case I PUT "John John Doe", even if it was already POST/PUT in elasticsearch; the tokens were not counted until a PUT again, after adding tokens to the mapping.!)
PUT test_token_count
{
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"fields": {
"stored_length": {
"type": "token_count",
"analyzer": "standard",
//------------------v
"store": true
}
}
}
}
}
}
}
PUT test_token_count/_doc/1
{
"name": "John John Doe"
}
Now we can query, or search for results, and configure results to include the name.stored_length field (which is both a multi-field and a stored field!):
GET/POST test_token_count/_search
{
//------------------v
"stored_fields" : ["name.stored_length"]
}
And results to the search should include the total token count as named.stored_length...
{
...
"hits": {
...
"hits": [
{
"_index": "test_token_count",
"_type": "_doc",
"_id": "1",
"_score": 1,
"fields": {
//------------------v
"name.stored_length": [
3
]
}
}
]
}
}

Seems like you want to retrieve cardinality of total tokens in body field.
In such case you can just use cardinality aggregation like below.
curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
"query": {
"match_all": {}
},
"aggs": {
"tk_count": {
"cardinality" : {
"field" : "body"
}
}
},
"size": 0
}
For detailed information, see this official document

Related

How to apply exact match on single field and distinct on multiple fields together in ElasticSearch?

I recently started working on ElasticSearch, and I am trying search for following criteria
I want to apply exact match on ENAME & distinct on both EID & ENAME on above data.
Let say for matching, I have string ABC.
So result should be like as below
[
{"EID" :111, "ENAME" : "ABC"},
{"EID" : 444, "ENAME" : "ABC"}
]
You can achieve this via a combination of term query and terms aggregation.
Assuming that you have the following mapping:
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"EID": {
"type": "keyword"
},
"ENAME": {
"type": "keyword"
}
}
}
}
}
And inserted the documents like this:
POST my_index/doc/3
{
"EID": "111",
"ENAME": "ABC"
}
POST my_index/doc/4
{
"EID": "222",
"ENAME": "XYZ"
}
POST my_index/doc/12
{
"EID": "444",
"ENAME": "ABC"
}
The query that will do the job might look like this:
POST my_index/doc/_search
{
"query": {
"term": { 1️⃣
"ENAME": "ABC"
}
},
"size": 0, 3️⃣
"aggregations": {
"by EID": {
"terms": { 2️⃣
"field": "EID"
}
}
}
}
Let me explain how it works:
1️⃣ - term query asks Elasticsearch to filter on exact value of a keyword field "ENAME";
2️⃣ - terms aggregation collects the list of all possible values of another keyword field "EID" and gives back the first N most frequent ones;
3️⃣ - "size": 0 tells Elasticsearch not to return any search hits (we are only interested in the aggregations).
The output of the query will look like this:
{
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"by EID": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "111", <== Here is the first "distinct" value that we wanted
"doc_count": 3
},
{
"key": "444", <== Here is another "distinct" value
"doc_count": 2
}
]
}
}
}
The output does not look exactly like what you posted in the question, but I believe it is the closest what you can achieve with Elasticsearch.
However, this output is equivalent:
"ENAME" is implicitly present (since its value was used for filtering)
"EID" is present under the "buckets" of the aggregations section.
Note that under "doc_count" you will find the number of documents having such "EID".
What if I want to do a DISTINCT on several fields?
For a more complex scenario (e.g. when you need to do a distinct on many fields) see this answer.
More information about aggregations is available here.
Hope that helps!

Terms aggregation (to achieve hierarchical faceting) query performance slow

I am indexing metric names in elastic search. Metric names are of the form foo.bar.baz.aux. Here is the index I use.
{
"index": {
"analysis": {
"analyzer": {
"prefix-test-analyzer": {
"filter": "dotted",
"tokenizer": "prefix-test-tokenizer",
"type": "custom"
}
},
"filter": {
"dotted": {
"patterns": [
"([^.]+)"
],
"type": "pattern_capture"
}
},
"tokenizer": {
"prefix-test-tokenizer": {
"delimiter": ".",
"type": "path_hierarchy"
}
}
}
}
}
{
"metrics": {
"_routing": {
"required": true
},
"properties": {
"tenantId": {
"type": "string",
"index": "not_analyzed"
},
"unit": {
"type": "string",
"index": "not_analyzed"
},
"metric_name": {
"index_analyzer": "prefix-test-analyzer",
"search_analyzer": "keyword",
"type": "string"
}
}
}
}
The above index creates the following terms for a metric name foo.bar.baz
foo
bar
baz
foo.bar
foo.bar.baz
If I have bunch of metrics, like below
a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z
I have to write a query to grab the nth level of tokens. In the example above
for level = 0, I should get [a, x]
for level = 1, with 'a' as first token I should get [b]
with 'x' as first token I should get [y]
for level = 2, with 'a.b' as first token I should get [c, m]
I couldn't think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b, here is the query I came up with.
time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
"size": 0,
"query": {
"term": {
"tenantId": "12345"
}
},
"aggs": {
"metric_name_tokens": {
"terms": {
"field" : "metric_name",
"include": "a[.]b[.][^.]*",
"execution_hint": "map",
"size": 0
}
}
}
}'
This would result in the following buckets. I parse the output and grab [c, m] from there.
"buckets" : [ {
"key" : "a.b.c",
"doc_count" : 2
}, {
"key" : "a.b.m",
"doc_count" : 1
} ]
So far so good. The query works great for most of the tenants(notice tenantId term query above). For certain tenants which has large amounts of data (around 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.
I am wondering if terms aggregation is the right choice for this kind of data and also looking for other possible kinds of queries.
Some suggestions:
"mirror" the filter at the aggregations level in the query part as well. So, for a.b. matching, use the following as a query and keep the same aggs section:
"bool": {
"must": [
{
"term": {
"tenantId": 123
}
},
{
"prefix": {
"metric_name": {
"value": "a.b."
}
}
}
]
}
or even use regexp with the same regular expression as in the aggregation part. In this way, the aggregations will have to evaluate less buckets as the documents that reach the aggregation part will be less.
You mentioned that regexp is working better for you, my initial guess was that the prefix would perform better.
change "size": 0 from aggregations to "size": 100. After testing you mentioned this doesn't make any difference
remove "execution_hint": "map" and let Elasticsearch use the defaults. After testing you mentioned that the default execution_hint was performing far worse.
the only other thing I could think of is to relieve the pressure at searching time by moving it at indexing time. What I mean by that: at indexing time, in your own application or whatever indexing method you are using, split the text to be indexed programaticaly (not ES doing it) and index each element in the hierarchy in a separate field. For example a.b in field2, a.b.c in field3 and so on. This for the same document. Then, at search time, you look at specific fields depending on what the search text is. This whole idea, though, requires some additional work outside ES.
From all the suggestions above the first one had the greatest impact: queries response times improved from 23 secs to 11 seconds.

How to limit ElasticSearch results by a field value?

We've got a system that indexes resume documents in ElasticSearch using the mapper attachment plugin. Alongside the indexed document, I store some basic info, like if it's tied to an applicant or employee, their name, and the ID they're assigned in the system. A query that runs might look something like this when it hits ES:
{
"size" : 100,
"query" : {
"query_string" : {
"query" : "software AND (developer OR engineer)",
"default_field" : "fileData"
}
},
"_source" : {
"includes" : [ "applicant.*", "employee.*" ]
}
}
And gets me results like:
"hits": [100]
0: {
"_index": "careers"
"_type": "resume"
"_id": "AVEW8FJcqKzY6y-HB4tr"
"_score": 0.4530588
"_source": {
"applicant": {
"name": "John Doe"
"id": 338338
}
}
}...
What I'm trying to do is limit the results, so that if John Doe with id 338338 has three different resumes in the system that all match the query, I only get back one match, preferably the highest scoring one (though that's not as important, as long as I can find the person). I've been trying different options with filters and aggregates, but I haven't stumbled across a way to do this.
There are various approaches I can take in the app that calls ES to tackle this after I get results back, but if I can do it on the ES side, that would be preferable. Since I'm limiting the query to say, 100 results, I'd like to get back 100 individual people, rather than getting back 100 results and then finding out that 25% of them are docs tied to the same person.
What you want to do is an aggregation to get the top 100 unique records, and then a sub aggregation asking for the "top_hits". Here is an example from my system. In my example I'm:
setting the result size to 0 because I only care about the aggregations
setting the size of the aggregation to 100
for each aggregation, get the top 1 result
GET index1/type1/_search
{
"size": 0,
"aggs": {
"a1": {
"terms": {
"field": "input.user.name",
"size": 100
},
"aggs": {
"topHits": {
"top_hits": {
"size": 1
}
}
}
}
}
}
There's a simpler way to accomplish what #ckasek is looking to do by making use of Elasticsearch's collapse functionality.
Field Collapsing, as described in the Elasticsearch docs:
Allows to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key.
Based on the original query example above, you would modify it like so:
{
"size" : 100,
"query" : {
"query_string" : {
"query" : "software AND (developer OR engineer)",
"default_field" : "fileData"
}
},
"collapse": {
"field": "id",
},
"_source" : {
"includes" : [ "applicant.*", "employee.*" ]
}
}
Using the answer above and the link from IanGabes, I was able to restructure my search like so:
{
"size": 0,
"query": {
"query_string": {
"query": "software AND (developer OR engineer)",
"default_field": "fileData"
}
},
"aggregations": {
"employee": {
"terms": {
"field": "employee.id",
"size": 100
},
"aggregations": {
"score": {
"max": {
"script": "scores"
}
}
}
},
"applicant": {
"terms": {
"field": "applicant.id",
"size": 100
},
"aggregations": {
"score": {
"max": {
"script": "scores"
}
}
}
}
}
}
This gets me back two buckets, one containing all the applicant Ids and the highest score from the matched docs, as well as the same for employees. The script is nothing more than a groovy script on the shard that contains '_score' as the content.

How to query for many facets in single elasticsearch query

I'm looking for a way to query the distribution of the top n values for many object fields in single query
My object in elastic search looks like:
obj: {
os: "Android",
device_model: "Samsung Galaxy S II (GT-I9100)",
device_brand: "Samsung",
os_version: "Android-2.3",
country: "BR",
interests: [1,2,3],
behavioral_segment: ["sport", "lifestyle"]
}
The following query brings the distribution of the values for specific field with number of appearances of this value only for the UK users
curl -XPOST http://<endpoint>/profiles/_search?search_type=count -d '
{
"query": {
"match": {
"country" : "UK"
}
},
"facets": {
"ItemsPerCategoryCount": {
"terms": {
"field": "behavioral_segment"
}
}
}
}'
How can I query for many fields - for example I would like to get a result for behavioral_segment and device_brand and os in single query. Is it possible?
In the facets section of the query, you should use the fields parameter.
"facets": {
"ItemsPerCategoryCount": {
"terms": {
"fields": ["behavioral_segment","device_brand"]
}
}
}
That should solve your problem, but of course it might not garantee the coherence of the data

Elasticsearch shuffle index sorting

Thanks in advance. I expose the situation first and in the end the solution.
I have a collection of 2M documents with the following mapping:
{
"image": {
"properties": {
"timestamp": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string"
},
"url": {
"type": "string"
}
}
}
}
I have a webpage which paginates through all the documents with the following search:
{
"from":STARTING_POSITION_NUMBER,
"size":15,
"sort" : [
{ "_id" : {"order" : "desc"}}
],
"query" : {
"match_all": {}
}
}
And a hit looks like this(note that the _id value is a hash of the url to prevent duplicated documents):
{
"_index": "images",
"_type": "image",
"_id": "2a750a4817bd1600",
"_score": null,
"_source": {
"url": "http://test.test/test.jpg",
"timestamp": "2014-02-13T17:01:40.442307",
"title": "Test image!"
},
"sort": [
null
]
}
This works pretty well. The only problem I have is that the documents appear sorted chronologically (The oldest documents appear on the first page, and the ones indexed more recently on the last page), but I want them to appear on a random order. For example, page 10 should always show always the same N documents, but they don't have to appear sorted by the date.
I though of something like sorting all the documents by their hash, which is kind of random and deterministic. How could I do it?
I've searched on the docs and the sorting api just works for sorting the results, not the full index. If I don't find a solution I will pick documents randomly and index them on a separated collection.
Thank you.
I solved it using the following search:
{
"from":STARTING_POSITION_NUMBER,
"size":15,
"query" : {
"function_score": {
"random_score": {
"seed" : 1
}
}
}
}
Thanks to David from the Elasticsearch mailing list for pointing out the function score with random scoring.

Resources