Elasticsearch: total term frequency and doc count from given set of documents - elasticsearch

I am trying to get total term frequency and document count from given set of documents, but _termvectors in elasticsearch returns ttf and doc_count from all documents within the index. Is there any way so that I can specify list of documents (document ids) so that result will based on those documents only.
Below are documents details and query to get total term frequency:
Index details:
PUT /twitter
{ "mappings": {
"tweets": {
"properties": {
"name": {
"type": "text",
"analyzer":"english"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
}
}
Document Details:
PUT /twitter/tweets/1
{
"name":"Hello bar"
}
PUT /twitter/tweets/2
{
"name":"Hello foo"
}
PUT /twitter/tweets/3
{
"name":"Hello foo bar"
}
It will create three document with ids 1, 2 and 3. Now suppose tweets with ids 1 and 2 belongs to user1 and 3 belong to another user and I want to get the termvectors for user1.
Query to get this result:
GET /twitter/tweets/_mtermvectors
{
"ids" : ["1", "2"],
"parameters": {
"fields": ["name"],
"term_statistics": true,
"offsets":false,
"payloads":false,
"positions":false
}
}
Response:
{
"docs": [
{
"_index": "twitter",
"_type": "tweets",
"_id": "1",
"_version": 1,
"found": true,
"took": 1,
"term_vectors": {
"name": {
"field_statistics": {
"sum_doc_freq": 7,
"doc_count": 3,
"sum_ttf": 7
},
"terms": {
"bar": {
"doc_freq": 2,
"ttf": 2,
"term_freq": 1
},
"hello": {
"doc_freq": 3,
"ttf": 3,
"term_freq": 1
}
}
}
}
},
{
"_index": "twitter",
"_type": "tweets",
"_id": "2",
"_version": 1,
"found": true,
"took": 1,
"term_vectors": {
"name": {
"field_statistics": {
"sum_doc_freq": 7,
"doc_count": 3,
"sum_ttf": 7
},
"terms": {
"foo": {
"doc_freq": 2,
"ttf": 2,
"term_freq": 1
},
"hello": {
"doc_freq": 3,
"ttf": 3,
"term_freq": 1
}
}
}
}
}
]
}
Here we can see hello is having doc_count 3 and ttf 3. How can I make it to consider only documents with given ids.
One approach I am thinking is to create different index for different users. But I am not sure if this approach is correct. With this approach indices will increase with users. Or can there be another solution?

To obtain term doc count on a subset of documents you may try to use simple aggregations.
You will have to enable fielddata in the mapping of the field (though it might become tough on memory, check out the documentation page about fielddata for more details):
PUT /twitter
{
"mappings": {
"tweets": {
"properties": {
"name": {
"type": "text",
"analyzer":"english",
"fielddata": true,
"term_vector": "yes"
}
}
}
}
}
Then use terms aggregation:
POST /twitter/tweets/_search
{
"size": 0,
"query": {
"terms": {
"_id": [
"1",
"2"
]
}
},
"aggs": {
"my_term_doc_count": {
"terms": {
"field": "name"
}
}
}
}
The response will be:
{
"hits": ...,
"aggregations": {
"my_term_doc_count": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "hello",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 1
},
{
"key": "foo",
"doc_count": 1
}
]
}
}
}
I couldn't find a way to calculate total_term_frequency on the subset of documents though, I'm afraid it can't be done.
I would suggest to compute term vectors offline with _analyze API and store them in a separate index explicitly. In this way you will be able to use simple aggregations to compute also total term frequency. Here I show an example usage of _analyze API.
POST twitter/_analyze
{
"text": "Hello foo bar"
}
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "foo",
"start_offset": 6,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "bar",
"start_offset": 10,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Hope that helps!

Related

elasticSearch count multiple

I have two profiles, "A" and "B" both have events in the elastic
this is the elastic data for ex:
{hits: [
{tag:"A"},
{tag:"B"},
{tag:B}
]}
I want to count how much events tag "a" have and and how much "B" in one request
Ive tried this but it counts them total as 3 and I want A:1 and B:2
GET forensics/_count
{
"query": {
"terms": {
"waas_tag": ["A","B"]
}
}
}
You can use term vector API to get information about the terms of a particular field.
Adding a working example with index data and response
Index Data
{
"waas_tag": [
{
"tag": "A"
},
{
"tag": "B"
},
{
"tag": "B"
}
]
}
Term Vector API:
GET _termvectors/1?fields=waas_tag.tag
Response:
"term_vectors": {
"waas_tag.tag": {
"field_statistics": {
"sum_doc_freq": 2,
"doc_count": 1,
"sum_ttf": 3
},
"terms": {
"a": {
"term_freq": 1, // note this
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 1
}
]
},
"b": {
"term_freq": 2, // note this
"tokens": [
{
"position": 101,
"start_offset": 2,
"end_offset": 3
},
{
"position": 202,
"start_offset": 4,
"end_offset": 5
}
]
}
}
}
}
at the end I found a solution not using count but msearch
GET forensics/_msearch
{} // this means {index:"forensics"}
{"query":{"term":{"waas_tag":"A"}}}
{} // this means {index:"forensics"}
{"query":
{
"bool":{
"must":[{"term":{"waas_tag":"B"}
},
{
"range":{"#timestamp":{"gte":"now-20d","lt":"now/s"}}}]}
}
}
You can use filters aggregation to get the count for each tag in a single query without using _msearch endpoint. This query should work:
{
"size": 0,
"aggs": {
"counts": {
"filters": {
"filters": {
"CountA": {
"term": {
"waas_tag": "A"
}
},
"CountB": {
"term": {
"waas_tag": "B"
}
}
}
}
}
}
}

elasticsearch match query in array

I have follow query with terms, that works fine.
{
"query": {
"terms": {
"130": [
"jon#domain.com",
"mat#domain.com"
]
}
}
}
Found 2 docs.
but now i would like to build similar query with match (want to find all users in domain). I've tried follow query without any result
{
"query": {
"match": {
"130": {
"query":"#domain.com"
}
}
}
}
Found 0 docs. Why??
Field 130 has follow mapping:
"130":{"type":"text","analyzer":"whitespace","fielddata":true}
If you are using a whitespace analyzer, then the token generated will be :
{
"tokens": [
{
"token": "jon#domain.com",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
So terms query will match with the above token as it returns documents that contain one or more exact terms in a provided field, but match query will give 0 results
Instead, you should use a standard analyzer (which is the default one), which will generate the following tokens:
{
"tokens": [
{
"token": "jon",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "domain.com",
"start_offset": 4,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
You can even go through the uax_url_email tokenizer which is like the standard tokenizer except that it recognizes URLs and email addresses as single tokens.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"130": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Index Data:
{
"130":"jon#domain.com"
}
Search Query:
{
"query": {
"match": {
"130": {
"query": "#domain.com"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65121147",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"130": "jon#domain.com"
}
}
]

Elasticsearch match vs. term in filter

I don't see any difference between term and match in filter:
POST /admin/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"partnumber": "j1knd"
}
}
]
}
}
}
And the result contains not exactly matched partnumbers too, e.g.: "52527.J1KND-H"
Why?
Term queries are not analyzed and mean whatever you send will be used as it is to match the tokens in the inverted index, while match queries are analyzed and the same analyzer applied on the fields, which is used at index time and accordingly matches the document.
Read more about term query and match query. As mentioned in the match query:
Returns documents that match a provided text, number, date or boolean
value. The provided text is analyzed before matching.
You can also use the analyze API to see the tokens generated for a particular field.
Tokens generated by standard analyzer on 52527.J1KND-H text.
POST /_analyze
{
"text": "52527.J1KND-H",
"analyzer" : "standard"
}
{
"tokens": [
{
"token": "52527",
"start_offset": 0,
"end_offset": 5,
"type": "<NUM>",
"position": 0
},
{
"token": "j1knd",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "h",
"start_offset": 12,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Above explain to you why you are getting the not exactly matched partnumbers too, e.g.: "52527.J1KND-H", I would take your example and how you can make it work.
Index mapping
{
"mappings": {
"properties": {
"partnumber": {
"type": "text",
"fields": {
"raw": {
"type": "keyword" --> note this
}
}
}
}
}
}
Index docs
{
"partnumber" : "j1knd"
}
{
"partnumber" : "52527.J1KND-H"
}
Search query to return only the exact match
{
"query": {
"bool": {
"filter": [
{
"term": {
"partnumber.raw": "j1knd" --> note `.raw` in field
}
}
]
}
}
Result
"hits": [
{
"_index": "so_match_term",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"partnumber": "j1knd"
}
}
]
}

Extract keywords from fields

I want to write a query to analyze one or more fields ?
i.e. current analyzers require text to function, instead of passing text I want to pass a field value.
If I have a document like this
{
"desc": "A document description",
"name": "This name is not original",
"amount": 3000
}
I would like to return something like the below
{
"desc": ["document", "description"],
"name": ["name", "original"],
"amount": 3000
}
You can use Term Vectors or Multi Term Vectors to achieve what you're looking for:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-termvectors.html
You'd have to specify the Ids of the fields you want as well as the fields and it will return an array of analyzed tokens for each document you have as well as certain other info which you can easily disable.
GET /exampleindex/_doc/_mtermvectors
{
"ids": [
"1","2"
],
"parameters": {
"fields": [
"*"
]
}
}
Will return something along the lines of:
"docs": [
{
"_index": "exampleindex",
"_type": "_doc",
"_id": "1",
"_version": 2,
"found": true,
"took": 0,
"term_vectors": {
"desc": {
"field_statistics": {
"sum_doc_freq": 5,
"doc_count": 2,
"sum_ttf": 5
},
"terms": {
"amazing": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 3,
"end_offset": 10
}
]
},
"an": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 2
}
]
}

Elasticsearch - Script filter on array

I'm a new bie in ES and I want to use script filter to get all match that the array has at least one element less than max and greater than min (max and min are param in the script).
The document like:
{
"number": "5",
"array": {
"key": [
10,
5,
9,
20
]
}
}
I tried the script but it does not work
{
"script": {
"lang": "groovy",
"params": {
"max": 64,
"min": 6
},
"script": "for(element in doc['array.key'].values){element>= min + doc['number'].value && element <=max + doc['number'].value}"
}
}
There is no error message but the search result is wrong.Is there a way to iterate array field?
Thank you all.
Yes it's doable, your script is not doing that, though. Try using Groovy's any() method instead:
doc['array.key'].values.any{ it -> it >= min + doc['number'] && it <= max + doc['number'] }
A few things:
Your script just goes over a collection and checks a condition, doesn't return a boolean value, and that's what you want
you might consider changing the mapping for number into an integer type
not really sure why you have a field array and inside it a nested field key. Couldn't you just have a field array that would be... and array? ;-)
remember that in ES by default each field can be a single value or an array.
As #Val has mentioned you need to enable dynamic scripting in your conf/elasticsearch.yml but I'm guessing you've done it, otherwise you'd be getting exceptions.
A very simple mapping like this should work:
{
"mappings": {
"document": {
"properties": {
"value": {
"type": "integer"
},
"key": {
"type": "integer"
}
}
}
}
}
Example:
POST /documents/document/1
{
"number": 5,
"key": [
10,
5,
9,
20
]
}
POST /documents/document/2
{
"number": 5,
"key": [
70,
72
]
}
Query:
GET /documents/document/_search
{
"query": {
"bool": {
"filter": {
"script": {
"lang": "groovy",
"params": {
"max": 64,
"min": 6
},
"script": "doc['key'].values.any{ it -> it >= min + doc['number'] && it <= max + doc['number'] }"
}
}
}
}
}
Result:
{
"took": 22,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": [
{
"_index": "documents",
"_type": "document",
"_id": "1",
"_score": 0,
"_source": {
"number": 5,
"key": [
10,
5,
9,
20
]
}
}
]
}
}

Resources