Elasticsearch: Count terms in document - elasticsearch

I'm fairly new to elasticsearch, use version 6.5. My database contains website pages and their content, like this:
Url Content
abc.com There is some content about cars here. Lots of cars!
def.com This page is all about cars.
ghi.com Here it tells us something about insurances.
jkl.com Another page about cars and how to buy cars.
I have been able to perform a simple query that returns all documents that contain the word "cars" in their content (using Python):
current_app.elasticsearch.search(index=index, doc_type=index,
body={"query": {"multi_match": {"query": "cars", "fields": ["*"]}},
"from": 0, "size": 100})
Result looks something like this:
{'took': 2521,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index':
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571,
'_source': {'content': '....'}}]}}
The "_id"s are referring to a domain, so I basically get back:
abc.com
def.com
jkl.com
But I now want to know how often the searchterm ("cars") is present in each document, like:
abc.com: 2
def.com: 1
jkl.com: 2
I found several solutions how to obtain the number of documents that contain the searchterm, but none that would tell how to get the number of terms in a document. I also couldn't find anything in the official documentation, although I'm pretty sure is in there somewhere and I'm maybe just not realising that it is the solution for my problem.
Update:
As suggested by #Curious_MInd I tried term aggregation:
current_app.elasticsearch.search(index=index, doc_type=index,
body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content"
}}}})
Result:
{'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful':
5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0,
'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252',
'_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations':
{'skala_count': {'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0, 'buckets': []}}}
I don't see where it would display the counts per document here, but I'm assuming that's because "buckets" is empty? On another note: The results found by term aggregation are significantly worse than those with multi_match query. Is there any way to combine those?

What you are trying to achieve can't be done in a single query. The first query will be to filter and get the doc Ids for which the terms counts is required.
Lets assume you have the following mapping:
{
"test": {
"mappings": {
"_doc": {
"properties": {
"details": {
"type": "text",
"store": true,
"term_vector": "with_positions_offsets_payloads"
},
"name": {
"type": "keyword"
}
}
}
}
}
}
Assuming you query returns the following two docs:
{
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"details": "There is some content about cars here. Lots of cars!",
"name": "n1"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 1,
"_source": {
"details": "This page is all about cars",
"name": "n2"
}
}
]
}
}
From the above response you can get all the document ids that matched your query. For above we have : "_id": "1" and "_id": "2"
Now we use _mtermvectors api to get the frequency(count) of each term in a given field:
test/_doc/_mtermvectors
{
"docs": [
{
"_id": "1",
"fields": [
"details"
]
},
{
"_id": "2",
"fields": [
"details"
]
}
]
}
The above returns the following result:
{
"docs": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_version": 1,
"found": true,
"took": 8,
"term_vectors": {
"details": {
"field_statistics": {
"sum_doc_freq": 15,
"doc_count": 2,
"sum_ttf": 16
},
"terms": {
....
,
"cars": {
"term_freq": 2,
"tokens": [
{
"position": 5,
"start_offset": 28,
"end_offset": 32
},
{
"position": 9,
"start_offset": 47,
"end_offset": 51
}
]
},
....
}
}
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_version": 1,
"found": true,
"took": 2,
"term_vectors": {
"details": {
"field_statistics": {
"sum_doc_freq": 15,
"doc_count": 2,
"sum_ttf": 16
},
"terms": {
....
,
"cars": {
"term_freq": 1,
"tokens": [
{
"position": 5,
"start_offset": 23,
"end_offset": 27
}
]
},
....
}
}
}
]
}
Note that I have used .... to denote other terms data in the field since the term vector api return the term related details for all the terms.
You can definitely extract the info about the required term from the above response, here I have shown for cars and the field you are interested in is term_freq

I guess you need Term Aggregation here like below, See
GET /_search
{
"aggs" : {
"cars_count" : {
"terms" : { "field" : "Content" }
}
}
}

Related

How do i get accurate sum in elasticsearch based on source hits?

How do i get an exact sum aggregation in elasticsearch? Fore reference i am currently using elasticsearch 5.6 and the my index mapping looks like this:
{
"my-index":{
"mappings":{
"my-type":{
"properties":{
"id":{
"type":"keyword"
},
"fieldA":{
"type":"double"
},
"fieldB":{
"type":"double"
},
"fieldC":{
"type":"double"
},
"version":{
"type":"long"
}
}
}
}
}
}
The search query generated (using java client) is:
{
/// ... some filters here
"aggregations" : {
"fieldA" : {
"sum" : {
"field" : "fieldA"
}
},
"fieldB" : {
"sum" : {
"field" : "fieldB"
}
},
"fieldC" : {
"sum" : {
"field" : "fieldC"
}
}
}
}
However my result hits generate the following:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 3.8466966,
"hits": [
{
"_index": "my-index",
"_type": "my-type",
"_id": "25a203b63e264fd2be13db006684b06d",
"_score": 3.8466966,
"_source": {
"fieldC": 108,
"fieldA": 108,
"fieldB": 0
}
},
{
"_index": "my-index",
"_type": "my-type",
"_id": "25a203b63e264fd2be13db006684b06d",
"_score": 3.8466966,
"_source": {
"fieldC": -36,
"fieldA": 108,
"fieldB": 144
}
},
{
"_index": "my-index",
"_type": "my-type",
"_id": "25a203b63e264fd2be13db006684b06d",
"_score": 3.8466966,
"_source": {
"fieldC": -7.2,
"fieldA": 1.8,
"fieldB": 9
}
},
{
"_index": "my-index",
"_type": "my-type",
"_id": "25a203b63e264fd2be13db006684b06d",
"_score": 3.8466966,
"_source": {
"fieldC": 14.85,
"fieldA": 18.9,
"fieldB": 4.05
}
},
{
"_index": "my-index",
"_type": "my-type",
"_id": "25a203b63e264fd2be13db006684b06d",
"_score": 3.8466966,
"_source": {
"fieldC": 36,
"fieldA": 36,
"fieldB": 0
}
}
]
},
"aggregations": {
"fieldA": {
"value": 272.70000000000005
},
"fieldB": {
"value": 157.05
},
"fieldC": {
"value": 115.64999999999999
}
}
}
why do i get:
115.64999999999999 instead of 115.65 in fieldC
272.70000000000005 instead of 272.7 in fieldA
should i use float instead of double? or is there a way i can change the query without using painless script and using java's BigDecimal with specified precision and rounding mode?
It has to do with float number precision in JavaScript (similar to what can be seen here and explained here).
Here are two ways to check this:
A. If you node.js installed, just type node at the prompt and then enter the sum of all fieldA values:
$ node
108 - 36 - 7.2 + 14.85 + 36
115.64999999999999 <--- this is the answer
B. Open the Developer tools of your browser and pick the Console view. Then type the same sum as above:
> 108-36-7.2+14.85+36
< 115.64999999999999
As you can see, both results are consistent with what you're seeing in your ES response.
One way to circumvent this is to store your numbers either as normal integers (i.e. 1485 instead of 14.85, 3600 instead of 36, etc) or as scaled_float with a scaling factor of 100 (or bigger depending on the precision you need)

Extract keywords from fields

I want to write a query to analyze one or more fields ?
i.e. current analyzers require text to function, instead of passing text I want to pass a field value.
If I have a document like this
{
"desc": "A document description",
"name": "This name is not original",
"amount": 3000
}
I would like to return something like the below
{
"desc": ["document", "description"],
"name": ["name", "original"],
"amount": 3000
}
You can use Term Vectors or Multi Term Vectors to achieve what you're looking for:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-termvectors.html
You'd have to specify the Ids of the fields you want as well as the fields and it will return an array of analyzed tokens for each document you have as well as certain other info which you can easily disable.
GET /exampleindex/_doc/_mtermvectors
{
"ids": [
"1","2"
],
"parameters": {
"fields": [
"*"
]
}
}
Will return something along the lines of:
"docs": [
{
"_index": "exampleindex",
"_type": "_doc",
"_id": "1",
"_version": 2,
"found": true,
"took": 0,
"term_vectors": {
"desc": {
"field_statistics": {
"sum_doc_freq": 5,
"doc_count": 2,
"sum_ttf": 5
},
"terms": {
"amazing": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 3,
"end_offset": 10
}
]
},
"an": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 2
}
]
}

Reindex multiple types from one index to single type in another index

I have two indexes:
twitter and reitwitter
twitter has multiple documents across different types like:
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 1,
"_source": {
"message": "trying out Elasticsearch"
}
},
{
"_index": "twitter",
"_type": "tweet2",
"_id": "1",
"_score": 1,
"_source": {
"message": "trying out Elasticsearch2"
}
},
{
"_index": "twitter",
"_type": "tweet1",
"_id": "1",
"_score": 1,
"_source": {
"message": "trying out Elasticsearch1"
}
}
]
Now, when I reindex, I wanted to get rid of all the different types and just use one because essentially they have the same field mappings.
I tried several different combinations but I always only get one document instead of those three:
Approach 1:
POST _reindex/
{
"source": {
"index": "twitter"
}
,
"dest": {
"index": "reitwitter",
"type": "reitweet"
}
}
Response:
{
"took": 12,
"timed_out": false,
"total": 3,
"updated": 3,
"created": 0,
"deleted": 0,
"batches": 1,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1,
"throttled_until_millis": 0,
"failures": []
}
Note : It says updated 3 because this was the second time I made the same call I guess?
Second approach:
POST _reindex/
{
"source": {
"index": "twitter",
"query": {
"match_all": {
}
}
}
,
"dest": {
"index": "reitwitter",
"type": "reitweet"
}
}
Same response as first one.
In both cases when I make this GET call:
GET reitwitter/_search
{
"query": {
"match_all": {
}
}
}
I only get one document:
{
"_index": "reitwitter",
"_type": "reitweet",
"_id": "1",
"_score": 1,
"_source": {
"message": "trying out Elasticsearch1"
}
Is this use case even supported by reindex ? If not, do I have to write a script using scan and scroll to get all the documents from source index and reindex them with same doc type in destination?
PS: I don't want to use "_source": ["tweet1", "tweet"] because I have around million doc type which have one document each that I want to map to the same doc type in the destination.
The problem is that all the documents has the same id(1), and then they are overriding themselves during the re-index process.
Try to index your documents with different ids and you will see it works.

ElasticSearch query with conditions on multiple documents

I have data of this format in elasticsearch, each one is in seperate document:
{ 'pid': 1, 'nm' : 'tom'}, { 'pid': 1, 'nm' : 'dick''},{ 'pid': 1, 'nm' : 'harry'}, { 'pid': 2, 'nm' : 'tom'}, { 'pid': 2, 'nm' : 'harry'}, { 'pid': 3, 'nm' : 'dick'}, { 'pid': 3, 'nm' : 'harry'}, { 'pid': 4, 'nm' : 'harry'}
{
"took": 137,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": null,
"hits": [
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KS86AaDUbQTYUmwY",
"_score": null,
"_source": {
"pid": 1,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KJ9BAaDUbQTYUmwW",
"_score": null,
"_source": {
"pid": 1,
"nm": "Tom"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KRlbAaDUbQTYUmwX",
"_score": null,
"_source": {
"pid": 1,
"nm": "Dick"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KYnKAaDUbQTYUmwa",
"_score": null,
"_source": {
"pid": 2,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KXL5AaDUbQTYUmwZ",
"_score": null,
"_source": {
"pid": 2,
"nm": "Tom"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KbcpAaDUbQTYUmwb",
"_score": null,
"_source": {
"pid": 3,
"nm": "Dick"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9Kdy5AaDUbQTYUmwc",
"_score": null,
"_source": {
"pid": 3,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KetLAaDUbQTYUmwd",
"_score": null,
"_source": {
"pid": 4,
"nm": "Harry"
}
}
]
}
}
And I need to find the pid's which have 'harry' and do not have 'tom', which in the above example are 3 and 4. Which essentialy means look for the documents having same pids where none of them has nm with value 'tom' but at least one of them have nm with value 'harry'.
How do I query that?
EDIT: Using Elasticsearch version 5
What if you have a POST request body which could look something like below, where you might use bool :
POST _search
{
"query": {
"bool" : {
"must" : {
"term" : { "nm" : "harry" }
},
"must_not" : {
"term" : { "nm" : "tom" }
}
}
}
}
I am relatively very new in Elasticsearch, so I might be wrong. But I have never seen such query. Simple filters can not be used here as those are applied on a doc (and not aggregations) which you do not want. What I see is you want to do a "Group by" query with "Having" clause (in terms of SQL). But Group by queries involve some aggregation (like avg, max, min of any field) which is used in "Having" clause. Basically you use a reducer for Post processing of aggregation results. For queries like this Bucket Selector Aggregation can be used. Read this
But your case is different. You do not want to apply Having clause on any metric aggregation but you want to check if some value is present in field (or column) of your "group by" data. In terms of SQL, you want to do a "where" query in "group by". This is what I have never seen. You can also read this
However, at application level, you can easily do this by breaking your query. First find unique pid where nm= harry using term aggs. Then get docs for those pid with additional condition nm != tom.
P.S. I am very new to ES. And I will be very happy if any one contradicts me show ways to do this in one query. I will also learn that.

MLT (More Like This) elasticsearch query

I'm trying to use elasticsearch MLT (More Like This) query.
Only one doc in store:
{
"_index": "monitors",
"_type": "monitor",
"_id": "AVTnvJ8SancUpEdFLMiq",
"_score": 1,
"_source": {
"ProcessGroup": "test",
"ProcessName": "test",
"OpName": "test",
"Domain": "test",
"LogLevel": "Info",
"StartDateTime": "2016-05-04 04:46:47",
"EndDateTime": "2016-05-04 04:47:47",
"MessageDateTime": "2016-05-04 04:46:47",
"ApplicationCode": "test",
"Status": "10",
}
}
Query:
POST /_search
{
"query": {
"more_like_this" : {
"fields" : ["ProcessName"],
"like" : "test",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
ProcessName is a not analyzed field.
I was expected to get this document as a response, but instead i got nada:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
Why is that ?
Another question:
Suppose I have search engines docs, and I search for "stph". I expect to get "Stephan Curry" suggestion because it's commonly searched. Fuzzy search doesn't fit because distance is greater than 2, so does using MLT query is a good option for this scenario ?

Resources