How to get the word count for all the documents based on index and type in elasticsearch? - elasticsearch

If I have few documents and would like to get the count of each word in all the documents for a particular field how do I get?
ex: Doc1 : "aaa bbb aaa ccc"
doc2 : "aaa ccc"
doc3 : "www"
I want it like aaa-3, bbb-1, ccc-2, www-1

If you want the document counts, you can do it by using a terms aggregation like this:
POST your_index/_search
{
"aggs" : {
"counts" : {
"terms" : { "field" : "your_field" }
}
}
}
UPDATE
If you want to get the term count, you need to use the _termvector API, however, you'll only be able to query one document after another.
GET /your_index/your_type/1/_termvector?fields=your_field
And for doc1 you'll get
aaa: 2
bbb: 1
ccc: 1
The multi-term vectors API can help but you'll still need to specify the documents to get the term vectors from.
POST /your_index/your_type/_mtermvectors' -d '{
"docs": [
{
"_id": "1"
},
{
"_id": "2"
},
{
"_id": "3"
}
]
}'
And for your docs you'll get
aaa: 2 + 1
bbb: 1
ccc: 1 + 1
www: 1

Related

Search using “OR” condition on keyword field that contains spaces

My data "keywords" contain spaces. So "X AAA" is one "keyword". And "B AAA" is another keyword. My data will only have one of these in the actual field. So the data field will never look like a combination of the two "X AAA B AAA". There will always be just one "keyword" in the field.
Here is a sample data set of 6 rows for the field:
X AAA
Y AAA
Z AAA
X BBB
Y BBB
Z BBB
My mapping looks like this for the field
"mappings" : {
"properties" : {
"MYKEYWORDFIELD" : {
"type" : "keyword"
},
...
When I query the MYKEYWORDFIELD for only part of the "keyword" such as "AAA" I don't get any results. This is what I want. Thus my understanding is that the field is being treated as the entire contents of the field is one keyword. Am I understanding this correctly?
Also, I want to query MYKEYWORDFIELD for "X AAA" OR "X BBB" in a single query. Is it possible to do so? If so, how would I do so?
====
1/7/20 Update: To clarify, for the results of my query, I don't want to potentially receive rows other than those in the query. Therefore I don't believe I can use "should" which only affects result scoring and therefore may allow other rows like "Y BBB" to show up in my query.
You can use should query, like:
{
"query" :{
"bool" :{
"should": [
{
"match" :{
"MYKEYWORDFIELD.keyword": "X AAA"
}
},
{
"match" :{
"MYKEYWORDFIELD.keyword": "X BBB"
}
}
]
}
}
}

ElasticSearch boosting relevance based on the count of the field value

I'm trying to boost the relevance based on the count of the field value. The less count of the field value, the more relevant.
For example, I have 1001 documents. 1000 documents are written by John, and only one is written by Joe.
// 1000 documents by John
{"title": "abc 1", "author": "John"}
{"title": "abc 2", "author": "John"}
// ...
{"title": "abc 1000", "author": "John"}
// 1 document by Joe
{"title": "abc 1", "author": "Joe"}
I'll get 1001 documents when I search "abc" against title field. These documents should have pretty similar relevance score if they are not exact same. The count of field value "John" is 1000 and the count of field value "Joe" is 1. Now, I'd like to boost the relevance of the document {"title": "abc 1", "author": "Joe"}, otherwise, it would be really hard to see the document with the author Joe.
Thank you!
In case someone runs into the same use case, I'll explain my workaround by using Function Score Query. This way would make at least two calls to Elasticsearch server.
Get the counts for each person(You may use aggregation feature). In our example, we get 1000 from John and 1 from Joe.
Generate the weight from the counts. The more counts, the less relevance weight. Something like 1 + sqrt(1/1000) for John and 1 + sqrt(1/1) for Joe.
Use the weight in the script to calculate the score according to the author value(The script can be much better):
{
"query": {
"function_score": {
"query": {
"match": { "title": "abc" }
},
"script_score" : {
"script" : {
"inline": "if (doc['author'].value == 'John') {return (1 + sqrt(1/1000)) * _score}\n return (1 + sqrt(1/1)) * _score;"
}
}
}
}
}

elasticsearch query on single array item

If I have a document in elasticsearch that looks like the following:
{
"_id" : 1
"sentences" : [
"The cat lives in Chicago",
"The dog lives in Milan",
"The pig lives in Mexico"
]
}
How can I perform a search / query which will only match if all conditions are met in the same sentence?
I would like to search sentences:(+Chicago +cat) I would get a match, but if I searched sentences:(+Mexico +dog) I want to get no match.

Aggregation distinct values in ElasticSearch

I'm trying to get the distinct values and their amount in ElasticSearch.
This can be done via:
"distinct_publisher": {
"terms": {
"field": "publisher", "size": 0
}
}
The problem I've is that it counts the terms, but if there are values in publishers separated via a space e.g.:
"Chicken Dog"
and 5 documents have this value in the publisher field, then I get 5 for Chicken and 5 for Dog:
"buckets" : [
{
"key" : "chicken",
"doc_count" : 5
},
{
"key" : "dog",
"doc_count" : 5
},
...
]
But I want to get as the result:
"buckets" : [
{
"key" : "Chicken Dog",
"doc_count" : 5
}
]
The reason you're getting 5 buckets for each of chicken and dog is because your documents were analyzed at the time that you indexed them.
This means elasticsearch did some small processing to turn Chicken Dog into chicken and dog (lowercase, and tokenize on space). You can see how elasticsearch will analyze a given piece of text into searchable tokens by using the Analyze API, for example:
curl -XGET 'localhost:9200/_analyze?&text=Chicken+Dog'
In order to aggregate over the "raw" distinct values, you need to utilize the not_analyzed mapping so elasticsearch doesn't do its usual processing. This reference may help. You may need to reindex your data to apply the not_analyzed mapping to get the result you want.

elasticsearch custom_score multiplication is inaccurate

I've inserted some documents which are all identical except for one floating-point field, called a.
When script of a custom_score query is set to just _score, the resulting score is 0.40464813 for a particular query matching some fields. When script is then changed to _score * a (mvel) for the same query, where a is 9.908349251612433, the final score becomes 4.0619955.
Now, if I run this calculation via Chrome's JS console, I get 4.009394996051871.
4.0619955 (elasticsearch)
4.009394996051871 (Chrome)
This is quite a difference and produces an incorrect ordering of results. Why could it be, and is there a way to correct it?
If I run a simple calculation using the numbers you provided, then I get the result that you expect.
curl -XPOST 'http://127.0.0.1:9200/test/test?pretty=1' -d '
{
"a" : 9.90834925161243
}
'
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"custom_score" : {
"script" : "0.40464813 *doc[\u0027a\u0027].value",
"query" : {
"match_all" : {}
}
}
}
}
'
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "a" : 9.90834925161243
# },
# "_score" : 4.009395,
# "_index" : "test",
# "_id" : "lPesz0j6RT-Xt76aATcFOw",
# "_type" : "test"
# }
# ],
# "max_score" : 4.009395,
# "total" : 1
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 1
# }
I think what you are running into here is testing too little data across multiple shards.
Doc frequencies are calculated per shard by default. So if you have two identical docs on shard_1 and one doc on shard_2, then the docs on shard_1 will score lower than the docs on shard_2.
With more data, the document frequencies tend to even out over shards. But when testing small amounts of data you either want to create an index with only one shard, or to add search_type=dfs_query_then_fetch to the query string params.
This calculates global doc frequencies across all involved shards before calculating the scores.
If you set explain to true in your query, then you can see exactly how your scores are being calculated

Resources