Elasticsearch query to find documents with number of values of a term equal to a specified number - elasticsearch

I have an Elasticsearch index "library" with below mapping:
{
"mappings": {
"book": {
"properties": {
"title": { "type": "text", "index": "not_analyzed" },
"author": { "type": "text", "index": "not_analyzed" },
"price": { "type": "integer" },
}
}
}
Now I want make a query to find all documents(book) where number of author is equal to 3. i.e. I want to make a query which will match
curl -XGET "http://localhost:9200/library/_search?pretty=true" -d '{
"query": {
"match": {
Number of values of term "author" = 3.
}
}
}'
Is there any way to make such an query without adding an extra term?
[I know the aggregation to find all possible values of a term in search result but wasn't able to convert that aggregation in according to above criteria.]

Can't find a way to get exactly each author with 3 documents.
Aggregation will give you all possible values. But, it also show you the doc_count - and there we can find our way:
{
"size": 0,
"aggregations": {
"authors": {
"terms": {
"field": "author",
"min_doc_count": 3,
"size": 5
}
}
}
}
min_doc_count - will filter only buckets with, at least, 3 documents.
size - will give you only first 5 documents (remember that, by default, buckets are sorted by 'doc_count' ascending).
Now you can adjust size to get exactly those authors with 3 documents.

Related

Composite and Terms Aggregations on a field with a high cardinality

I am facing a huge performance problem with ES which results in more than 2 min response.
I have an index that has more than 25M files and composes of the next 4 fields (among others):
...
"group_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"group_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
...
I have something like 100K unique users and groups and each field is a list of users/groups that holds ~100 values. For example:
"user_read": ["user_1", "group_1", ...],
"user_write": ["user_1", "group_2", ...]
...
I have 2 kinds of aggregation I am using, composite and terms. Composite aggregations for getting only first X results to display and terms aggregation for prefix search.
Composite aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"composite": {
"sources": [
{
"Group Read": {
"terms": {
"field": "group_read.raw"
}
}
}
],
"size": 10
}
},
"Group_Write_Permissions": {
"composite": {
"sources": [
{
"Group Write": {
"terms": {
"field": "group_write.raw"
}
}
}
]
}
},
"User_Write_Permissions": {
"composite": {
"sources": [
{
"User Write": {
"terms": {
"field": "user_write.raw"
}
}
}
]
}
},
"User_Read_Permissions": {
"composite": {
"sources": [
{
"User Read": {
"terms": {
"field": "user_read.raw"
}
}
}
]
}
}
}
}
Terms aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"terms": {
"field": "group_read.raw",
"include": ".*[Ss].*"
}
},
"Group Write Permissions": {
"terms": {
"field": "group_write.raw",
"include": ".*[Ss].*"
}
},
"User Read Permissions": {
"terms": {
"field": "user_read.raw",
"include": ".*[Ss].*"
}
},
"User Write Permissions": {
"terms": {
"field": "user_write.raw",
"include": ".*[Ss].*"
}
}
}
}
Composite aggregation returns results within 1 min and the terms aggregation can take up to 5 min.
What I have tried so far:
Adding new field user_group_permissions and adding to the above 4 fields "copy_to": "user_group_permissions"
Adding to the above 4 fields and to the field "user_group_permissions" the next property: "eager_global_ordinals": true
Increased the refresh_interval up to 200s
** I reindexed for the first 2 suggestions [took something like 6 hours]
All of the above did help a little with the retrieval time but still: composite aggregation takes up to 20s and terms aggregation takes up to 3 min.
[The best results were on the fields user_group_permissions which has been created in the first suggestion, with eager_global_ordinals = true and refresh_interval = 120s].
Please, if someone has any idea how to improve the retrieval times I will be grateful.
First of all, if you only need the first 10 results, you don't need to use the composite aggregation, which is meant to be used only if you need to paginate over all results. Simply use the terms aggregation with default size 10, that'll do the job.
Second, what you're doing with the terms aggregation is not a prefix filtering, but infix filtering, which is completely different in terms of performance. While it's easy to search for prefixes, searching for infixes requires the equivalent of a "full table scan" because each and every term must be visited.
A first optimization I would suggest is that in your second query you should do your regex in the query part (bool/should with one regex query per field), so as to reduce the document set on which the terms aggregations need to run. That might help a bit.
A second optimization is to leverage the wildcard field type which is a specialized field type made specially for grep-like wildcard and regexp queries.
Another possible optimization is to lowercase all your permissions, so that you only need to search for .*s.* instead of the uppercase variant.
Depending on your comments, I'll add more optimizations as the discussion goes on.

Elasticsearch fuzzy query

I am trying to make fuzzy search that should be intended like this
And I have my index like this
{
"test": {
"aliases": {},
"mappings": {
"properties": {
"first_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"last_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"settings": {
"index": {
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"number_of_shards": "1",
"provided_name": "test",
"creation_date": "1617623285742",
"number_of_replicas": "1",
"uuid": "MxSWoxSoS6y6x5Jdt2AvMQ",
"version": {
"created": "7120099"
}
}
}
}
}
Inside that index there is one data with
{
"first_name": "homo sapiens",
"last_name": "moho"
}
I tried to query like this but it doesn't work
{
"query": {
"match": {
"first_name": {
"query": "hosan",
"fuzziness": "AUTO:0,0"
}
}
}
}
but if I search with "hoom", "homoo" or "homos" it works.
Can someone help me with this fuzzy search? Thanks!
With a query term consisting of 5 characters, (hosan), a fuzziness value of auto will only give you an edit distance value of 1, which is not going to be enough to get you from hosan to homo. The max edit distance value you can achieve with auto is 2, and you will only achieve that if your query term is greater than 5 characters. You can force a fuzziness value of 3 or 4 to attempt to achieve your desired results, but the reason the ES default is max of 2 is because higher numbers can start yielding unexpected and unwieldy results. Note also that your other search examples (hoom, homoo, etc) are matching only on the word homo. Match queries are OR queries by default, and will return results for any matched term.
Just for reference, auto will give you 0 edit distance for query terms of length 1-2 characters, 1 edit distance for query term of 3-5 characters, and 2 edit distance for query terms greater than 5 characters.
So I would bump up your fuzziness value by 1 until you get the result returned when searching on hosan, but only to prove out what I’m outlining here. I personally would not go above a fuzziness value of 2, maybe 3, in any production environment.
After a lot of research about elasticsearch and fuzzy search, I found that it is imposible to only use fuzzy to expect the result like "homo sapiens" with search keyword "hosan". Then to solve this I need to combine fuzzy with regex query from elasticsearch

How can I get ElasticSearch aggregations to count the parent documents instead of the nested documents

My ElasticSearch index has nested documents to indicate the places where various events occurred related to the document. I am using aggregations to get facets of the places. The count returned is the count of the number of occurrences of the place. For example, if a document has a birth and death place of California, the aggregation count for California is 2. I would like the aggregation count to be the number of documents containing a particular place, rather than the number of child documents containing the place. The relevant part of my schema looks like this:
"mappings": {
"document": {
"properties": {
"docId" : { "type": "keyword" },
"place": {
"type": "nested",
"properties": {
"id": { "type": "keyword" },
"type": { "type": "keyword" },
"loc": { "type" : "geo_point" },
"text": {
"type": "text",
"analyzer": "english",
"copy_to" : "text"
}
},
"dynamic": false
}
}
}
}
I can get facets with a simple aggregation like this, which retrieves the places with type place.vital.* (e.g. place.vital.birth, place.vital.death, etc), but counts the number of nested documents, not the number of parent documents.
"aggs": {
"place.vital": {
"aggs": {
"types": {
"aggs": {
"values": {
"terms": {
"field": "place.id"
}
}
},
"terms": {
"field": "place.type",
"include": "place\\.vital\\..*"
}
}
},
"nested": {
"path": "place"
}
}
Is it possible to tweak my aggregation so that it only counts each parent document once?
Use reverse nested aggregation. This will then create an aggregation with the nested counts and a sub aggregation with the parent counts.
See how to return the count of unique documents by using elasticsearch aggregation for more detail.
I'm sure you can do it with nested fields, but not with parent child relationships. If you are looking for places Why don't you search on places index and filter by child?
Has child query

Aggregation in elastic search

Need help with aggregation in elastic search. Is it possible to agreggate values of a particular field as an array or list - This is more of a grouping for example instead of getting the result as
{"Book_Id":"102","Review_Text":"DescentRead"},{"Book_Id":"102","Review_Text":"For Kids."},{"Book_Id":"103","Review_Text":"Great"},{"Book_Id":"103","Review_Text":"Excellent"}
can i get all the reviews of each book as a list ?
[ { Book_Id: 102, Review_Text: [ "DescentRead", "For Kids"], { Book_Id: 103, reviews: [ "Great","Excellent"] } ]
Tried some trail with aggs but not able to get it. Any pointers would help!!
Could aggregations with top hits work? The limitation is that you need to specify a max amount of hits per aggregation (will return the top 100 results per book ID in the example ordered by the review text), but apart from that you can do run it as a normal query and specify which fields to return, how they should be sorted (to get the top hits), etc.
Example aggs query:
POST
http://myserver:9200/books/book/_search
{
"size": 0,
"aggs": {
"BookReviews": {
"terms": {
"field": "Book_Id.keyword"
},
"aggs": {
"top_reviews": {
"top_hits": {
"sort": [ { "Review_Text.keyword": { "order": "desc" } } ],
"size": 100,
"_source": {
"includes": [ "Review_Text" ]
}
}
}
}
}
}
}
Note that the name for the aggregations ("BookReviews" and "top_reviews") you can use any name you choose, and that same name will appear in the resulting aggregation tree. You can do multi level aggregations on terms in your index, and inclute top hits on any level, typically for drill-down reporting or similar cases.
Mapping used:
{
"books": {
"mappings": {
"book": {
"properties": {
"Book_Id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"Review_Text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
"size": 0 in the root node will omit any hits for the search and only return the aggs trees.
You can also add a normal "query": {} block on the same level as size and aggs if you need to filter the results before elastic starts aggregating.
Read more in the elasticsearch documentation pages:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
(If you provide a more complete example dataset, we can give a more realistic example query, as there isn't a lot of data in the example for sorting or scoring the results)

querying multiple indexes in elasticsearch and remove duplicates

We have some documents indexed in two (or more than two) indexes in elasticsearch.
How can I get all the documents from the two indexes without duplication?
Change the _id to be indexed and then use a terms aggregation.
The mapping change:
{
"mappings": {
"test": {
"_id": {
"store": true,
"index": "not_analyzed"
}
...
The query:
GET /test*/_search
{
"aggs": {
"NAME": {
"terms": {
"field": "_id",
"size": 10
}
}
}
}
The tricky part is with "get all the documents from the two indexes". Increasing the size from the terms aggregation will eat a lot of memory.

Resources