Elasticsearch - Any way to find out all the documents with field value as text - elasticsearch

In the elasticsearch cluster, I accidentally pushed some text in a field which should ideally be a Number. Later, I fixed that and pushed the Number type value. Now, I wanted to fix it such that all the old values can be replaced by some Number for which I need to find out all the documents which are having this field as text.
Is there any elasticsearch query that I can use to get this information?

I think that can be possible by using a nested aggregations.
At the top-level; use terms aggregation to know text values, at the sub-level; use top_hits aggregation to get documents that includes these values.
for instance:
GET example_index/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "example_field.keyword",
"size": 10
},
"aggs": {
"documents": {
"top_hits": {
"size": 10
}
}
}
}
}
}
This query; will return distinct values of the field, and the related documents in the sub-level, something like:
{
"aggregations": {
"NAME": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "mistake",
"doc_count": 2,
"documents": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "example_index",
"_type": "example_index",
"_id": "2QoDoXEBOCkJkkpwq5P0",
"_score": 1,
"_source": {
"example_field": "mistake"
}
},
{
"_index": "example_index",
"_type": "example_index",
"_id": "qAoDoXEBOCkJkkpwq5T0",
"_score": 1,
"_source": {
"example_field": "mistake"
}
}
]
}
}
},
{
"key": "520",
"doc_count": 2,
"documents": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "example_index",
"_type": "example_index",
"_id": "5goDoXEBOCkJkkpwq5P0",
"_score": 1,
"_source": {
"example_field": "1"
}
}
]
}
}
}
]
}
}
}
I the example above; we need to delete the documents with mistake value, you can simply delete them by id.
NOTE: if you have a big index, it's rather to write a function inside your code that builds aggregations, gets the response, filters values if it can be parsed to a number, then removes documents by id.

Related

Remove results with same id from Elasticsearch search result

Let's assume we have a search result with 3 documents. Two of them share a key attribute (product-ID or similar).
Is it possible to remove duplicates from the search result by using Elasticsearch, so that only 2 documents would be returned in that case? I don't want to implement this in application logic as I would still like to use pagination, aggregation, etc. It does not matter which of the two documents with the same id is removed.
Thanks,
Philipp
Edit:
This would be the example in Elasticsearch:
PUT /tmp_pd_articles
{
"mappings": {
"properties": {
"name": { "type": "text" },
"articleNumber": { "type": "keyword" }
}
}
}
PUT /tmp_pd_articles/_doc/1
{
"name": "My Book 1",
"articleNumber": "A9781"
}
PUT /tmp_pd_articles/_doc/2
{
"name": "My Book 1 (with some other title)",
"articleNumber": "A9781"
}
PUT /tmp_pd_articles/_doc/3
{
"name": "My Book 2",
"articleNumber": "A9782"
}
GET /tmp_pd_articles/_search
{
"query": { "match_all": {} }
}
The goal is to write a query that returns only two articles instead of all three:
#1 ("A9781", "My Book 1") OR #2 ("A9781", "My Book 1 (with some other title)") AND
#3 ("A9782", "My Book 2")
This reduction should be applied because #1 and #2 share the same productNumber "A9781". I wonder whether there is a Elasticsearch query to accomplish this goal.
Yes, its possible using top-hits aggregation, please use below query to filter the data., note tested it on your mapping and sample data, and it provides your expected data.
{
"size": 0, --> returns only aggregate data, if you want to include all 3 documents remove this size param.
"aggs": {
"dedup": {
"terms": {
"field": "articleNumber"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1
}
}
}
}
}
}
And Search result
"aggregations": {
"dedup": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "A9781",
"doc_count": 2,
"dedup_docs": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "tmp_pd_articles",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"name": "My Book 1",
"articleNumber": "A9781"
}
}
]
}
}
},
{
"key": "A9782",
"doc_count": 1,
"dedup_docs": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "tmp_pd_articles",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"name": "My Book 2",
"articleNumber": "A9782"
}
}
]
}
}
}
]
}

How to select buckets of aggregation results based on top hit document attribute?

I am trying to get result for following Elasticsearch query. I got the response as shown below. Now I want to select the buckets based on the top hit document field "source".
POST /data/_search?size=0{
"aggs":{
"by_partyIds":{
"terms":{
"field":"id.keyword"
},
"aggs":{
"oldest_record":{
"top_hits":{
"sort":[
{
"createdate.keyword":{
"order":"asc"
}
}
],
"_source":[
"source"
],
"size":1
}
}
}
}
}
}
Response :
{
"aggregations": {
"by_partyIds": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 3,
"oldest_record": {
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "data",
"_type": "osr",
"_id": "DcagSm4B9WnM0Ke-MgGk",
"_score": null,
"_source": {
"source": "US"
},
"sort": [
"20-09-18 05:45:26.000000000AM"
]
}
]
}
}
},
{
"key": "2",
"doc_count": 3,
"oldest_record": {
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "data",
"_type": "osr",
"_id": "7caiSm4B9WnM0Ke-HwGx",
"_score": null,
"_source": {
"source": "UK"
},
"sort": [
"22-09-18 05:45:26.000000000AM"
]
}
]
}
}
}
]
}
}
}
Now I want to get the buckets with count US as source. Can we write the query for that? I tried A bucket aggregation which is parent pipeline aggregation which executes a script which determines whether the current bucket will be retained in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a boolean value. If the script language is expression then a numeric return value is permitted. In this case 0.0 will be evaluated as false and all other values will evaluate to true.

Terms Aggregation return multiple fields (min_doc_count: 0)

I'm making a Terms Aggregation but I want to return multiple fields. I want a user to select buckets via "slug" (my-name), but show the actual "name" (My Name).
At this moment I'm making a TopHits SubAggregation like this:
"organisation": {
"aggregations": {
"label": {
"top_hits": {
"_source": {
"includes": [
"organisations.name"
]
},
"size": 1
}
}
},
"terms": {
"field": "organisations.slug",
"min_doc_count": 0,
"size": 20
}
}
This gives the desired result when my whole query actually find some buckets/results.
You see I've set the min_doc_count to 0 which will return buckets with a doc count of 0. The problem I'm facing here is that my TopHits response is empty, which results of not being able to render the proper name to the client.
Example response:
"organisation": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "my-name",
"doc_count": 27,
"label": {
"hits": {
"total": 27,
"max_score": 1,
"hits": [
{
"_index": "users",
"_type": "doc",
"_id": "4475",
"_score": 1,
"_source": {
"organisations": [
{
"name": "My name"
}]
}
}]
}
}
},
{
"key": "my-name-2",
"doc_count": 0,
"label": {
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
},
.....
Anyone has accomplished this desired result? I feel like TopHits won't help me here. It should always fetch the name.
What I've also tried:
Working with a terms sub aggregation. (same result)
Working with a significant terms sub aggregation. (same result)
What I think could be a solution, but feels dirty:
Index a new field with "organisations.slug___organisations.name" and work the magic via this.
Manual query the name field where the count is 0 (read TopHits is empty)
Kind regards,
Thanks in advance

Elasticsearch: Top k results per keyword

We have the following document in elasticsearch.
class Query(DocType):
text = Text(analyzer='snowball', fields={'raw': Keyword()})
src = Keyword()
Now we want top k results for each src. How can we achieve this?
Example:- Lets assume we index the following:
# src: place_order
Query(text="I want to order food", src="place_order")
Query(text="Take my order", src="place_order")
...
# src: payment
Query(text="How to pay ?", src="payment")
Query(text="Do you accept credit card ?", src="payment")
...
Now if the user writes a query take my order please along with the credit card details, and k=1, then we should return the following two results
[{"text": "Take my order", "src": "place_order", },
{"text": "Do you accept credit card ?", "src": "payment"}
]
Here since k=1, we are returning the just one result for each src.
You may try top hits aggregation which will return top N matching documents per each bucket in aggregation.
For the example in your post the query might look like this:
POST queries/query/_search
{
"query": {
"match": {
"text": "take my order please along with the credit card details"
}
},
"aggs": {
"src types": {
"terms": {
"field": "src"
},
"aggs": {
"best hit": {
"top_hits": {
"size": 1
}
}
}
}
}
}
The search on the text query restricts the set of documents for the aggregation. "src types" aggregation groups all src values found in the matched documents, and "best hit" selects one most relevant document per bucket (size parameter can be changed according to your needs).
The result of the query would be like the following:
{
"hits": {
"total": 3,
"max_score": 1.3862944,
"hits": [
{
"_index": "queries",
"_type": "query",
"_id": "VD7QVmABl04oXt2HGbGB",
"_score": 1.3862944,
"_source": {
"text": "Do you accept credit card ?",
"src": "payment"
}
},
{
"_index": "queries",
"_type": "query",
"_id": "Uj7PVmABl04oXt2HlLFI",
"_score": 0.8630463,
"_source": {
"text": "Take my order",
"src": "place_order"
}
},
{
"_index": "queries",
"_type": "query",
"_id": "UT7PVmABl04oXt2HKLFy",
"_score": 0.6931472,
"_source": {
"text": "I want to order food",
"src": "place_order"
}
}
]
},
"aggregations": {
"src types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "place_order",
"doc_count": 2,
"best hit": {
"hits": {
"total": 2,
"max_score": 0.8630463,
"hits": [
{
"_index": "queries",
"_type": "query",
"_id": "Uj7PVmABl04oXt2HlLFI",
"_score": 0.8630463,
"_source": {
"text": "Take my order",
"src": "place_order"
}
}
]
}
}
},
{
"key": "payment",
"doc_count": 1,
"best hit": {
"hits": {
"total": 1,
"max_score": 1.3862944,
"hits": [
{
"_index": "queries",
"_type": "query",
"_id": "VD7QVmABl04oXt2HGbGB",
"_score": 1.3862944,
"_source": {
"text": "Do you accept credit card ?",
"src": "payment"
}
}
]
}
}
}
]
}
}
}
Hope that helps!

Elasticsearch aggregation turns results to lowercase

I've been playing with ElasticSearch a little and found an issue when doing aggregations.
I have two endpoints, /A and /B. In the first one I have parents for the second one. So, one or many objects in B must belong to one object in A. Therefore, objects in B have an attribute "parentId" with parent index generated by ElasticSearch.
I want to filter parents in A by children attributes of B. In order to do it, I first filter children in B by attributes and get its unique parent ids that I'll later use to get parents.
I send this request:
POST http://localhost:9200/test/B/_search
{
"query": {
"query_string": {
"default_field": "name",
"query": "derp2*"
}
},
"aggregations": {
"ids": {
"terms": {
"field": "parentId"
}
}
}
}
And get this response:
{
"took": 91,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "child",
"_id": "AU_fjH5u40Hx1Kh6rfQG",
"_score": 1,
"_source": {
"parentId": "AU_ffvwM40Hx1Kh6rfQA",
"name": "derp2child2"
}
},
{
"_index": "test",
"_type": "child",
"_id": "AU_fjD_U40Hx1Kh6rfQF",
"_score": 1,
"_source": {
"parentId": "AU_ffvwM40Hx1Kh6rfQA",
"name": "derp2child1"
}
},
{
"_index": "test",
"_type": "child",
"_id": "AU_fjKqf40Hx1Kh6rfQH",
"_score": 1,
"_source": {
"parentId": "AU_ffvwM40Hx1Kh6rfQA",
"name": "derp2child3"
}
}
]
},
"aggregations": {
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "au_ffvwm40hx1kh6rfqa",
"doc_count": 3
}
]
}
}
}
For some reason, the filtered key is returned in lowercase, hence not being able to request parent to ElasticSearch
GET http://localhost:9200/test/A/au_ffvwm40hx1kh6rfqa
Response:
{
"_index": "test",
"_type": "A",
"_id": "au_ffvwm40hx1kh6rfqa",
"found": false
}
Any ideas on why is this happening?
The difference between the hits and the results of the aggregations is that the aggregations work on the created terms. They will also return the terms. The hits return the original source.
How are these terms created? Based on the chosen analyser, which in your case is the default one, the standard analyser. One of the things this analyser does is lowercasing all the characters of the terms. Like mentioned by Andrei, you should configure the field parentId to be not_analyzed.
PUT test
{
"mappings": {
"B": {
"properties": {
"parentId": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
I am late from the party but I had the same issue and understood that it caused by the normalization.
You have to change the mapping of the index if you want to prevent any normalization changes the aggregated values to lowercase.
You can check the current mapping in the DevTools console by typing
GET /A/_mapping
GET /B/_mapping
When you see the structure of the index you have to see the setting of the parentId field.
If you don't want to change the behaviour of the field but you also want to avoid the normalization during the aggregation then you can add a sub-field to the parentId field.
For changing the mapping you have to delete the index and recreate it with the new mapping:
creating the index
Adding multi-fields to an existing field
In your case it looks like this (it contains only the parentId field)
PUT /B/_mapping
{
"properties": {
"parentId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
then you have to use the subfield in the query:
POST http://localhost:9200/test/B/_search
{
"query": {
"query_string": {
"default_field": "name",
"query": "derp2*"
}
},
"aggregations": {
"ids": {
"terms": {
"field": "parentId.keyword",
"order": {"_key": "desc"}
}
}
}
}

Resources