finding duplicate field values in elasticsearch - elasticsearch

Using elasticsearch 0.19.4 (I know this is old, but its what is required by a dependency)
I have a field "digest" in an elasticsearch index - and I would like to execute a query that will return me all the cases where there are duplicate values of digest. Can this be done?
For the records that have duplicate values, I would like to return other values - such as "url" which may not be duplicated.

You can use Terms Aggregation for this.
POST <index>/<type>/_search?search_type=count
{
"aggs": {
"duplicateNames": {
"terms": {
"field": "digest",
"size": 0,
"min_doc_count": 2
}
}
}
}
This will return all values of the field digest which occur in at least 2 documents. I agree this does not exactly match to your use case but it might help.

Related

elasticsearch: count appearance of terms aggregation on other fields

I want to count how many times, unique values (result of terms aggragation) have appeared in other fields in the same query. Let's say:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"unique_products": {
"terms": {
"field": "products.name.keyword",
"min_doc_count": 10
}
}
}
}
What I want is to count, how many time each of the keys returned in the bucket, appeared in another field.
My ideal output is:
"aggregations": {
"product_stat": {
"key": "<product_name>"
"sold": "<#>" #I want to know how many times the key is appeared in another field like sold
"bought": "<#>"
}
}
Elasticsearch cannot do terms aggregations over multiple fields. In short, if they would, aggregations would not be blazing fast.
As documentation suggests, there are two options:
use script terms aggregation (with performance penalty),
change how the documents are indexed so a normal terms aggregation can be used.
Depending on the structure of your data and your use-cases, you might get by with a complex aggregation + some processing on the client side. This can be done with sub aggregations like here, for example.
Hope that helps!

Elasticsearch: get documents only when value changes

I have an ES index with such kind of documents:
from_1,to_1,timestamp_1
from_1,to_1,timestamp_2
from_1,to_2,timestamp_3
from_2,to_3,timestamp_4
from_1,to_2,timestamp_5
from_2,to_3,timestamp_6
from_1,to_1,timestamp_7
from_2,to_4,timestamp_8
I need a query that would return a document only if its combination of from and to values is different than the previous seen document with the same from value.
So with the provided sample above:
document with timestamp_1 should be in the result because there is no earlier document with from_1+to_1 combination
document with timestamp_2 must be skipped because its from+to combination is exactly the same as the last seen document with from = from_1
document with timestamp_3 should be in the result because its to field (to_2) is different than the value of the last seen with the same from (to_1 in document with timestamp_1
document with timestamp_4 should be in the result
document with timestamp_5 must not be in the result because it has the same combination of from+to as the last seen with from_1 (document with timestamp_3)
document with timestamp_6 must not be in the result because it has the same combination of from+to as the last seen with from_2 (document with timestamp_4)
document with timestamp_7 should be in the result because it has the different combination of from+to to the last seen with from_1 (document with timestamp_3)
document with timestamp_8 should be in the result because its combination is completely new so far
I need to fetch all such "semi-unique" documents from the index, so it would be nice if it possible to use scroll request or after_key if an aggregation is used.
Any ideas how to approach it?
The closest thing I could come up with is the following (let me know if it does not work with your data).
{
"size": 0,
"aggs": {
"from_and_to": {
"composite" : {
"size": 5,
"sources": [
{
"from_to_collected":{
"terms": {
"script": {
"lang": "painless",
"source": "doc['from'].value + '_' + doc['to'].value"
}
}
}
}]
},
"aggs": {
"top_from_and_to_hits": {
"top_hits": {
"size": 1,
"sort": [{"timestamp":{"order":"asc"}}],
"_source": {"includes": ["_id"]}
}
}
}
}
}
}
Keep in mind that the terms aggregations is probabilistic.
This will allow you to scroll to the next set of buckets over the from_to_collected key.

How to find all duplicate documents in ElasticSearch

We have a need to walk over all of the documents in our AWS ElasticSearch cluster, version 6.0, and gather a count of all the duplicate user ids.
I have tried using a Data Visualization to aggregate counts on the user ids and export them, but the numbers don't match another source of our data that is searchable via traditional SQL.
What we would like to see is like this:
USER ID COUNT
userid1 4
userid22 3
...
I am not an advanced Lucene query person and have yet to find an answer to this question. If anyone can provide some insight into how to do this, I would be appreciative.
The following query will count each id, and filter the ids which have <2 counts, so you'll get something in the likes of:
id:2, count:2
id:4, count:15
GET /index
{
"query":{
"match_all":{}
},
"aggs":{
"user_id":{
"terms":{
"field":"user_id",
"size":100000,
"min_doc_count":2
}
}
}
}
More here:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
If you want to get all duplicate userids with count
First you get to know maximum size of aggs.
find all maximum matches record via aggs cardinality.
GET index/type/_search
{
"size": 0,
"aggs": {
"maximum_match_counts": {
"cardinality": {
"field": "userid",
"precision_threshold": 100
}
}
}
}
get value of maximum_match_counts aggregations
Now you can get all duplicate userids
GET index/type/_search
{
"size": 0,
"aggs": {
"userIds": {
"terms": {
"field": "userid",
"size": maximum_match_counts,
"min_doc_count": 2
}
}
}
}
When you go with terms aggregation (Bharat suggestion) and set aggregation size more than 10K you will get a warning about this approach will throw an error for the feature releases.
Instead of using terms aggregation you should go with composite aggregation to scan all of your documents by pagination/afterkey method.
the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.

Elasticsearch:: Sorting giving weird results

When I am searching the for the first time, its sorting all documents and giving me the first 5 records. However, if same search query is executed by changing the sort direction(ASC -> DESC), then its not sorting all documents again, its giving me last 5 retrieved documents(from previous search query), sorting them in desc order, and giving it back to me. I was expecting that it will sort all available documents in DESC order, and then retrieve first 5 results.
Am I doing something wrong, or missed any concept.
My search query:
{
"sort": {
"taskid": {
"order": "ASC"
}
},
"from": 0,
"size": 5,
"query": {
"filtered": {
"query": {
"match_all": []
}
}
}
}
I have data with taskid 1 to 100. Now above query fetched me record from taskid 1 to 5 in first attempt. Now when I changed the sort direction to desc, I was expecting documents with taskid 96-100(100,99,98,97,96 sequence) should be returned, however I was returned documents with taskid 5,4,3,2,1 in that sequence. Which meant, sorting was done on previous returned result only.
Please note that taskid and _id are same in my document. I had added a redundant field in my mapping which will be same as _id
Just change the case of the value in order key and you are good to go.
{
"sort": {
"taskid": {
"order": "asc" // or "desc"
}
},
"from": 0,
"size": 5,
"query": {
"filtered": {
"query": {
"match_all": []
}
}
}
}
Hope this helps..
In elastic search, sort query is applied after the result are extracted from the es. As per the query mentioned in your question, first result is filtered based on search criteria, and then sorting is applied on the filtered result.
If it looks like you are only getting results based on an old subset of your data, then it may be that your newer data has not been indexed yet. This can happen easily in an automated test but with manual testing it is less likely.
Segments are rebuilt every second, so adding a delay/sleep of about a second between indexing and searching should fix your test if this is the problem.

Elasticsearch delete duplicates

Some of the records are duplicated in my index identified by a numeric field recordid.
There is delete-by-query in elasticsearch, Can I use it to delete any one of the duplicate record?
Or some other way to achieve this?
Yes, you can find duplicated document with an aggregation query:
curl -XPOST http://localhost:9200/your_index/_search -d '
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "recordid",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size": 10
}
}
}
}
}
}'
then delete duplicated documents preferably using a bulk query. Have a look at es-deduplicator for automated duplicates removal (disclaimer: I'm author of that script).
NOTE: Aggregate queries could be very expensive and might lead to crash of your nodes (in case that your index is too large and number of data nodes too small).
Elasticsearch recommends "use(ing) the scroll/scan API to find all matching ids and then issue a bulk request to delete them".
**Edited
The first challenge here would be to identify the duplicate documents. For that you need to run a terms aggregation on the fields that defines the uniqueness of the document. On the second level of aggregation use top_hits to get the document ID too. Once you are there , you will get the ID of documents having duplicates.
Now you can safely remove them , may be using Bulk API.
You can read of other approaches to detect and remove duplicate documents here.

Resources