Elasticsearch: Remove duplicates from index - elasticsearch

I have an index with multiple duplicate entries. They have different ids but the other fields have identical content.
For example:
{id: 1, content: 'content1'}
{id: 2, content: 'content1'}
{id: 3, content: 'content2'}
{id: 4, content: 'content2'}
After removing the duplicates:
{id: 1, content: 'content1'}
{id: 3, content: 'content2'}
Is there a way to delete all duplicates and keep only one distinct entry without manually comparing all entries?

This can be accomplished in several ways. Below I outline two possible approaches:
1) If you don't mind generating new _id values and reindexing all of the documents into a new collection, then you can use Logstash and the fingerprint filter to generate a unique fingerprint (hash) from the fields that you are trying to de-duplicate, and use this fingerprint as the _id for documents as they are written into the new collection. Since the _id field must be unique, any documents that have the same fingerprint will be written to the same _id and therefore deduplicated.
2) You can write a custom script that scrolls over your index. As each document is read, you can create a hash from the fields that you consider to define a unique document (in your case, the content field). Then use this hash as they key in a dictionary (aka hash table). The value associated with this key would be a list of all of the document's _ids that generate this same hash. Once you have all of the hashes and associated lists of _ids, you can execute a delete operation on all but one of the _ids that are associated with each identical hash. Note that this second approach does not require writing documents to a new index in order to de-duplicate, as you would delete documents directly from the original index.
I have written a blog post and code that demonstrates both of these approaches at the following URL: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/
Disclaimer: I am a Consulting Engineer at Elastic.

I use rails and if necessary I will import things with the FORCE=y command, which removes and re-indexes everything for that index and type... however not sure what environment you are running ES in. Only issue I can see is if the data source you are importing from (i.e. Database) has duplicate records. I guess I would see first if the data source could be fixed, if that is feasible, and you re-index everything; otherwise you could try to create a custom import method that only indexes one of the duplicate items for each record.
Furthermore, and I know this doesn't comply with you wanting to remove duplicate entries, but you could simply customize your search so that you are only returning one of the duplicate ids back, either by most recent "timestamp" or indexing deduplicated data and grouping by your content field -- see if this post helps. Even though this would still retain the duplicate records in your index, at least they won't come up in the search results.
I also found this as well: Elasticsearch delete duplicates
I tried thinking of many possible scenarios for you to see if any of those options work or at least could be a temp fix.

Here is a script I created based on Alexander Marquardt answer.
import hashlib
from elasticsearch import Elasticsearch, helpers
ES_HOST = 'localhost:9200'
es = Elasticsearch([ES_HOST])
def scroll_over_all_docs(index_name='squad_docs'):
dict_of_duplicate_docs = {}
index_docs_count = es.cat.count(index_name, params={"format": "json"})
total_docs = int(index_docs_count[0]['count'])
count = 0
for hit in helpers.scan(es, index=index_name):
count += 1
text = hit['_source']['text']
id = hit['_id']
hashed_text = hashlib.md5(text.encode('utf-8')).digest()
dict_of_duplicate_docs.setdefault(hashed_text,[]).append(id)
if (count % 100 == 0):
print(f'Progress: {count} / {total_docs}')
return dict_of_duplicate_docs
def delete_duplicates(duplicates, index_name='squad_docs'):
for hash, ids in duplicates.items():
if len(ids) > 1:
print(f'Number of docs: {len(ids)}. Number of docs to delete: {len(ids) -1}')
for id in ids:
if id == ids[0]:
continue
res = es.delete(index=index_name, doc_type= '_doc', id=id)
id_deleted = res['_id']
results = res['result']
print(f'Document id {id_deleted} status: {results}')
reminder_doc = es.get(index=index_name, doc_type= '_all', id=ids[0])
print('Reminder Document:')
print(reminder_doc)
def main():
dict_of_duplicate_docs = scroll_over_all_docs()
delete_duplicates(dict_of_duplicate_docs)
if __name__ == "__main__":
main()

Related

Is it possible to get data contained in another document by id, when map function is running for some document in couchbase view?

I have two kinds of documents in my couchbase bucket with keys like -
product.id.1.main
product.id.2.main
product.id.3.main
and
product.id.1.extended
product.id.2.extended
product.id.3.extended
I want to write a view for documents of first kind, such that when some conditions are matched for a document, I can emit the attributes contained in the documents of first kind as well as the document of second kind.
Something like -
function(doc, meta){
if((meta.id).match("product.id.*.main") && doc.attribute1.match("value1"){
var extendedDocId = replaceMainWithExtended(meta.id)
emit(meta.id, doc.attribute1 + getExtendedDoc(extendedDocId).extendedAttribute1 );
}
}
I want to know how to implement this kind of function in couchbase views -
getExtendedDoc(extendedDocId).extendedAttribute1

skip invalid documents and upload valid ones from single file using INSERT aql with arangodb

I am using the following aql query for uploading documents from a file into the database "FOR document in #file INSERT document INTO ##collection LET newDoc = NEW RETURN newDoc".
I have created a unique hash index for all attributes in the collection so when attempting to upload a duplicated document I get an error (which I wanted) but then none of the documents from the file is uploaded to the database.
I would like to know if is there a way to upload only the valid documents and skip the wrong ones (in my case duplicated documents) by using aql query.
UPDATE:
I am using python and I can upload documents one by one as:
for document in file:
doc = collection.createDocument() #function from pyArango
try:
for key, value in document.iteritems():
doc[key] = value
doc.save()
except:
print "wrong document"
I was wondering if I can do it using aql query and not "manually" upload one by one
You can specify ignoreErrors: true in the query OPTIONS statement like this:
FOR document in #file
INSERT document INTO ##collection OPTIONS {ignoreErrors: true}
RETURN NEW
It will then ignore the documents with collisions and will only return those documents that were actually created.
If you try to only return the _key field, you will gett a null for each failed document:
FOR document in #file
INSERT document INTO ##collection OPTIONS {ignoreErrors: true}
RETURN NEW._key
The Result will look like this, first being a duplicate, the second fresh generated _key with value 23225:
[
null,
"23225"
]

Index JSON Array in Postgres DB

I have a table where each row has a JSON structure as follows that I'm trying to index in a postgresql database and was wondering what the best way to do it is:
{
"name" : "Mr. Jones",
"wish_list": [
{"present_name": "Counting Crows",
"present_link": "www.amazon.com"},
{ "present_name": "Justin Bieber",
"present_link": "www.amazon.com"},
]
}
I'd like to put an index on each present_name within the wish_list array. The goal here is that I'd like to be able to find each row where the person wants a particular gift through an index.
I've been reading on how to create an index on a JSON which makes sense. The problem I'm having is creating an index on each element of an array within a JSON object.
The best guess I have is using something like the json_array_elements function and creating an index on each item returned through that.
Thanks for a push in the right direction!
Please check JSONB Indexing section in Postgres documentation.
For your case index config may be the following:
CREATE INDEX idx_gin_wishlist ON your_table USING gin ((jsonb_column -> 'wish_list'));
It will store copies of every key and value inside wish_list, but you should be careful with a query which hits the index. You should use #> operator:
SELECT jsonb_column->'wish_list'
FROM your_table WHERE jsonb_column->'wish_list' #> '[{"present_link": "www.amazon.com", "present_name": "Counting Crows"}]';
Strongly suggested to check existing nswers:
How to query for array elements inside JSON type
Index for finding an element in a JSON array

Getting the objects with similar secondary index in Riak?

Is there a way to get all the objects in key/value format which are under one similar secondary index value. I know we can get the list of keys for one secondary index (bucket/{{bucketName}}/index/{{index_name}}/{{index_val}}). But somehow my requirements are such that if I can get all the objects too. I don't want to perform a separate query for each key to get the object details separately if there is way around it.
I am completely new to Riak and I am totally a front-end guy, so please bear with me if something I ask is of novice level.
In Riak, it's sometimes the case that the better way is to do separate lookups for each key. Coming from other databases this seems strange, and likely inefficient, however you may find your query will be faster over an index and a bunch of single object gets, than a map/reduce for all the objects in a single go.
Try both these approaches, and see which turns out fastest for your dataset - variables that affect this are: size of data being queried; size of each document; power of your cluster; load the cluster is under etc.
Python code demonstrating the index and separate gets (if the data you're getting is large, this method can be made memory-efficient on the client, as you don't need to store all the objects in memory):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
return [v.key];
}"""
)
results = query.run()
bucket = riak_client.bucket("bucket_name")
for key in results:
obj = bucket.get(key)
# .. do something with the object
Python code demonstrating a map/reduce for all objects (returns a list of {key:document} objects):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
var obj = Riak.mapValuesJson(v)[0];
return [ {
'key': v.key,
'data': obj,
} ];
}"""
)
results = query.run()

Query two or more sphinx indexes

I'm using php API to query two sphinx indexes as below
$cl->Query("test","index1 index2");
and I'm getting the result from both of successfully but I can't differentiate which result is from which index. is there a way to tell the difference? or do I need to do 2 queries separately?
Set a unique attribute on each
source1 {
sql_query = SELECT id, 1 as index_id, ....
sql_attr_unit = index_id
}
source2 {
sql_query = SELECT id, 2 as index_id, ....
sql_attr_unit = index_id
}
Results will contain a 'index_id' attribute.
Almost the same if using RT indexes. just need to define a rt_attr_unit and then populate it appropriately when you inject data into the index.
The otherway, persumably you've already arranged for the ids in the two indexes to be non-overlapping (it wont work if have the same ids in both indexes) so can look a the ID to deduce the source index.

Resources