Duplicates in Solr index - items added twice or more times - solrnet

Consider you have a Solr index with approx. 20 Million items. When you index these items they are added to the index in batches.
Approx 5 % of all these items are indexed twice or more times, therefore causing a duplicates problem.
If you check the log, you can actually see these items are indeed added twice (or more). Often with an interval of 2-3 minutes between them, and other items between them too.
The web server which triggers the indexing is in a load balanced environment (2 web servers). However, the web server who does the actual indexing is a single web server.
Here are some of the config elements in solrconfig.xml:
<indexDefaults>
.....
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>128</ramBufferSizeMB>
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
<mergePolicy class="org.apache.lucene.index.LogByteSizeMergePolicy">
<double name="maxMergeMB">1024.0</double>
</mergePolicy>
<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>
I'm using Solr 1.4.1 and Tomcat 7.0.16. Also I'm using the latest SolrNET library.
What might cause this duplicates problem? Thanks for all input!

To answer your question completely i should be able to know the schema. There is a unique id field in the schema that works more like the unique key in the db, make sure the unique identifier of the document is made the unique key then the duplicates will be overwritten to keep just one value.

It is not possible to have two documents with identical value in their field marked as a unique id in the schema. Adding two documents with the same value will just result in the latter one overwriting (replacing) the previous.
So it sounds like it is your mistake and the documents are not really identical.
Make sure your schema and id fields are correct.

As a completion to what was said above, a solution, in this case, can be to generate a unique ID (or to define one of the fields as an unique ID) for the document from code, before sending it to SOLR.
In this case you make sure that the document you want to update will be overwritted and not recreated.

Actually, all added documents will have an auto generated unique key, through Solr's own uuid type:
<field name="uid" type="uuid" indexed="true" stored="true" default="NEW"/>
So any document added to the index will be considered a new one, since it gets a GUID. However, I think we've got a problem with some other code here, code that adds items to the index when they are updated, instead of just updating them..
I'll be back! Thanks so far!

Ok, it turned out there was a couple of bugs in the code updating the index. Instead of updating, we always had a document added to index, even tho it already existed.
It wasn't overwritten because every document in our Solr index has its own GUID.
Thank you for your answers and time!

Related

How the indexing of a new document wil be done by a couchbase view? Is it being added at last?

I want to know that if a new document is added to couchbase and I am accessing these kind of documents via a mapreduce view. Then will the new document comse at last in the list of these documents or it can come at nay position in the list?
It does not depends on time of creation but depends totally on your key combinations.
So if you use a name field as key, then it will be displayed in alphabetical order.
See Writing Views for more detail.

Are IDs guaranteed to be unique across indices in Elasticsearch 6+?

With mapping types being removed in Elasticsearch 6.0 I wonder if IDs of documents are guaranteed to be unique across indices?
Say I have three indices, all with a "parent" field that contains an ID. Do I need to include which index the ID belongs to or can I just search through all three indices when looking for a document with the given ID?
IDs are not unique across indices.
If you want to refer to a document you need to know both the index name and the ID.
Explicit IDs
If you explicitly set the document ID when indexing, nothing prevents you from using the same ID twice for documents going in different indices.
Autogenerated IDs
If you don't set the ID when indexing, ES will generate one before storing the document.
According to the code, the ID is securely generated from a random number, the host MAC address and the current timestamp in ms. Additional work is done to ensure that the timestamp (and thus the ID sequence) increases monotonically.
To generate the same ID, when the JVM starts a specific random number has to be picked and the document ID must be generated in a specific moment with sub-millisecond precision. So while the chance exists, it's so small that I wouldn't care about it. (just like I wouldn't care about collisions when using an hash function to check file integrity)
Final note: as a code comment notes, the implementation is opaque and could change at any time, so what I wrote might not hold true in future versions.

Lucene: Filter query by doc ID

I want to have in the search response only documents with specified doc id. In stackoverflow I found this question (Lucene filter with docIds) but as far as I understand there is created the additional field in the document and then doing search by this field. Is there another way to deal with it?
Lucene's docids are intended only to be internal keys. You should not be using them as search keys, or storing them for later use. Those ids are subject to change without warning. They will be changed when updating or reindexing documents, and can change at other times, such as segment merges, as well.
If you want your documents to have a unique identifier, you should generate that key separate from the docId, and index it as a field in your document.

Avoid duplicate documents in Elasticsearch

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.
Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.
I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.
I could compare each item to the children of my target-parent and add new documents, if there is no equal child.
I wondered if there is a way, to let elasticsearch handle duplicates.
Duplication needs to be handled in ID handling itself.
Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.
If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.
You can read more about this approach here.
When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

Elasticsearch not searching some fields

I have just updated a website, the update adds new fields to elasticsearch.
In my dev environment, it all works fine. but on the live site, the new fields are not being found.
Eg. I have added a new field with the value : 1
However, when adding a filtered query of
{"field":1}
It does not find any matching results.
When I look in the documents, I can see docs with the field set to 1
Would the reason for this be that the new field was added after the mappings was set? I am not all that familiar with elasticsearch, So I am not really sure where to start looking to fix it.
Any help would be appreciated.
Update:
querying from URL shows nothing either
_search/?pretty=true&size=50&q=field1:*
however there is another field that was added at the same time which I can search on.
I can see field1 in the result set but it just wont allow me to search on it.
Only difference i see in the mapping is that the one that is working is set to type:long whereas the one not working is set as type:string
Is it a length issue on the ngram? what was your "min_gram" settings?
When you check on your index settings like this:
GET <host>/<index_name>/_settings
Does it work when you filter for a two digit field?
Are all the field values one digit?
It's OK to add a field after the mapping was set. ElasticSearch will guess the mapping for you. (in fact, it's one of their selling features --- no need to define the mapping, just throw the data at it)
There are a few things that can go wrong:
Verify that data is actually in the index. To do that, just navigate to the _search url with no parameters, you should see the field if it is indexed.
Look at your mapping. Could it be that the field is explicitly set not to be indexed?
Another possibility is that your query is wrong (but that is unlikely, since you're saying it works in the development environment)

Resources