Finding duplicate documents - elasticsearch

I have some documents whose ids are randomly generated. The issue here is I need to find the duplicates amongst these documents. I have three fields which should not be identical for two documents. So how to check for duplicates based on multiple fields?
Sample documents
document 1 = {
"process" : "business",
"processId" : 5433321,
"country" : "US"
}
document 2 = {
"process" : "operations",
"processId" : 334233,
"country" : "UK"
}
document 3 = {
"process" : "business",
"processId" : 5433321,
"country" : "US"
}
Here as you can see, document 1 and document 3 are the same, but they are having different Ids in my database,so exist as separate documents. So on run I need to find the above as duplicates and if possible keep only one.

The best option here would be to model your document around doc ID. Now for each unique document , create a docID which is a hash of the content of the document. This makes sure that only one unique document exists across the index. Next use _create API to create documents. This will fail all requests on over write document with same document ID.
You can further read about other duplication issues and its solutions here.

Related

Handle Frequently updated geo data in Elastic Search

I have the location(geo points) data updated every 5 minutes for millions of users. We have to search users with specific attributes(age/interests/languages) & in particular geo range. Wanted to understand the right strategy to store such data in Elastic.
Option1
Create user document with following keys
user Metadata & attributes (age, interests, languages, salary etc around 8-10 searchable attributes)
Live location (changing every few minutes)
"liveLocation" : {
"type" : "Point",
"coordinates" : [-72.333077, 30.856567]
}
location data - multiple addresses - home address, work address etc along with geo points. (not updated frequently) -
"addresses" :
[
{
"type" : "home",
"address" : "first floor, xyz, near landmark",
"city" : "Newyork",
"country" : "Country",
"zipcode" : "US1029",
"location" : {
"type" : "Point",
"coordinates" : [-73.856077, 40.848447]
},
{
... more atype of addresses
}
]
We want to perform geo search queries over all the geo type fields. My worry - live location for users will be updated quite frequently. Is this a viable approach ?
Option2
Treat every location update as a time series data and insert a new document. This will avoid updating the documents. instead will insert new documents for each user every few minutes.
Problems -
While searching all the users(home/office/live location) in a particular geo range, I have to consider only the last updated documents for each user . How to do that in elastic ?
We have to search users with specific attributes(age/interests/language) & in particular geo range. If option2 is preferable should user attribute-metadata & location updates be treated as parent-child relationship or some other approach?
In Conclusion - What should be the right approach .

elasticsearch - query between document types

I have a production_order document_type
i.e.
{
part_number: "abc123",
start_date: "2018-01-20"
},
{
part_number: "1234",
start_date: "2018-04-16"
}
I want to create a commodity document type
i.e.
{
part_number: "abc123",
commodity: "1 meter machining"
},
{
part_number: "1234",
commodity: "small flat & form"
}
Production orders are datawarehoused every week and are immutable.
Commodities on the other hand could change over time. i.e abc123 could change from 1 meter machining to 5 meter machining, so I don't want to store this data with the production_order records.
If a user searches for "small flat & form" in the commodity document type, I want to pull all matching records from the production_order document type, the match being between part number.
Obviously I can do this in a relational database with a join. Is it possible to do the same in elasticsearch?
If it helps, we have about 500k part numbers that will be commoditized and our production order data warehouse currently holds 20 million records.
I have found that you can indeed now query between indexs in elasticsearch, however you have to ensure your data stored correctly. Here is an example from the 6.3 elasticsearch docs
Terms lookup twitter example At first we index the information for
user with id 2, specifically, its followers, then index a tweet from
user with id 1. Finally we search on all the tweets that match the
followers of user 2.
PUT /users/user/2
{
"followers" : ["1", "3"]
}
PUT /tweets/tweet/1
{
"user" : "1"
}
GET /tweets/_search
{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}
Here is the link to the original page
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-terms-query.html
In my case above I need to setup my storage so that commodity is a field and it's values are an array of part numbers.
i.e.
{
"1 meter machining": ["abc1234", "1234"]
}
I can then look up the 1 meter machining part numbers against my production_order documents
I have tested and it works.
There is no joins supported in elasticsearch.
You can query twice first by getting all the partnumbers using "small flat & form" and then using all the partnumbers to query the other index.
Else try to find a way to merge these into a single index. That would be better. Updating the Commodities would not cause you any problem by combining the both.

How to update a document using index alias

I have created an index "index-000001" with primary shards = 5 and replica = 1. And I have created two aliases
alias-read -> index-000001
alias-write -> index-000001
for indexing and searching purposes. When I do a rollover on alias-write when it reaches its maximum capacity, it creates a new "index-000002" and updates aliases as
alias-read -> index-000001 and index-000002
alias-write -> index-000002
How do I update/delete a document existing in index-000001(what if in case all I know is the document id but not in which index the document resides) ?
Thanks
Updating using an index alias is not directly possible, the best solution for this is to use a search query using the document id or a term and get the required index. Using the index you can update your document directly.
GET alias-read/{type}/{doc_id} will get the required Document if doc_id is known.
If doc_id is not known, then find it using a unique id reference
GET alias-read/_search
{
"term" : { "field" : "value" }
}
In both cases, you will get a single document as a response.
Once the document is obtained, you can use the "_index" field to get the required index.
PUT {index_name}/{type}/{id} {
"required_field" : "new_value"
}
to update the document.

ElasticSearch - Unique Tags for multiple documents (indexing)

We would like a unique Tag and multiple values in elastic search : to be clearer. We need to do a timeserie graph. So we get values between 2 dates. But of course we have different kinds of data. That where our tags comes. We want to search our tags with an autoCompletion, then choose our values with the dates.
{tag :["sdfsf", "fddsfsd", "fsdfsf"]
{
values : 145.45
date : "2004-10-23"
},
{
values : 556.09
date : "2010-02-13"
}
}
After, a bit of research we found the parent/child technique but because we want to do a completion on tag (in the parent), we need an aggregation which is impossible in ES with "has_parent".
Our solutions is to do :
{
{
tag :["sdfsf", "fddsfsd", "fsdfsf"],
values : 145.45,
date : "2004-10-23"
},
{
tag :null,
values : 556.09,
date : "2010-02-13"
}, {etc...}
}
So we only have one tag easy to check with completion. But it's kind of "ugly".
Does anybody have a correct way to do what we want to do ?
thx in advance

can terms lookup mechanism query by other field but id?

here is elasticsearch official website about terms:
https://www.elastic.co/guide/en/elasticsearch/reference/2.1/query-dsl-terms-query.html
As we can see, if we want to do terms lookup mechanism query, we should use command like this:
curl -XGET localhost:9200/tweets/_search -d '{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}'
But what if i want to do query by other field of users.
Assume that users has some other fields such as name and can i use terms lookup mechanism finding the tweets by giving users name but not id.
I have tried to use command like this:
curl -XGET localhost:9200/tweets/_search -d '{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"name" : "Jane",
"path" : "followers"
}
}
}
}'
but it occurs error.
Looking forward to your help. Thank you!
The terms lookup mechanism is basically a built-in optimization to not have to make two queries to JOIN two indices, i.e. one in index A to get the ids to lookup and a second to fetch the documents with those ids in index B.
In contrary to SQL, such a JOIN can only work on the id field since this is the only way to uniquely retrieve a document from Elasticsearch via a GET call, which is exactly what Elasticsearch will do in the terms lookup.
So to answer your question, the terms lookup mechanism will not work on any other field than the id field since the first document to be retrieved must be unique. In your case, ES would not know how to fetch the document for the user with name Jane since name is just a field present in the user document, but in no way a unique identifier for user Jane.
I think you did not understand exactly how this works. Terms lookup query works by reading values from a field of a document with the given id. In this case, you are trying to match the value of field user in tweets index with values of field followers in document with id "2" present in users index and user type.
If you want to read from any other field then simply mention that in "path".
What you mainly need to understand is that the lookup values are all fetched from a field of a single document and not multiple documents.

Resources