Removing duplicates in Elasticsearch cross cluster search

Removing duplicates in Elasticsearch cross cluster search - elasticsearch

I'm using cross cluster search and searching for a document by _id that exists in both clusters.
ES returns with 2 hits (1 in local index, 1 in remote index). I just want the one in the local index. How can I remove the duplicate from the remote cluster ?
Query :
{
"query": {
"terms": {
"_id": [ "123"]
}
}
}```

You should be able achieving this by using Field Collapsingover the _id-field and define a sorting condition in which documents from your local cluster rank higher (e.g a cluster id, or a timestamp etc)
(see Elasticsearch Reference: Field Collapsing)

Related

Moving data from oine Elasticsearch index to another with higher number of shards or increasing shard number in existing index

I am new to Elasticsearch and I have been reading documentation in order to find a way of increasing amount of shards that my index consists of. Currently my index looks like this:
country_data 0 p STARTED 227 100.7kb 192.168.0.115 $HOSTNAME
country_data 0 r STARTED 227 100.7kb 192.168.0.116 $HOSTNAME
I wanted to increase the number of shard to 5 however I was unable to find a proper way of doing it. I learnt from another Stackoverflow question that I should be able to do it like this:
POST _reindex?slices=5
{
"source": {
"index": "country_data"
},
"dest": {
"index": "country_data_new"
}
}
However when I did that I got a copy of my country_data with same amount of shards and replicas (1 and 1). I tried to learn more about it in documentation but all I found is this: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/option_slices.html
I couldn't find anything in documentation about increasing number of shards in existing index or how can I move data to new index which would have more shards. I would be grateful for any insights into this problem or at least a website where could I learn how to do it.

This can be done in any of the below mentioned way.
1st Option : You can use the elastic search Split Index API.
I suggest you to please go through the documentation once before proceeding with this method.
2nd Option : Create a new index with same mappings and give the required settings for new shards. Then use the reindex API to copy data from source index to destination index
To create the new Index:
PUT /<NEW_INDEX_NAME>
{
"settings": {
"number_of_shards": <REQUIRED_NUMBER_OF_SHARDS>
},
"mappings": {<MAPPINGS_OF_SOURCE_INDEX>}
}
}
If you don't give the number of shards in the settings while creating an index, by default it creates index with one primary and one replica shard.
To Reindex from source to newly created index:
POST _reindex
{
"source": {
"index": "<SOURCE_INDEX_NAME>"
},
"dest": {
"index": "<NEW_INDEX_NAME>"
}
}

AWS elasticsearch disable replication of all indices

I am using a single node AWS ES cluster. Currently, its health status is showing yellow which is obvious because there is no other node to which Amazon ES can assign a replica. I want to set the replication of all my current and upcoming indices to 0. I have indices created in this pattern:
app-one-2021.02.10
app-two-2021.01.11
so on...
These indices are currently having number_of_replicas set to 1. To disable replication for all indices I am throwing a PUT request in index pattern:
PUT /app-one-*/_settings
{
"index" : {
"number_of_replicas":0
}
}
Since I am using a wildcard here so it should set number_of_replicas to 0 in all the matching indices, which it is doing successfuly.
But if any new index is created in the future let's say app-one-2021.03.10. Then the number_of_replicas is again set to 1 in this index.
Every time I have to run a PUT request to set number_of_replicas to 0 which is tedious. Why new indices are not automatically taking number_of_replicas to 0 even if I am using a wildcard (*) in my PUT request.
Is there any way to completely set replication (number_of_replicas to 0) to 0, and doesn't matter if it's a new index or an old index. How can I achieve this?

Yes, the way is to define index templates.
Before Elasticsearch v7.8, you could only use the _template API (see docs). E.g., in your case, you can create a template matching all the app-* indices:
PUT _template/app_settings
{
"index_patterns": ["app-*"],
"settings": {
"number_of_replicas": 0
}
}
Since Elasticsearch v7.8, the old API is still supported but deprecated, and you can use the _index_template API instead (see docs).
PUT _index_template/app_settings
{
"index_patterns": ["app-*"],
"template": {
"settings": {
"number_of_replicas": 0
}
}
}
Update: add code snippets for both _template and _index_template API.

elastic search fulltext search on multiple index

Design Query for elasticsearch:
I have 10 tables in my mysql database : news, emails, etc. Which i would sync into elasticsearch. and i want to search across all these tables in the same go.
There are no relationship in tables and all have txt field in them. Just want to search in txt field .. so should i have multiple index or just 1 index.
How should i organize my indices:
Option 1 : Should i have just one elasticsearch index(with an attribute of table type) for all the tables
OR
Option 2 : Should i have just multiple elasticsearch index for all the tables
Considering:
want to make combined query in multiple data source ordered by hits . Example : search all email + news ..
or single query to only search email or news only

Have multiple indices and query any number of them at any given time:
POST emails/_doc
{
"txt": "abc"
}
POST news/_doc
{
"txt": "ab"
}
GET emails,news/_search
{
"query": {
"query_string": {
"default_field": "txt",
"query": "ab OR abc"
}
}
}
Wildcard index names are supported too in case you've got, say, timestamp-bucketed names such as emails_2020, emails_2019 etc:
GET em*,ne*/_search
...

Also you could use the msearch to search multiple indices:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html

Elastic search string filter - does such option exists?

I am wondering, is there such an option like string filter?
I've recently bumped into the following error:
RequestError(400, 'search_phase_execution_exception',
'too_many_clauses: maxClauseCount is set to 1024')
According to Lucene's documentation, it says:
Use a filter to replace the part of the query that causes the exception.
Do you have any ideas?

The Lucene FAQ mentions a few approaches to overcoming the TooManyClauses exception which doesn't apply to Elasticsearch as before they used to have terms filter separately but now its part of terms query itself.
so below is the example how you could use terms in the filter context:
{
"query": {
"bool": {
"filter": [
{ "term": { "user" : ["kimchy", "elasticsearch"]},
]
}
}
}
If you really need to use a query instead of a filter then you can update
indices.query.bool.max_clause_count: n in the elasticsearch.yml (replace n with the number of clause count that you need) file of each node of the cluster and restart the cluster.
Note that this will increase the
memory requirements for searches that expand to many terms.

Application-side Joins Elasticsearch

I have two indexes in Elasticsearch, a system index, and a telemetry index. I'd like to perform queries and aggregations on the telemetry index using filters from the systems index. The systems index is relatively small and only receives new documents occasionally, but the telemetry index is much larger and is constantly receiving new documents. This seems like an ideal situation for using an application-side join.
I tried emulating the example query at the pervious link, but it turns out the filtered query is deprecated as of ES 5.0. (Why is this example in the current documentation?!)
Here are my queries:
GET /system/_search
{
"query": {
"match": {
"name": "George's system"
}
}
}
GET /telemetry/_search
{
"query": {
"bool":{
"must": {
"multi_match": {
"operator": "and",
"fields": ["systemId"]
, [1] }
}
}
}
}
}
The second one fails with a json_parse_exception because for some reason it doesn't like the [ ] characters after "fields".
Can anyone provide a simple example of using application-side joins?
Once such a query is defined (perhaps in Kibana's Dev Tools console) is there a way to visualize it in Kibana?

With elastic there is no way to execute two nested queries like in a relational database where the first query uses the response of the second. The example in the application-side join, means that you are actually making two queries (two different requests to elastic) on the application side.
First query you get the list of ids you need to filter on.
Second query you pass the list of ids that you got to the terms filter.
This works when you have no more than 1024 values for systemId. Because terms query has a limit on the number of terms.
Because this query is not feasible, then you can't visualize it in kibana.
In such case you have to sacrifice a little of space and add the systemId to your mapping.
Good Luck!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Removing duplicates in Elasticsearch cross cluster search - elasticsearch

You should be able achieving this by using Field Collapsingover the _id-field and define a sorting condition in which documents from your local cluster rank higher (e.g a cluster id, or a timestamp etc) (see Elasticsearch Reference: Field Collapsing)

Related

Moving data from oine Elasticsearch index to another with higher number of shards or increasing shard number in existing index

AWS elasticsearch disable replication of all indices

elastic search fulltext search on multiple index

Elastic search string filter - does such option exists?

Application-side Joins Elasticsearch

Categories

Resources