Query to compare two columns from two different indexes - elasticsearch

I have a Column in one Index with a number of Countries in it, I want to check whether these countries are similar or same as countries in the Column in another index.
So it is like, in one index we have user data with the countries user has specified and in the other index we have the master data with the actual countries. So now I want to check whether the countries entered by the user are the same as the ones in master data.
If anybody knows how to write a query for this in Kibana, kindly help.
GET final,master/_count
{"query": {"bool": {"must": [{"script": {"script":"['A_OPERATINGCOUNTRY'].value == ['AD_Country Name.keyword'].value"}}]}}}

You need to manage this outside Elasticsearch. But, since the incoming data doesn't have the country name you want, why do you want to check this on Elasticsearch? The form you are using should have the exactly value you want to index.

You can use the terms query
Suppose we have an index of page access logs like so:
PUT /mybeat-2018/_doc/1
{
"host" : "elastic.co",
"ttl" : 40
}
PUT /mybeat-2018/_doc/2
{
"host" : "elastic.co",
"ttl" : 666
}
PUT /mybeat-2018/_doc/3
{
"host" : "google.com",
"ttl" : 55
}
and an independent whitelist that can shrink or grow, with a bunch of hosts:
PUT /whitelist/_doc/1
{
"hosts" : [
{
"name" : "elastic.co"
},
{
"name" : "twitter.com"
}
]
}
Then a search on the mybeat-* for whatever is in the whitelist should reference the whitelist document (in our case the document with id: 1) like so:
GET /mybeat-*/_search
{
"query" : {
"terms" : {
"host" : {
"index" : "whitelist",
"type" : "_doc",
"id" : "1",
"path" : "hosts.name"
}
}
}
}

Related

Protecting data in elastic search

I have a elastic search engine running locally with an index which contains data from Multiple customers. When a customer makes a query, is there a way to dynamically add Customer Id in the filtering criteria so a customer cannot access the records from other customers.
Yes, you can achieve that using filtered aliases. So you'd create one alias per customer like this:
POST /_aliases
{
"actions" : [
{
"add" : {
"index" : "customer_index",
"alias" : "customer_1234",
"filter" : { "term" : { "customer_id" : "1234" } }
}
}
]
}
Then your customer can simply query the alias customer_1234 and only his data is going to come back.

Aggregation on different document type

i'm trying to solve a problem here, I'm not even sure it can be done how I want it to be done.
Basically I need to query my index on the type Company. This type does have a field named Company_Name. Now I have another type which is Employee that does have a field named Company_Name as well, which basically "links him" to the Company.
What I need to do is to make an aggregation in my query of companies, that gets me the number of employees that are within the company (for each result).
Any way to do this ?
EDIT :
The mapping :
"company" : {
"properties" : {
"company_name" : {
"type" : "text"
}
}
},
"employee" : {
"properties" : {
"employee_name" : {
"type" : "text"
},
"company_name" : {
"type" : "text"
}
}
}
Not sure if I fully understood what you want but you mentioned:
Now I have another type which is Employee that does have a field named Company_Name
And then you said you want, number of employees that are withing the company.
So for me it looks like all you need is an aggregation on Employees doc type by company.
If you really need to query both indexes then you can apply a multi-index search and the aggregation to the result of the multi index search.
This or something close to this should work:
http://localhost:9200/$index/employee/_search?search_type=count
{
"aggs" : {
"company" : {
"terms" : {
"field" : "company_name"
}
}
}
}
One think you should notice is that if your field company_name is analyzed it will be tokenized and you can't aggregate on that. You should then add a raw field with company name to use for your aggregation.
Some useful references:
https://www.elastic.co/guide/en/elasticsearch/guide/current/relations.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_closing_thoughts.html and

Elastic search Update by Query to Update Complex Document

I have a use case of elastic search to update a doc.
My doc is something like this-
{
"first_name" : "firstName",
"last_name" : "lastName",
"version" : 1234,
"user_roles" : {
"version" : 12345,
"id" : 1234,
"name" : "role1"},
},
"groups" : {
"version" : 123,
"list": [
{"id":123, "name" : "ashd"},
{"id":1234, "name" : "awshd"},
]
}
}
Now depepeding on some feed I will either will be updating the parent doc or will be updating the nested doc.
I am able to find how to update the basic attributes like firstName and lastName but unable to get how to update complex/nested ones
I did something like from REST client-
"script": {
"inline": "ctx._source.user_roles = { "id" : 5678, "name" :"hcsdl"}
}
but its giving me exception-
Actual use case-
I will actually be getting a Map in java.
This key can be simple key like "first_name" or can be complex key like "user_role" and "groups"
I want to update the document using update by query on version.
The code I wrote is something like-
for (String key : document.keySet()) {
String value = defaultObjectMapper.writeValueAsString(document.get(key));
scriptBuilder.append("ctx._source.");
scriptBuilder.append(key);
scriptBuilder.append('=');
scriptBuilder.append(value);
scriptBuilder.append(";");
}
where document is the Map
Now I might get the simple fields to update or complex object.
I tried giving keys like user_roles.id and user_roles.name and also tried giving complete user_roles but nothing is working.
Can someone helpout
Try this with groovy maps instead of verbatim JSON inside your script:
"script": {
"inline": "ctx._source.user_roles = [ 'id' : 5678, 'name' : 'hcsdl']}
}

ElasticSearch Filtered Aliases Creation - Best Practice

We are planning to use Filtered Aliases as mentioned here - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html
Our input data is going to be a stream with each line of the stream corresponding to an object we would like to store in ES.
Each object contains an 'id', which we are using for routing and filtering.
QUESTION -
How do we create alias and index data in a performant way ?
-- Do we index all data, keep track of all the unique 'id's and the very end create the filtered alias ? OR
-- For each object, check if an alias for that 'id' exists; if it doesn't create one ?
I'm leaning towards the first approach. Is it advisable and performant when compared to the second approach ?
TIA.
Based on our discussion above and after having glanced over the blog article you posted, I'm pretty positive that in your case you don't need aliases at all and the routing key would suffice. Again, only because you have a single index, if you had many indices this would not be true anymore!
You simply need to specify the routing key to use when indexing your document. Until ES 2.0, you can use the _routing field for that purpose, even though it's been deprecated in ES 1.5, but in your case it serves your purpose.
{
"customer" : {
"_routing" : {
"required" : true,
"path" : "customer_id" <----- the field you use as the routing key
},
"properties": { ... }
}
}
Then when searching you simply need to specify &routing=<customer_id> in your search URL in addition to your customer id filter (since a given shard can host documents for different customers). Your search will go directly to the shard identified by the given routing key, and thus, only retrieve data from the specified customer.
Using a filtered alias for this brings nothing as the filter and routing key you'd include in your alias definition would not contribute anything additional, since the retrieved documents are already "filtered" (kind of) by the routing key. This is way easier than trying to detect (on each new document to index) if an alias exists or not and create it if it doesn't.
UPDATE:
Now if you absolutely have/want to create filtered aliases, the more performant way would be the first one you mentioned:
First index your daily data
Then run a terms aggregation on your customer_id field with size high enough (i.e. higher than the cardinality of the field, which was ~100 in your case) to make sure you capture all unique customer ids to create your aliases
Loop over all the buckets to retrieve all unique customer ids
Create all aliases in one shot using one action for each customer_id
curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
{
"add" : {
"index" : "customers",
"alias" : "alias_cid1",
"routing" : "cid1",
"filter" : { "term" : { "customer_id" : "cid1" } }
}
},
{
"add" : {
"index" : "customers",
"alias" : "alias_cid2",
"routing" : "cid2",
"filter" : { "term" : { "customer_id" : "cid2" } }
}
},
{
"add" : {
"index" : "customers",
"alias" : "alias_cid3",
"routing" : "cid3",
"filter" : { "term" : { "customer_id" : "cid3" } }
}
},
...
]
}'
Note that you don't have to worry if an alias already exists, the whole command won't fail and silently ignore the existing alias.
When this command has run, you'll have all your aliases on your unique index, properly configured with a filter and a routing key.

Elasticsearch querying alias with routing giving partial results

In an effort to create multi-tenant architecture for my project.
I've created an elasticsearch cluster with an index 'tenant'
"tenant" : {
"some_type" : {
"_routing" : {
"required" : true,
"path" : "tenantId"
},
Now,
I've also created some aliases -
"tenant" : {
"aliases" : {
"tenant_1" : {
"index_routing" : "1",
"search_routing" : "1"
},
"tenant_2" : {
"index_routing" : "2",
"search_routing" : "2"
},
"tenant_3" : {
"index_routing" : "3",
"search_routing" : "3"
},
"tenant_4" : {
"index_routing" : "4",
"search_routing" : "4"
}
I've added some data with tenantId = 2
After all that, I tried to query 'tenant_2' but I only got partial results, while querying 'tenant' index directly returns with the full results.
Why's that?
I was sure that routing is supposed to query all the shards that documents with tenantId = 2 resides on.
When you have created aliases in elasticsearch, you have to do all operations using aliases only. Be it indexing, update or search.
Try reindexing the data again and check if possible (If it is a test index, I hope so).
Remove all the indices.
curl -XDELETE 'localhost:9200/' # Warning:!! Dont use this in production.
Use this command only if it is test index.
Create the index again. Create alias again. Do all the indexing, search and delete operations on alias name. Even the import of data should also be done via alias name.

Resources