Differentiating _delete_by_query tasks in a multi-tenant index - elasticsearch

Scenario:
I have an index with a bunch of multi-tenant data in Elasticsearch 6.x. This data is frequently deleted (via _delete_by_query) and populated by the tenants.
When issuing a _delete_by_query request with wait_for_completion=false, supplying a query JSON to delete a tenants' data, I am able to see generic task information via the _tasks API. Problem is, with a large number of tenants, it is not actively clear who is deleting data at any given time.
My question is this:
Is there a way I can view the query for which the _delete_by_query task is operating on? Or can I attach an additional param to the URL that is cached in the task to differentiate them?
Side note: looking at the docs: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/tasks.html I see there is a description field in the _tasks API response that has the query as a String, however, I do not see that level of detail in my description field:
"description" : "delete-by-query [myindex]"
Thanks in advance

One way to identify queries is to add the X-Opaque-Id HTTP header to your queries:
For instance, when deleting all tenant data for (e.g.) User 3, you can issue the following command:
curl -XPOST -H 'X-Opaque-Id: 3' -H 'Content-type: application/json' http://localhost:9200/my-index/_delete_by_query?wait_for_completion=false -d '{"query":{"term":{"user": 3}}}'
You then get a task ID, and when checking the related task document, you'll be able to identify which task is/was deleting which tenant data thanks to the headers section which contains your HTTP header:
"_source" : {
"completed" : true,
"task" : {
"node" : "DB0GKYZrTt6wuo7d8B8p_w",
"id" : 20314843,
"type" : "transport",
"action" : "indices:data/write/delete/byquery",
"status" : {
"total" : 3,
"updated" : 0,
"created" : 0,
"deleted" : 3,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0
},
"description" : "delete-by-query [deletes]",
"start_time_in_millis" : 1570075424296,
"running_time_in_nanos" : 4020566,
"cancellable" : true,
"headers" : {
"X-Opaque-Id" : "3" <--- user 3
}
},

Related

How to form index stats API?

ES Version : 7.10.2
I have a requirement to show index statistics, I have come across the index stats API which does fulfill my requirement.
But the issue is I don't necessarily need all the fields for a particular metric.
Ex: curl -XGET "http://localhost:9200/order/_stats/docs"
It shows response as below (omitted for brevity)
"docs" : {
"count" : 7,
"deleted" : 0
}
But I only want "count" not "deleted" field, from this.
So, in Index Stats API documentation, i came across a query param as :
fields:
(Optional, string) Comma-separated list or wildcard expressions of fields to include in the statistics.
Used as the default list unless a specific field list is provided in the completion_fields or fielddata_fields parameters
As per above when I perform curl -XGET "http://localhost:9200/order/_stats/docs?fields=count"
It throws an exception
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "request [/order/_stats/docs] contains unrecognized parameter: [fields]"
}
],
"type" : "illegal_argument_exception",
"reason" : "request [/order/_stats/docs] contains unrecognized parameter: [fields]"
},
"status" : 400
}
Am I understanding the usage of fields correctly ?
If yes/no, how can I achieve the above requirement ?
Any help is much appreciated :)
You can use the filter_path argument, like:
curl -XGET "http://localhost:9200/order/_stats?filter_path=_all.primaries.docs.count
This will return you only one field like:
{
"_all" : {
"primaries" : {
"docs" : {
"count" : 10
}
}
}
}

Scroll id returned by Scroll API is too long

I'm trying to use Scroll API to fetch 100K records from Kibana logs. The default size is set to 500 and I do not have authorization to change it. I tried scrolling using the below command:
curl -XPOST "http://elasticsearch.us-central1.gcp.cloud.internal/shared/_search?scroll=1m&size=500&pretty" -H "Content-Type: application/json" -d '{
"_source": ["message"],
"query": {
"match_phrase": {
"kubernetes.container_name": {
"query": "my-container-name"
}
}
}
}'
The output looks something like this:
{
"scroll_id": DnF1ZXJ5VGhlbkZldGNo5TkAA.... //300,000 characters long
"took" : 16626,
"timed_out" : false,
"_shards" : {
"total" : 7397,
"successful" : 7397,
"skipped" : 0,
"failed" : 0
},
"hits" : {
.....
Since the scroll id is too long, I cannot pass it on to the Scroll API to fetch the next batch of results. How can I resolve this? Is this due to large number of shards and is there any way to limit the number of shards?
Based on the discussion on ES Community, there seems to be a direct relation between the length of the scroll_id and the number of shards in the index.
The recommendation is to pass scroll_id in the request Body. E.g:
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}
is there any way to limit the number of shards?
You can to create a new index with less number of shards
Then Reindex the data to the new index using ReIndex API
There is no way to reduce the number of shards of an existing index.

Cosmos DB Collection not using _id index when querying by _id?

I have a CosmosDb - MongoDb collection that I'm using purely as a key/value store for arbitrary data where the _id is the key for my collection.
When I run the query below:
globaldb:PRIMARY> db.FieldData.find({_id : new BinData(3, "xIAPpVWVkEaspHxRbLjaRA==")}).explain(true)
I get this result:
{
"_t" : "ExplainResponse",
"ok" : 1,
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "data.FieldData",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 106,
"totalKeysExamined" : 0,
"totalDocsExamined" : 3571,
"executionStages" : {
},
"allPlansExecution" : [ ]
},
"serverInfo" : #REMOVED#
}
Notice that the totalKeysExamined is 0 and the totalDocsExamined is 3571 and the query took over 106ms. If i run without .explain() it does find the document.
I would have expected this query to be lightning quick given that the _id field is automatically indexed as a unique primary key on the collection. As this collection grows in size, I only expect this problem to get worse.
I'm definitely not understanding something about the index and how it works here. Any help would be most appreciated.
Thanks!

How to store nested document as String in elastic search

Context:
1) We are building a CDC pipeline (using kafka & connect framework)
2) We are using debezium for capturing mysql Tx logs
3) We are using Elastic Search connector to add documents to ES index
Sample change event generated by Debezium:
{
"source" : {
"before" : {
"Id" : 97,
"name" : "Northland",
"code" : "NTL",
"country_id" : 6,
"is_business_mapped" : 0
},
"after" : {
"Id" : 97,
"name" : "Northland",
"code" : "NTL",
"country_id" : 6,
"is_business_mapped" : 1
},
"source" : {
"version" : "0.7.5",
"name" : "__",
"server_id" : 252639387,
"ts_sec" : 1547805940,
"gtid" : null,
"file" : "mysql-bin-changelog.000570",
"pos" : 236,
"row" : 0,
"snapshot" : false,
"thread" : 614,
"db" : "bazaarify",
"table" : "state"
},
"op" : "u",
"ts_ms" : 1547805939683
}
What we want :
We want to visualize only 3 columns in kibana :
1) before - containing the nested JSON as string
2) after - containing the nested JSON as string
3) source - containing the nested JSON as string
I can think below possibilities here :
a) Either converting nested JSON as string
b) Combining column data in elastic search
I am a newbie to elastic search . Can someone please guide me how to do that.
I tried defining custom mapping as well but it is giving me exception.
You can always view your document as a Raw JSON in Kibana.
You don't need to manipulate it before indexing in elastic.
As this is related to visualization, handle this in Kibana only.
Check this link for a screenshot.
Refer this to add the columns which you want to see onto the results
I don't fully understand your use case, but if you would like to turn some json's to their representing strings, then you can use logstash for that, or even Elasticsearch ingest capabilities to convert an object (json) to a string.
From the link above, an example:
PUT _ingest/pipeline/my-pipeline-id { "description": "converts the
content of the id field to an integer", "processors" : [
{
"convert" : {
"field" : "source",
"type": "string"
}
} ] }

Elasticsearch - how to return only data, not meta information?

When doing a search, Elasticsearch returns a data structure that contains various meta information.
The actual result set is contained within a "hits" field within the JSON result returned from the database.
Is it possible for Elasticsearch to return only the needed data (the contents of then "hits" field) without being embedded within all the other meta data?
I know I could parse the result into JSON and extract it, but I don't want the complexity, hassle, performance hit.
thanks!
Here is an example of the data structure that Elasticsearch returns.
{
"_shards":{
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits":{
"total" : 1,
"hits" : [
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_source" : {
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}
}
]
}
}
You can at least filter the results, even if you cannot extract them. The "common options" page of the REST API explains the "filter_path" option. This lets you filter only the portions of the tree you are interested in. The tree structure is still the same, but without the extra metadata.
I generally add the query option:
&filter_path=hits.hits.*,aggregations.*
The documentation doesn't say anything about this making your query any faster (I doubt that it does), but at least you could return only the interesting parts.
Corrected to show only hits.hits.*, since the top level "hits" has metadata as well.
No, it's not possible at this moment. If performance and complexity of parsing are the main concerns, you might want to consider using different clients: java client or Thrift plugin, for example.

Resources