Elasticsearch - how to return only data, not meta information? - elasticsearch

When doing a search, Elasticsearch returns a data structure that contains various meta information.
The actual result set is contained within a "hits" field within the JSON result returned from the database.
Is it possible for Elasticsearch to return only the needed data (the contents of then "hits" field) without being embedded within all the other meta data?
I know I could parse the result into JSON and extract it, but I don't want the complexity, hassle, performance hit.
thanks!
Here is an example of the data structure that Elasticsearch returns.
{
"_shards":{
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits":{
"total" : 1,
"hits" : [
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_source" : {
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}
}
]
}
}

You can at least filter the results, even if you cannot extract them. The "common options" page of the REST API explains the "filter_path" option. This lets you filter only the portions of the tree you are interested in. The tree structure is still the same, but without the extra metadata.
I generally add the query option:
&filter_path=hits.hits.*,aggregations.*
The documentation doesn't say anything about this making your query any faster (I doubt that it does), but at least you could return only the interesting parts.
Corrected to show only hits.hits.*, since the top level "hits" has metadata as well.

No, it's not possible at this moment. If performance and complexity of parsing are the main concerns, you might want to consider using different clients: java client or Thrift plugin, for example.

Related

Differentiating _delete_by_query tasks in a multi-tenant index

Scenario:
I have an index with a bunch of multi-tenant data in Elasticsearch 6.x. This data is frequently deleted (via _delete_by_query) and populated by the tenants.
When issuing a _delete_by_query request with wait_for_completion=false, supplying a query JSON to delete a tenants' data, I am able to see generic task information via the _tasks API. Problem is, with a large number of tenants, it is not actively clear who is deleting data at any given time.
My question is this:
Is there a way I can view the query for which the _delete_by_query task is operating on? Or can I attach an additional param to the URL that is cached in the task to differentiate them?
Side note: looking at the docs: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/tasks.html I see there is a description field in the _tasks API response that has the query as a String, however, I do not see that level of detail in my description field:
"description" : "delete-by-query [myindex]"
Thanks in advance
One way to identify queries is to add the X-Opaque-Id HTTP header to your queries:
For instance, when deleting all tenant data for (e.g.) User 3, you can issue the following command:
curl -XPOST -H 'X-Opaque-Id: 3' -H 'Content-type: application/json' http://localhost:9200/my-index/_delete_by_query?wait_for_completion=false -d '{"query":{"term":{"user": 3}}}'
You then get a task ID, and when checking the related task document, you'll be able to identify which task is/was deleting which tenant data thanks to the headers section which contains your HTTP header:
"_source" : {
"completed" : true,
"task" : {
"node" : "DB0GKYZrTt6wuo7d8B8p_w",
"id" : 20314843,
"type" : "transport",
"action" : "indices:data/write/delete/byquery",
"status" : {
"total" : 3,
"updated" : 0,
"created" : 0,
"deleted" : 3,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0
},
"description" : "delete-by-query [deletes]",
"start_time_in_millis" : 1570075424296,
"running_time_in_nanos" : 4020566,
"cancellable" : true,
"headers" : {
"X-Opaque-Id" : "3" <--- user 3
}
},

Attempting to delete all the data for an Index in Elasticsearch

I am trying to delete all the documents, i.e. data from an index. I am using v6.6 along with the dev tools in Kibana.
In the past, I have done this operation successfully but now it is saying 'not found'
{
"_index" : "new-index",
"_type" : "doc",
"_id" : "_query",
"_version" : 1,
"result" : "not_found",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 313,
"_primary_term" : 7
}
Here is my kibana statement
DELETE /new-index/doc/_query
{
"query": {
"match_all": {}
}
}
Also, the index GET operation which verified the index has data and exists:
GET new-index/doc/_search
I verified the type is doc but I can post the whole mapping, if needed.
Easier way is to navigate in Kibana to Management->Elasticsearch index mapping then select indexes you would like to delete via checkboxes, and click on Manage index -> delete index or flush index depending on your need.
I was able to resolve the issue by using a delete by query:
POST new-index/_delete_by_query
{
"query": {
"match_all": {}
}
}
Delete documents is a problematic way to clear data.
Preferable delete index:
DELETE [your-index]
From kibana console.
And recreate from scratch.
And more preferable way is to make a template for an index that creates index as well with the first indexed document.
Only solutions currently are to either delete the index itself (faster), or delete-by-query (slower)
https://www.elastic.co/guide/en/elasticsearch/reference/7.4/docs-delete-by-query.html
POST new-index/_delete_by_query?conflicts=proceed
{
"query": {
"match_all": {}
}
}
Delete API only removes a single document https://www.elastic.co/guide/en/elasticsearch/reference/7.4/docs-delete.html
My guess is that someone changed a field's name and now the DB (NoSQL) and Elasticsearch string name for that field doesn't match. So Elasticsearch tried to delete that field, but the field was "not found".
It's not an error I would lose sleep over.

Bulk indexing using elastic search

Till now i was indexing data to elastic document by document and now as the data started increasing it has become very slow and not an optimized approach. So i was searching for a bulk insert thing and found Elastic Bulk API. From the documents in their official site i got confused. The approach i am using is by passing the data as WebRequest and executing them in the elastic server. So while creating a batch/bulk insert request the API wants us to form a template like
localhost:9200/_bulk as URL and
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
to index a document with id 1 and field1 values as value 1. Also the API suggests to send the data as JSON (unpretty, to maintain a non escaping character or so). So to pass multiple document with multiple properties how can i structure my data.
I tried like this in FF RestClient , with POST and header as JSON , but RestClient is throwing some error and i know its not a valid JSON
{ "index" : { "_index" : "indexName", "_type" : "type1", "_id" : "111" },
{ "Name" : "CHRIS","Age" : "23" },"Gender" : "M"}
Your data is not well-formed:
You don't need the comma after the first line
You're missing a closing } on the first line
You have a closing } in the middle of your second line you need to remove it as well.
The correct way of formatting your data for a bulk insert look like this:
curl -XPOST localhost:9200/_bulk -d '
{ "index" : { "_index" : "indexName", "_type" : "type1", "_id" : "111" }}
{ "Name" : "CHRIS","Age" : "23" ,"Gender" : "M"}
-H 'Content-Type: application/x-ndjson'
This will work.
UPDATE
Using Postman on Chrome it looks like this. Make sure to add a new line after line 2:
Using the elasticsearch 7.9.2
In order to send the bulk update I was getting the error of new line as below
Failed update without new line
This is wierd but after adding the new line in the last of the all the operations it is working fine with postman, notice line number 5 in below screenshot
bulk update success after adding newline in last of all the commands in postman

MongoDB embedded vs reference schema for large data documents

I am designing my first MongoDB database schema for log managment system and I would like to store information from log file to mongoDB and I can't decide what schema should I
use for large documents (embedded vs reference).
Note: project has many sources and source has many logs (in some cases over 1 000 000 logs)
{
"_id" : ObjectId("5141e051e2f56cbb680b77f9"),
"name" : "projectName",
"source" : [{
"name" : "sourceName",
"log" : [{
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"
}, {
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "ERROR",
"message" : "test"
}]
}]
}
My focuse is on performance during the reading data from database (NOT WRITING) e.g. Filtering, Searching, Pagination etc. User can filter source log by date, status etc (so I want to focus on reading performance when user search or filtering data)
I know that MongoDB has a 16Mbyte document size limit so I am worried if I gonna have 1 000 000 logs for one source how this gonna work (as I can have many sources for one project and sources can have many logs). What is better solutions when I gonna work with large documents and I want to have good reading performance, should I use embedded or reference schema? Thanks
The answer to your question is neither. Instead of embedding or using references, you should flatten the schema to one doc per log entry so that it scales beyond whatever can fit in the 16MB doc limit and so that you have access to the full power and performance of MongoDB's querying capabilities.
So get rid of the array fields and move everything up to top-level fields using an approach like:
{
"_id" : ObjectId("5141e051e2f56cbb680b77f9"),
"name" : "projectName",
"sourcename" : "sourceName",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"
}, {
"_id" : ObjectId("5141e051e2f56cbb680b77fa"),
"name" : "projectName",
"sourcename" : "sourceName",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "ERROR",
"message" : "test"
}
I think having logs in an array might get messy..If project and source entities don't have any attributes(keys) other than a name and logs are not to be stored for long, you may use a capped collection having one log per document:
{_id: ObjectId("5141e051e2f56cbb680b77f9"),
p: "project_name",
s: "source_name",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"}
Refer this as well: http://docs.mongodb.org/manual/use-cases/storing-log-data/
Capped collections maintain natural order. So you don't need an index on timestamp to return the logs in natural order. In your case, you may want to retrieve all logs from a particular source/project. You can create index{p:1,s:1}to speed up this query.
But I'd recommend you do some 'benchmarking' to check performance. Try the capped collection approach above. And also try bucketing of documents with the fully embedded schema that you have suggested. This technique is used in the classic blog-comments problem. Hence you only store so many logs of each source inside a single document and overflow to a new document whenever the custom-defined size exceeds.

Index huge data into Elasticsearch

I am new to elasticsearch and have huge data(more than 16k huge rows in the mysql table). I need to push this data into elasticsearch and am facing problems indexing it into it.
Is there a way to make indexing data faster? How to deal with huge data?
Expanding on the Bulk API
You will make a POST request to the /_bulk
Your payload will follow the following format where \n is the newline character.
action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
...
Make sure your json is not pretty printed
The available actions are index, create, update and delete.
Bulk Load Example
To answer your question, if you just want to bulk load data into your index.
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
The first line contains the action and metadata. In this case, we are calling create. We will be inserting a document of type type1 into the index named test with a manually assigned id of 3 (instead of elasticsearch auto-generating one).
The second line contains all the fields in your mapping, which in this example is just field1 with a value of value3.
You will just concatenate as many as these as you'd like to insert into your index.
This may be an old thread but I though I would comment anyway for anyone who is looking for a solution to this problem. The JDBC river plugin for Elastic Search is very useful for importing data from a wide array of supported DB's.
Link to JDBC' River source here..
Using Git Bash' curl command I PUT the following configuration document to allow for communication between ES instance and MySQL instance -
curl -XPUT 'localhost:9200/_river/uber/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://localhost:3306/elastic",
"user" : "root",
"password" : "root",
"sql" : "select * from tbl_indexed",
"poll" : "24h",
"max_retries": 3,
"max_retries_wait" : "10s"
},
"index": {
"index": "uber",
"type" : "uber",
"bulk_size" : 100
}
}'
Ensure you have the mysql-connector-java-VERSION-bin in the river-jdbc plugin directory which contains jdbc-river' necessary JAR files.
Try bulk api
http://www.elasticsearch.org/guide/reference/api/bulk.html

Resources