Index huge data into Elasticsearch - elasticsearch

I am new to elasticsearch and have huge data(more than 16k huge rows in the mysql table). I need to push this data into elasticsearch and am facing problems indexing it into it.
Is there a way to make indexing data faster? How to deal with huge data?

Expanding on the Bulk API
You will make a POST request to the /_bulk
Your payload will follow the following format where \n is the newline character.
action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
...
Make sure your json is not pretty printed
The available actions are index, create, update and delete.
Bulk Load Example
To answer your question, if you just want to bulk load data into your index.
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
The first line contains the action and metadata. In this case, we are calling create. We will be inserting a document of type type1 into the index named test with a manually assigned id of 3 (instead of elasticsearch auto-generating one).
The second line contains all the fields in your mapping, which in this example is just field1 with a value of value3.
You will just concatenate as many as these as you'd like to insert into your index.

This may be an old thread but I though I would comment anyway for anyone who is looking for a solution to this problem. The JDBC river plugin for Elastic Search is very useful for importing data from a wide array of supported DB's.
Link to JDBC' River source here..
Using Git Bash' curl command I PUT the following configuration document to allow for communication between ES instance and MySQL instance -
curl -XPUT 'localhost:9200/_river/uber/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://localhost:3306/elastic",
"user" : "root",
"password" : "root",
"sql" : "select * from tbl_indexed",
"poll" : "24h",
"max_retries": 3,
"max_retries_wait" : "10s"
},
"index": {
"index": "uber",
"type" : "uber",
"bulk_size" : 100
}
}'
Ensure you have the mysql-connector-java-VERSION-bin in the river-jdbc plugin directory which contains jdbc-river' necessary JAR files.

Try bulk api
http://www.elasticsearch.org/guide/reference/api/bulk.html

Related

Attempting to delete all the data for an Index in Elasticsearch

I am trying to delete all the documents, i.e. data from an index. I am using v6.6 along with the dev tools in Kibana.
In the past, I have done this operation successfully but now it is saying 'not found'
{
"_index" : "new-index",
"_type" : "doc",
"_id" : "_query",
"_version" : 1,
"result" : "not_found",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 313,
"_primary_term" : 7
}
Here is my kibana statement
DELETE /new-index/doc/_query
{
"query": {
"match_all": {}
}
}
Also, the index GET operation which verified the index has data and exists:
GET new-index/doc/_search
I verified the type is doc but I can post the whole mapping, if needed.
Easier way is to navigate in Kibana to Management->Elasticsearch index mapping then select indexes you would like to delete via checkboxes, and click on Manage index -> delete index or flush index depending on your need.
I was able to resolve the issue by using a delete by query:
POST new-index/_delete_by_query
{
"query": {
"match_all": {}
}
}
Delete documents is a problematic way to clear data.
Preferable delete index:
DELETE [your-index]
From kibana console.
And recreate from scratch.
And more preferable way is to make a template for an index that creates index as well with the first indexed document.
Only solutions currently are to either delete the index itself (faster), or delete-by-query (slower)
https://www.elastic.co/guide/en/elasticsearch/reference/7.4/docs-delete-by-query.html
POST new-index/_delete_by_query?conflicts=proceed
{
"query": {
"match_all": {}
}
}
Delete API only removes a single document https://www.elastic.co/guide/en/elasticsearch/reference/7.4/docs-delete.html
My guess is that someone changed a field's name and now the DB (NoSQL) and Elasticsearch string name for that field doesn't match. So Elasticsearch tried to delete that field, but the field was "not found".
It's not an error I would lose sleep over.

Bulk indexing using elastic search

Till now i was indexing data to elastic document by document and now as the data started increasing it has become very slow and not an optimized approach. So i was searching for a bulk insert thing and found Elastic Bulk API. From the documents in their official site i got confused. The approach i am using is by passing the data as WebRequest and executing them in the elastic server. So while creating a batch/bulk insert request the API wants us to form a template like
localhost:9200/_bulk as URL and
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
to index a document with id 1 and field1 values as value 1. Also the API suggests to send the data as JSON (unpretty, to maintain a non escaping character or so). So to pass multiple document with multiple properties how can i structure my data.
I tried like this in FF RestClient , with POST and header as JSON , but RestClient is throwing some error and i know its not a valid JSON
{ "index" : { "_index" : "indexName", "_type" : "type1", "_id" : "111" },
{ "Name" : "CHRIS","Age" : "23" },"Gender" : "M"}
Your data is not well-formed:
You don't need the comma after the first line
You're missing a closing } on the first line
You have a closing } in the middle of your second line you need to remove it as well.
The correct way of formatting your data for a bulk insert look like this:
curl -XPOST localhost:9200/_bulk -d '
{ "index" : { "_index" : "indexName", "_type" : "type1", "_id" : "111" }}
{ "Name" : "CHRIS","Age" : "23" ,"Gender" : "M"}
-H 'Content-Type: application/x-ndjson'
This will work.
UPDATE
Using Postman on Chrome it looks like this. Make sure to add a new line after line 2:
Using the elasticsearch 7.9.2
In order to send the bulk update I was getting the error of new line as below
Failed update without new line
This is wierd but after adding the new line in the last of the all the operations it is working fine with postman, notice line number 5 in below screenshot
bulk update success after adding newline in last of all the commands in postman

how to copy ElasticSearch field to another field

I have 100GB ES index now. Right now I need to change one field to multi-fields, such as: username to username.username and username.raw (not_analyzed). I know it will apply to the incoming data. But how can I make this change affect on the old data? Should I using index scroll to copy the whole index to a new one, Or there is a better solution to just copy one filed please.
There's a way to achieve this without reindexing all your data by using the update by query plugin.
Basically, after installing the plugin, you can run the following query and all your documents will get the multi-field re-populated.
curl -XPOST 'localhost:9200/your_index/_update_by_query' -d '{
"query" : {
"match_all" : {}
},
"script" : "ctx._source.username = ctx._source.username;"
}'
It might take a while to run on 100GB docs, but after this runs, the username.raw field will be populated.
Note: for this plugin to work, one needs to have scripting enabled.
POST index/type/_update_by_query
{
"query" : {
"match_all" : {}
},
"script" :{
"inline" : "ctx._source.username = ctx._source.username;",
"lang" : "painless"
}
}
This worked for me on es 5.6, above one did not!

ElasticSearch Filtered Aliases Creation - Best Practice

We are planning to use Filtered Aliases as mentioned here - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html
Our input data is going to be a stream with each line of the stream corresponding to an object we would like to store in ES.
Each object contains an 'id', which we are using for routing and filtering.
QUESTION -
How do we create alias and index data in a performant way ?
-- Do we index all data, keep track of all the unique 'id's and the very end create the filtered alias ? OR
-- For each object, check if an alias for that 'id' exists; if it doesn't create one ?
I'm leaning towards the first approach. Is it advisable and performant when compared to the second approach ?
TIA.
Based on our discussion above and after having glanced over the blog article you posted, I'm pretty positive that in your case you don't need aliases at all and the routing key would suffice. Again, only because you have a single index, if you had many indices this would not be true anymore!
You simply need to specify the routing key to use when indexing your document. Until ES 2.0, you can use the _routing field for that purpose, even though it's been deprecated in ES 1.5, but in your case it serves your purpose.
{
"customer" : {
"_routing" : {
"required" : true,
"path" : "customer_id" <----- the field you use as the routing key
},
"properties": { ... }
}
}
Then when searching you simply need to specify &routing=<customer_id> in your search URL in addition to your customer id filter (since a given shard can host documents for different customers). Your search will go directly to the shard identified by the given routing key, and thus, only retrieve data from the specified customer.
Using a filtered alias for this brings nothing as the filter and routing key you'd include in your alias definition would not contribute anything additional, since the retrieved documents are already "filtered" (kind of) by the routing key. This is way easier than trying to detect (on each new document to index) if an alias exists or not and create it if it doesn't.
UPDATE:
Now if you absolutely have/want to create filtered aliases, the more performant way would be the first one you mentioned:
First index your daily data
Then run a terms aggregation on your customer_id field with size high enough (i.e. higher than the cardinality of the field, which was ~100 in your case) to make sure you capture all unique customer ids to create your aliases
Loop over all the buckets to retrieve all unique customer ids
Create all aliases in one shot using one action for each customer_id
curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
{
"add" : {
"index" : "customers",
"alias" : "alias_cid1",
"routing" : "cid1",
"filter" : { "term" : { "customer_id" : "cid1" } }
}
},
{
"add" : {
"index" : "customers",
"alias" : "alias_cid2",
"routing" : "cid2",
"filter" : { "term" : { "customer_id" : "cid2" } }
}
},
{
"add" : {
"index" : "customers",
"alias" : "alias_cid3",
"routing" : "cid3",
"filter" : { "term" : { "customer_id" : "cid3" } }
}
},
...
]
}'
Note that you don't have to worry if an alias already exists, the whole command won't fail and silently ignore the existing alias.
When this command has run, you'll have all your aliases on your unique index, properly configured with a filter and a routing key.

Elasticsearch querying alias with routing giving partial results

In an effort to create multi-tenant architecture for my project.
I've created an elasticsearch cluster with an index 'tenant'
"tenant" : {
"some_type" : {
"_routing" : {
"required" : true,
"path" : "tenantId"
},
Now,
I've also created some aliases -
"tenant" : {
"aliases" : {
"tenant_1" : {
"index_routing" : "1",
"search_routing" : "1"
},
"tenant_2" : {
"index_routing" : "2",
"search_routing" : "2"
},
"tenant_3" : {
"index_routing" : "3",
"search_routing" : "3"
},
"tenant_4" : {
"index_routing" : "4",
"search_routing" : "4"
}
I've added some data with tenantId = 2
After all that, I tried to query 'tenant_2' but I only got partial results, while querying 'tenant' index directly returns with the full results.
Why's that?
I was sure that routing is supposed to query all the shards that documents with tenantId = 2 resides on.
When you have created aliases in elasticsearch, you have to do all operations using aliases only. Be it indexing, update or search.
Try reindexing the data again and check if possible (If it is a test index, I hope so).
Remove all the indices.
curl -XDELETE 'localhost:9200/' # Warning:!! Dont use this in production.
Use this command only if it is test index.
Create the index again. Create alias again. Do all the indexing, search and delete operations on alias name. Even the import of data should also be done via alias name.

Resources