ElasticSearch assign own IDs while indexing with LogStash - elasticsearch

I am indexing a large corpora of information and I have a string-key that I know is unique. I would like to avoid using the search and rather access documents by this artificial identifier.
Since the Path directive is discontinued in ES 1.5, anyone know a workaround to this problem!?
My data look like:
{unique-string},val1, val2, val3...
{unique-string2},val4, val5, val6...
I am using logstash to index the files and would prefer to fetch the documents through a direct get, rather than through an exact-match.

In your elasticsearch output plugin, just specify the document_id setting with a reference to the field you want to use as id, i.e. the one named 1 in your csv filter.
input {
file {...}
}
filter {
csv{
columns=>["1","2","3"]
separator => ","
}
}
output {
elasticsearch {
action => "index"
host => "localhost"
port => "9200"
index => "index-name"
document_id => "%{1}" <--- add this line
workers => 2
cluster => "elasticsearch-cluster"
protocol => "http"
}
}

Related

Generate multiple types with multiple csv

I am trying to generate various types in the same index based on various csv. As I donĀ“t know the amount of csv, making an input for each one would be non-viable.
So does anyone know how to generate types with the names of the files and in those, introduce the csv respectively?
input {
file {
path => "/home/user/Documents/data/*.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
skip_header => "true"
autodetect_column_names => true
autogenerate_column_names => true
}
}
output {
elasticsearch {
hosts => "http://localhost:9200"
index => "final_index"
}
stdout {}
}
Thank you so much
Having multiple document structures in the same index has been removed in Elasticsearch indices since version 6, if a document is not looking the same way as the index is templated it will not be able to send the data to it, what you can do is make sure that all fields are known and you have one general template containing all possible fields.
Is there a reason why you want all of it in one index?
If it is for querying purposes or Kibana, do know you can wildcard when searching and have patterns for Kibana.
Update after your comment:
Use a filter to extract the filename using grok
filter {
grok {
match => ["path","%{GREEDYDATA}/%{GREEDYDATA:filename}\.csv"]
}
}
And use the filename in your output like this:
elasticsearch {
hosts => "http://localhost:9200"
index => "final_index-%{[filename]}"
}

Logstash agent not indexing anymore

I have a Logstash instance running as a service that reads from Redis and outputs to Elasticsearch. I just noticed there was nothing new in Elasticsearch for the last few days, but the Redis lists were increasing.
Logstash log was filled with 2 errors repeated for thousands of lines:
:message=>"Got error to send bulk of actions"
:message=>"Failed to flush outgoing items"
The reason being:
{"error":"IllegalArgumentException[Malformed action/metadata line [107], expected a simple value for field [_type] but found [START_ARRAY]]","status":500},
Additionally, trying to stop the service failed repeatedly, I had to kill it. Restarting it emptied the Redis lists and imported everything to Elasticsearch. It seems to work ok now.
But I have no idea how to prevent that from happening again. The mentioned type field is set as a string for each input directive, so I don't understand how it could have become an array.
What am I missing?
I'm using Elasticsearch 1.7.1 and Logstash 1.5.3. The logstash.conf file looks like this:
input {
redis {
host => "127.0.0.1"
port => 6381
data_type => "list"
key => "b2c-web"
type => "b2c-web"
codec => "json"
}
redis {
host => "127.0.0.1"
port => 6381
data_type => "list"
key => "b2c-web-staging"
type => "b2c-web-staging"
codec => "json"
}
/* other redis inputs, only key/type variations */
}
filter {
grok {
match => ["msg", "Cache hit %{WORD:query} in %{NUMBER:hit_total:int}ms. Network: %{NUMBER:hit_network:int} ms. Deserialization %{NUMBER:hit_deserial:int}"]
add_tag => ["cache_hit"]
tag_on_failure => []
}
/* other groks, not related to type field */
}
output {
elasticsearch {
host => "[IP]"
port => "9200"
protocol=> "http"
cluster => "logstash-prod-2"
}
}
According to your log message:
{"error":"IllegalArgumentException[Malformed action/metadata line [107], expected a simple value for field [_type] but found [START_ARRAY]]","status":500},
It seems you're trying to index a document with a type field that's an array instead of a string.
I can't help you without more of the logstash.conf file.
But check followings to make sure:
When you use add_field for changing the type you actually turn type into an array with multiple values, which is what Elasticsearch is complaining about.
You can use mutate join to convert arrays to strings: api link
filter {
mutate {
join => { "fieldname" => "," }
}
}

How to efficiently move data from elasticsearch index (with one shard) to another index (with 5 shards)?

I have an elasticsearch index which contains around 5 GB of data on a single node in a single shard. Now I have created another index with same settings as older one, but with number_of_shards as 5 instead of 1.
I am looking for the most efficient approach to copy the data from older index to newer index without any downtime.
I would suggest using Logstash for this. You could use the following configuration. Make sure to replace the source and target hosts, as well as the index and type names to match your local environment.
File: reindex.conf
input {
elasticsearch {
hosts => "localhost:9200" <---- your source host
index => "my_source_index"
}
}
filter {
mutate {
remove_field => [ "#version", "#timestamp" ]
}
}
output {
elasticsearch {
host => "localhost" <--- your target host
port => 9200
protocol => "http"
manage_template => false
index => "my_target_index"
document_type => "my_type"
workers => 5
}
}
And then you can simply launch it with
bin/logstash -f reindex.conf

Copy ElasticSearch-Index with Logstash

I have an ready-build Apache-Index on one machine, that I would like to clone to another machine using logstash. Fairly easy i thought
input {
elasticsearch {
host => "xxx.xxx.xxx.xxx"
index => "logs"
}
}
filter {
}
output {
elasticsearch {
cluster => "Loa"
host => "127.0.0.1"
protocol => http
index => "logs"
index_type => "apache_access"
}
}
that pulls over the docs, but doesn't stop as it uses the default query "*" (the original index has ~50.000 docs and I killed the former script, when the new index was over 600.000 docs and rising)
Next I tried to make sure the docs would get updated instead of duplicated, but this commit hasn't made it yet, so i don't have a primary..
Then I remembered sincedb but don't seem to be able to use that in the query (or is that possible)
Any advice? Maybe a complete different approach? Thanks a lot!
Assuming that the elasticsearch input creates a logstash event with the document id ( I assume it will be _id or something similar), try setting the elastic search output the following way:
output {
elasticsearch {
cluster => "Loa"
host => "127.0.0.1"
protocol => http
index => "logs"
index_type => "apache_access"
document_id => "%{_id}"
}
}
That way, even if the elasticsearch input, for whatever reason, continues to push the same documents indefinitely, elasticsearch will merely updated the existing documents, instead of creating new documents with new ids.
Once you reach 50,000, you can stop.

Way to populate Logstash output variable without getting it from an Input?

Is there another way to tell Logstash to supply a value to an output variable without pulling it from a Logstash input? For example, in my case I'd like to create an Elasticsearch index based on a performance run ID (which I'd do from an external script) and then have Logstash send to that. For now I was thinking of creating a tcp input just for receiving perf run info and then have a filter to match on the run id. Seems like a convoluted way to do this though. For example:
input {
tcp {
type => "perfinfo"
port => 8888
}
}
if [type] == "perfinfo" {
do some matching to extract the id
}
output {
elasticsearch {
cluster => "mycluster"
manage_template => false
index => "%{id}-perftest"
}
}
I'm not sure if setting manage_template to false would actually be necessary. I've read that it is.
Update
Thanks Nirdesh for that. Using Ruby might be very handy.
While I was waiting I tried using a grok filter like so:
grok {
match => { "message" => "%{WORD:perftype}-%{POSINT:perfid}" }
}
Which produced this stdout during debugging:
{
"message" => "awperf-14",
"#version" => "1",
"#timestamp" => "2014-10-17T20:01:19.758Z",
"host" => "0:0:0:0:0:0:0:1:33361",
"type" => "perfinfo",
"perftype" => "awperf",
"perfid" => "14"
}
Which I tried creating an index based on this like so:
index => "%{perftype}-%{perfid}"
So when I passed 'awperf-14' to the input, I ended up creating these indexes
%{perftype}-%{perfid}
awperf-14
Which is not what I was expecting. Also, it's the %{perftype}-%{perfid} index that starts to be populated, not awperf-14, the one I actually wanted.
Yes.
You can add any no. of your own variables either for intermediate result or for permanent using a property called add_field. All most all filters in logstash support this property.
So, for your soluation, you can use a ruby script to find out the id dynamically and store it in a new variable called id, which you can use it in output.
For Example :
input {
tcp {
type => "perfinfo"
port => 8888
}
}
filter{
if [type] == "perfinfo" {
ruby{
//do some processing
add_field => { "id" => "Some value" }
}
}
}
output {
elasticsearch {
cluster => "mycluster"
manage_template => false
index => "%{id}-perftest"
}
}
I'm not sure I can do what I was trying to do via Logstash. To be a clearer, I simply wanted to change the index based on the performance run ID I'm executing. There's nothing in the data that would have this information (I have to pull it from a DB). So instead of trying to have Logstash listen for a performance run ID, I scripted this externally. The script uses the Elasticsearch API to create a new index, and then does a string replace for the index in the Logstash config file. It then restarts Logstash, which normally happens between performance runs anyway. This approach was much easier to do, and seems cleaner.

Resources