No effect of “size” in ‘query’ while reindexing in elasticsearch - elasticsearch

I have been using logstash to migrate a index to another. I have recently tried to reindex certain amount of data from large dataset in local environment. So I tried using following configuration for migration:
input{
elasticsearch{
hosts=>"localhost:9200"
index=>"old_indexindex"
query=>'{"query":{"match_all":{}},"size":10 }'
}
}filter{
mutate{
remove_field=>[
"#version",
"#timestamp"
]
}
}output{
elasticsearch{
hosts=>"localhost:9200"
index=>"new_index"
document_type=>"contact"
manage_template=>false
document_id=>"%{contactId}"
}
}
But this reindexes all the documents in old_index to new_index, where as , I was expecting just 10 documents to be reindexed in new_index.
Am I missing some concept using logstash with elasticsearch?

The elasticsearch input doesn't make a conventional search, but does a scan/scroll search type instead. This means that all data will be retrieved from the index and the role of the size parameter just serves to define how much data will be fetched during each scroll, not how much data will be fetched altogether.
Also, note that the size parameter in the query itself has no effect. You need to use the size parameter of the elasticsearch input and not specify it in the query.
input{
elasticsearch{
hosts=> "localhost:9200"
index=> "old_index"
query=> '*'
size => 10 <--- size goes here
}
}
That being said, if you're running ES 2.3 or later, there's a way to achieve what you desire using the Reindex API, like this:
POST /_reindex
{
"size": 10,
"source": {
"index": "old_index"
},
"dest": {
"index": "new_index"
}
}

Related

Insert aggregation results into an index

The goal is to build an Elasticsearch index with only the most recent documents in groups of related documents to track the current state of some monitoring counters and states.
I have crafted a simple Elasticsearch aggregation query:
{
"size": 0,
"aggs": {
"group_by_monitor": {
"terms": {
"field": "monitor_name"
},
"aggs": {
"get_latest": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}
It groups related documents into buckets and select the most recent document for each bucket.
Here are the different ideas I had to get the job done:
directly use the aggregation query to push the results into the index, but it does not seem possible : Is it possible to put the results of an ElasticSearch aggregation back into the index?
use the Logstash Elasticsearch input plugin to execute the aggregation query and the Elasticsearch output plugin to push into the index, but seems like the input plugin only looks at the hits field and is unable to handle aggregation results: Aggregation Query possible input ES plugin !
use the Logstash http_poller plugin to get a JSON document, but it does not seem to allow specifying a body for the HTTP request !
use the Logstash exec plugin to execute cURL commands to get the JSON but this seems quite cumbersome and my last resort.
use the NEST API to build a basic application that will do polling, extract results, clean them and inject the resulting documents into the target index, but I'd like to avoid adding a new tool to maintain.
Is there a reasonably complex way of accomplishing this?
Edit the logstash.conf file as follow
input {
elasticsearch {
hosts => "localhost"
index => "source_index_name"
type =>"index_type"
query => '{Query}'
size => 500
scroll => "5m"
docinfo => true
}
}
output {
elasticsearch {
index => "target_index_name"
document_id => "%{[#metadata][_id]}"
}
}

elasticsearch: define field's order in returned doc

i'm doing sending queries to elasticsearch and it responde with an unknown order of fields in its documents.
how can i fix the order that elsasticsearch is returning fields inside documents?
i mean, i'm sending this query:
{
"index": "my_index",
"_source":{
"includes" : ["field1","field2","field3","field14"]
},
"size": X,
"body": {
"query": {
// stuff
}
}
}
and when it responds, it gives me something not in the good order.
i ultimatly want to convert this to csv, and want to fix csv headers.
is there something to do so i can get something like
doc1 :{"field1","field2","field3","field14"}
doc2 :{"field1","field2","field3","field14"}
...
in the same order as my "_source" ?
thank's for your help.
A document in Elasticsearch is a JSON hash/map and by definition maps are unordered.
One solution around this would be to use Logstash in order to extract docs from ES using an elasticsearch input and output them in CSV using a csv output. That way you can guarantee that the fields in the CSV file will have the exact same order as specified. Another benefit is that you don't have to write your own boilerplate code to extract from ES and sink to CSV, Logstash does it all for you for free.
The Logstash configuration would look something like this:
input {
elasticsearch {
hosts => "localhost"
query => '{ "query": { "match_all": {} } }'
size => 100
index => "my_index"
}
}
filter {}
output {
csv {
fields => ["field1","field2","field3","field14"]
path => "/path/to/file.csv"
}
}

Get elasticsearch indices before specific date

My logstash service sends the logs to elasticsearch as daily indices.
elasticsearch {
hosts => [ "127.0.0.1:9200" ]
index => "%{type}-%{+YYYY.MM.dd}"
}
Does Elasticsearch provides the API to lookup the indices before specific date?
For example, how could I get the indices created before 2015-12-15 ?
The only time I really care about what indexes are created is when I want to close/delete them using curator. Curator has "age" type features built in, if that's also your use case.
I think you are looking for Indices Query have a look here
Here is an example:
GET /_search
{
"query": {
"indices" : {
"query": {
"term": {"description": "*"}
},
"indices" : ["2015-01-*", "2015-12-*"],
"no_match_query": "none"
}
}
}
Each index has a creation_date field.
Since the number of indices is supposed to be quite small there's no such feature as 'searching for indices'. So you just get their metadata and filter them inside your app. The creation_date is also available via _cat API.

Why doesn't elasticsearch allow changes to an indexed data?

I tried to convert some of the fields in a previously indexed data from string to integer. But when i ran logstash again, the fields didn't get converted (checked in Kibana only). Why can't i make changes to an already indexed data and if not, how can i make the required changes to my index?
I've only been making changes in logstash. Here is a piece of logstash.conf:
input {
file {
type => "movie"
path => "C:/TestLogs/Test5.txt"
start_position => "beginning"
}
}
filter {
grok {
match => {"message" => "(?<Movie_Name>[\w.\-\']*)\s(?<Rating>[\d.]+)\s(?<No. Of Downloads>\d+)\s(?<No. of views>\d+)" }
}
mutate {
convert => {"Rating" => "float"}
convert => {"No. of Downloads" => "integer"}
convert => {"No. of views" => "integer"}
}
}
Elasticsearch is using Lucene at its core for indexing and storing data. Lucene uses a read-only datastructure to store data and that's the reason why it is not possible to change data structures for data that is already stored in elasticsearch. It is possible to update the documents with new values, but not to change the structure for an entire index.
If you want to change the mappings, i.e. the data structure then you have to create a new index with a new mapping and store it there.
This of course is not that easy if elasticsearch is the master of the data. To do this you have to create a new index with a new mapping and read data from the old index and put it into the new index. You can do this by using the Scan and Scroll approach.
If you want to make this transparent to the application reading from elasticsearch you can use an alias:
At first the index name is data_v1 and the alias is data:
data -> data_v1
Then you create a new index: data_v2 with the new mapping. Read all data from data_v1 and store it in data_v2. Having done this, change the alias to point to data_v2
data -> data_v2
To change aliases you can use the 'remove' and 'add' functions:
POST /_aliases
{
"actions": [
{ "remove": {
"alias": "items",
"index": "items_v1"
}}
]
}
POST /_aliases
{
"actions": [
{ "add": {
"alias": "items",
"index": "items_v2"
}}
]
}

Converting nginx access log bytes to number in Kibana4

I would like to create a visualization of the sum of bytes sent using the data from my nginx access logs. When trying to create a "Metric" visualization, I can't use the bytes field as a sum because it is a string type.
And I'm not able to change it under settings.
How do I go about changing this field type to a number/bytes type?
Here is my logstash config for nginx access logs
filter {
if [type] == "nginx-access" {
grok {
match => { "message" => "%{NGINXACCESS}" }
}
geoip {
source => "clientip"
}
useragent {
source => "agent"
target => "useragent"
}
}
}
Since each logstash index is being created as an index, I'm guess I need to change it here.
I tried adding
mutate {
convert => { "bytes" => "integer" }
}
But it doesn't seem to make a difference.
Field types are configured using mappings, which is configured at the index level and can hardly change. With Logstash, as a new index is created everyday, so if you wan't to change these mappings either wait for the next day or delete the current index if you can.
By default these mappings are generated automatically by Elasticsearch depending on the syntax of the indexed JSON document and the applied Index Templates:
# Type String
{"bytes":"123"}
# Type Integer
{"bytes":123}
In the end there are 2 solutions:
Tune Logstash, to make it generate an integer and let Elasticsearch guess the field type → Use the mutate/convert filter
Tune Elasticsearch, to force the field bytes for the document type nginx-access to be of type integer → Use Index Template:
Index Template API:
PUT _template/logstash-nginx-access
{
"order": 1,
"template": "logstash-*",
"mappings": {
"nginx-access": {
"properties": {
"bytes": {
"type": "integer"
}
}
}
}
}

Resources