Elastic search load csv data with context

Elastic search load csv data with context - elasticsearch

I have 3m records. Headers are value, type, other_fields..
Here I need to load the data as in this
I need to specify type as context for that value in the record. Is there any way to do this with log stash? or any other options?
val,val_type,id
Sunnyvale it labs, seller, 10223667

For this, I'd use the new CSV ingest processor
First create the ingest pipeline to parse your CSV data
PUT _ingest/pipeline/csv-parser
{
"processors": [
{
"csv": {
"field": "message",
"target_fields": [
"val",
"val_type",
"id"
]
}
},
{
"script": {
"source": """
def val = ctx.val;
ctx.val = [
'input': val,
'contexts': [
'type': [ctx.val_type]
]
]
"""
}
},
{
"remove": {
"field": "message"
}
}
]
}
Then, you can index your documents as follow:
PUT index/_doc/1?pipeline=csv-parser
{
"message": "Sunnyvale it labs,seller,10223667"
}
After ingestion, the document will look like this:
{
"val_type": "seller",
"id": "10223667",
"val": {
"input": "Sunnyvale it labs",
"contexts": {
"type": [
"seller"
]
}
}
}
UPDATE: Logstash solution
Using Logstash, it's also feasible. The configuration file would look something like this:
input {
file {
path => "/path/to/your/file.csv"
sincedb_path => "/dev/null"
start_position => "beginning"
}
}
filter {
csv {
skip_header => true
separator => ","
columns => ["val", "val_type", "id"]
}
mutate {
rename => { "val" => "value" }
add_field => {
"[val][input]" => "%{value}"
"[val][contexts][type]" => "%{val_type}"
}
remove_field => [ "value" ]
}
}
output {
elasticsearch {
hosts => "http://localhost:9200"
index => "your-index"
}
}

Related

Ruby filter plugin creates two records for a single input json

There are two conf files used to load data from 2 json files,testOrders and testItems, each containing only one document, into same index. I am trying to create parent child relationship between two documents.
Below is my conf for testorders
input{
file{
path => ["/path_data/testOrders.json"]
type => "json"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
json {
source => "message"
target => "testorders_collection"
remove_field => [ "message" ]
}
ruby {
code => "
event.set('[my_join_field][name]', 'testorders')
"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "testorder"
document_id => "%{[testorders_collection][eId]}"
routing => "%{[testorders_collection][eId]}"
}
}
Below is the conf for testItems
input{
file{
path => ["/path_to_data/testItems.json"]
type => "json"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
json {
source => "message"
target => "test_collection"
remove_field => [ "message" ]
}
}
filter {
ruby {
code => "
event.set('[my_join_field][name]', 'testItems')
event.set('[my_join_field][parent]', event.get('[test_collection][foreignKeyId]'))
"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "testorder"
document_id => "%{[test_collection][eId]}"
routing => "%{[test_collection][foreignKeyId]}"
}
}
As expected the logstash creates 1 record for testOrders but creates 2 records for testItems given 1 json document each for testOrders and testItems. One document is created properly with data but other is created as duplicate and there seems to be no data. The document that is created with data not parsed looks like as follows
{
"_index": "testorder",
"_type": "doc",
"_id": "%{[test_collection][eId]}",
"_score": 1,
"_routing": "%{[test_collection][foreignKeyId]}",
"_source": {
"type": "json",
"#timestamp": "2018-07-10T04:15:58.494Z",
"host": "<hidden>",
"test_collection": null,
"my_join_field": {
"name": "testItems",
"parent": null
},
"path": "/path_to_data/testItems.json",
"#version": "1"
}

Defining a mapping relationship in elastic search solved the issue. This is the way to define the relationship
PUT fulfillmentorder
{
"mappings": {
"doc": {
"properties": {
"my_join_field": {
"type": "join",
"relations": {
"fulfillmentorders": "orderlineitems"
}
}
}
}
}
}

Logstash split root message

I am collecting some metrics about my application and periodically export them over REST one by one. The output json looks like:
{
"name": "decoder.example.type-3",
"value": 2000,
"from": 1517847790049
"to": 1517847840004
}
This is my logstash configuration that is working well. It will remove all http headers, the original counter name, and add example as interface and type-3 as transaction.
input {
http {
port => 31311
}
}
filter {
json {
source => "message"
}
grok {
match => [ "name", "decoder.%{WORD:interface}.%{NOTSPACE:transaction}" ]
}
mutate {
remove_field => [ "name", "headers", "message" ]
}
}
output {
elasticsearch {
hosts => [ "http://localhost:9200" ]
index => "metric.decoder-%{+YYYY.MM.dd}"
}
}
What I am trying to do now is send all my metrics at once as json array and split all these messages and apply the same logic that was applied to them one by one. An example of the input message would look like:
[
{
"name": "decoder.example.type-3",
"value": 2000,
"from": 1517847790049,
"to": 1517847840004
},
{
"name": "decoder.another.type-0",
"value": 3500,
"from": 1517847790049,
"to": 1517847840004
}
]
I am pretty certain I am supposed to use split filter, but I can't figure out how to use it. I have tried putting split before and after my json plugin, using different field settings, targets, but nothing seems to work as expected.
Could someone point me in the right direction?

In my config I used split first, then I did the logic. Yours should look based on that something like this:
input {
http {
port => 31311
}
}
filter {
json {
source => "message"
}
split{
field => "message"
}
grok {
match => [ "name", "decoder.%{WORD:interface}.%{NOTSPACE:transaction}" ]
}
mutate {
remove_field => [ "name", "headers", "message" ]
}
}
output {
elasticsearch {
hosts => [ "http://localhost:9200" ]
index => "metric.decoder-%{+YYYY.MM.dd}"
}
}
But this presumes that you always have a message field that is an array.
Oh yeah, and I think you should check whether you have the new message field contain the object that you posted. Because if so, your grok won't find anything under name, you need to match message.name. (I usually create a temp field from [message][name] and remove temp later because I didn't care to look up how to call nested fields. There must be a smarter way.)

This is the configuration I ended up with. Perhaps it can be done in fewer steps, but this works well. I had to move some fields around to keep the same structure so it is a bit bigger than my initial one which worked one by one.
The basic idea is to put the parsed json into a specific field, not in the root, and then split that new field.
input {
http {
port => 31311
}
}
filter {
json {
source => "message"
target => "stats"
}
split {
field => "stats"
}
grok {
match => [ "[stats][name]", "decoder.%{WORD:interface}.%{NOTSPACE:transaction}" ]
}
mutate {
add_field => {
"value" => "%{[stats][value]}"
"from" => "%{[stats][from]}"
"to" => "%{[stats][to]}"
}
remove_field => [ "headers", "message", "stats" ]
}
mutate {
convert => {
"value" => "integer"
"from" => "integer"
"to" => "integer"
}
}
}
output {
elasticsearch {
hosts => [ "http://localhost:9200" ]
index => "metric.decoder-%{+YYYY.MM.dd}"
}
}

How to use a field for determining index in Logstash without saving it?

I am using logstash for the first time and can't figure out how to determine index on a parsed field without persisting it.
This is my configuration file:
input {
http {
port => 31311
}
}
filter {
json {
source => "message"
}
mutate {
remove_field => [ "headers", "message" ]
}
grok {
match => [ "name", "^(?<metric-type>\w+)\..*" ]
}
}
output {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "%{metric-type}-%{+YYYY.MM.dd}"
}
}
Json example sent to the http plugin:
{
"name": "counter.custom",
"value": 321,
"from": "2017-11-30T10:43:17.213Z",
"to": "2017-11-30T10:44:00.001Z"
}
This record is saved in the counter-2017.11.30 index as expected. However, I don't want the field metric-type to be saved, I just need it to determine the index.
Any suggestions please?

I have used grok to put my metric-type into a field since grok pattern does not support [#metadata][metric-type] syntax. I have used a mutate filter to copy that field to #metadata and then removed the temporary field.
input {
http {
port => 31311
}
}
filter {
json {
source => "message"
}
mutate {
remove_field => [ "headers", "message" ]
}
grok {
match => [ "name", "^(?<metric-type>\w+)\..*" ]
}
mutate {
add_field => { "[#metadata][metric-type]" => "%{metric-type}" }
remove_field => [ "metric-type" ]
}
}
output {
elasticsearch {
hosts => [ "http://localhost:9200" ]
index => "%{[#metadata][metric-type]}-%{+YYYY.MM.dd}"
}
}
-- EDIT --
As suggested by #Phonolog in the discussion, there is a simpler and much better solution. By using grok keyword matching instead of regex, I was able to save the captured group directly to the #metadata.
input {
http {
port => 31311
}
}
filter {
json {
source => "message"
}
mutate {
remove_field => [ "headers", "message" ]
}
grok {
match => [ "name", "%{WORD:[#metadata][metric-type]}." ]
}
}
output {
elasticsearch {
hosts => [ "http://localhost:9200" ]
index => "%{[#metadata][metric-type]}-%{+YYYY.MM.dd}"
}
}

geo_point in Elastic

I'm trying to map a latitude and longitude to a geo_point in Elastic.
Here's my log file entry:
13-01-2017 ORDER COMPLETE: £22.00 Glasgow, 55.856299, -4.258845
And here's my conf file
input {
file {
path => "/opt/logs/orders.log"
start_position => "beginning"
}
}
filter {
grok {
match => { "message" => "(?<date>[0-9-]+) (?<order_status>ORDER [a-zA-Z]+): (?<order_amount>£[0-9.]+) (?<order_location>[a-zA-Z ]+)"}
}
mutate {
convert => { "order_amount" => "float" }
convert => { "order_lat" => "float" }
convert => { "order_long" => "float" }
rename => {
"order_long" => "[location][lon]"
"order_lat" => "[location][lat]"
}
}
}
output {
elasticsearch {
hosts => "localhost"
index => "sales"
document_type => "order"
}
stdout {}
}
I start logstash with /bin/logstash -f orders.conf and this gives:
"#version"=>{"type"=>"keyword", "include_in_all"=>false}, "geoip"=>{"dynamic"=>true,
"properties"=>{"ip"=>{"type"=>"ip"},
"location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"},
"longitude"=>{"type"=>"half_float"}}}}}}}}
See? It's seeing location as a geo_point. Yet GET sales/_mapping results in this:
"location": {
"properties": {
"lat": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"lon": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
Update
Each time I reindex, I stop logstash thenremove the .sincedb from /opt/logstash/data/plugins/inputs/file.... I have also made a brand new log file and I increment the index each time (I'm currently up to sales7).
conf file
input {
file {
path => "/opt/ag-created/logs/orders2.log"
start_position => "beginning"
}
}
filter {
grok {
match => { "message" => "(?<date>[0-9-]+) (?<order_status>ORDER [a-zA-Z]+): (?<order_amount>£[0-9.]+) (?<order_location>[a-zA-Z ]+), (?<order_lat>[0-9.]+), (?<order_long>[-0-9.]+)( - (?<order_failure_reason>[A-Za-z :]+))?" }
}
mutate {
convert => { "order_amount" => "float" }
}
mutate {
convert => { "order_lat" => "float" }
}
mutate {
convert => { "order_long" => "float" }
}
mutate {
rename => { "order_long" => "[location][lon]" }
}
mutate {
rename => { "order_lat" => "[location][lat]" }
}
}
output {
elasticsearch {
hosts => "localhost"
index => "sales7"
document_type => "order"
template_name => "myindex"
template => "/tmp/templates/custom-orders2.json"
template_overwrite => true
}
stdout {}
}
JSON file
{
"template": "sales7",
"settings": {
"index.refresh_interval": "5s"
},
"mappings": {
"sales": {
"_source": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point"
}
}
}
},
"aliases": {}
}
index => "sales7"
document_type => "order"
template_name => "myindex"
template => "/tmp/templates/custom-orders.json"
template_overwrite => true
}
stdout {}
}
Interestingly, when the geo_point mapping doesn't work (ie. both lat and long are floats), my data is indexed (30 rows). But when the location is correctly made into a geo_point, none of my rows are indexed.

There is two way to do this. First one is creating a template for your mapping to create a correct mapping while indexing you data. Because Elasticseach does not understand what your data type is. You should say it theses things like below.
Firstly, create a template.json file for your mapping structure:
{
"template": "sales*",
"settings": {
"index.refresh_interval": "5s"
},
"mappings": {
"sales": {
"_source": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point"
}
}
}
},
"aliases": {}
}
After that change your logstash configuration to put this mapping your index :
input {
file {
path => "/opt/logs/orders.log"
start_position => "beginning"
}
}
filter {
grok {
match => { "message" => "(?<date>[0-9-]+) (?<order_status>ORDER [a-zA-Z]+): (?<order_amount>£[0-9.]+) (?<order_location>[a-zA-Z ]+)"}
}
mutate {
convert => { "order_amount" => "float" }
convert => { "order_lat" => "float" }
convert => { "order_long" => "float" }
rename => {
"order_long" => "[location][lon]"
"order_lat" => "[location][lat]"
}
}
}
output {
elasticsearch {
hosts => "localhost"
index => "sales"
document_type => "order"
template_name => "myindex"
template => "/etc/logstash/conf.d/template.json"
template_overwrite => true
}
stdout {}
}
Second option is ingest node feature. I will update my answer for this option but now you can check my dockerized repository. At this example, I used ingest node feature instead of template while parsing location data.

logstash and elasticsearch geo_point

I am using logstash to input geospatial data from a csv into elasticsearch as geo_points.
The CSV looks like the following:
$ head -5 geo_data.csv
"lon","lat","lon2","lat2","d","i","approx_bearing"
-1.7841,50.7408,-1.7841,50.7408,0.982654,1,256.307
-1.7841,50.7408,-1.78411,50.7408,0.982654,1,256.307
-1.78411,50.7408,-1.78412,50.7408,0.982654,1,256.307
-1.78412,50.7408,-1.78413,50.7408,0.982654,1,256.307
I have create a mapping template that looks like the following:
$ cat map_template.json
{
"template": "base_map_template",
"order": 1,
"settings": {
"number_of_shards": 1
},
{
"mappings": {
"base_map": {
"properties": {
"lon2": { "type" : "float" },
"lat2": { "type" : "float" },
"d": { "type" : "float" },
"appox_bearing": { "type" : "float" },
"location": { "type" : "geo_point" }
}
}
}
}
}
My config file for logstash has been set up as follows:
$ cat map.conf
input {
stdin {}
}
filter {
csv {
columns => [
"lon","lat","lon2","lat2","d","i","approx_bearing"
]
}
if [lon] == "lon" {
drop { }
} else {
mutate {
remove_field => [ "message", "host", "#timestamp", "#version" ]
}
mutate {
convert => { "lon" => "float" }
convert => { "lat" => "float" }
convert => { "lon2" => "float" }
convert => { "lat2" => "float" }
convert => { "d" => "float" }
convert => { "i" => "integer"}
convert => { "approx_bearing" => "float"}
}
mutate {
rename => {
"lon" => "[location][lon]"
"lat" => "[location][lat]"
}
}
}
}
output {
# stdout { codec => rubydebug }
stdout { codec => dots }
elasticsearch {
index => "base_map"
template => "map_template.json"
document_type => "node_points"
document_id => "%{i}"
}
}
I then try and use logstash to input the csv data into elasticsearch as geo_points using the following command:
$ cat geo_data.csv | logstash-2.1.3/bin/logstash -f map.conf
I get the following error:
Settings: Default filter workers: 16
Unexpected character ('{' (code 123)): was expecting double-quote to start field name
at [Source: [B#278e55d1; line: 7, column: 3]{:class=>"LogStash::Json::ParserError", :level=>:error}
Logstash startup completed
....Logstash shutdown completed
What am I missing?

wayward "{" on line 7 of your template file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elastic search load csv data with context - elasticsearch

I have 3m records. Headers are value, type, other_fields.. Here I need to load the data as in this I need to specify type as context for that value in the record. Is there any way to do this with log stash? or any other options? val,val_type,id Sunnyvale it labs, seller, 10223667

Related

Ruby filter plugin creates two records for a single input json

Logstash split root message

How to use a field for determining index in Logstash without saving it?

geo_point in Elastic

logstash and elasticsearch geo_point

Categories

Resources