Transform ElasticSearch index from field explosions into nested documents via Logstash - elasticsearch

So we have an old elasticsearch index that succumbed to field explosion. We have redesigned the structure of the index to fix this using nested documents. However, we are attempting to figure out how to migrate the old index data into the new structure. We are currently looking at using Logstash plugins, notably the aggregate plugin, to try to accomplish this. However, all the examples we can find show how to created the nested documents from database calls, as opposed to from a field-exploded index. For context, here is an example of what an old index might look like:
"assetID": 22074,
"metadata": {
"50": {
"analyzed": "Phase One",
"full": "Phase One",
"date": "0001-01-01T00:00:00"
},
"51": {
"analyzed": "H 25",
"full": "H 25",
"date": "0001-01-01T00:00:00"
},
"58": {
"analyzed": "50",
"full": "50",
"date": "0001-01-01T00:00:00"
}
}
And here is what we would like the transformed data to look like in the end:
"assetID": 22074,
"metadata": [{
"metadataId": 50,
"ngrams": "Phase One", //This was "analyzed"
"alphanumeric": "Phase One", //This was "full"
"date": "0001-01-01T00:00:00"
}, {
"metadataId": 51,
"ngrams": "H 25", //This was "analyzed"
"alphanumeric": "H 25", //This was "full"
"date": "0001-01-01T00:00:00"
}, {
"metadataId": 58,
"ngrams": "50", //This was "analyzed"
"alphanumeric": "50", //This was "full"
"date": "0001-01-01T00:00:00"
}
}]
As a dumbed-down example, here is what we can figure from the aggregate plugin:
input {
elasticsearch {
hosts => "my.old.host.name:9266"
index => "my-old-index"
query => '{"query": {"bool": {"must": [{"term": {"_id": "22074"}}]}}}'
size => 500
scroll => "5m"
docinfo => true
}
}
filter {
aggregate {
task_id => "%{id}"
code => "
map['assetID'] = event.get('assetID')
map['metadata'] ||= []
map['metadata'] << {
metadataId => ? //somehow parse the Id out of the exploded field name "metadata.#.full",
ngrams => event.get('metadata.#.analyzed'),
alphanumeric => event.get('metadata.#.full'),
date => event.get('metadata.#.date'),
}
"
push_previous_map_as_event => true
timeout => 150000
timeout_tags => ['aggregated']
}
if "aggregated" not in [tags] {
drop {}
}
}
output {
elasticsearch {
hosts => "my.new.host:9266"
index => "my-new-index"
document_type => "%{[#metadata][_type]}"
document_id => "%{[#metadata][_id]}"
action => "update"
}
file {
path => "C:\apps\logstash\logstash-5.6.6\testLog.log"
}
}
Obviously the above example is basically just pseudocode, but that is all we can gather from looking at the documentation for both Logstash and ElasticSearch, as well as the aggregate filter plugin and generally Googling things within an inch of their life.

You can play around with the event object, massage it and then add it into the new index. Something like below (The logstash code is untested, you may find some errors. Check the working ruby code after this section):
aggregate {
task_id => "%{id}"
code => "arr = Array.new()
map["assetID"] = event.get("assetID")
metadataObj = event.get("metadata")
metadataObj.to_hash.each do |key,value|
transformedMetadata = {}
transformedMetadata["metadataId"] = key
value.to_hash.each do |k , v|
if k == "analyzed" then
transformedMetadata["ngrams"] = v
elsif k == "full" then
transformedMetadata["alphanumeric"] = v
else
transformedMetadata["date"] = v
end
end
arr.push(transformedMetadata)
end
map['metadata'] ||= []
map['metadata'] << arr
"
}
}
try to play around with above based on the event input and you will get there. Here's a working example, with the input you have in the question, for you to play around : https://repl.it/repls/HarshIntelligentEagle
json_data = {"assetID": 22074,
"metadata": {
"50": {
"analyzed": "Phase One",
"full": "Phase One",
"date": "0001-01-01T00:00:00"
},
"51": {
"analyzed": "H 25",
"full": "H 25",
"date": "0001-01-01T00:00:00"
},
"58": {
"analyzed": "50",
"full": "50",
"date": "0001-01-01T00:00:00"
}
}
}
arr = Array.new()
transformedObj = {}
transformedObj["assetID"] = json_data[:assetID]
json_data[:metadata].to_hash.each do |key,value|
transformedMetadata = {}
transformedMetadata["metadataId"] = key
value.to_hash.each do |k , v|
if k == :analyzed then
transformedMetadata["ngrams"] = v
elsif k == :full then
transformedMetadata["alphanumeric"] = v
else
transformedMetadata["date"] = v
end
end
arr.push(transformedMetadata)
end
transformedObj["metadata"] = arr
puts transformedObj

In the end, we used ruby code to solve it in a script:
# Must use the input plugin for elasticsearch at version 4.0.2, or it cannot contact a 1.X index
input {
elasticsearch {
hosts => "my.old.host.name:9266"
index => "my-old-index"
query => '{
"query": {
"bool": {
"must": [
{ "match_all": { } }
]
}
}
}'
size => 500
scroll => "5m"
docinfo => true
}
}
filter {
mutate {
remove_field => ['#version', '#timestamp']
}
}
#metadata
filter {
mutate {
rename => { "[metadata]" => "[metadata_OLD]" }
}
ruby {
code => "
metadataDocs = []
metadataFields = event.get('metadata_OLD')
metadataFields.each { |key, value|
metadataDoc = {
'metadataID' => key.to_i,
'date' => value['date']
}
if !value['full'].nil?
metadataDoc[:alphanumeric] = value['full']
end
if !value['analyzed'].nil?
metadataDoc[:ngrams] = value['analyzed']
end
metadataDocs << metadataDoc
}
event.set('metadata', metadataDocs)
"
}
mutate {
remove_field => ['metadata_OLD']
}
}
output {
elasticsearch {
hosts => "my.new.host:9266"
index => "my-new-index"
document_type => "searchasset"
document_id => "%{assetID}"
action => "update"
doc_as_upsert => true
}
file {
path => "F:\logstash-6.1.2\logs\esMigration.log"
}
}

Related

Incorrect document_id for Logstash elastic search output

I'm using Logstash to read json messages from Solace queue and write it to elastic Search.I'm using the doc_as_upsert => true along with the document_id parameters in the output.This is how my logstash configuration looks like
logstash.conf
input
{
jms {
include_header => false
include_properties => false
include_body => true
use_jms_timestamp => false
destination => 'SpringBatchTestQueue'
pub_sub => false
jndi_name => '/JMS/CF/MDM'
jndi_context => {
'java.naming.factory.initial' => 'com.solacesystems.jndi.SolJNDIInitialContextFactory'
'java.naming.security.principal' => 'EDM_Test_User#NovartisDevVPN'
'java.naming.provider.url' => 'tcp://localhost:55555'
'java.naming.security.credentials' => 'EDM_Test_User'
}
require_jars=> ['/app/elasticsearch/jms/commons-lang-2.6.jar',
'/app/elasticsearch/jms/sol-jms-10.10.0.jar',
'/app/elasticsearch/jms/geronimo-jms_1.1_spec-1.1.1.jar']
}
}
output
{
elasticsearch
{
hosts => ["https://glchbs-sd220240.eu.novartis.net:9200/"]
index => "test-%{+YYYY.MM.dd}"
document_id => "%{customerId}"
doc_as_upsert => true
ssl => true
ssl_certificate_verification => true
cacert => "/app/elasticsearch/config/ssl/Novartis_Silver_Three_Chain.pem"
}
}
Json Message:
{
"customerId": "N-CA-Z9II2YJ1YJ",
"name": "Alan Birch",
"customerRecordType": "Health Care Professional",
"country": "CA",
"language": "EN",
"privacyLawStatus": false,
"salutation": "Mr.",
"firstName": "Alan",
"lastName": "Birch",
"customerType": "Non Prescriber",
"hcpType": "Pharmacist Assistant",
"isMedicalExpert": false,
"customerAddresses": [
{
"addressType": "Primary Address",
"addressLine1": "4001 Leslie Street"
},
{
"addressType": "Other",
"addressLine1": "3004 Center St"
}
],
"meansOfContact": [
{
"type": "Email1",
"value": "alab#noname.com",
"status": "Active"
},
{
"type": "Email2",
"value": "balan#gmail.com",
"status": "Active"
}
],
"specialities": [
{
"specialtyType": "Primary Specialty",
"specialty": "Pharmacy Technician",
"status": "Active"
}
]
}
As you can see, I'm trying to use the customerId field of the JSON message as the document id for elasticsearch. But this is what a document inserted into Elasticsearch looks like:
As you can see document_id field should be mapped to customerId field but this is not case..Document is being inserted as %{customerId}
How to fix this?Appreciate your help
That is telling you that the [customerId] field does not exist on that event. If the [message] field is JSON then you should add a json filter to parse it. That will create the [customerId] field, which you can then use as the document_id.
json { source => "message" }

Json Array splitting issue Logstash configuration : Unexpected end-of-input: expected close marker for Array (start marker at [Source: (S

This is how my json object looks like, i have verified that the json i am getting is a valid. I tries setting up configuration files for the same, but always get the same error
SON parse error, original data now in message field {:error=>#, :data=>"{\"total_rows\":15587,\"offset\":0,\"rows\":[\r"}
[2019-08-05T21:07:49,799][WARN ][logstash.filters.split ] Only String and Array types are splittable. field:[doc][serversGroups] is of type = NilClass
[2019-08-05T21:07:50,584][WARN ][logstash.filters.split ] Only String and Array types are splittable. field:[doc][serversGroups][ActiveUsers] is of type = NilClass
This is my source Config file i am using for logstash
filter {
json {
source => "message"
skip_on_invalid_json => "true"
target => "doc"
}
split {
field => "[doc][serversGroups]"
}
split {
field => "[doc][serversGroups][ActiveUsers]"
}
date {
match => [ "[doc][date]", "UNIX" ]
target => "unix_time"
}
mutate {
convert => { "[doc][serversGroups][ActiveUsers][handle]" => "integer"
"[doc][serversGroups][list][UsedLicenses]" => "integer"
"[doc][serversGroups][list][issuedLicenses]" => "integer"
}
}
fingerprint {
concatenate_all_fields => "true"
method => "SHA256"
target => "fingerprint"
}
}
output {
stdout {
codec => "rubydebug"
}
elasticsearch {
hosts => ["localhost:9200"]
index => "pyyython"
codec => "json"
document_id => "%{[fingerprint]}"
}
}
This is my source JSON
{
"total_rows": 156122,
"offset": 12,
"rows": [
{
"id": "12345",
"key": "12345",
"value": {
"rev": "1-12345"
},
"doc": {
"_id": "12345",
"_rev": "1-12345",
"date": "15645348122",
"HostServerName": "abc.com",
"serversGroups": [
{
"ServiceName": "--- ",
"list": {
"issuedLicenses": "123",
"UsedLicenses": "12"
},
"ActiveUsers": [
{}
]
},
{
"ServiceName": "--- ",
"list": {
"issuedLicenses": "123",
"UsedLicenses": "12"
},
"ActiveUsers": [
{}
]
},
{
"ServiceName": "--- ",
"list": {
"issuedLicenses": "123",
"UsedLicenses": "12"
},
"ActiveUsers": [
{}
]
},
{
"ServiceName": "--- ",
"list": {
"issuedLicenses": "123",
"UsedLicenses": "1"
},
"ActiveUsers": [
{
"user": "me",
"user_host": "myself",
"dispay": "andI",
"version": "v1.1",
"server_host": "testing.abc.com",
"handle": "12345",
"last_date_license_check": "7/7",
"last_time_license_check": "12:12"
}
]
}
]
}
}
]
}
I keep getting this error
SON parse error, original data now in message field {:error=>#<LogStash::Json::ParserError: Unexpected end-of-input: expected close marker for Array (start marker at [Source: (S"; line: 1, column: 39])87,"offset":0,"rows":[
"; line: 2, column: 41]>, :data=>"{\"total_rows\":15587,\"offset\":0,\"rows\":[\r"}
[2019-08-05T21:07:49,799][WARN ][logstash.filters.split ] Only String and Array types are splittable. field:[doc][serversGroups] is of type = NilClass
[2019-08-05T21:07:50,584][WARN ][logstash.filters.split ] Only String and Array types are splittable. field:[doc][serversGroups][ActiveUsers] is of type = NilClass
not sure if my splitting is wrong!
The source JSON that you show is clearly invalid, since it ends with a comma. If I replace the comma with
]
}
}
]
}
then it is valid. With that change made it can be split using
split { field => "[doc][rows][0][doc][serversGroups]" }
split { field => "[doc][rows][0][doc][serversGroups][ActiveUsers]" }

Unable to create nested json output (aggregated) from CSV input

Issue I am facing is I need aggregation of CSV inputs on ID, and it contains multiple nesting. I am able to perform single nesting, but on further nesting, I am not able to write correct syntax.
INPUT:
input {
generator {
id => "first"
type => 'csv'
message => '829cd0e0-8d24-4f25-92e1-724e6bd811e0,GSIH1,2017-10-10 00:00:00.000,HCC,0.83,COMMUNITYID1'
count => 1
}
generator {
id => "second"
type => 'csv'
message => '829cd0e0-8d24-4f25-92e1-724e6bd811e0,GSIH1,2017-10-10 00:00:00.000,LACE,12,COMMUNITYID1'
count => 1
}
generator {
id => "third"
type => 'csv'
message => '829cd0e0-8d24-4f25-92e1-724e6bd811e0,GSIH1,2017-10-10 00:00:00.000,CCI,0.23,COMMUNITYID1'
count => 1
}
}
filter
{
csv {
columns => ['id', 'reference', 'occurrenceDateTime', 'code', 'probabilityDecimal', 'comment']
}
mutate {
rename => {
"reference" => "[subject][reference]"
"code" => "[prediction][outcome][coding][code]"
"probabilityDecimal" => "[prediction][probabilityDecimal]"
}
}
mutate {
add_field => {
"[resourceType]" => "RiskAssessment"
"[prediction][outcome][text]" => "Member HCC score based on CMS HCC V22 risk adjustment model"
"[status]" => "final"
}
}
mutate {
update => {
"[subject][reference]" => "Patient/%{[subject][reference]}"
"[comment]" => "CommunityId/%{[comment]}"
}
}
mutate {
remove_field => [ "#timestamp", "sequence", "#version", "message", "host", "type" ]
}
}
filter {
aggregate {
task_id => "%{id}"
code => "
map['resourceType'] = event.get('resourceType')
map['id'] = event.get('id')
map['status'] = event.get('status')
map['occurrenceDateTime'] = event.get('occurrenceDateTime')
map['comment'] = event.get('comment')
map['[reference]'] = event.get('[subject][reference]')
map['[prediction]'] ||=
map['[prediction]'] << {
'code' => event.get('[prediction][outcome][coding][code]'),
'text' => event.get('[prediction][outcome][text]'),
'probabilityDecimal'=> event.get('[prediction][probabilityDecimal]')
}
event.cancel()
"
push_previous_map_as_event => true
timeout => 3
}
mutate {
remove_field => [ "#timestamp", "tags", "#version"]
}
}
output{
elasticsearch {
template => "templates/riskFactor.json"
template_name => "riskFactor"
action => "index"
hosts => ["localhost:9201"]
index => ["deepak"]
}
stdout {
codec => json{}
}
}
OUTPUT:
{
"reference": "Patient/GSIH1",
"comment": "CommunityId/COMMUNITYID1",
"id": "829cd0e0-8d24-4f25-92e1-724e6bd811e0",
"status": "final",
"resourceType": "RiskAssessment",
"occurrenceDateTime": "2017-10-10 00:00:00.000",
"prediction": [
{
"probabilityDecimal": "0.83",
"code": "HCC",
"text": "Member HCC score based on CMS HCC V22 risk adjustment model"
},
{
"probabilityDecimal": "0.23",
"code": "CCI",
"text": "Member HCC score based on CMS HCC V22 risk adjustment model"
},
{
"probabilityDecimal": "12",
"code": "LACE",
"text": "Member HCC score based on CMS HCC V22 risk adjustment model"
}
]
}
REQUIRED OUTPUT:
{
"resourceType": "RiskAssessment",
"id": "829cd0e0-8d24-4f25-92e1-724e6bd811e0",
"status": "final",
"subject": {
"reference": "Patient/GSIH1"
},
"occurrenceDateTime": "2017-10-10 00:00:00.000",
"prediction": [
{
"outcome": {
"coding": [
{
"code": "HCC"
}
],
"text": "Member HCC score based on CMS HCC V22 risk adjustment model"
},
"probabilityDecimal": 0.83
},
{
"outcome": {
"coding": [
{
"code": "CCI"
}
],
"text": "Member HCC score based on CMS HCC V22 risk adjustment model"
},
"probabilityDecimal": 0.83
}
],
"comment": "CommunityId/COMMUNITYID1"
}

Show location points in a tile map with kibi

I'm using logstash 2.3.1, elasticsearch 2.3.1 and kibi 0.3.2. I have problems visualizing locations in a map with kibi.
I have the following configuration in logstash:
input {
file {
path => "/opt/logstash-2.3.1/logTest/Dades.csv"
type => "Dades"
start_position => "beginning"
}
}
filter {
csv {
columns => ["c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12", "c13", "c14", "c15", "c16", "c17", "c18", "c19", "c20", "c21", "c22", "c23"]
separator => ";"
}
ruby {
code => "
temp = event['c17']
event['c17'] = temp[0..1].to_f+ (temp[2..8].to_f/60)
temp = event['c19']
event['c19'] = temp[0..2].to_f+ (temp[3..8].to_f/60)
"
}
mutate {
convert => {
"c3" => "float"
"c5" => "float"
"c7" => "float"
"c9" => "float"
"c11" => "float"
"c13" => "float"
"c15" => "float"
"c21" => "float"
"c23" => "float"
}
}
date {
match => [ "c1", "dd/MM/YYYY HH:mm:ss.SSS", "ISO8601"]
target => "ts_date"
}
mutate {
rename => [ "c17", "[location][lat]",
"c19", "[location][lon]" ]
}
}
output {
elasticsearch {
hosts => localhost
index => "tram3"
manage_template => false
template => "tram3_template.json"
template_name => "tram3"
template_overwrite => "true"
}
stdout {
codec => rubydebug
}
}
The mapping configuration file (tram3_template.json) is like this:
{
"template": "tram3",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"tram3": {
"_all": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
When I import de csv file to elasticsearch it seems that all works ok. The output is something like this:
{
"message" => "26/02/2016 00:00:22.984;Total;4231.143555;Trac1;26.547932;Trac2;-338.939697;AA1;-364.611511;AA2;3968.135010;Reo1;0.000000;Reo2;0.000000;Latitud;4125.1846;Longitud;00213.5219;Speed;0.000000;CVS;3873.429443;\r",
"#version" => "1",
"#timestamp" => "2016-04-25T14:02:52.901Z",
"path" => "/opt/logstash-2.3.1/logTest/Dades.csv",
"host" => "ubuntu",
"type" => "Dades",
"c1" => "26/02/2016 00:00:22.984",
"c2" => "Total",
"c3" => 4231.143555,
"c4" => "Trac1",
"c5" => 26.547932,
"c6" => "Trac2",
"c7" => -338.939697,
"c8" => "AA1",
"c9" => -364.611511,
"c10" => "AA2",
"c11" => 3968.13501,
"c12" => "Reo1",
"c13" => 0.0,
"c14" => "Reo2",
"c15" => 0.0,
"c16" => "Latitud",
"c18" => "Longitud",
"c20" => "Speed",
"c21" => 0.0,
"c22" => "CVS",
"c23" => 3873.429443,
"column24" => nil,
"ts_date" => "2016-02-25T23:00:22.984Z",
"location" => {
"lat" => 41.41974333333334,
"lon" => 2.22535
}
}
But when I try to visualize the location parameter in a map it doesn't show any result:
I don't know what I'm doing wrong. Why the location point doesn't appear in the map?
In your ES mapping file, you probably need to enable the storage of the geohash sub-field (defaults to false) as the geohash aggregation cannot work without it.
{
"template": "tram3",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"tram3": {
"_all": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point",
"geohash": true, <-- add this
"geohash_prefix": true <-- add this
}
}
}
}
}
Then you can build a geohash aggregation on the location.geohash field
Note that if you want to also index all geohash prefixes, you can also add "geohash_prefix": true to your field mapping.
UPDATE
After reproducing the case, here are some more fixes to do:
You need to change the type in your file input as it will be used as the document type and your mapping specifies that the mapping type is named dades2 not Dades:
file {
path => "/opt/logstash-2.3.1/logTest/Dades.csv"
type => "dades2"
start_position => "beginning"
sincedb_path => "/dev/null"
}
Your elasticsearch output should look like below, namely, manage_template should be true and use the full path to your dades2_template.json file (make sure to change /full/path/to with the actual path name.
elasticsearch {
hosts => localhost
index => "dades2"
manage_template => true
template => "/full/path/to/dades2_template.json"
template_name => "dades2"
template_overwrite => "true"
}
The new dades2_template.json file should look like this
{
"template": "dades2",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"dades2": {
"_all": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point",
"geohash": true,
"geohash_prefix": true
}
}
}
}
}

How do I access JSON array data?

I have the following array:
[ { "attributes": {
"id": "usdeur",
"code": 4
},
"name": "USD/EUR"
},
{ "attributes": {
"id": "eurgbp",
"code": 5
},
"name": "EUR/GBP"
}
]
How can I get both ids for futher processing as output?
I tried a lot but no success. My problem is I always get only one id as output:
Market.all.select.each do |market|
present market.id
end
Or:
Market.all.each{|attributes| present attributes[:id]}
which gives me only "eurgbp" as a result while I need both ids.
JSON#parse should help you with this
require 'json'
json = '[ { "attributes": {
"id": "usdeur",
"code": 4
},
"name": "USD/EUR"
},
{ "attributes": {
"id": "eurgbp",
"code": 5
},
"name": "EUR/GBP"
}]'
ids = JSON.parse(json).map{|hash| hash['attributes']['id'] }
#=> ["usdeur", "eurgbp"]
JSON#parse turns a jSON response into a Hash then just use standard Hash methods for access.
I'm going to assume that the data is JSON that you're parsing (with JSON.parse) into a Ruby Array of Hashes, which would look like this:
hashes = [ { "attributes" => { "id" => "usdeur", "code" => 4 },
"name" => "USD/EUR"
},
{ "attributes" => { "id" => "eurgbp", "code" => 5 },
"name" => "EUR/GBP"
} ]
If you wanted to get just the first "id" value, you'd do this:
first_hash = hashes[0]
first_hash_attributes = first_hash["attributes"]
p first_hash_attributes["id"]
# => "usdeur"
Or just:
p hashes[0]["attributes"]["id"]
# => "usdeur"
To get them all, you'll do this:
all_attributes = hashes.map {|hash| hash["attributes"] }
# => [ { "id" => "usdeur", "code" => 4 },
# { "id" => "eurgbp", "code" => 5 } ]
all_ids = all_attributes.map {|attrs| attrs["id"] }
# => [ "usdeur", "eurgbp" ]
Or just:
p hashes.map {|hash| hash["attributes"]["id"] }
# => [ "usdeur", "eurgbp" ]
JSON library what using Rails is very slowly...
I prefer to use:
gem 'oj'
from https://github.com/ohler55/oj
fast and simple! LET'S GO!

Resources