Unable to create nested json output (aggregated) from CSV input - elasticsearch

Issue I am facing is I need aggregation of CSV inputs on ID, and it contains multiple nesting. I am able to perform single nesting, but on further nesting, I am not able to write correct syntax.
INPUT:
input {
generator {
id => "first"
type => 'csv'
message => '829cd0e0-8d24-4f25-92e1-724e6bd811e0,GSIH1,2017-10-10 00:00:00.000,HCC,0.83,COMMUNITYID1'
count => 1
}
generator {
id => "second"
type => 'csv'
message => '829cd0e0-8d24-4f25-92e1-724e6bd811e0,GSIH1,2017-10-10 00:00:00.000,LACE,12,COMMUNITYID1'
count => 1
}
generator {
id => "third"
type => 'csv'
message => '829cd0e0-8d24-4f25-92e1-724e6bd811e0,GSIH1,2017-10-10 00:00:00.000,CCI,0.23,COMMUNITYID1'
count => 1
}
}
filter
{
csv {
columns => ['id', 'reference', 'occurrenceDateTime', 'code', 'probabilityDecimal', 'comment']
}
mutate {
rename => {
"reference" => "[subject][reference]"
"code" => "[prediction][outcome][coding][code]"
"probabilityDecimal" => "[prediction][probabilityDecimal]"
}
}
mutate {
add_field => {
"[resourceType]" => "RiskAssessment"
"[prediction][outcome][text]" => "Member HCC score based on CMS HCC V22 risk adjustment model"
"[status]" => "final"
}
}
mutate {
update => {
"[subject][reference]" => "Patient/%{[subject][reference]}"
"[comment]" => "CommunityId/%{[comment]}"
}
}
mutate {
remove_field => [ "#timestamp", "sequence", "#version", "message", "host", "type" ]
}
}
filter {
aggregate {
task_id => "%{id}"
code => "
map['resourceType'] = event.get('resourceType')
map['id'] = event.get('id')
map['status'] = event.get('status')
map['occurrenceDateTime'] = event.get('occurrenceDateTime')
map['comment'] = event.get('comment')
map['[reference]'] = event.get('[subject][reference]')
map['[prediction]'] ||=
map['[prediction]'] << {
'code' => event.get('[prediction][outcome][coding][code]'),
'text' => event.get('[prediction][outcome][text]'),
'probabilityDecimal'=> event.get('[prediction][probabilityDecimal]')
}
event.cancel()
"
push_previous_map_as_event => true
timeout => 3
}
mutate {
remove_field => [ "#timestamp", "tags", "#version"]
}
}
output{
elasticsearch {
template => "templates/riskFactor.json"
template_name => "riskFactor"
action => "index"
hosts => ["localhost:9201"]
index => ["deepak"]
}
stdout {
codec => json{}
}
}
OUTPUT:
{
"reference": "Patient/GSIH1",
"comment": "CommunityId/COMMUNITYID1",
"id": "829cd0e0-8d24-4f25-92e1-724e6bd811e0",
"status": "final",
"resourceType": "RiskAssessment",
"occurrenceDateTime": "2017-10-10 00:00:00.000",
"prediction": [
{
"probabilityDecimal": "0.83",
"code": "HCC",
"text": "Member HCC score based on CMS HCC V22 risk adjustment model"
},
{
"probabilityDecimal": "0.23",
"code": "CCI",
"text": "Member HCC score based on CMS HCC V22 risk adjustment model"
},
{
"probabilityDecimal": "12",
"code": "LACE",
"text": "Member HCC score based on CMS HCC V22 risk adjustment model"
}
]
}
REQUIRED OUTPUT:
{
"resourceType": "RiskAssessment",
"id": "829cd0e0-8d24-4f25-92e1-724e6bd811e0",
"status": "final",
"subject": {
"reference": "Patient/GSIH1"
},
"occurrenceDateTime": "2017-10-10 00:00:00.000",
"prediction": [
{
"outcome": {
"coding": [
{
"code": "HCC"
}
],
"text": "Member HCC score based on CMS HCC V22 risk adjustment model"
},
"probabilityDecimal": 0.83
},
{
"outcome": {
"coding": [
{
"code": "CCI"
}
],
"text": "Member HCC score based on CMS HCC V22 risk adjustment model"
},
"probabilityDecimal": 0.83
}
],
"comment": "CommunityId/COMMUNITYID1"
}

Related

Incorrect document_id for Logstash elastic search output

I'm using Logstash to read json messages from Solace queue and write it to elastic Search.I'm using the doc_as_upsert => true along with the document_id parameters in the output.This is how my logstash configuration looks like
logstash.conf
input
{
jms {
include_header => false
include_properties => false
include_body => true
use_jms_timestamp => false
destination => 'SpringBatchTestQueue'
pub_sub => false
jndi_name => '/JMS/CF/MDM'
jndi_context => {
'java.naming.factory.initial' => 'com.solacesystems.jndi.SolJNDIInitialContextFactory'
'java.naming.security.principal' => 'EDM_Test_User#NovartisDevVPN'
'java.naming.provider.url' => 'tcp://localhost:55555'
'java.naming.security.credentials' => 'EDM_Test_User'
}
require_jars=> ['/app/elasticsearch/jms/commons-lang-2.6.jar',
'/app/elasticsearch/jms/sol-jms-10.10.0.jar',
'/app/elasticsearch/jms/geronimo-jms_1.1_spec-1.1.1.jar']
}
}
output
{
elasticsearch
{
hosts => ["https://glchbs-sd220240.eu.novartis.net:9200/"]
index => "test-%{+YYYY.MM.dd}"
document_id => "%{customerId}"
doc_as_upsert => true
ssl => true
ssl_certificate_verification => true
cacert => "/app/elasticsearch/config/ssl/Novartis_Silver_Three_Chain.pem"
}
}
Json Message:
{
"customerId": "N-CA-Z9II2YJ1YJ",
"name": "Alan Birch",
"customerRecordType": "Health Care Professional",
"country": "CA",
"language": "EN",
"privacyLawStatus": false,
"salutation": "Mr.",
"firstName": "Alan",
"lastName": "Birch",
"customerType": "Non Prescriber",
"hcpType": "Pharmacist Assistant",
"isMedicalExpert": false,
"customerAddresses": [
{
"addressType": "Primary Address",
"addressLine1": "4001 Leslie Street"
},
{
"addressType": "Other",
"addressLine1": "3004 Center St"
}
],
"meansOfContact": [
{
"type": "Email1",
"value": "alab#noname.com",
"status": "Active"
},
{
"type": "Email2",
"value": "balan#gmail.com",
"status": "Active"
}
],
"specialities": [
{
"specialtyType": "Primary Specialty",
"specialty": "Pharmacy Technician",
"status": "Active"
}
]
}
As you can see, I'm trying to use the customerId field of the JSON message as the document id for elasticsearch. But this is what a document inserted into Elasticsearch looks like:
As you can see document_id field should be mapped to customerId field but this is not case..Document is being inserted as %{customerId}
How to fix this?Appreciate your help
That is telling you that the [customerId] field does not exist on that event. If the [message] field is JSON then you should add a json filter to parse it. That will create the [customerId] field, which you can then use as the document_id.
json { source => "message" }

Transform ElasticSearch index from field explosions into nested documents via Logstash

So we have an old elasticsearch index that succumbed to field explosion. We have redesigned the structure of the index to fix this using nested documents. However, we are attempting to figure out how to migrate the old index data into the new structure. We are currently looking at using Logstash plugins, notably the aggregate plugin, to try to accomplish this. However, all the examples we can find show how to created the nested documents from database calls, as opposed to from a field-exploded index. For context, here is an example of what an old index might look like:
"assetID": 22074,
"metadata": {
"50": {
"analyzed": "Phase One",
"full": "Phase One",
"date": "0001-01-01T00:00:00"
},
"51": {
"analyzed": "H 25",
"full": "H 25",
"date": "0001-01-01T00:00:00"
},
"58": {
"analyzed": "50",
"full": "50",
"date": "0001-01-01T00:00:00"
}
}
And here is what we would like the transformed data to look like in the end:
"assetID": 22074,
"metadata": [{
"metadataId": 50,
"ngrams": "Phase One", //This was "analyzed"
"alphanumeric": "Phase One", //This was "full"
"date": "0001-01-01T00:00:00"
}, {
"metadataId": 51,
"ngrams": "H 25", //This was "analyzed"
"alphanumeric": "H 25", //This was "full"
"date": "0001-01-01T00:00:00"
}, {
"metadataId": 58,
"ngrams": "50", //This was "analyzed"
"alphanumeric": "50", //This was "full"
"date": "0001-01-01T00:00:00"
}
}]
As a dumbed-down example, here is what we can figure from the aggregate plugin:
input {
elasticsearch {
hosts => "my.old.host.name:9266"
index => "my-old-index"
query => '{"query": {"bool": {"must": [{"term": {"_id": "22074"}}]}}}'
size => 500
scroll => "5m"
docinfo => true
}
}
filter {
aggregate {
task_id => "%{id}"
code => "
map['assetID'] = event.get('assetID')
map['metadata'] ||= []
map['metadata'] << {
metadataId => ? //somehow parse the Id out of the exploded field name "metadata.#.full",
ngrams => event.get('metadata.#.analyzed'),
alphanumeric => event.get('metadata.#.full'),
date => event.get('metadata.#.date'),
}
"
push_previous_map_as_event => true
timeout => 150000
timeout_tags => ['aggregated']
}
if "aggregated" not in [tags] {
drop {}
}
}
output {
elasticsearch {
hosts => "my.new.host:9266"
index => "my-new-index"
document_type => "%{[#metadata][_type]}"
document_id => "%{[#metadata][_id]}"
action => "update"
}
file {
path => "C:\apps\logstash\logstash-5.6.6\testLog.log"
}
}
Obviously the above example is basically just pseudocode, but that is all we can gather from looking at the documentation for both Logstash and ElasticSearch, as well as the aggregate filter plugin and generally Googling things within an inch of their life.
You can play around with the event object, massage it and then add it into the new index. Something like below (The logstash code is untested, you may find some errors. Check the working ruby code after this section):
aggregate {
task_id => "%{id}"
code => "arr = Array.new()
map["assetID"] = event.get("assetID")
metadataObj = event.get("metadata")
metadataObj.to_hash.each do |key,value|
transformedMetadata = {}
transformedMetadata["metadataId"] = key
value.to_hash.each do |k , v|
if k == "analyzed" then
transformedMetadata["ngrams"] = v
elsif k == "full" then
transformedMetadata["alphanumeric"] = v
else
transformedMetadata["date"] = v
end
end
arr.push(transformedMetadata)
end
map['metadata'] ||= []
map['metadata'] << arr
"
}
}
try to play around with above based on the event input and you will get there. Here's a working example, with the input you have in the question, for you to play around : https://repl.it/repls/HarshIntelligentEagle
json_data = {"assetID": 22074,
"metadata": {
"50": {
"analyzed": "Phase One",
"full": "Phase One",
"date": "0001-01-01T00:00:00"
},
"51": {
"analyzed": "H 25",
"full": "H 25",
"date": "0001-01-01T00:00:00"
},
"58": {
"analyzed": "50",
"full": "50",
"date": "0001-01-01T00:00:00"
}
}
}
arr = Array.new()
transformedObj = {}
transformedObj["assetID"] = json_data[:assetID]
json_data[:metadata].to_hash.each do |key,value|
transformedMetadata = {}
transformedMetadata["metadataId"] = key
value.to_hash.each do |k , v|
if k == :analyzed then
transformedMetadata["ngrams"] = v
elsif k == :full then
transformedMetadata["alphanumeric"] = v
else
transformedMetadata["date"] = v
end
end
arr.push(transformedMetadata)
end
transformedObj["metadata"] = arr
puts transformedObj
In the end, we used ruby code to solve it in a script:
# Must use the input plugin for elasticsearch at version 4.0.2, or it cannot contact a 1.X index
input {
elasticsearch {
hosts => "my.old.host.name:9266"
index => "my-old-index"
query => '{
"query": {
"bool": {
"must": [
{ "match_all": { } }
]
}
}
}'
size => 500
scroll => "5m"
docinfo => true
}
}
filter {
mutate {
remove_field => ['#version', '#timestamp']
}
}
#metadata
filter {
mutate {
rename => { "[metadata]" => "[metadata_OLD]" }
}
ruby {
code => "
metadataDocs = []
metadataFields = event.get('metadata_OLD')
metadataFields.each { |key, value|
metadataDoc = {
'metadataID' => key.to_i,
'date' => value['date']
}
if !value['full'].nil?
metadataDoc[:alphanumeric] = value['full']
end
if !value['analyzed'].nil?
metadataDoc[:ngrams] = value['analyzed']
end
metadataDocs << metadataDoc
}
event.set('metadata', metadataDocs)
"
}
mutate {
remove_field => ['metadata_OLD']
}
}
output {
elasticsearch {
hosts => "my.new.host:9266"
index => "my-new-index"
document_type => "searchasset"
document_id => "%{assetID}"
action => "update"
doc_as_upsert => true
}
file {
path => "F:\logstash-6.1.2\logs\esMigration.log"
}
}

How to preprocess a document before indexation?

I'm using logstash and elasticsearch to collect tweet using the Twitter plug in. My problem is that I receive a document from twitter and I would like to make some preprocessing before indexing my document. Let's say that I have this as a document result from twitter:
{
"tweet": {
"tweetId": 1025,
"tweetContent": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch",
"hashtags": ["stackOverflow", "elasticsearch"],
"publishedAt": "2017 23 August",
"analytics": {
"likeNumber": 400,
"shareNumber": 100,
}
},
"author":{
"authorId": 819744,
"authorAt": "the_expert",
"authorName": "John Smith",
"description": "Haha it's a fake description"
}
}
Now out of this document that twitter is sending me I would like to generate two documents:
the first one will be indexed in twitter/tweet/1025 :
# The id for this document should be the one from tweetId `"tweetId": 1025`
{
"content": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch", # this field has been renamed
"hashtags": ["stackOverflow", "elasticsearch"],
"date": "2017/08/23", # the date has been formated
"shareNumber": 100 # This field has been flattened
}
The second one will be indexed in twitter/author/819744:
# The id for this document should be the one from authorId `"authorId": 819744 `
{
"authorAt": "the_expert",
"description": "Haha it's a fake description"
}
I have defined my output as follow:
output {
stdout { codec => dots }
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
}
}
How can I process the information from twitter?
EDIT:
So my full config file should look like:
input {
twitter {
consumer_key => "consumer_key"
consumer_secret => "consumer_secret"
oauth_token => "access_token"
oauth_token_secret => "access_token_secret"
keywords => [ "random", "word"]
full_tweet => true
type => "tweet"
}
}
filter {
clone {
clones => ["author"]
}
if([type] == "tweet") {
mutate {
remove_field => ["authorId", "authorAt"]
}
} else {
mutate {
remove_field => ["tweetId", "tweetContent"]
}
}
}
output {
stdout { codec => dots }
if [type] == "tweet" {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[tweetId]}"
}
} else {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "author"
document_id => "%{[authorId]}"
}
}
}
You could use the clone filter plugin on logstash.
With a sample logstash configuration file that takes a JSON input from stdin and simply shows the output on stdout:
input {
stdin {
codec => json
type => "tweet"
}
}
filter {
mutate {
add_field => {
"tweetId" => "%{[tweet][tweetId]}"
"content" => "%{[tweet][tweetContent]}"
"date" => "%{[tweet][publishedAt]}"
"shareNumber" => "%{[tweet][analytics][shareNumber]}"
"authorId" => "%{[author][authorId]}"
"authorAt" => "%{[author][authorAt]}"
"description" => "%{[author][description]}"
}
}
date {
match => ["date", "yyyy dd MMMM"]
target => "date"
}
ruby {
code => '
event.set("hashtags", event.get("[tweet][hashtags]"))
'
}
clone {
clones => ["author"]
}
mutate {
remove_field => ["author", "tweet", "message"]
}
if([type] == "tweet") {
mutate {
remove_field => ["authorId", "authorAt", "description"]
}
} else {
mutate {
remove_field => ["tweetId", "content", "hashtags", "date", "shareNumber"]
}
}
}
output {
stdout {
codec => rubydebug
}
}
Using as input:
{"tweet": { "tweetId": 1025, "tweetContent": "Hey this is a fake document", "hashtags": ["stackOverflow", "elasticsearch"], "publishedAt": "2017 23 August","analytics": { "likeNumber": 400, "shareNumber": 100 } }, "author":{ "authorId": 819744, "authorAt": "the_expert", "authorName": "John Smith", "description": "fake description" } }
You would get these two documents:
{
"date" => 2017-08-23T00:00:00.000Z,
"hashtags" => [
[0] "stackOverflow",
[1] "elasticsearch"
],
"type" => "tweet",
"tweetId" => "1025",
"content" => "Hey this is a fake document",
"shareNumber" => "100",
"#timestamp" => 2017-08-23T20:36:53.795Z,
"#version" => "1",
"host" => "my-host"
}
{
"description" => "fake description",
"type" => "author",
"authorId" => "819744",
"#timestamp" => 2017-08-23T20:36:53.795Z,
"authorAt" => "the_expert",
"#version" => "1",
"host" => "my-host"
}
You could alternatively use a ruby script to flatten the fields, and then use rename on mutate, when necessary.
If you want elasticsearch to use authorId and tweetId, instead of default ID, you could probably configure elasticsearch output with document_id.
output {
stdout { codec => dots }
if [type] == "tweet" {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[tweetId]}"
}
} else {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[authorId]}"
}
}
}

How to query array of objects as part of term query

I am using elasticsearch 5.5.0.
Im my index i have data of type attraction part of the json in elastic looks like:
"directions": "Exit the M4 at Junction 1",
"phoneNumber": "03333212001",
"website": "https://www.londoneye.com/",
"postCode": "SE1 7PB",
"categories": [
{
"id": "ce4cf4d0-6ddd-49fd-a8fe-3cbf7be9b61d",
"name": "Theater"
},
{
"id": "5fa1a3ce-fd5f-450f-92b7-2be6e3d0df90",
"name": "Family"
},
{
"id": "ed492986-b8a7-43c3-be3d-b17c4055bfa0",
"name": "Outdoors"
}
],
"genres": [],
"featuredImage": "https://www.daysoutguide.co.uk/media/1234/london-eye.jpg",
"images": [],
"region": "London",
My next query looks like:
var query2 = Query<Attraction>.Bool(
bq => bq.Filter(
fq => fq.Terms(t => t.Field(f => f.Region).Terms(request.Region.ToLower())),
fq => fq.Terms(t => t.Field(f => f.Categories).Terms(request.Category.ToLower())))
The query generated looks like:
{
"query": {
"bool": {
"filter": [
{
"terms": {
"region": [
"london"
]
}
},
{
"terms": {
"categories": [
"family"
]
}
}
]
}
}
}
That returns no results. If i take out the categories bit i get results. So i am trying to do term filter on categories which is an array of objects. Looks like I am doing this query wrong. Anyone any hints on how to get this to work?
Regards
Ismail
You can still use strongly typed properties access by using:
t.Field(f => f.Categories.First().Name)
NEST's property inferrer will reader will read over .First() and yield categories.name.
t.Field(f => f.Categories[0].Name) works as well.

Show location points in a tile map with kibi

I'm using logstash 2.3.1, elasticsearch 2.3.1 and kibi 0.3.2. I have problems visualizing locations in a map with kibi.
I have the following configuration in logstash:
input {
file {
path => "/opt/logstash-2.3.1/logTest/Dades.csv"
type => "Dades"
start_position => "beginning"
}
}
filter {
csv {
columns => ["c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12", "c13", "c14", "c15", "c16", "c17", "c18", "c19", "c20", "c21", "c22", "c23"]
separator => ";"
}
ruby {
code => "
temp = event['c17']
event['c17'] = temp[0..1].to_f+ (temp[2..8].to_f/60)
temp = event['c19']
event['c19'] = temp[0..2].to_f+ (temp[3..8].to_f/60)
"
}
mutate {
convert => {
"c3" => "float"
"c5" => "float"
"c7" => "float"
"c9" => "float"
"c11" => "float"
"c13" => "float"
"c15" => "float"
"c21" => "float"
"c23" => "float"
}
}
date {
match => [ "c1", "dd/MM/YYYY HH:mm:ss.SSS", "ISO8601"]
target => "ts_date"
}
mutate {
rename => [ "c17", "[location][lat]",
"c19", "[location][lon]" ]
}
}
output {
elasticsearch {
hosts => localhost
index => "tram3"
manage_template => false
template => "tram3_template.json"
template_name => "tram3"
template_overwrite => "true"
}
stdout {
codec => rubydebug
}
}
The mapping configuration file (tram3_template.json) is like this:
{
"template": "tram3",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"tram3": {
"_all": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
When I import de csv file to elasticsearch it seems that all works ok. The output is something like this:
{
"message" => "26/02/2016 00:00:22.984;Total;4231.143555;Trac1;26.547932;Trac2;-338.939697;AA1;-364.611511;AA2;3968.135010;Reo1;0.000000;Reo2;0.000000;Latitud;4125.1846;Longitud;00213.5219;Speed;0.000000;CVS;3873.429443;\r",
"#version" => "1",
"#timestamp" => "2016-04-25T14:02:52.901Z",
"path" => "/opt/logstash-2.3.1/logTest/Dades.csv",
"host" => "ubuntu",
"type" => "Dades",
"c1" => "26/02/2016 00:00:22.984",
"c2" => "Total",
"c3" => 4231.143555,
"c4" => "Trac1",
"c5" => 26.547932,
"c6" => "Trac2",
"c7" => -338.939697,
"c8" => "AA1",
"c9" => -364.611511,
"c10" => "AA2",
"c11" => 3968.13501,
"c12" => "Reo1",
"c13" => 0.0,
"c14" => "Reo2",
"c15" => 0.0,
"c16" => "Latitud",
"c18" => "Longitud",
"c20" => "Speed",
"c21" => 0.0,
"c22" => "CVS",
"c23" => 3873.429443,
"column24" => nil,
"ts_date" => "2016-02-25T23:00:22.984Z",
"location" => {
"lat" => 41.41974333333334,
"lon" => 2.22535
}
}
But when I try to visualize the location parameter in a map it doesn't show any result:
I don't know what I'm doing wrong. Why the location point doesn't appear in the map?
In your ES mapping file, you probably need to enable the storage of the geohash sub-field (defaults to false) as the geohash aggregation cannot work without it.
{
"template": "tram3",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"tram3": {
"_all": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point",
"geohash": true, <-- add this
"geohash_prefix": true <-- add this
}
}
}
}
}
Then you can build a geohash aggregation on the location.geohash field
Note that if you want to also index all geohash prefixes, you can also add "geohash_prefix": true to your field mapping.
UPDATE
After reproducing the case, here are some more fixes to do:
You need to change the type in your file input as it will be used as the document type and your mapping specifies that the mapping type is named dades2 not Dades:
file {
path => "/opt/logstash-2.3.1/logTest/Dades.csv"
type => "dades2"
start_position => "beginning"
sincedb_path => "/dev/null"
}
Your elasticsearch output should look like below, namely, manage_template should be true and use the full path to your dades2_template.json file (make sure to change /full/path/to with the actual path name.
elasticsearch {
hosts => localhost
index => "dades2"
manage_template => true
template => "/full/path/to/dades2_template.json"
template_name => "dades2"
template_overwrite => "true"
}
The new dades2_template.json file should look like this
{
"template": "dades2",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"dades2": {
"_all": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point",
"geohash": true,
"geohash_prefix": true
}
}
}
}
}

Resources