CSV to nested JSON using NiFi - apache-nifi

I want to create nested json from csv using nifi -
CSV file:
"Foo",12,"newyork","North avenue","123213"
"Foo1",12,"newyork","North avenue","123213"
"Foo2",12,"newyork","North avenue","123213"
Required Json:
{
"studentName":"Foo",
"Age":"12",
"address__city":"newyork",
"address":{
"address__address1":"North avenue",
"address__zipcode":"123213"
}
}
I am using nifi 1.4 convertRecord Processor by applying avro schema but not able to get the nested json.
Avro schema:
{
"type" : "record",
"name" : "MyClass",
"namespace" : "com.test.avro",
"fields" : [ {
"name" : "studentName",
"type" : "string"
}, {
"name" : "Age",
"type" : "string"
}, {
"name" : "address__city",
"type" : "string"
}, {
"name" : "address",
"type" : {
"type" : "record",
"name" : "address",
"fields" : [ {
"name" : "address__address1",
"type" : "string"
}, {
"name" : "address__zipcode",
"type" : "string"
} ]
}
} ]
}

You will need to:
Split your flowfile into individual records using SplitRecord
Convert from CSV to flat JSON files using ConvertRecord
Use the JOLT transform processor to transform your flat JSON into nested JSON objects in your desired format.

Related

Getting a timestamp exception when I try to update an unrelated field using painless in elasticsearch

Im trying to run the following script
POST /data_hip/_update/1638643727.0
{
"script":{
"source":"ctx._source.avgmer=4;"
}
}
But I am getting the following error.
{
"error" : {
"root_cause" : [
{
"type" : "mapper_parsing_exception",
"reason" : "failed to parse field [#timestamp] of type [date] in document with id '1638643727.0'. Preview of field's value: '1.638642742E12'"
}
],
"type" : "mapper_parsing_exception",
"reason" : "failed to parse field [#timestamp] of type [date] in document with id '1638643727.0'. Preview of field's value: '1.638642742E12'",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "failed to parse date field [1.638642742E12] with format [epoch_millis]",
"caused_by" : {
"type" : "date_time_parse_exception",
"reason" : "Failed to parse with all enclosed parsers"
}
}
},
"status" : 400
}
this is strange because on queries (not updates) the date time is parsed fine.
The timestamp field mapping is as follows
"#timestamp": {
"type":"date",
"format":"epoch_millis"
},
I am running elasticsearch 7+
EDIT:
Adding my index settings
{
"data_hip" : {
"settings" : {
"index" : {
"routing" : {
"allocation" : {
"include" : {
"_tier_preference" : "data_content"
}
}
},
"number_of_shards" : "1",
"provided_name" : "data_hip",
"creation_date" : "1638559533343",
"number_of_replicas" : "1",
"uuid" : "CHjkvSdhSgySLioCju9NqQ",
"version" : {
"created" : "7150199"
}
}
}
}
}
Im not running an ingest pipeline
The problem is the scientific notation, the 'E12' suffix, being in a field that ES is expecting to be an integer.
Using this reprex:
PUT so_test
{
"mappings": {
"properties": {
"ts": {
"type": "date",
"format": "epoch_millis"
}
}
}
}
# this works
POST so_test/_doc/
{
"ts" : "123456789"
}
# this does not, throws the same error you have IRL
POST so_test/_doc/
{
"ts" : "123456789E12"
}
I'm not sure how/where those values are creeping in, but they are there in the document you are passing to ES.

Adding Geo_shape to Elasticsearch using Logstash

I have a CSV file which contains Geometries in WKT format. I was trying to ingest geo_shape data using CSV file. I created a mapping as given in file "input_mapping.json"
{
"mappings" : {
"doc" : {
"properties" : {
"Lot" : {
"type" : "long"
},
"Lot_plan" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"Parcel_Address_Line_1" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"Plan" : {
"type" : "long"
},
"Tenure" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"WKT" : {
"type" : "geo_shape"
}
}
}
}
}
WKT is my geo_shape and it is in WKT(String) format.
Below is input CSV file which I am trying to insert using logstash:
WKT,Lot_plan,Tenure,Parcel_Address_Line_1,Lot,Plan
"POLYGON ((148.41503356 -26.62829003,148.44798048 -26.62800857,148.45234634 -26.63457929,148.45507096 -26.64778132,148.41735984 -26.64808729,148.41514107 -26.64091476,148.41503356 -26.62829003))",21MM1,FH,MASSEY DOWNS,21,1
"POLYGON ((148.45507096 -26.64778132,148.45779641 -26.66098396,148.45859297 -26.66259081,148.45801376 -26.66410383,148.45989472 -26.67278979,148.42510081 -26.67310328,148.42434355 -26.67065659,148.41735984 -26.64808729,148.45507096 -26.64778132))",21MM2,FH,,21,2
"POLYGON ((148.39514404 -26.68791317,148.37228669 -26.68894235,148.37188338 -26.68895271,148.37092744 -26.68897445,148.37051869 -26.68898023,148.36312088 -26.68908468,148.36261958 -26.66909425,148.39598678 -26.66869309,148.39584372 -26.66934742,148.39583604 -26.66968184,148.39590526 -26.67007957,148.39598629 -26.67039933,148.39614586 -26.67085156,148.39625052 -26.67085085,148.42434355 -26.67065659,148.42510081 -26.67310328,148.42537156 -26.67397795,148.42549108 -26.68541445,148.41781484 -26.68547248,148.39988482 -26.68562107,148.39966009 -26.68562292,148.39704234 -26.68564442,148.39514404 -26.68791317))",21MM3,LL,DERWENT PARK,21,3
And my logstash conf file is :
input{
file{
path=>"D:/input.csv"
start_position=>"beginning"
sincedb_path=>"D:/sample.text"
}
}
filter{
csv{
separator =>","
columns =>["WKT","Lot_plan","Tenure","Parcel_Address_Line_1","Lot","Plan"]
skip_header=>true
skip_empty_columns=>true
convert => {
"Lot" => "integer"
"Plan" => "integer"
}
remove_field =>[ "_source","message","host","path","#version","#timestamp" ]
}
}
output{
elasticsearch{
hosts=>"http://localhost:9701"
index=>"input_mapping"
template =>"D:/input_mapping.json"
template_name => "input_mapping"
manage_template => true
}
}
Due to some reason it is not getting ingested in the ElasticSearch. I am using ElasticSearch version 6.5.4 and logstash version 6.5.4.
Kindly let me know if I have missed anything.
I realized there will be many other developers who would be looking for problem similar which I had faced it. Later point of time, I checked GDAL( ogr2ogr) which provides ElasticSearch ingestion. Also I use PostgreSQL to ingest the CSV file. Therefore using ogr2ogr tool helps me by following the below steps:
First ingest my CSV file in PostgreSQL where I put WKT as text column in a table.
Create another column within the table and updated this column with ST_GeomFromText function.
update TableName set WKT_GEOM=ST_GeomFromText("WKT",4632)
(Note: I already installed the postgis in PostgreSQL)
Now I start my ElasticSearch.
Using ogr2ogr by following the examples provided:
a.First create elasticsearch mapping using ogr2ogr.
b.Now ingest the data from PostgreSQL to ElasticSearch.
https://gdal.org/drivers/vector/elasticsearch.html
In this way, I was able to perform geoquery in Elasticsearch. But unfortunately it was without logstash. :(
Please comment if you have any doubts.

How do I flatten nested Avro records in a Pig query?

Avro schema looks like this:
{
"type" : "record",
"name" : "name1",
"fields" :
[
{
"name" : "f1",
"type" : "string"
},
{
"name" : "f2",
"type" :
{
"type" : "array",
"items" :
{
"type" : "record",
"name" : "name2",
"fields" :
[
{
"name" : "time",
"type" : [ "float", "int", "double", "long" ]
},
]
}
}
}
]
}
After reading it in Pig:
grunt> A = load 'data' using AvroStorage();
grunt> DESCRIBE A;
A: {f1: chararray,f2: {ARRAY_ELEM: (time: (FLOAT: float,INT: int,DOUBLE: double,LONG: long))}}
What I want is probably a bag of (f1:chararray, timestamp:double). This is what I did:
grunt> B = FOREACH A GENERATE f1, f2.time AS timestamp;
grunt> DESCRIBE B;
B: {f1: chararray,timestamp: {(time: (FLOAT: float,INT: int,DOUBLE: double,LONG: long))}}
So how do I flatten this record?
I'm new to Pig, Avro and don't know what I'm trying to do even makes sense. Thanks for your help.

How to add default values while adding a new field in existing mapping in elasticsearch

This is my existing mapping in elastic search for one of the child document
sessions" : {
"_routing" : {
"required" : true
},
"properties" : {
"operatingSystem" : {
"index" : "not_analyzed",
"type" : "string"
},
"eventDate" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"durations" : {
"type" : "integer"
},
"manufacturer" : {
"index" : "not_analyzed",
"type" : "string"
},
"deviceModel" : {
"index" : "not_analyzed",
"type" : "string"
},
"applicationId" : {
"type" : "integer"
},
"deviceId" : {
"type" : "string"
}
},
"_parent" : {
"type" : "userinfo"
}
}
in above mapping "durations" field is an integer array. I need to update the existing mapping by adding a new field called "durationCount" whose default value should be the size of durations array.
PUT sessions/_mapping
{
"properties" : {
"sessionCount" : {
"type" : "integer"
}
}
}
using above mapping I am able to update the existing mapping but I am not able to figure out how to assign a value ( which would vary for each session document like it should be durations array size ) while updating the mapping. any ideas ?
Well 2 recommendations here -
Instead of adding default value , you can adjust it in the query using missing filter. Lets say , you want to search based on a match query - Instead of just match query , use a bool query with should clause having the match and missing filter. inside filtered query. This way , those documents which did not have the field is also accounted.
If you absolutely need the value in that field for existing documents , you need to reindex the whole set of documents. Or , use the out of box plugin , update by query -

How to search on a URL exactly in ElasticSearch / Kibana

I have imported an IIS log file and the data has moved through Logstash (1.4.2), into ElasticSearch (1.3.1) and then being displayed in Kibana.
My filter section is as follows:
filter {
grok {
match =>
["message" , "%{TIMESTAMP_ISO8601:iisTimestamp} %{IP:serverIP} %{WORD:method} %{URIPATH:uri} - %{NUMBER:port} - %{IP:clientIP} - %{NUMBER:status} %{NUMBER:subStatus} %{NUMBER:win32Status} %{NUMBER:timeTaken}"]
}
}
When using a Terms panel in Kibana, and using "uri" (one of my captured fields from Logstash), it is matching the tokens within the URI. Therefore it is matching items like:
'Scripts'
'/'
'EN
Q: How do I display the 'Top URLs' in their full form?
Q: How do I inform ElasticSearch that the field is 'not_analysed'. I don't mind having 2 fields, for example:
uri - The tokenized URI
uri.raw - the fully formed URL.
Can this be done Logstash side, or is this a mapping that needs to be set up in ElasticSearch?
Mapping is as follows :
//http://localhost:9200/iislog-2014.10.09/_mapping?pretty
{
"iislog-2014.10.09" : {
"mappings" : {
"iislogs" : {
"properties" : {
"#timestamp" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"#version" : {
"type" : "string"
},
"clientIP" : {
"type" : "string"
},
"device" : {
"type" : "string"
},
"host" : {
"type" : "string"
},
"id" : {
"type" : "string"
},
"iisTimestamp" : {
"type" : "string"
},
"logFilePath" : {
"type" : "string"
},
"message" : {
"type" : "string"
},
"method" : {
"type" : "string"
},
"name" : {
"type" : "string"
},
"os" : {
"type" : "string"
},
"os_name" : {
"type" : "string"
},
"port" : {
"type" : "string"
},
"serverIP" : {
"type" : "string"
},
"status" : {
"type" : "string"
},
"subStatus" : {
"type" : "string"
},
"tags" : {
"type" : "string"
},
"timeTaken" : {
"type" : "string"
},
"type" : {
"type" : "string"
},
"uri" : {
"type" : "string"
},
"win32Status" : {
"type" : "string"
}
}
}
}
}
}
In your Elasticsearch mapping:
url: {
type: "string",
index: "not_analyzed"
}
The problem is that the iislog- is not compliant with the logstash- format, and hence did not pick up the template:
My index format was iislog-YYYY.MM.dd, this did not use the out-of-the-box mappings by Logstash. When using the logstash- index format, Logstash will create 2 pairs of fields for strings. For example uri is:
uri (appears in Kibana)
uri.raw (does not appear in Kibana)
Note that the uri.raw will not appear in Kibana - but it is queryable.
So the solution to use an alternative index is to:
Don't bother! Use the default index format of logstash-%{+YYYY.MM.dd}
Add a "type" to the file input to help you filter the correct logs in Kibana (whilst using the logstash- index format)
input {
file {
type => "iislog"
....
}
}
Apply filtering in Kibana based in the type
OR
If you really really do want a different index format:
Copy the default configuration file to a new file, say iislog-template.json
Reference the configuration file in the output ==> elasticsearch like this:
output {
elasticsearch_http {
host => localhost
template_name => "iislog-template.json"
template => "<path to template>"
index => "iislog-%{+YYYY.MM.dd}"
}
}

Resources