NiFi Convert JSON to CSV via JsonPathReader or JsonTreeReader - apache-nifi

I am trying to convert a JSON File into CSV but I don't seem to have any luck in doing so. My JSON looks something like that:
...
{
{"meta": {
"contentType": "Response"
},
"content": {
"data": {
"_type": "ObjectList",
"erpDataObjects": [
{
"meta": {
"lastModified": "2020-08-10T08:37:21.000+0000",
},
"head": {
"fields": {
"number": {
"value": "1",
},
"id": {
"value": "10000"
},
}
}
{
"meta": {
"lastModified": "2020-08-10T08:37:21.000+0000",
},
"head": {
"fields": {
"number": {
"value": "2",
},
"id": {
"value": "10001"
},
}
}
{
"meta": {
"lastModified": "2020-08-10T08:37:21.000+0000",
},
"head": {
.. much more data
I basically want my csv to look like this:
number,id
1,10000
2,10001
My flow looks like this:
GetFile -> Set the output-file name -> ConvertRecord -> UpdateAttribute -> PutFile
ConvertRecord uses the JsonTreeReader and a CSVRecordSetWriter
JsonTreeReader
CsvRecordSetWriter.
They both call on an AvroSchemaRegistry which looks like this:
AvroSchemaRegistry
The AvroSchema itself looks like this:
{
"type": "record",
"name": "head",
"fields":
[
{"name": "number", "type": ["string"]},
{"name": "id", "type": ["string"]},
]
}
But I only get this output:
number,id
,
Which makes sense because I'm not specifically indicating where those values are located. I used the JsonPathReader instead before but it only looked like this:
JsonPathReader
Which obvioulsy only gave me one record. I'm not really sure how I can configure either of the two to output exactly what I want. Help would be much appreciated!

Using ConvertRecord for JSON -> CSV is mostly intended for "flat" JSON files where each field in the object becomes a column in the outgoing CSV file. For nested/complex structures, consider JoltConvertRecord, it allows you to do more complex transformations. Your example doesn't appear to be valid JSON as-is, but assuming you have something like this as input:
{
"meta": {
"contentType": "Response"
},
"content": {
"data": {
"_type": "ObjectList",
"erpDataObjects": [
{
"meta": {
"lastModified": "2020-08-10T08:37:21.000+0000"
},
"head": {
"fields": {
"number": {
"value": "1"
},
"id": {
"value": "10000"
}
}
}
},
{
"meta": {
"lastModified": "2020-08-10T08:37:21.000+0000"
},
"head": {
"fields": {
"number": {
"value": "2"
},
"id": {
"value": "10001"
}
}
}
}
]
}
}
}
The following JOLT spec should give you what you want for output:
[
{
"operation": "shift",
"spec": {
"content": {
"data": {
"erpDataObjects": {
"*": {
"head": {
"fields": {
"number": {
"value": "[&4].number"
},
"id": {
"value": "[&4].id"
}
}
}
}
}
}
}
}
}
]

Related

How in strapi graphql I can pull records from a given month

Hi I would like to draw from graphql only those records whose date is equal to the month - August
If I want to pull another month, it is enough to replace it only in the query. At the moment, my query takes all the months instead of the ones it gives inside the filter
schema.json
{
"kind": "collectionType",
"collectionName": "product_popularities",
"info": {
"singularName": "product-popularity",
"pluralName": "product-popularities",
"displayName": "Popularity",
"description": ""
},
"options": {
"draftAndPublish": true
},
"pluginOptions": {},
"attributes": {
"podcast": {
"type": "relation",
"relation": "manyToOne",
"target": "api::product.products",
"inversedBy": "products"
},
"value": {
"type": "integer"
},
"date": {
"type": "date"
}
}
}
My query
query {
Popularities(filters: {date: {contains: [2022-08]}}) {
data {
attributes {
date
value
}
}
}
}
Response
{
"data": {
"Popularities": {
"data": [
{
"attributes": {
"date": "2022-08-03",
"value": 50
}
},
{
"attributes": {
"date": "2022-08-04",
"value": 1
}
},
{
"attributes": {
"date": "2022-08-10",
"value": 100
}
},
{
"attributes": {
"date": "2022-07-06",
"value": 20
}
}
]
}
}
}

Elastic Search Wildcard query with space failing 7.11

I am having my data indexed in elastic search in version 7.11. This is my mapping i got when i directly added documents to my index.
{"properties":{"name":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}
I havent added the keyword part but no idea where it came from.
I am running a wild card query on the same. But unable to get data for keywords with spaces.
{
"query": {
"bool":{
"should":[
{"wildcard": {"name":"*hello world*"}}
]
}
}
}
Have seen many answers related to not_analyzed . And i have tried updating {"index":"true"} in mapping but with no help. How to make the wild card search work in this version of elastic search
Tried adding the wildcard field
PUT http://localhost:9001/indexname/_mapping
{
"properties": {
"name": {
"type" :"wildcard"
}
}
}
And got following response
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "mapper [name] cannot be changed from type [text] to [wildcard]"
}
],
"type": "illegal_argument_exception",
"reason": "mapper [name] cannot be changed from type [text] to [wildcard]"
},
"status": 400
}
Adding a sample document to match
{
"_index": "accelerators",
"_type": "_doc",
"_id": "602ec047a70f7f30bcf75dec",
"_score": 1.0,
"_source": {
"acc_id": "602ec047a70f7f30bcf75dec",
"name": "hello world example",
"type": "Accelerator",
"description": "khdkhfk ldsjl klsdkl",
"teamMembers": [
{
"userId": "karthik.r#gmail.com",
"name": "Karthik Ganesh R",
"shortName": "KR",
"isOwner": true
},
{
"userId": "anand.sajan#gmail.com",
"name": "Anand Sajan",
"shortName": "AS",
"isOwner": false
}
],
"sectorObj": [
{
"item_id": 14,
"item_text": "Cross-sector"
}
],
"geographyObj": [
{
"item_id": 4,
"item_text": "Global"
}
],
"technologyObj": [
{
"item_id": 1,
"item_text": "Artificial Intelligence"
}
],
"themeColor": 1,
"mainImage": "assets/images/Graphics/Asset 35.svg",
"features": [
{
"name": "Ideation",
"icon": "Asset 1007.svg"
},
{
"name": "Innovation",
"icon": "Asset 1044.svg"
},
{
"name": "Strategy",
"icon": "Asset 1129.svg"
},
{
"name": "Intuitive",
"icon": "Asset 964.svg"
},
],
"logo": {
"actualFileName": "",
"fileExtension": "",
"fileName": "",
"fileSize": 0,
"fileUrl": ""
},
"customLogo": {
"logoColor": "#B9241C",
"logoText": "EC",
"logoTextColor": "#F6F6FA"
},
"collaborators": [
{
"userId": "muhammed.arif#gmail.com",
"name": "muhammed Arif P T",
"shortName": "MA"
},
{
"userId": "anand.sajan#gmail.com",
"name": "Anand Sajan",
"shortName": "AS"
}
],
"created_date": "2021-02-18T19:30:15.238000Z",
"modified_date": "2021-03-11T11:45:49.583000Z"
}
}
You cannot modify a field mapping once created. However, you can create another sub-field of type wildcard, like this:
PUT http://localhost:9001/indexname/_mapping
{
"properties": {
"name": {
"type": "text",
"fields": {
"wildcard": {
"type" :"wildcard"
},
"keyword": {
"type" :"keyword",
"ignore_above":256
}
}
}
}
}
When the mapping is updated, you need to reindex your data so that the new field gets indexed, like this:
POST http://localhost:9001/indexname/_update_by_query
And then when this finishes, you'll be able to query on this new field like this:
{
"query": {
"bool": {
"should": [
{
"wildcard": {
"name.wildcard": "*hello world*"
}
}
]
}
}
}

Term aggregation on ElasticSearch join

I would like to perform an aggregation on a join relation using ElasticSearch 7.7.
I need to know how many children I have for each parent.
The only way that I found to solve my issue is to use script inside term aggregation, but my concern is about performance.
/my_index/_search
{
"size": 0,
"aggs": {
"total": {
"terms": {
"script": {
"lang": "painless",
"source": "params['_source']['my_join']['parent']"
}
}
},
"max_total": {
"max_bucket": {
"buckets_path": "total>_count"
}
}
}
}
Someone knows a more fast way to execute this aggregation avoiding the script?
If the join field wasn't a parent/child I could replace the term aggregation with:
"terms": { "field": "my_field" }
To give more context I add some information about mapping:
I'm using Elastic 7.7.
I also attach a mapping with some sample documents:
{
"mappings": {
"properties": {
"my_join": {
"relations": {
"other": "doc"
},
"type": "join"
},
"reader": {
"type": "keyword"
},
"name": {
"type": "text"
},
"content": {
"type": "text"
}
}
}
}
PUT example/_doc/1
{
"reader": [
"A",
"B"
],
"my_join": {
"name": "other"
}
}
PUT example/_doc/2
{
"reader": [
"A",
"B"
],
"my_join": {
"name": "other"
}
}
PUT example/_doc/3
{
"content": "abc",
"my_join": {
"name": "doc",
"parent": 1
}
}
PUT example/_doc/4
{
"content": "def",
"my_join": {
"name": "doc"
"parent": 2
}
}
PUT example/_doc/5
{
"content": "def",
"acl_join": {
"name": "doc"
"parent": 1
}
}

Return only elements of an array in an object that contain a certain value

I've got the following document in an elastic search index:
{
"type": "foo",
"components": [{
"id": "1234123", ,
"data_collections": [{
"date_time": "2020-03-02T08:14:48+00:00",
"group": "1",
"group_description": "group1",
"measures": [{
"measure_name": "MEASURE_1",
"actual": "23.34"
}, {
"measure_name": "MEASURE_2",
"actual": "5"
}, {
"measure_name": "MEASURE_3",
"actual": "string_message"
}, {
"measure_name": "MEASURE_4",
"actual": "another_string"
}
]
},
{
"date_time": "2020-03-03T08:14:48+00:00",
"group": "2",
"group_description": "group2",
"measures": [{
"measure_name": "MEASURE_1",
"actual": "23.34"
}, {
"measure_name": "MEASURE_4",
"actual": "foo"
}, {
"measure_name": "MEASURE_5",
"actual": "bar"
}, {
"measure_name": "MEASURE_6",
"actual": "4"
}
]
}
]
}
]
}
Now I'm trying to figure out a mapping and a query for this document so the result would only contain the groups and measure_names I am interesed in. So far I'm able to query but I'll always retrieve the whole document which is not feasible since the array of measures can be quite large and most of the time I'd like a small subset.
For example I'm search for documents with "group": "1" and "measure_name": "MEASURE_" and the result I'd like to achieve looks like this:
{
"_id": "oiqwueou8931283u12",
"_source": {
"type": "foo",
"components": [{
"id": "1234123", ,
"data_collections": [{
"date_time": "2020-03-02T08:14:48+00:00",
"group": "1",
"group_description": "group1",
"measures": [{
"measure_name": "MEASURE_1",
"actual": "23.34"
}
]
}
]
}
]
}
}
I think what comes close to what I am looking for is the source parameter, but as far as I know there is no way to filter for values like {"measure_name": {"value": "MEASURE_1"}}
Thanks.
The simplest mapping that comes to mind is
PUT timo
{
"mappings": {
"properties": {
"components": {
"type": "nested",
"properties": {
"data_collections": {
"type": "nested",
"properties": {
"measures": {
"type": "nested"
}
}
}
}
}
}
}
}
and the search query would be
GET timo/_search
{
"_source": ["inner_hits", "type", "components.id"],
"query": {
"bool": {
"must": [
{
"nested": {
"path": "components.data_collections",
"query": {
"term": {
"components.data_collections.group.keyword": {
"value": "1"
}
}
},
"inner_hits": {}
}
},
{
"nested": {
"path": "components.data_collections.measures",
"query": {
"term": {
"components.data_collections.measures.measure_name.keyword": {
"value": "MEASURE_1"
}
}
},
"inner_hits": {}
}
}
]
}
}
}
Notice the inner_hits param under each subquery and that the _source param is limited so that we don't return the whole hit, but rather only the subgroups that did match. type and component.id cannot be "seen" in the nested fields so we've included them explicitly.
The response should then look like this:
You now have precisely the attributes you need so a bit of post-processing will get you the desired format!
I'm not familiar w/ a cleaner way of doing this but if any of y'all do, I'd be glad to learn it.

Elasticsearch - How to get filtered response without harming the links of each entity/field

So assuming that i have a mapping structure like the following
{
"mappings": {
"users": {
"properties": {
"user": {
"type": "nested"
}
}
}
}
}
and I have indexed the following
users/52
{
"user": [
{
"id": 52,
"first": "John",
"last": "Smith",
"age": 21,
"school": {
"name": "STC",
"location": "Mt LV",
"District": "Western"
}
}
]
}
users/57
{
"user": [
{
"id": 57,
"first": "Alice",
"last": "White",
"age": 25,
"school": {
"name": "HFC",
"location": "DEH WLA",
"District": "Western"
}
}
]
}
What if I want to get certain fields using the id and without destroying the relationship link of each other.
For an example
If id == 57
then the return structure should consists only "first","age","school.name","school.District"
{
"user": [
{
"first": "Alice",
"age": 25,
"school": {
"name": "HFC",
"District": "Western"
}
}
]
}
How should you write a query for this sort of response in Elasticsearch?
Use response filtering in Elasticsearch. According to your scenario, a GET request would look like GET /_search?user=57&filter_path=first,age,school.name,school.District

Resources