How can I two-phase split large Json File on NiFi - apache-nifi

I'm using NiFi for recover and put to Kafka many data. I'm actually in test phase and i'm using a large Json file.
My Json file countains 500K recordings.
Actually, I have a processor getFile for get the file and a SplitJson.
JsonPath Expression : $..posts.*
This configuration works with little file that countain 50K recordings but for large files, she crashes.
My Json file looks like that, with the 500K registeries in "posts":[]
{
"meta":{
"requestid":"request1000",
"http_code":200,
"network":"twitter",
"query_type":"realtime",
"limit":10,
"page":0
},
"posts":[
{
"network":"twitter",
"posted":"posted1",
"postid":"id1",
"text":"text1",
"lang":"lang1",
"type":"type1",
"sentiment":"sentiment1",
"url":"url1"
},
{
"network":"twitter",
"posted":"posted2",
"postid":"id2",
"text":"text2",
"lang":"lang2",
"type":"type2",
"sentiment":"sentiment2",
"url":"url2"
}
]
}
I read some documentations for this problem but, topics are for text file and speakers propose to link many SplitText for split progressively the file. With a rigide structure like my Json, I don't understand how I can do that.
I'm looking for a solution that she makes the job on 500K recordings well.

Unfortunately I think this case (large array inside a record) is not handled very well right now...
SplitJson requires the entire flow file to be read into memory, and it also doesn't have an outgoing split size. So this won't work.
SplitRecord generally would be the correct solution, but currently there are two JSON record readers - JsonTreeReader and JsonPathReader. Both of these stream records, but the issue here is there is only one huge record, so they will each read the entire document into memory.
There have been a couple of efforts around this specific problem, but unfortunately none of them have made it into a release.
This PR which is now closed had added a new JSON record reader which could stream records starting from a JSON path, which in your case could be $.posts:
https://github.com/apache/nifi/pull/3222
With that reader you wouldn't even do a split, you would just send the flow file to PublishKafkaRecord_2_0 (or whichever appropriate version of PublishKafkaRecord), and it would read each record and publish to Kafka.
There is also an open PR for a new SelectJson processor that looks like it could potentially help:
https://github.com/apache/nifi/pull/3455

Try using SplitRecord processor in NiFi.
Define Record Reader/Writer controller services in SplitRecord processor.
Then configure Records Per Split to 1 and use Splits relationship for further processing.
(OR)
if you want to flatten and fork the record then use ForkRecord processor in NiFi.
For usage refer to this link.

I had the same issue with json and used to write streaming parser
Use ExeuteGroovyScript processor with
the following code.
It should split large incoming file to small ones:
#Grab(group='acme.groovy', module='acmejson', version='20200120')
import groovyx.acme.json.AcmeJsonParser
import groovyx.acme.json.AcmeJsonOutput
def ff=session.get()
if(!ff)return
def objMeta=null
def count=0
ff.read().withReader("UTF-8"){reader->
new AcmeJsonParser().withFilter{
onValue('$.meta'){
//just remember it to use later
objMeta=it
}
onValue('$.posts.[*]'){objPost->
def ffOut = ff.clone(false) //clone without content
ffOut.post_index=count //add attribite with index
//write small json
ffOut.write("UTF-8"){writer->
AcmeJsonOutput.writeJson([meta:objMeta, post:objPost], writer, true)
}
REL_SUCCESS << ffOut //transfer to success
count++
}
}.parse(reader)
}
ff.remove()
output file example:
{
"meta": {
"requestid": "request1000",
"http_code": 200,
"network": "twitter",
"query_type": "realtime",
"limit": 10,
"page": 0
},
"post": {
"network": "twitter",
"posted": "posted11",
"postid": "id11",
"text": "text11",
"lang": "lang11",
"type": "type11",
"sentiment": "sentiment11",
"url": "url11"
}
}

Related

Read data from CSV and create Json Array in Jmeter

I have One POST request and below is the My Body Payload .
{
"ABC": "ITBK1",
"Code": "AH01001187",
"ScheduleDate": "2021-09-02T22:59:00Z",
"FilterType": 2,
"FilterData": [
"LoadTest92","LoadTest93"
]
}
I'm passing the ContractorId to filterData as below.
{
"ABC": "ITBK1",
"Code": "AH01001187",
"ScheduleDate": "${startTo}",
"FilterType": 2,
"FilterData": ["${contractorId}"]
}
but it taking one id at a time for this Json. How can i send multiple data for this FilterData jsonArray from csv please help on this.
First of all don't post code (including CSV file content) as image
As per CSV Data Set Config documentation:
By default, the file is only opened once, and each thread will use a different line from the file. However the order in which lines are passed to threads depends on the order in which they execute, which may vary between iterations. Lines are read at the start of each test iteration. The file name and mode are resolved in the first iteration.
So it means that you need to go to the next iteration in order to read the next line.
If you want to send all the values from column J as the "FilterData" you could do something like:
Add JSR223 PreProcessor as a child of the request you want to parameterize
Put the following code into "Script" area:
def lines = new File('/path/to/your/file.csv').readLines()
def payload = []
lines.each { line ->
def contractor = line.split(',')[9]
payload.add(contractor as String)
}
vars.put('payload', new groovy.json.JsonBuilder(payload).toPrettyString())
That's it, use ${payload} instead of your ["${contractorId}"] variable in the HTTP request.
More information:
JsonBuilder
JMeterVariables
Apache Groovy - Why and How You Should Use It

Trying to get a JSONPath array of all leaves on json object

New to Go. My first project is to compare a NodeJS proxy and a Go proxy for account number tokenization. I have been doing NodeJS for a few years and am very comfortable with it. My proxies will not know the format of any request or response from the target servers. But it does have configurations coming from Redis/MongoDB that is similar to JSONPath expression. These configurations can change things like the target server/path, query parameters, headers, request body and response body.
For NodeJS, I am using deepdash's paths function to get an array of all the leaves in a JSON object in JSONPath format. I am using this array and RegEx to find my matching paths that I need to process from any request or response body. So far, it looks like I will be using gjson for my JSONPath needs, but it does not have anything for the paths command I was using in deepdash.
Will I need to create a recursive function to build this JSONPath array myself, or does anyone know of a library that will produce something similar?
for example:
{
"response": {
"results": [
{
"acctNum": 1234,
"someData": "something"
},
{
"acctNum": 5678,
"someData": "something2"
}
]
}
}
I will get an array back in the format:
[
"response.results[0].acctNum",
"response.results[0].someData",
"response.results[1].acctNum",
"response.results[1].someData"
]
and I can then use my filter of response.results[*].acctNum which translates to response\.results\[.*\]\.acctNum in Regex to get my acctNum fields.
From this filtered array, I should be able to use gjson to get the actual value, process it and then set the new value (I am using lodash in NodeJS)
There are a number of JSONPath implementations in GoLang. And I cannot really give a recommendation in this respect.
However, I think all you need is this basic path: $..*
It should return in pretty much any implementation that is able to return pathes instead of values:
[
"$['response']",
"$['response']['results']",
"$['response']['results'][0]",
"$['response']['results'][1]",
"$['response']['results'][0]['acctNum']",
"$['response']['results'][0]['someData']",
"$['response']['results'][1]['acctNum']",
"$['response']['results'][1]['someData']"
]
If I understand correctly this should still work using your approach filtering using RegEx.
Go SONPath implementations:
http://github.com-PaesslerAG-jsonpath
http://github.com-bhmj-jsonslice
http://github.com-ohler55-ojg
http://github.com-oliveagle-jsonpath
http://github.com-spyzhov-ajson
http://github.com-vmware-labs-yaml-jsonpath

How to assert a JSON response which have results in random order every time in JMeter?

I am using JSON Assertion to assert if a JSON path exists. Suppose I have a JSON response of an array of 'rooms' that 'contains' an array of cabinets, just like the following example
"rooms":
[
{
"cabinets":
[
{
"id":"HFXXXX",
"locationid":null,
"name":"HFXXXX",
"type":"Hosp"
},
{
"id":"HFYYYY",
"locationid":null,
"name":"HFYYYY",
"type":"Hosp"
},
{
"id":"HFZZZZ",
"locationid":null,
"name":"HFZZZZ",
"type":"Hosp"
}
],
"hasMap":false,
"id":"2",
"map":
{
"h":null,
"w":null,
"x":null,
"y":null
},
"name":"Fantastic Room#3"
}
],
[
{ "cabinets":
[
{
"id":"HFBBBB",
"locationid":null,
"name":"HFBBBB",
"type":"Hosp"
}
],
"hasMap":false,
"id":"3",
"map":
{
"h":null,
"w":null,
"x":null,
"y":null
},
"name":"BallRoom #4"
}
]
I want to Make sure that the 'id' of all the cabinets are correct, therefore I define the JSON path as rooms[*].cabinets[*].id and expect the value to be ["HFXXXX","HFYYYY","HFZZZZ","HFBBBB"]
This works perfectly except that sometimes the values are returned in a different order["HFBBBB", "HFXXX","HFYYYY","HFZZZZ"] instead of ["HFXXXX","HFYYYY","HFZZZZ","HFBBBB"], hence the assertion will fail. The problem is with the order of the returned array and not the values themselves.
Is there a way to sort the order of a response before Asserting and keep using the JSON assertion? or the only way of doing this is extracting the value i want to assert against and use it in JSR223 Assertion (groovy or javascript)?
if that is the case can you show me an example of how I could do it in JSR223 plugin.
I would recommend using a dedicated library, for instance JSONAssert, this way you will not have to reinvent the wheel and can compare 2 JSON objects in a single line of code
Download jsonassert-x.x.x.jar and put it somewhere to JMeter Classpath
Download suitable version of JSON in Java library and put it to JMeter Classpath as well. If you're uncertain regarding what is "JMeter Classpath" just drop the .jars to "lib" folder of your JMeter installation
Restart JMeter so it would be able to load the new libraries
Add JSR223 Assertion as a child of the request which returns the above JSON
Put the following code into "Script" area:
def expected = vars.get('expected')
def actual = prev.getResponseDataAsString()
org.skyscreamer.jsonassert.JSONAssert.assertEquals(expected, actual, false)
It will compare the response of the parent sampler with the contents of ${expected} JMeter Variable, the order of elements, presence of new lines, formatting do not matter, it compares only keys and values
In case of mismatch you will have the error message stating that as the Assertion Result and the full debugging output will be available in STDOUT (console where you started JMeter from)

Reducing duplication for JSON test input in RSpec

I'm working on an application that reads JSON content from files and uses them to produce output. I'm testing with RSpec, and my specs are littered with JSON literal content all over the place. There's a ton of duplication, the files are big and hard to read, and it's getting to the point where it's so painful to add new cases, it's discouraging me from covering the corner cases.
Is there a good strategy for me to reuse large sections of JSON in my specs? I'd like to store the JSON somewhere that's not in the spec file, so I can focus on the test logic in the specs, and just understand which example JSON I'm using.
I understand that if the tests are hard to write, I may need to refactor the application, but until I can get the time to do that, I need to cover these test cases.
Below is one modified example from the application. I have to load many different JSON formatted strings like this, many are considerably larger and more complex:
RSpec.describe DataGenerator do
describe "#create_data" do
let(:input){
'{ "schema": "TEST_SCHEMA",
"tables": [
{ "name": "CASE_INFORMATION",
"rows": 1,
"columns": [
{ "name": "case_location_id", "type": "integer", "initial_value": "10000", "strategy": "next" },
{ "name": "id", "type": "integer", "delete_key": true, "initial_value": "10000", "strategy": "next" }
]
}
]
}'
}
it "generates the correct number of tables" do
generator = DataGenerator.new(input)
expect(generator.tables.size).to eq 1
end
end
end
We had a very same problem. We solved it by creating following helpers:
module JsonHelper
def get_json(name)
File.read(Rails.root.join 'spec', 'fixtures', 'json', "#{name}.json")
end
end
We moved all the json into files in spec/fixtures/json folder. Now you will eb able to use it as:
include JsonHelper
let(:input){ get_json :create_data_input }
Naturally you can tweak it as mach as you like/need. For example we were stubbing external services json responses, so we created get_service_response(service_name, request_name, response_type) helper. It is much more readable now when we use get_service_response('cdl', 'reg_lookup', 'invalid_reg')
assuming you put your json into 'create_data_input`

How to elegantly beautify a topojson code?

Is there some tool to explore the tree structure of the one-line topojson files ? (beautify)
{"type":"Topology","transform":{"scale":[0.0015484881821515486,0.0010301030103010299],"translate":-5.491666666666662,41.008333333333354]},"objects": {"level_0001m":{"type":"GeometryCollection","geometries":[{"type":"Polygon","arcs":[[0]],"properties":{"name":1}},{"type":"Polygon","arcs":[[1]],"properties":{"name":1}},{ ... }]},"level_0050m":{ ... }}}
Comments: My current method is to open the topojson .json into a text editor, and to manually look for clues while browsing. I end up sumarizing the whole by hand and keeping an handy note, something like :
{
"type":"Topology",
"transform":
{
"scale": [0.0015484881821515486,0.0010301030103010299],
"translate":[-5.491666666666662,41.008333333333354]
},
"objects": {
"level_0001m":
{
"type":"GeometryCollection",
"geometries":
[
{"type":"Polygon","arcs":[[0]],"properties":{"name":1}},
{"type":"Polygon","arcs":[[1]],"properties":{"name":1}},
{ ... }
]
},
"level_0050m":
{ ... }
}
}
But is there some more advanced tools to open, explore, edit topojson ?
Try this, jsbeautifier. I just did it this way.
If you're working with Windows try JSONedit. It's generic JSON editor, but it is relatively efficient when handling medium size JSON files (like your world-50m.json: 747 kB, 254k nodes including 165k int and 88k array nodes). Similar files to your notes can be created by deleting array elements after few initial ones (RMB + "Delete all siblings after node").
http://jsonprettyprint.com/json-pretty-printer.php
I tried this one with a file of 1.9 mb and it worked, maybe it'll work for you aswell
js-beautify from the command line generates json the way I would write them by hand.
https://github.com/einars/js-beautify
Use a JSON prettifier, example: http://pro.jsonlint.com
Use http://jsoneditoronline.org

Resources