I am using Spring Integration to process/load data from csv files.
My Configuration is -
1) Poll For incoming File
2) Split the file using splitter - this gives me individual lines(records) of the file
3) Tokenize the line - this gives me the values or columns
4) Use aggregator to aggregate/collect lines(records) and write it to database in a batch
Poller -> Splitter -> Tokenizer -> Aggregator
Now I want to wait till all the content of the file has been written to the database and then move the file to a different folder.
But how to identify when the file processing is finished ?
Problem here is, if the file has 1 million records and my aggregator has batch size of 500, how would i know when every record of my file has been aggregated and written out to the database.
The FileSplitter can optionally add markers (BOF, EOF) to the output - you would have to filter and/or route them before your secondary splitter.
See FileSplitter.
(markers) Set to true to emit start/end of file marker messages before and after the file data. Markers are messages with FileSplitter.FileMarker payloads (with START and END values in the mark property). Markers might be used when sequentially processing files in a downstream flow where some lines are filtered. They enable the downstream processing to know when a file has been completely processed. In addition, a header file_marker containing START or END are added to these messages. The END marker includes a line count. If the file is empty, only START and END markers are emitted with 0 as the lineCount. Default: false. When true, apply-sequence is false by default. Also see markers-json.
Related
I am using NiFi 1.9.2
I am reading a text file which happens to be a csv file. I have the Contents of the file in the Contents of a flowFile.
Contents are
a,b,c
d,e,f
g,h,i
I want to prepend a line number to all records in the flowfile and get
1,a,b,c
2,d,e,f
3,g,h,i
each time I feed a file through this processor
I can achieve something close by using the ReplaceText processor with Properties as follows:
Search Value : (?m)(^.*$)
Replacement Value : ${nextInt()},$1
But because nextInt() persists it's value over the lifetime of the running NiFi instance I get
0,a,b,c
1,d,e,f
2,g,h,i
for 1st execution
3,a,b,c
4,d,e,f
5,g,h,i
for the next execution etc
Additionally, from the NiFi expression-language-guide, the "counter is shared across all NiFi components, so calling this function multiple times from one Processor will not guarantee sequential values within the context of a particular Processor."
Is there a way to ensure the line numbers always start at 0 for each execution of this processor for the lifetime of the NiFi instance, and are always sequential?
What the range of the counter?
Can I get the counter to start at 1?
You can split the content to several lines then use fragment.index to prepent the counter to the lines. After that you can merge them again.
The Flow:
GenerateFlowFile:
SplitText:
ReplaceText:
MergeContent:
Don't forget to add a new line (Shift+Enter) to Demarcator attribute.
Result:
You can use ${Fragment.index:minus(1)} if you want to count from zero.
I have two configuration files for Logstash: test1.conf and test2.conf.
Each one of them has it's own flow of input -> filter -> ouput.
Both of them have the same filter and elasticsearch output writing to the same index.
My problem is that Logstash is writing duplicate events to the ElasticSearch index, no matter which input I choose to test (every event becomes two identical events instead of one).
How can I fix this?
By default, Logstash has one pipeline named main which automatically detects all .conf files in conf.d folder; this configuration is set at pipelines.yml file:
- pipeline.id: main
path.config: "/etc/logstash/conf.d/*.conf"
If you have multiple .conf files under one pipeline, Logstash will merge them together, causing all filters and outputs to be performed on all of the inputs, so in this case, no matter which input is receiving events, it will go through two paths of filter/output, causing duplicate writing to ElasticSearch (identical events if the filters/outputs are the same for both .conf files).
Solutions
1. Move filter/output into a separate file
If your filters/outputs are the same across the config files, move filter/output into a separate file. So now you have two .conf files, one for each input, and a third .conf file for the filter/output. With this setup every input will go through only one processing path.
For example:
input1.conf
input {
# input 1
}
input2.conf
input {
# input 2
}
filter_output.conf
filter {
# common filter
}
output {
# common output
}
You can check out this answer for another example when this solution should be chosen.
Note that If the filters/output are the same but you still want to refer them as complete different processing paths, please keep reading.
2. Split the .conf files to different pipelines
If you need every .conf file to be independent, split the .conf files to different pipelines.
In order to do that, just edit pipelines.yml file.
For example:
pipelines.yml
- pipeline.id: test1
path.config: "/etc/logstash/conf.d/test1.conf"
- pipeline.id: test2
path.config: "/etc/logstash/conf.d/test2.conf"
Read more about Multiple Pipelines
3. Separate by types
Tag each input with different type and check it later on the filters/outputs with if statement.
You can read more about it in this answer.
I have been writing a spring batch wherein I have to perform some error-handling.
I know that spring batch has its own way of handling errors and restarts.
However, when the batch fails and restarts again, I want to pass my own values, conditions and parameters (for restart) which need to be followed before starting/executing the first step.
So, is it possible to write such custom restart in spring batch?
UPDATE1:(Providing a better explanation for above question.)
Let's say that the input to my reader in Step 1 is in the following format:
Table with following columns:
CompanyName1 -> VehicleId1
CN1 -> VID2
CN1 -> VID3
.
.
CN1 -> VID30
CN2 -> VID1
CN2 -> VID2
.
.
CNn -> VIDn
The reader reads this table row by row for a chunk size 1 (so in this case the row retrieved will be CN -> VID ) processes it and writes it to a File object.
This process goes on until all the CN1 type data is written into the File object. When the reader sends the row with company Name of type CN2 , the File object that was created earlier (for company name of type CN1) will be stored in a remote location. Then the process of File Object creation will continue for CN2 until we encounter CN3, in which case CN2 File Object will be sent for storage to a remote location
and the process will continue.
Now, once you understand this, here's a catch.
Let's say the data is currently being written by the writer for company Name 2 (CN2) and vehicle ID is VID20 (CN2 -> VID20)
in the File object. Then, due to some reason we had to stop the job/the job fails. In that case, the instance that will be saved will be CN2 -> VID20. So, next time when the job runs, it will start from CN2->VID20
As you might have guessed, all the 19 entries before CN2->VID20 which were written in the File Object were deleted permanently when the file Object got destroyed and these entries were never sent through the File to remote location.
So my question here is this:
Is there a way where I can write my custom restart for the batch where I could tell the job to start from CN2->VID1 instead of CN2->VID20?
If you could think of any other way to handle this scenario then such suggestions are also welcome.
Since you want to write the data of each company in a separate file, I would use the company name as a job parameter. With this approach:
Each job will read the data of a single company and write it to a file
If a job fails, you can restart it and it would resume where it left off (Spring Batch will truncate the file to last known written offset before writing new data). So there is no need for a custom restart "policy".
No need to set the chunk-size to 1, this is not efficient. You can use a reasonable chunk size with this approach
If the number of companies is small enough, you can run jobs manually. Otherwise, you can always get the distinct values with something like select distinct(company_name) from yourTable and write a script/loop to launch batch jobs with different parameters. Bottom line, make each job do one thing and do it well.
I'm concerned with work of CSV Data Set Config along JMeter rules set with scoping rules and execution order.
For CSV Data Set Config it is said "Lines are read at the start of each test iteration.". At first I thought that talks about threads, then I've read Use jmeter to test multiple Websites where config is put inside loop controller and lines are read each loop iteration. I've tested with now 5.1.1 and it works. But if I put config at root of test plan, then in will read new line only each thread iteration. Can I expect such behaviour based on docs only w/out try-and-error? I cannot see how it flows from scoping+exec order+docs on csv config element. Am I missing something?
I would appreciate some ideas why such factual behaviour is convenient and why functionality was implemented this way.
P.S. how can I read one line cvs to vars at start of test and then stop running that config to save CPU time? In 2.x version there was VariablesFromCSV config for that...
The Thread Group has an implicit Loop Controller inside it:
the next line from CSV will be read as soon as LoopIterationListener.iterationStart() event occurs, no matter of origin
It is safe to use CSV Data Set Config as it doesn't keep the whole file in the memory, it reads the next line only when the aforementioned iterationStart() event occurs. However it keeps an open file handle. If you do have really a lot of RAM and not enough file handles you can read the file into memory at the beginning of the test using i.e. setUp Thread Group and JSR223 Sampler with the following code
SampleResult.setIgnore()
new File('/path/to/csv/file').readLines().eachWithIndex { line, index ->
props.put('line_' + (index + 1), line)
}
once done you will be able to refer the first line using __P() function as ${__P(line_1,)}, second line as ${__P(line_2,)}, etc.
I have a Jmeter test with multiple while controler, each looping through data in separate files
I want each while loop to loop through the end of that file.
Structure:
While controller 1
- CSV Data Config 1
- Http sampler 1
While Controller 2
- CSV Data Config 2
- http sampler 2
When I set as an end condition: ${__javaScript(${myVar}!="<EOF>")} with stop thread on end of file to true, it stops the whole test completly.
If I set stop on end of file to false it loops on the also, meaning it loops one time too many
Is there another way to do this?
Thanks
A possible solution would be use beanshell processor and find the number of records in each file. Use the number of lines as the condition to break out of the loop. Please note that you will have to use parseint in the while condition as discussed in this thread.
Another option could be changing your While Controller to do the infinite loop and control thread stopping via CSV Data Set Config.
As per Using CSV DATA SET CONFIG guide:
It is worth mentioning that every CSV Data Set Config is visible to all Thread Groups by default. If you need to use separate CSV Data Set Config for every Thread, you create a number of data files that you need and in every CSV Data Set Config set “Sharing mode” to “Current Thread”
So the following combination:
Recycle on EOF = false
Stop Thread on EOF = true
Sharing mode = Current Thread
Should do the trick for you.