Spring Batch Two Files With Different Structure - spring

I have a project in spring batch where I must read from two .txt files, one has many lines and the other is a control file that has the number of lines that should be read from the first file. I know that I must use partition to process these files because the first one is very large and I need to divide it and be able to restart it in case it fails but I don't know how the reader should handle these files since both files do not have the same width in their lines.
None of the files have a header or separator in their lines, so i have to obtain the fields according to a range mainly in the first one.
One of my doubts is whether I should read both in the same reader? And how should I set the reader FixedLengthTokenizer and DefaultLineMapper to handle both files in the case of using the same reader??
These are examples of the input file and the control file
- input file
09459915032508501149343120020562580292792085100204001530012282921883101 the txt file can contain up to 50000 lines
- control file
00128*
It only has one line
Thanks!

I must read from two .txt files, one has many lines and the other is a control file that has the number of lines that should be read from the first file
Here is a possible way to tackle your use case:
Create a first step (tasklet) that reads the control file and put the number of lines to read in the job execution context (to share it with the next step)
Create a second step (chunk-oriented) with a step scoped reader that is configured to read only the number of lines calculated by the first step (get value from job execution context)
You can read more about sharing data between steps here: https://docs.spring.io/spring-batch/docs/4.2.x/reference/html/common-patterns.html#passingDataToFutureSteps

Related

How to wait until a specific file arrives in a folder before NiFi's ListFile processor lists the entire contents of the floder

I need to move several hundred files from a Windows source folder to a destination folder together in one operation. The files are named sequentially (e.g. part-0001.csv, part-002.csv). It is not known what the final file in the sequence will be called. The files will arrive in the source folder over a number of weeks and it is not ascertainable when the final one will arrive. The users want to use a trigger file (i.e. the arrival of a spefic named file in the folder e.g. trigger.txt) to cause flow to start. My first two thoughts were using a first ListFile processor as an input to a second, or the input to an ExecuteProcess processor that would call a script to start the second one, however, neither of these processors accept an input, so I am a bit stumped as to how I might achieve this, or indeed if it is possible with NiFi. Has anyone encountered this use case, and if so how did you resolve it?

How can I cut several sound files using a script?

I am new to Praat and wondering, if someone can help me to find out, how I can cut all my sound files with a script or anything.
I have like 100 sound files I need for my research. They all have a different length, some are 1 min and others are 3 min long.
I would like to have only the first 22 sec from each sound file.
Thanks in advance!
Kind regards
Olga
The first step is to construct a script that extracts the initial 22 seconds of some specific sound object that is already open. In general, the easiest way to at least start a script is to do a thing manually once, and after you've done that, in a Praat script window, copy the command history (with ctrl-h) to see what the underlying commands are. The manual approach is to look for "Extract part" under "Convert", which corresponds to the command
Extract part: 0, 22, "rectangular", 1, "no"
There is also a command to save a file as a wav file, so you would add that to the core of the script.
Then you need to add a loop that does this a number of times, to different files. You will (probably) need a file with wav file names, and some system for naming the output files, for example if you have "input1.wav", you might want to call the cut-down version "output1.wav". This implies some computation of the output file name based on the input file name, so you need to get familiar with how string manipulation works in Praat.
If you have that much sorted out, then the basic logic is
get next input file name file
compute output name
open the input file
extract from that file
save the extracted file
remove the extract
remove the original
loop until no more files
I would plan on spending a lot of time trying to understand simple things like string variables, or object selection. I left out explicitly selecting objects since it is not necessarily required, but every command works on "the selected object" and it's easy to lose track of what is selected.
Another common approach is to beg a colleague to write it for you.

Jmeter read 2 different csv files in different loops

Currently, i have a requirement where I need to make sure that the data once read is not read again. Earlier I used to use HttpSimpleTableServer when I had to run only one loop with keep=false. However now I need to run 2 loops and for which the above option doesn’t work as the same csv is read from the start agin for the second loop. So I was thinking if there is a way to read data from different csv files per loop. If not how can I make sure that different data is read from the csv for every loop and no data is ever repeated. My Jmeter version is 5.3.
You can use CSV Data Set Config component to read the data from CSV files.
Set the `` flag to false to read the data only once.
You may set the remaining flags based on your need.
You may add two different CSV Data Set Config elements to work with different CSV files.
If you want to handle this programmatically API documentation will be useful. API documenation
If you need to read 2 different files in 2 different loops you should consider going for __CSVRead() function instead
Create 2 files like file0.csv and file1.csv
Once done you will be able to:
${__CSVRead(file${__jm__Thread Group__idx}.csv,0)} - read first column
${__CSVRead(file${__jm__Thread Group__idx}.csv,1)} - read second column
${__CSVRead(file${__jm__Thread Group__idx}.csv,next)} - proceed to next row
etc.
The __CSVRead() function will proceed to the next file on next Thread Group iteration
More information: How to Pick Different CSV Files at JMeter Runtime

Processing large text files through Nifi returns null

I'm having a pretty weird issue.
I currently have a nifi flow which uses a getfile processor to grab a log file that is dropped in a directory. From there, its passed to a custom processor where the file is processed inside of session.read using the BufferedReader class. I do all of my necessary processing and pass the flow files on. Simple stuff
This works perfectly fine for moderately sized files but when I try to process a large log file (around 2.5GB) I start getting null returned when trying to call readLine() from buffered reader. It seems as if very large files aren't being opened/read by the bufferedreader.
Any advice on areas to troubleshoot to figure out why this behavior is happening for bigger files and not to smaller files?
Try to split the file in smaller pieces, like for example 1000 lines per flowfile, you can do this using the SplitText component.
I had this problem in the past and was solved using this approach.
Also can happens that the SplitText block NiFi, in this case you can concat several SplitText to split in 100000 lines -> 10000 lines -> 1000 lines, with this trick you can avoid this problem.

Custom input splits for streaming the data in MapReduce

I have a large data set that is ingested into HDFS as sequence files, with the key being the file metadata and value the entire file contents. I am using SequenceFileInputFormat and hence my splits are based on the sequence file sync points.
The issue I am facing is when I ingest really large files, I am basically loading the entire file in memory in the Mapper/Reducer as the value is the entire file content. I am looking for ways to stream the file contents while retaining the Sequence file container. I even thought about writing custom splits but not sure of how I will retain the sequence file container.
Any ideas would be helpful.
The custom split approach is not suitable to this scenario for the following 2 reasons. 1) Entire file is getting loaded to the Map node because the Map function needs entire file (as value = entire content). If you split the file, Map function receives only a partial record (value) and it would fail.2) Probably the sequence file container is treating your file as a 'single record' file. So, it would have only 1 sync point at max, that is after the Header. So, even if you retain the Sequence File Container's sync points, the whole file gets loaded to the Map node as it being loaded now.
I had the concerns regarding losing the sequence files sync points if writing a custom split. I was thinking of this approach of modifying the Sequence File Input Format/Record Reader to return chunks of the file contents as opposed to the entire file, but return the same key for every chunk.
The chunking strategy would be similar to how file splits are calculated in MapReduce.

Resources