Multiple threads problem in Spring batch when using the StaxEventItemReader - spring

I want to speed up saving data to the database so I chose to use the Multithreading technique with SpringBatch,
the first time the input is a csv file so I used the FlatFileItemReader and got better performance using ThreadPoolTaskExecutor.
The second time I wanted to read this data from an xml file and save it to the database, so I used the StaxEventItemReader, in this case I could not use the multithreading technique and I got errors like:
"Error while reading from event reader" .
I don't know what the problem is exactly but I guess StaxEventItemReader can't be used in case of using threads .
there another Xml file reader I can use that allows me to use multiple threads? ?

Related

Writing in the same files using Spring Batch

I am trying to build the batch application in which I will pick the files from the some folders, filter them using name, pass them to the batch operation using MultiResourceItemReader. Then I will implement my own ItemProcessor to change few rows based on some condition.
My requirement is to write the updated data in the same files I am taking input from, I don't know if we can really do this with Spring Batch.
So basically I can't think of how to implement the ItemWriter here, because I need to write the data to the same file and at the same time to the multiple files.
I guess ClassifierCompositeItemWriter can be used here or MultiResourceItemWriter, I have tried to read about them in different stackoverflow answers, but couldn't find anything related to my requirement.
Can anyone help me to implement this.
Code example would be really helpful.
Thanks
While this should be technically possible, I'm not sure it is a good idea for restartability. What would you do if something goes wrong? The input file would have been overridden with new data and you would lose the original items.
I'm probably missing something here, but I see no reason not to keep the original file, which could be deleted before the new one is renamed with the same name in a final step at the end of the job (after all the processing has been successfully done).
Partitioning can be used here, I don't know if this the right approach, but its working in my case very well.
Firstly I created 3 sets of Partitioners for 3 types of files, using MultiResourcePartitioner. I filtered the files from file system using nio.Files class and fed the collection of files to the MultiResourcePartitioner. It will automatically create the partitioner for every file.
Then I write the Reader, Writer and Processor for every Paritioner. In Reader I dynamically picked the filename from the stepExecutionContext and #StepScope annotation, and in writer I used temporary filename to store the output.
Then finally I created a Tasklet for deleting the original files and renaming the temporary files.
Read the documentation about partioning and parallel processing, you will know if it works in your case [doc]:https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html#partitioning

Loading Data into the application from GUI using Ruby

Problem:
Hi everyone, I am currently building an automation suite using Ruby-Selenium Webdriver-Cucumber to load data into the application using it's GUI. I've take input from mainframe .txt files. The scenarios are like to create a customer and then load multiple accounts for them as per the data provided in the inputs.
Current Approach
Execute the scenario using the rake task by passing line number as parameter and the script is executed for only one set of data.
To read the data for a particular line, I'm using below code:
File.readlines("#{file_path}")[line_number.to_i - 1]
My purpose of using line by line loading is to keep the execution running even if a line fails to load.
Shortcomings
Supposed I've to load 10 accounts to a single customer. So my current script will run 10 times to load each account. I want something that can load the accounts in a single go.
What I am looking for
To overcome the above shortcoming, I want to capture the entire data for a single customer from the file like accounts etc and load them into the application in a single execution.
Also, I've to keep track on the execution time and memory allocation as well.
Please provide your thoughts on this approach and any suggestions or improvements are welcomed. (Sorry for the long post)
The first thing I'd do is break this down into steps -- as you said in your comment, but more formally here:
Get the data to apply to all records. Put up a page with the
necessary information (or support command line specification if not
too much?).
For each line in the file, do the following (automated):
Get the web page for inputting its data;
Fill in the fields;
Submit the form
Given this, I'd say the 'for each line' instruction should definitely be reading a line at a time from the file using File.foreach or similar.
Is there anything beyond this that needs to be taken into account?

two programs accessing one file

New to this forum - looks great!
I have some Processing code that periodically reads data wirelessly from remote devices and writes that data as bytes to a file, e.g. data.dat. I want to write an Objective C program on my Mac Mini using Xcode to read this file, parse the data, and act on the data if data values indicate a problem. My question is: can my two different programs access the same file asynchronously without a problem? If this is a problem can you suggest a technique that will allow these operations?
Thanks,
Kevin H.
Multiple processes can read from the same file at a time without any problem. A process can also read from a file while another writes without problem, although you'll have to take care to ensure that you read in any new data that was written. Multiple processes should not write to the same file at at the same time, though. The OS will let you do it, but the ordering of data will be undefined, and you'll like overwrite data—in general, you're gonna have a bad time if you do that. So you should take care to ensure that only one process writes to a file at a time.
The simplest way to protect a file so that only one process can write to it at a time is with the C function flock(), although that function is admittedly a bit rudimentary and may or may not suit your use case.

How to process an open file using MapReduce framework

I have a file that get aggregated and written into HDFS. This file will be opened for an hour before it is closed. Is it possible to compute this file using MapReduce framework, while it is open? I tried it but it's not picking up all appended data. I could query the data in HDFS and it available but not when done by MapReduce. Is there anyway I could force MapReduce to read an open file? Perhaps customize the FileInputFormat class?
You can read what was physically flushed. Since close() makes the final flush of the data, your reads may miss some of the most recent data regardless how you access it (mapreduce or command line).
As a solution I would recommend periodically close the current file, and then open a new one (with some incremented index suffix). You can run you map reduce on multiple files. You would still end up with some data missing in the most recent file, but at least you can control it by frequency of of your file "rotation".

Is it possible to use Pig streaming (StreamToPig) in a way that handles multiple lines as a single input tuple?

I'm streaming data in a pig script through an executable that returns an xml fragment for each line of input I stream to it. That xml fragment happens to span multiple lines and I have no control whatsoever over the output of the executable I stream to
In relation to Use Hadoop Pig to load data from text file w/ each record on multiple lines?, the answer was suggesting writing a custom record reader. The problem is, this works fine if you want to implement a LoadFunc that reads from a file, but to be able to use streaming, it has to implement StreamToPig. StreamToPig allows you to only read one line at a time as far as I understood
Does anyone know how to handle such a situation?
If you are absolutely sure, then one option is to manage it internally to the streaming solution. That is to say, you build up the tuple yourself, and when you hit whatever your desired size is, you do the processing and return a value. In general, evalfuncs in pig have this issue.

Resources