Spring batch to upload a CSV file and insert into database - spring

My project has this requirement where user uploads a CSV file which has to be pushed to sql server database.
I know we can use Spring batch to process large number of records. But I'm not able to find any tutorial/sample code for this requirement of mine.
All the tutorials which I came across just hardcoded the CSV file name and in-memory databases in it like below:
https://spring.io/guides/gs/batch-processing/
User Input file is available in shared drive location on schduled time with file name prefix as eg: stack_overlfow_dd-MM-yyyy HH:mm, on daily basis how can I poll the Network shared drive for every 5-10 minutes atleast for one hour daily if its matches with regex then upload to database.
How can I take the csv file first from shared location and store it in memory or somewhere and then configure spring batch to read that as input.
any help here would be appreciated. Thanks In advance

All the tutorials which I came across just hardcoded the CSV file name and in-memory databases
You can find samples in the official repo here. Here is an example where the input file name is not hardcoded but passed as a job parameter.
How can I take the csv file first from shared location and store it in memory or somewhere and then configure spring batch to read that as input.
You can proceed in two steps: download the file locally then read/process/write it to the database (See https://stackoverflow.com/a/52110781/5019386).
how can I poll the Network shared drive for every 5-10 minutes atleast for one hour daily if its matches with regex then upload to database.
Once you have defined your job, you can schedule it to run when you want using:
a scheduler like Quartz
or using Spring's task scheduling features.
or using a combination of Spring Integration and Spring Batch. Spring integration would poll the directory and then launches a Spring Batch job when appropriate. This approach is described here.
More details on job scheduling here.

You can make a service layer that can process excel file and read data from file and construct java object to save into DB. Here I have used apache POI to parse Excel data and read from excel sheet.
public class FileUploadService {
#Autowired
FileUploadDao fileUploadDao;
public String uploadFileData(String inputFilePath) {
Workbook workbook = null;
Sheet sheet = null;
try {
workbook = getWorkBook(new File(inputFilePath));
sheet = workbook.getSheetAt(0);
/*Build the header portion of the Output File*/
String headerDetails = "EmployeeId,EmployeeName,Address,Country";
String headerNames[] = headerDetails.split(",");
/*Read and process each Row*/
ArrayList < ExcelTemplateVO > employeeList = new ArrayList < > ();
Iterator < Row > rowIterator = sheet.iterator();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
//Read and process each column in row
ExcelTemplateVO excelTemplateVO = new ExcelTemplateVO();
int count = 0;
while (count < headerNames.length) {
String methodName = "set" + headerNames[count];
String inputCellValue = getCellValueBasedOnCellType(row, count++);
setValueIntoObject(excelTemplateVO, ExcelTemplateVO.class, methodName, "java.lang.String", inputCellValue);
}
employeeList.add(excelTemplateVO);
}
fileUploadDao.saveFileDataInDB(employeeList);
} catch (Exception ex) {
ex.printStackTrace();
}
return "Success";
}

I believe your question have already been answered here.
The author of the question has even uploaded a repository of his working result :
https://github.com/PriyankaBolisetty/SpringBatchUploadCSVFileToDatabase/tree/master/src/main/java/springbatch_example
You can retrieve and filter files' lists in a shared drive using JCIFS API method SmbFile.listFiles(String wildcard).

Related

Spring integration SFTP - issue with filters and number of messages emits

I started using spring integration SFTP and I have some questions.
Filters not working. I have example configuration:
Sftp.inboundAdapter(ftpFileSessionFactory())
.preserveTimestamp(true)
.deleteRemoteFiles(false)
.remoteDirectory(integrationProperties.getRemoteDirectory())
.filter(sftpFileListFilter()) // doesn't work
.patternFilter("*.xlsx") // doesn't work
And my ChainFileListFilter:
private ChainFileListFilter<ChannelSftp.LsEntry> sftpFileListFilter() {
ChainFileListFilter<ChannelSftp.LsEntry> chainFileListFilter = new ChainFileListFilter<>();
chainFileListFilter.addFilter(new SftpPersistentAcceptOnceFileListFilter(metadataStore(), "INT"));
chainFileListFilter.addFilter(new SftpSimplePatternFileListFilter("*.xlsx"));
return chainFileListFilter;
}
If I understand correctly, only the XLSX file should be saved in the local directory. If yes it doesn't work with this configuration. Am I doing something wrong or misunderstood this?
How I can configure SFTP that each downloaded file emit message? I see in the doc two params max-messages-per-poll and max-fetch-size, but I don't know how to set it up so that every file emits a message. I would like to sync files once every 24 hours and produce batch job queue. Maybe there is a workaround?
Is there built-in filter which allow me fetch only files with changed content? The best solution would be to check the checksums of the files.
I will be grateful for your help and explanations.
You cannot combine filter() and patternFilter(). Only one of them can be used: the last one overrides whatever you used before. In other words: or filter() or patternFilter() - not both. By default the logic is like this:
public SftpInboundChannelAdapterSpec patternFilter(String pattern) {
return filter(composeFilters(new SftpSimplePatternFileListFilter(pattern)));
}
private CompositeFileListFilter<ChannelSftp.LsEntry> composeFilters(FileListFilter<ChannelSftp.LsEntry>
fileListFilter) {
CompositeFileListFilter<ChannelSftp.LsEntry> compositeFileListFilter = new CompositeFileListFilter<>();
compositeFileListFilter.addFilters(fileListFilter,
new SftpPersistentAcceptOnceFileListFilter(new SimpleMetadataStore(), "sftpMessageSource"));
return compositeFileListFilter;
}
So, technically you don't need your custom one, if you don't use external persistent MetadataStore. But if you do, think about flipping SftpSimplePatternFileListFilter with SftpPersistentAcceptOnceFileListFilter. Since it is better to check for the pattern before storing the file into MetadataStore.
It is the fact that every synched remote file, passed those filters, is stored into local dir and the message for that local file is emitted immediately when the poller does a request.
The maxFetchSize plays the role when we load remote files into a local dir. The maxMessagesPerPoll is used from the poller, but those are already built from the local files. The message is emitted per local file, not as a batch for all of them. That's not what messaging is designed for.
Please, share more info what does not work with files. The SftpPersistentAcceptOnceFileListFilter checks not only file name, but also mtime of the file. So, that it not about any checksum, but more last modified timestamp of the file.

Download file contents and names into a List with Apache Camel FTP

I would like to download a list of files with name and content in Apache Camel.
Currently I am downloading the file content of all files as byte[] and storing them in a List. I then read the list using a ConsumerTemplate.
This works well. This is my Route:
from(downloadUri)
.aggregate(AggregationStrategies.flexible(byte[].class).accumulateInCollection(
LinkedList.class))
.constant(true)
.completionFromBatchConsumer()
.to("direct:" + this.destinationObjectId);
I get the List of all downloaded file contents as byte[] as desired.
I would like to extend it now so that it downloads the content and the file name of each file. It shall be stored in a pair object:
public class NameContentPair {
private String fileName;
private byte[] fileContent;
public NameContentPair(String fileName, byte[] fileContent) { ... }
}
These pair objects for each downloaded file shall in turn be stored in a List. How can I change or extend my Route to do this?
I tried Camel Converters, but was not able to build them properly into my Route. I always got the Route setup wrong.
I solved this by implementing a custom AggregationStrategy.
It reads the file name and the file content from each Exchange and puts them into a list as a NameContentPair objects. The file content and file name is present in the Exchange's body as a RemoteFile and is read from there.
The general aggregation implementation is based on the example implementation from https://camel.apache.org/components/3.15.x/eips/aggregate-eip.html
The aggregation strategy is then added to the route
from(downloadUri)
.aggregate(new FileContentWithFileNameInListAggregationStrategy())
.constant(true)
.completionFromBatchConsumer()
.to("direct:" + this.destinationObjectId);

How to read multiple files, process and write separately using spring batch

I want to read multiple files, name*.txt and process them.
For that I am using MultiResourceItemReader.
It is reading all files and process and write at one time only. I want to read multiple files seperately, process and write to them.
The code:
#Bean
public MultiResourceItemReader<POJO> multiResourceItemReader() {
MultiResourceItemReader<POJO> resourceItemReader = new MultiResourceItemReader<POJO>();
ClassLoader cl = this.getClass().getClassLoader();
ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver(cl);
Resource[] resources = resolver.getResources("file:" + filePath );
resourceItemReader.setResources(resources);
resourceItemReader.setDelegate(reader());
return resourceItemReader;
}
That's how the MultiResourceItemReader is designed to work. In your case, you can create a job instance per file.
There are many advantages of making one thing do one thing and do it well, one of them in your use case is restartability: If one of the jobs fail, you only restart the failed one.

Hive setup()-like functionality similar to Mapper setup()?

I want to replace a Hadoop job with Hive. My challenge is in Hadoop, I'm using setup() to build a kdtree by reading in reference data (points of interest) from the distributed cache. I then use the kdtree in map() to evaluate distance of the target data against the kdtree.
In Hive, I wanted to use a udf with evaluate() method to determine the distance, but I don't know how to setup the kdtree with the reference data. Is this possible?
I probably don't have the entire answer, so I'm just going to throw out some ideas that might be of help.
You can add files to the distributed cache in hive using ADD FILE ...
Hive 11+ (I think) should let you access to the distributed cache in GenericUDF.initialize
https://issues.apache.org/jira/browse/HIVE-1016 which references...
https://issues.apache.org/jira/browse/HIVE-3628
So when you initialize the UDF, you might be able to build your kdtree by accessing the file you added in the distributed cache.
Like climbage says ADD FILE command adds the file into distributed cache.
You can access the distributed cache in your UDF simply by opening a file which is in the current directory.
ie... open( new File( System.getProperty("user.dir") + "/myfile") );
You can use a ConstantObjectInspector to access the filename in the initialize method of GenericUDF, where you can open the file and read into memory into your data structure.
The distributed_map UDF of Brickhouse does something similar ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java )
Something like
public ObjectInspector initialize(ObjectInspector[] inspArr) {
ConstantObjectInspector fileNameInsp = (ConstantObjectInspector)inspArr[0];
String fileName = fileNameInsp.getWritableConstantValue().toString();
FileInputStream inFile = new FileInputStream("./" + fileName);
doStuff( inFile );
.....
}

Issue with WatchService in java 7

I'm using jdk7's WatchService API for monitoring the folder on file system.I'm sending a new file through
email to that folder, when the file comes into that folder i m triggering the ENTRY_CRATE option. its working fine.
But the issue is its generating two events of ENTRY_CREATE instead of one event which i'm invoking.
BELOW IS THE CODE:
Path dir = Paths.get(/var/mail);
WatchService watcher = dir.getFileSystem().newWatchService();
dir.register(watcher, StandardWatchEventKinds.ENTRY_CREATE);
System.out.println("waiting for new file");
WatchKey watckKey = watcher.take();
List<WatchEvent<?>> events = watckKey.pollEvents();
System.out.println(events.size());
for(WatchEvent<?> event : events){
if(event.kind() == StandardWatchEventKinds.ENTRY_CREATE){
String fileCreated=event.context().toString().trim();
}
}
In the above code I'm gettng the events size as 2.
Can any one please help me in finding out the reason why i'm getting two events.
I am guessing that there might be some temporary files being created in the folder at the same time. Just check what are the name/paths of the file being created.

Resources