Spring integration - processing files - spring

We are using spring integration for a scenario where we are reading a file (xml) and using splitter to split the files into several smaller files.
The flow goes like this: file is picked and is split into several small files and these files are processed individually and transformed. While transforming, we perform a validation whether the received file is in correct format or not. If validation is successful we move the file to success folder while in case validation fails it is moved to rejected folder.
We are facing an issue here, when input xml file has two elements, file splits into two sub files and is processed for validation. When first file validation succeeds, original file is moved to success folder whereas when second sub file validation fails, file is not available to be moved to rejected folder.
Can someone please suggest how can we move the file to success or rejected folder after individual sub file validation is success.

It's not clear from your requirements what you would like "to success or rejected": original file or just its part. If you talk about only those splitted parts, you should think about really creating unique files for them. From your concern it sounds like you reuse an original file name which really would cause a behavior you describe.

Related

Am I misunderstanding something about using the List and Fetch combination?

I am trying to understand the combination of List and Fetch processors.
I have a directory with three JSON files and I get the ListAzureDataLakeStorage to list them. But when I connect a FetchAzureDataLakeStorage with which I intend to take only one of the files, the Fetch takes the same file three times. In summary, it takes the file whose azure.filename matches with the value that I put in the File Name property, but as many times as there are files in the listed directory.
I really want to use a single List and connect three Fetches to it, each one to take a different file, and thus use them for different streams.
In each Fetch I put in the "File Name" property the name of the file that I want to take. For example:
File Name: fileName1.json
I have also tried putting in "File Name" with Expression Language the following:
FileName: $ {azure.filename: equals ('fileName1.json')}. But this option causes a 404 empty body error.
But there is no way. Am I misunderstanding something about using the List and Fetch combination?
If you are statically entering file names and you want to respond to each one differently, then the ListX processors aren't very beneficial to your flow.
The easier option would be to use a GenerateFlowFile processor with the appropriate schedule to trigger a corresponding FetchX processor.
If you're only doing this for 3 files, it's not too much manual overhead. You could also achieve something similar using RouteOnContent/Attribute.

How to transfer files in order (first come first serve) using apache camel

In my code there are two types of files with extension .csv or .psv and .tigger files. .csv files have more size than .trigger files, so .trigger files are getting transfer in prior to .csv files.
How to make sure that once .csv files are transferred only .trigger files should be transferred.
Am using same single route to transfer both the files.
You can use the sortBy-option of the camel file component. See http://camel.apache.org/file2.html for more information.
One idea is to implement camel's org.apache.camel.component.file.GenericFileFilter and write your filter logic in accept method. Logic should pick all the csv files first and then the trigger files. Use filter option of file component, from end point will be like:
from("file://inbox?filter=myFilter")

How can I specify filesystem rules?

I have the following two challanges:
I'd like to assert that the filename of an xml file is always equal to a certain string in the file itself
I'd like to assert that in every folder called 'Foo' is a file called 'bar.xml'
How can I do this with sonar? Is there already a plugin for this available?
There's no plugin for that, you will have to write your own.
To do the first point, you can write a sensor that parses the XML files to find if the name of the files exists in the file itself, this should not be complicated.
For the second point, you would have to write a sensor that is executed only on folders.
You can check the "Extension Guide" documentation to find code samples on how to do that.

Is there an efficient way in docpad to keep static and to-be-rendered files in the same directory?

I am rebuilding a site with docpad and it's very liberating to form a folders structure that makes sense with my workflow of content-creation, but I'm running into a problem with docpad's hard-division of content-to-be-rendered vs 'static'-content.
Docpad recommends that you put things like images in /files instead of /documents, and the documentation makes it sound as if otherwise there will be some processing overhead incurred.
First, I'd like an explanation if anyone has it of why a file with a
single extension (therefore no rendering) and no YAML front-matter,
such as a .jpg, would impact site-regeneration time when placed
within /documents.
Second, the real issue: is there a way, if it does indeed create a
performance hit, to mitigate it? For example, to specify an 'ignore'
list with regex, etc...
My use case
I would like to do this for posts and their associated images to make authoring a post more natural. I can easily see the images I have to work with and all the related files are in one place.
I also am doing this for an artwork I am displaying. In this case it's an even stronger use case, as the only data in my html.eco file is yaml front matter of various meta data, my layout automatically generates the gallery from all the attached images located in a folder of the same-name as the post. I can match the relative output path folder in my /files directory but it's error prone, because you're in one folder (src/files/artworks/) when creating the folder of images and another (src/documents/artworks/) when creating the html file -- typos are far more likely (as you can't ever see the folder and the html file side by side)...
Even without justifying a use case I can't see why docpad should be putting forth such a hard division. A performance consideration should not be passed on to the end user like that if it can be avoided in any way; since with docpad I am likely to be managing my blog through the file system I ought to have full control over that structure and certainly don't want my content divided up based on some framework limitation or performance concern instead of based on logical content divisions.
I think the key is the line about "metadata".Even though a file does NOT have a double extension, it can still have metadata at the top of the file which needs to be scanned and read. The double extension really just tells docpad to convert the file from one format and output it as another. If I create a straight html file in the document folder I can still include the metadata header in the form:
---
tags: ['tag1','tag2','tag3']
title: 'Some title'
---
When the file is copied to the out directory, this metadata will be removed. If I do the same thing to a html file in the files directory, the file will be copied to the out directory with the metadata header intact. So, the answer to your question is that even though your file has a single extension and is not "rendered" as such, it still needs to be opened and processed.
The point you make, however, is a good one. Keeping images and documents together. I can see a good argument for excluding certain file extensions (like image files) from being processed. Or perhaps, only including certain file extensions.

How to validate that an uploaded file is complete

I have an system with which users can upload a CSV file via an FTP server, or via a html form. On my end, a script polls the uploads directory and processes new files found. Some users will create the CSV by exporting it from Excel, while others will programmatically create it with scripts of their own.
My concern at the moment is: How can I be 100% certain that the file my processing script acts on is complete - in other words that it isn't a partial file (in progress, failed upload, etc)?
If the file format was something more structured, like XML, I'd be 100% confident that the file is complete by checking that the XML structure is valid (ie: closing tags).
Is there a good way to ensure that the uploaded CSV file is complete, without burdening & confusing less technical users who are simply uploading a file exported from a spreadsheet program (ie: providing an md5 of the file contents would be beyond them).
When designing CSV file formats in the past, I've always added a header and footer line as follows:
id,one,two,three,four,five,six
10,1,2,3,4,5,6
11,1,2,3,4,5,6
12,1,2,3,4,5,6
13,1,2,3,4,5,6
14,1,2,3,4,5,6
FOOTER,5
Most CSV file formats have a header to label the columns, the purpose of the footer is to indicate the file is completed. The footer contains a simple line count, which is easy to audit when looping through the file's contents. Not too complicated for users.
You could crosscheck whenever the filesize of the uploaded file matches the filesize of the original file.

Resources