How to validate that an uploaded file is complete - validation

I have an system with which users can upload a CSV file via an FTP server, or via a html form. On my end, a script polls the uploads directory and processes new files found. Some users will create the CSV by exporting it from Excel, while others will programmatically create it with scripts of their own.
My concern at the moment is: How can I be 100% certain that the file my processing script acts on is complete - in other words that it isn't a partial file (in progress, failed upload, etc)?
If the file format was something more structured, like XML, I'd be 100% confident that the file is complete by checking that the XML structure is valid (ie: closing tags).
Is there a good way to ensure that the uploaded CSV file is complete, without burdening & confusing less technical users who are simply uploading a file exported from a spreadsheet program (ie: providing an md5 of the file contents would be beyond them).

When designing CSV file formats in the past, I've always added a header and footer line as follows:
id,one,two,three,four,five,six
10,1,2,3,4,5,6
11,1,2,3,4,5,6
12,1,2,3,4,5,6
13,1,2,3,4,5,6
14,1,2,3,4,5,6
FOOTER,5
Most CSV file formats have a header to label the columns, the purpose of the footer is to indicate the file is completed. The footer contains a simple line count, which is easy to audit when looping through the file's contents. Not too complicated for users.

You could crosscheck whenever the filesize of the uploaded file matches the filesize of the original file.

Related

Spring integration - processing files

We are using spring integration for a scenario where we are reading a file (xml) and using splitter to split the files into several smaller files.
The flow goes like this: file is picked and is split into several small files and these files are processed individually and transformed. While transforming, we perform a validation whether the received file is in correct format or not. If validation is successful we move the file to success folder while in case validation fails it is moved to rejected folder.
We are facing an issue here, when input xml file has two elements, file splits into two sub files and is processed for validation. When first file validation succeeds, original file is moved to success folder whereas when second sub file validation fails, file is not available to be moved to rejected folder.
Can someone please suggest how can we move the file to success or rejected folder after individual sub file validation is success.
It's not clear from your requirements what you would like "to success or rejected": original file or just its part. If you talk about only those splitted parts, you should think about really creating unique files for them. From your concern it sounds like you reuse an original file name which really would cause a behavior you describe.

Issue when copying two text files using the copy command

I am having an issue when attempting to copy two files into a separate file using the Windows Copy command. In the first file, when I open the text file in notepad I see the data in the file formatted as expected.
File #1
0900|Y3RN|19944|12/OCT/2016|2|2|1600|C||||||0|0|||Replace
0900|Y3RN|19944|13/OCT/2016|2|2|2000|C||||||0|0|||Replace
0900|Y3RN|19944|14/OCT/2016|2|2|600|C||||||0|0|||Replace
However in the second file that has the same fields the format of the data in notepad is different.
File #2
0901|ECQQ|339489|18/OCT/2016|2|2|25|C||||||0|0|||Replace0901|ECQQ|339489|19/OCT/2016|2|2|180|C||||||0|0|||Replace0901|EK1P|339489|04/OCT/2016|2|2|100|C||||||0|0|||Replace
Supposedly the same process is generating the two files on my customer's system of record. If I open only each file separately using Textpad, the two files have the same format as File#1 above.
When I use the Copy command, the resulting file looks as below when viewing in Notepad.
0900|Y3RN|19944|28/OCT/2016|2|2|1400|C||||||0|0|||Replace
0900|Y3RN|19944|31/OCT/2016|2|2|1400|C||||||0|0|||Replace
0900|Y6CJ|19944|10/OCT/2016|2|2|200|C||||||0|0|||Replace0901|ECQQ|339489|18/OCT/2016|2|2|25|C||||||0|0|||Replace0901|ECQQ|339489|19/OCT/2016|2|2|180|C||||||0|0|||Replace0901|EK1P|339489|04/OCT/2016|2|2|100|C||||||0|0|||Replace
However when viewing the resulting file in Textpad, the format is correct.
There has to be something missing in the format of File#2, but since I do not have access or visibility to how these files are being generated my hands are tied.
Is there a way to convert File #2 so that it is formatted exactly like File#1?

How to transfer files in order (first come first serve) using apache camel

In my code there are two types of files with extension .csv or .psv and .tigger files. .csv files have more size than .trigger files, so .trigger files are getting transfer in prior to .csv files.
How to make sure that once .csv files are transferred only .trigger files should be transferred.
Am using same single route to transfer both the files.
You can use the sortBy-option of the camel file component. See http://camel.apache.org/file2.html for more information.
One idea is to implement camel's org.apache.camel.component.file.GenericFileFilter and write your filter logic in accept method. Logic should pick all the csv files first and then the trigger files. Use filter option of file component, from end point will be like:
from("file://inbox?filter=myFilter")

Is there a part of a windows file that can't be modified?

I'm trying to accomplish something that will let a user download a file from a web application onto their system. The file will contain a unique five digit code. Using this unique five digit code the users can search for a file in their file system.
I'm wondering where is the best place to put this five digit code in a file so that users can easily search for the file. The simplest approach would be to put it in the name of the file, however, users can change the name of the file easily.
I'm looking for a filed where I can put the code so that users won't be able to modify it but will still be able to search for it. Is this possible?
If you say File.. what kind of file format do you mean. I'm asking because a file is just a pile of bytes and you can append your 5 digit code every where in the file, if it is your own file format. But if you tell us which file format you use, probably there are some fields which can be used to search for it. As example Tiff has many tags. Images have other meta data. etc

Ruby: Create files with metadata

We're creating an app that is going to generate some text files on *nix systems with hashed filenames to avoid too-long filenames.
However, it would be nice to tag the files with some metadata that gives a better clue as to what their content is.
Hence my question. Does anyone have any experience with creating files with custom metadata in Ruby?
I've done some searching and there seem to be some (very old) gems that read metadata:
https://github.com/kig/metadata
http://oai.rubyforge.org/
I also found: system file, read write o create custom metadata or attributes extended which seems to suggest that what I need may be at the system level, but dropping down there feels dirty and scary.
Anyone know of libraries that could achieve this? How would one create custom metadata for files generated by Ruby?
A very old but interesting question with no answers!
In order for a file to contain metadata, it has to have a format that has some way (implicitly or explicitly) to describe where and how the metadata is stored.
This can be done by the format, such as having a header that says where the "main" data is stored and where the "metadata" is stored, or perhaps implicitly, such as having a length to the "main" data, and storing metadata as anything beyond the "main" data.
This can also be done by the OS/filesystem by storing information along with the files, such as permission info, modtime, user, and more comprehensive file information like "icon" as you would find with iOS/Windows.
(Note that I am using "quotes" around "main" and "metadata" because the reality is that it's all data, and needs to be stored in some way that tools can retrieve it)
A true text file does not contain any headers or any such file format, and is essentially just a continuous block of characters (disregarding how the OS may store it). This also means that it can be generally opened by any text editor, which will merely read and display all the characters it finds.
So the answer in some sense is that you can't, at least not on a true text file that is truly portable to multiple OS.
A few thoughts on how to get around this:
Use binary at the end of the text file with hope/requirements that their text editor will ignore non-ascii.
Store it in the OS metadata for the file and make it OS specific (such as storing it in the "comments" section that an OS may have for a file.
Store it in a separate file that goes "along with" the file (i.e., file.txt and file.meta) and hope that they keep the files together.
Store it in a separate file and zip the text and the meta file together and have your tool be zip aware.
Come up with a new file format that is not just text but has a text section (though then it can no longer be edited with a text editor).
Store the metadata at the end of the text file in a text format with perhaps comments or some indicator to leave the metadata alone. This is similar to the technique that the vi/vim text editor uses to embed vim commands into a file, it just puts them as comments at the beginning or end of the file.
I'm not sure there are many other ways to accomplish what you want, but perhaps one of those will work.

Resources