Converting text file with spaces between CR & LF - text-files

I've never seen this line ending before and I am trying to load the file into a database.
The lines all have a fixed width. After the CSV text which contains the data (the length varies line-by-line), there is a CR followed by multiple spaces and ending with LF. The spaces provide the padding to equalize the line width.
Line1,Data 1,Data 2,Data 3,4,50D20202020200A
Line2,Data 11,Data 21,Data 31,41,510D2020200A
Line3,Data12,Data22,Data 32,42,520D202020200A
I am about to handle this with a stream reader / writer in C#, but there are 40 files that come in each month and if there is a way to convert them all at once instead of one line at a time, I would rather do that.
Any thoughts?

Line-by-line processing of a stream doesn't have to be a bottleneck if you implement it at the right point in your overall process.
When I've had to do this kind of preprocessing I put a folder watch on the inbound folder, then automatically pick up each file and process it upon arrival, putting the original into an archive folder and writing the processed file into another location from which data will be parsed or loaded into the database. Unless you have unusual real-time requirements, you'll never notice this kind of overhead. If you do have real-time requirements, this issue will pale in comparison to all the other issues you'll face with batched data files :)
But you may not even have to go through a preprocessing step at all. You didn't indicate what database you will be using or how you plan to load the data, but many databases do include utilities to process fixed-length records. In the past, fixed-format files came with every imaginable kind of bizarre format (and contained all kinds of stuff that had to be stripped out or converted). As a result those utilities tend to be very efficient at this kind of task. In my experience they can easily be at least an order of magnitude faster than line-by-line processing, which can make a real difference on larger bulk loads.
If your database doesn't have good bulk import processing tools, there are a number of many open-source or freeware utilities already written that do pretty much exactly what you need. You can find them on GitHub and other places. For example, NPM replace is here and zzzprojects findandreplace is here.
For a quick and dirty approach that allows you to preview all the changes as you develop a more robust solution, many text editors have the ability to find and replace in multiple files. I've used that approach successfully in the past. For example, here's the window from NotePad++ that lets you use RegEx to remove or change whatever you like in all files matching defined criteria.

Related

How to create a partially modifiable binary file format?

I'm creating my custom binary file extension.
I use the RIFF standard for encoding data. And it seems to work pretty well.
But there are some additional requirements:
Binary files could be large up to 500 MB.
Real-time saving data into the binary file in intervals when data on the application has changed.
Application could run on the browser.
The problem I face is when I want to save data it needs to read everything from memory and rewrite the whole binary file.
This won't be a problem when data is small. But when it's getting larger, the Real-time saving feature seems to be unscalable.
So main requirement of this binary file could be:
Able to partially read the binary file (Cause file is huge)
Able to partially write changed data into the file without rewriting the whole file.
Streaming protocol like .m3u8 is not an option, We can't split it into chunks and point it using separate URLs.
Any guidance on how to design a binary file system that scales in this scenario?
There is an answer from a random user that has been deleted here.
It seems great to me.
You can claim your answer back and I'll delete this one.
He said:
If we design the file to be support addition then we able to add whatever data we want without needing to rewrite the whole file.
This idea gives me a very great starting point.
So I can append more and more changes at the end of the file.
Then obsolete old chunks of data in the middle of the file.
I can then reuse these obsolete data slots later if I want to.
The downside is that I need to clean up the obsolete slot when I have a chance to rewrite the whole file.

Nifi: how to avoid copying file that are partially written

I am trying to use Nifi to get a file from SFTP server. Potentially the file can be big , so my question is how to avoid getting the file while it is being written. I am planning to use ListSFTP+FetchSFTP but also okay with GetSFTP if it can avoid copying partially written files.
thank you
In addition to Andy's solid answer you can also be a bit more flexible by using the ListSFTP/FetchSFTP processor pair by doing some metadata based routing.
After ListSFTP each flowfile will have attributes such as 'file.lastModifiedTime' and others. You can read about them here https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ListSFTP/index.html
You can put a RouteOnAttribute process in between the List and Fetch to detect objects that at least based on the reported last modified time are 'too new'. You could route those to a processor that is just a slow pass through to intentionally wait a bit. You can then run those back through the first router until they are 'old enough'. Now, this is admittedly a power user approach but it does give you a lot of flexibility and control. The approach I'm mentioning here is not fool proof as the source system may not report the last mod time correctly, it may not mean the source file is doing being written, etc.. But it gives you additional options IF you cannot do the definitely correct thing above that Andy talks about.
If you have control over the process which writes the file in, a common pattern to solve this is to initially write the file with a specific naming structure, such as beginning with .. After the successful write operation, the file is renamed without the . and it is picked up by the processor. Both GetSFTP and ListSFTP have a processor property called Ignore Dotted Files which is set to true by default and means those processors will not operate on or return files beginning with the dot character.
There is a minimum file age property you can use. The last modification time gets updated as the file is being written. Setting this value to something other than 0 will help fix the problem:

Does SQL*Loader have any functionality that allows for customizing the log file?

I have been asked to create a system for allowing third party companies to dump data into several of our tables. These third parties provide csv files on a periodic basis, and after doing some research it seemed like Oracle themselves had a standard tool for doing so, "sqlldr". I've since gotten it working to an acceptable degree, and we have a job scheduled to run that script once a day.
But one of the third parties supplies really dirty data, of the sort where I can't expect it to always load every row/record (looking like up to about 8% will fail). My boss asked me to forward "all output" from the first few tests to him, and like a moron I also sent the log file.
He has asked that this "report" be modified to include those exceptions that aren't unique constraints along with the line in the input file that caused the exception.
This means that I need data from the log file, but also from the (I believe) reject file in a single document. Rather than write a convoluted shell script to combine those two, does SQL*Loader itself allow any customization that might achieve the same thing? I've read through the Oracle documentation and haven't found anything that suggests this, but I've also learned not to trust it entirely either.
Is this possible? Ideally, the solution would allow me to add values to the reject file that don't exist in the original input file, but I'm also interested in any customization of the log file or reject file.
No.
I was going to stop there, but you can define the name of the log file, which might help with issue. Most automation with SQL*Loader involves wrapping it within shell scripts; aka "roll your own."

Millions of small graphics files and how to overcome slow file system access on XP

I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.
I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.
I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?
Thanks,
Barry.
Update:
Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)
There are several things you could/should do
Disable automatic NTFS short file name generation (google it)
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
As already posted consider splitting up the files in multiple directories.
.e.g. instead of
directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg
use
directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg
You could try an SSD....
http://www.crucial.com/promo/index.aspx?prog=ssd
Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.
Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.
The solution is most likely to restrict the number of files per directory.
I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.
gbp97m.xls
was stored in
g/b/p97m.xls
This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.
One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.
Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

Are there alternatives for creating large container files that are cross platform?

Previously, I asked the question.
The problem is the demands of our file structure are very high.
For instance, we're trying to create a container with up to 4500 files and 500mb data.
The file structure of this container consists of
SQLite DB (under 1mb)
Text based xml-like file
Images inside a dynamic folder structure that make up the rest of the 4,500ish files
After the initial creation the images files are read only with the exception of deletion.
The small db is used regularly when the container is accessed.
Tar, Zip and the likes are all too slow (even with 0 compression). Slow is subjective I know, but to untar a container of this size is over 20 seconds.
Any thoughts?
As you seem to be doing arbitrary file system operations on your container (say, creation, deletion of new files in the container, overwriting existing files, appending), I think you should go for some kind of file system. Allocate a large file, then create a file system structure in it.
There are several options for the file system available: for both Berkeley UFS and Linux ext2/ext3, there are user-mode libraries available. It might also be possible that you find a FAT implementation somewhere. Make sure you understand the structure of the file system, and pick one that allows for extending - I know that ext2 is fairly easy to extend (by another block group), and FAT is difficult to extend (need to append to the FAT).
Alternatively, you can put a virtual disk format yet below the file system, allowing arbitrary remapping of blocks. Then "free" blocks of the file system don't need to appear on disk, and you can allocate the virtual disk much larger than the real container file will be.
Three things.
1) What Timothy Walters said is right on, I'll go in to more detail.
2) 4500 files and 500Mb of data is simply a lot of data and disk writes. If you're operating on the entire dataset, it's going to be slow. Just I/O truth.
3) As others have mentioned, there's no detail on the use case.
If we assume a read only, random access scenario, then what Timothy says is pretty much dead on, and implementation is straightforward.
In a nutshell, here is what you do.
You concatenate all of the files in to a single blob. While you are concatenating them, you track their filename, the file length, and the offset that the file starts within the blob. You write that information out in to a block of data, sorted by name. We'll call this the Table of Contents, or TOC block.
Next, then, you concatenate the two files together. In the simple case, you have the TOC block first, then the data block.
When you wish to get data from this format, search the TOC for the file name, grab the offset from the begining of the data block, add in the TOC block size, and read FILE_LENGTH bytes of data. Simple.
If you want to be clever, you can put the TOC at the END of the blob file. Then, append at the very end, the offset to the start of the TOC. Then you lseek to the end of the file, back up 4 or 8 bytes (depending on your number size), take THAT value and lseek even farther back to the start of your TOC. Then you're back to square one. You do this so you don't have to rebuild the archive twice at the beginning.
If you lay out your TOC in blocks (say 1K byte in size), then you can easily perform a binary search on the TOC. Simply fill each block with the File information entries, and when you run out of room, write a marker, pad with zeroes and advance to the next block. To do the binary search, you already know the size of the TOC, start in the middle, read the first file name, and go from there. Soon, you'll find the block, and then you read in the block and scan it for the file. This makes it efficient for reading without having the entire TOC in RAM. The other benefit is that the blocking requires less disk activity than a chained scheme like TAR (where you have to crawl the archive to find something).
I suggest you pad the files to block sizes as well, disks like work with regular sized blocks of data, this isn't difficult either.
Updating this without rebuilding the entire thing is difficult. If you want an updatable container system, then you may as well look in to some of the simpler file system designs, because that's what you're really looking for in that case.
As for portability, I suggest you store your binary numbers in network order, as most standard libraries have routines to handle those details for you.
Working on the assumption that you're only going to need read-only access to the files why not just merge them all together and have a second "index" file (or an index in the header) that tells you the file name, start position and length. All you need to do is seek to the start point and read the correct number of bytes. The method will vary depending on your language but it's pretty straight forward in most of them.
The hardest part then becomes creating your data file + index, and even that is pretty basic!
An ISO disk image might do the trick. It should be able to hold that many files easily, and is supported by many pieces of software on all the major operating systems.
First, thank-you for expanding your question, it helps a lot in providing better answers.
Given that you're going to need a SQLite database anyway, have you looked at the performance of putting it all into the database? My experience is based around SQL Server 2000/2005/2008 so I'm not positive of the capabilities of SQLite but I'm sure it's going to be a pretty fast option for looking up records and getting the data, while still allowing for delete and/or update options.
Usually I would not recommend to put files inside the database, but given that the total size of all images is around 500MB for 4500 images you're looking at a little over 100K per image right? If you're using a dynamic path to store the images then in a slightly more normalized database you could have a "ImagePaths" table that maps each path to an ID, then you can look for images with that PathID and load the data from the BLOB column as needed.
The XML file(s) could also be in the SQLite database, which gives you a single 'data file' for your app that can move between Windows and OSX without issue. You can simply rely on your SQLite engine to provide the performance and compatability you need.
How you optimize it depends on your usage, for example if you're frequently needing to get all images at a certain path then having a PathID (as an integer for performance) would be fast, but if you're showing all images that start with "A" and simply show the path as a property then an index on the ImageName column would be of more use.
I am a little concerned though that this sounds like premature optimization, as you really need to find a solution that works 'fast enough', abstract the mechanics of it so your application (or both apps if you have both Mac and PC versions) use a simple repository or similar and then you can change the storage/retrieval method at will without any implication to your application.
Check Solid File System - it seems to be what you need.

Resources