What is the fastest way to load data in Matlab - performance

I have a vast quantity of data (>800Mb) that takes an age to load into Matlab mainly because it's split up into tiny files each <20kB. They are all in a proprietary format which I can read and load into Matlab, its just that it takes so long.
I am thinking of reading the data in and writing it out to some sort of binary file which should make it quicker for subsequent reads (of which there may be many, hence me needing a speed-up).
So, my question is, what would be the best format to write them to disk to make reading them back again as quick as possible?
I guess I have the option of writing using fwrite, or just saving the variables from matlab. I think I'd prefer the fwrite option so if needed, I could read them from another package/language...

Look in to the HDF5 data format, used by recent versions of MATLAB as the underlying format for .mat files. You can manually create your own HDF5 files using the hdf5write function, and this file can be accessed from any language that has HDF bindings (most common languages do, or at least offer a way to integrate C code that can call the HDF5 library).
If your data is numeric (and of the same datatype), you might find it hard to beat the performance of plain binary (fwrite).

Binary mat-files are the fastest. Just use
save myfile.mat <var_a> <var_b> ...

I achieved an amazing speed up in loading when I used the '-v6' option to save the .mat files like so:
save(matlabTrainingFile, 'Xtrain', 'ytrain', '-v6');
Here's the size of the matrices that I used in my test ...
Attr Name Size Bytes Class
==== ==== ==== ===== =====
g Xtest 1430x4000 45760000 double
g Xtrain 3411x4000 109152000 double
g Xval 1370x4000 43840000 double
g ytest 1430x1 11440 double
g ytrain 3411x1 27288 double
g yval 1370x1 10960 double
... and the performance improvements that we achieved:
Before the change:
time to load the training data: 78 SECONDS!!!
time to load validation data: 32
time to load the test data: 35
After the change:
time to load the training data: 0 SECONDS!!!
time to load validation data: 0
time to load the test data: 0
Apparently the reason the reason this works so well is that the old version 6 version used less compression the than new versions.
So your file sizes will be bigger, but they will load WAY faster.

Related

R307 Fingerprint Sensor working with more then 1000 fingerprints

I want to integrate fingerprint sensor in my project. For the instance I have shortlisted R307, which has capacity of 1000 fingerprints.But as project requirement is more then 1000 prints,so I will going to store print inside the host.
The procedure I understand by reading the datasheet for achieving project requirements is :
I will register the fingerprint by "GenImg".
I will download the template by "upchr"
Now whenever a fingerprint come I will follow the step 1 and step 2.
Then start some sort of matching algorithm that will match the recently downloaded template file with
the template file stored in database.
So below are the points for which I want your thoughts
Is the procedure I have written above is correct and optimized ?
Is matching algorithm is straight forward like just comparing or it is some sort of tricky ? How can
I implement that.Please suggest if some sort of library already exist.
The sensor stores the image in 256 * 288 pixels and if I take this file to host
at maximum data rate it takes ~5(256 * 288*8/115200) seconds. which seems very
large.
Thanks
Abhishek
PS: I just mentioned "HOST" from which I going to connect sensor, it can be Arduino/Pi or any other compute device depends on how much computing require for this task, I will select.
You most probably figured it out yourself. But for anyone stumbling here in future.
You're correct for the most part.
You will take finger image (GenImg)
You will then generate a character file (Img2Tz) at BufferID: 1
You'll repeat the above 2 steps again, but this time store the character file in BufferID: 2
You're now supposed to generate a template file by combining those 2 character files (RegModel).
The device combines them for you, and stores the template in both the character buffers
As a last step; you need to store this template in your storage (Store)
For searching the finger: You'll take finger image once, generate a character file in BufferID : 1 and search the library (Search). This performs a linear search and returns the finger id along with confidence score.
There's also another method (GR_Identify); does all of the above automatically.
The question about optimization isn't applicable here, you're using a 3rd party device and you have to follow the working instructions whether it's optimized or not.
The sensor stores the image in 256 * 288 pixels and if I take this file to host at maximum data rate it takes ~5(256 * 288*8/115200) seconds. which seems very large.
I don't really get what you mean by this, but the template file ( that you intend to upload to your host ) is 512 bytes, I don't think it should take much time.
If you want an overview of how this system is implemented; Adafruit's Library is a good reference.

THREDDS OPeNDAP speed Matlab

Using the following code in Matlab:
nc_file_list = {'http://data.nodc.noaa.gov/thredds/dodsC/ghrsst/L2P/MODIS_A/JPL/2015/287/20151014-MODIS_A-JPL-L2P-A2015287235500.L2_LAC_GHRSST_D-v01.nc.bz2'};
temp.sl = ncreadatt(nc_file_list,'/','northernmost_latitude');
I try to get a single attribute from a netcdf file on a THREDDS OPeNDAP server. Ive been told this should be very quick as the netcdf philosophy is build around accessing small parts of data in big data sets.
The total size of the netcdf file is around 20 Mb. Running this code takes 17 seconds (internet speed is 5 Mb/s).
I need to process 19,000 files so I want this netcdf attribute reading to go alot quicker. Is there a way to read the attribute of the link given above in under 1 second?
The file is bz2-compressed, so the whole thing must be decompressed before the NetCDF library can perform random access operations on it. There's no way to avoid that.
You can use the THREDDS DAS Service as is explained by this answer:

what are the various approaches for generating PDFs?

I have an idea for an app that would take some flash content which contains graphics and images like various geometric shapes and polygons and some random images and convert them to PDF.
Also, since I envision this app to be used my multiple users I want this process to be quick and scalable. One possible solution I could think of is have a small flash client with the capability of assembling the above mentioned graphics and images. Generate some sort of XML, send it to a server running a Java process which could render the PDF using iText.
I was wondering what are the other possible ways to do it or the best practices. Technology isn't an issue; open source or commercial.
I understand that image uploads etc will take variable amount of time so consider that images are readily available. Here are the criterias in terms of what I am looking for in a solution for PDF rendering:
No constraint on the flash client because the PDF render engine.
Scalable to multiple users
Speed and Efficiency
Least amount of serialization / deserialization
I would appreciate if you could share your tech stack idea. Thanks a lot!
PS: I would appreciated if you don't get bogged down my Flash >> XML >> Java approach.
I believe it to be one of the many approaches that could be taken.
If generating the PDF in the browser using Flash is an option, then consider using AlivePdf. If not, then check out XSL:FO, we use it for server side conversion to PDF.
I believe iText generates PDFs in Java code. It may or may not use XML as its data source; POJOs will do just as well.
Another way is XSL-FO. It requires an XML data source and an XSL-FO stylesheet to transform the XML and generate a PDF. Apache's Xalan (or any other XSL-T library) can do it for you.
"Quick" and "scalable" may require more than these. Uploading a lot of images is a process that has its own timescale and optimizations that have nothing to do with PDFs.
There's pdflib for PHP, and FPDF (also for PHP).
So you're also willing to consider other clients? It sounds like you've got a kids drawing app and want to generate something that'll preserve the state of their drawing at the time.
Lets face it, XML isn't that efficient. That's not its purpose. It's both machine and human readable, validatable, etc etc.
Instead, how about a <Canvas> based web page that submitted the state of that canvas to the server in JSON (fewer bytes, and less work to build them). The server can then work in whatever the hell library/language it wants. Lots of JSON->my-language libraries floating around out there.
Your choice in PDF libraries is then limited only by what is you have installed on your server. You also said you wanted to do as little reading/writing as possible.
The most efficient possible setup would be to have a read-only partial PDF already loaded into memory tailored to minimize the impact of canvas changes (including images). Each session would dupe that partial PDF, convert the JSON to PDF graphic commands, and save the PDF.
To minimize structural changes to the PDF you'd want to use Inline Images. No new objects in the PDF means you don't need to change your cross reference table at all (until you add fonts or want to reuse an existing image). You could build the "doc info" dictionary padded with a specific amount of spaces between objects so you could fill it in without changing any byte offsets (which would force you to recompute the xref table).
You may or may not need to mess with the page size... we are just talking about one page here, right?
So the PDF would look something like...
%%PDF-1.6
<3-4 random high order bytes to convince folks that we're a binary stream>
1 0 obj
<</Type/Catalog/Pages 2 0 R>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[3 0 R]>>
endobj
3 0 obj
<</Type/Page/Contents 4 0 R/MediaBox[0 0 612 792]/Parent 2 0 R>>
endobj
5 0 obj
<</Type/DocInfo/Author() --<insert big whitespace gap here>--
/Title() --<ditto>--
/Subject() --<ditto>--
/Keywords() --<ditto>--
/Creator(My app's Name)
/Producer(My pdf library's name)
/CreationDate(encodedDateWhenThisTemplateWasBuilt) D:YYYYMMDDHHMMSS-timeZoneOffset
/ModDate() --<another, smaller whitespace gap>--
>>
4 0 obj
<</Filter/SeveralDifferentFiltersAvailable/Length --<byte length of the stream in this file>-->>
stream
And your template stops there. You'd have a similar "end of the PDF" template that would look something like this:
endstream
endobj
xref
0 6
0000000000 65535 f
0000000010 00000 n
0000000025 00000 n
0000000039 00000 n
0000000097 00000 n
0000000050 00000 n
trailer
<</Root 1 0 R/Size 6/Info 5 0 R>>
startxref
--<some white space>--
%%EOF
The columns of numbers at the end are all wrong. The first column is the byte offset of that particular object (and I'm not up for counting bytes just now thank you). The second column is largely irrelevant.
PDF filling app will need to know:
The byte offsets of everything you intend to fill in within the first template.
All the "doc info" fields, which are all optional by the way. The /Info key and the dictionary it points to are optional for that matter. You could yank 'em if you cared to.
the /Length key of the content stream. That needs to be the post-filter byte length of the stream itself.
How to convert the JSON into pdf drawing commands. If you wanted to cheat a bit you could use iText[Sharp]'s PdfContentByte class, use its drawing commands, and then get the finished byte stream and slap that into your PDF. Be sure you use Inline Images or this whole scheme goes right out the window. There are probably other libraries you could gut similarly if you felt the need. Or you could just read up on the PDF spec and roll your own. You'll be sticking to a fairly limited subset of PDF's content syntax.
The byte offset of the word "xref" from the start of the file. You can calculate this: LengthOfInitialTemplate + LengthOfContentStream + OffsetFromStartOf2ndTemplateTo'xref'.
The byte offset of the line below "startxref", which is where you write the aforecalculated byte offset of 'xref'
You're not going to get much more efficient than that. You'd read in your templates once. Read/calculate the byte offsets you needed once.

Increasing the Loading Speed of Large Files

There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.
database = extractDatabase(#type).chomp("fasta") + "yml"
revDatabase = extractDatabase(#type + "-r").chomp("fasta.reverse") + "yml"
#proteins = Hash.new
#decoyProteins = Hash.new
File.open(database, "r").each_line do |line|
parts = line.split(": ")
#proteins[parts[0]] = parts[1]
end
File.open(revDatabase, "r").each_line do |line|
parts = line.split(": ")
#decoyProteins[parts[0]] = parts[1]
end
And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.
MTMDK: P31946 Q14624 Q14624-2 B5BU24 B7ZKJ8 B7Z545 Q4VY19 B2RMS9 B7Z544 Q4VY20
MTMDKSELVQK: P31946 B5BU24 Q4VY19 Q4VY20
....
I've messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it's still awfully slow.
Is there a way to improve the speed of this, or is there a whole other approach I can take?
List of things that don't work:
YAML.
Standard Ruby threads.
Forking off processes and then retrieving the hash through a pipe.
In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as
buffer = File.readlines(database)
buffer.each do |line|
...
end
If they're too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.
Why not use the solution devised through decades of experience: a database, say SQLlite3?
(To be different, although I'd first recommend looking at (Ruby) BDB and other "NoSQL" backend-engines, if they fit your need.)
If fixed-sized records with a deterministic index are used then you can perform a lazy-load of each item through a proxy object. This would be a suitable candidate for a mmap. However, this will not speed up the total access time, but will merely amortize the loading throughout the life-cycle of the program (at least until first use and if some data is never used then you get the benefit of never loading it). Without fixed-sized records or deterministic index values this problem is more complex and starts to look more like a traditional "index" store (eg. a B-tree in an SQL back-end or whatever BDB uses :-).
The general problems with threading here are:
The IO will likely be your bottleneck around Ruby "green" threads
You still need all the data before use
You may be interested in the Widefinder Project, just in general "trying to get faster IO processing".
I don't know too much about Ruby but I have had to deal with the problem before. I found the best way was to split the file up into chunks or separate files then spawn threads to read each chunk in at a single time. Once the partitioned files are in memory combining the results should be fast. Here is some information on Threads in Ruby:
http://rubylearning.com/satishtalim/ruby_threads.html
Hope that helps.

Are there alternatives for creating large container files that are cross platform?

Previously, I asked the question.
The problem is the demands of our file structure are very high.
For instance, we're trying to create a container with up to 4500 files and 500mb data.
The file structure of this container consists of
SQLite DB (under 1mb)
Text based xml-like file
Images inside a dynamic folder structure that make up the rest of the 4,500ish files
After the initial creation the images files are read only with the exception of deletion.
The small db is used regularly when the container is accessed.
Tar, Zip and the likes are all too slow (even with 0 compression). Slow is subjective I know, but to untar a container of this size is over 20 seconds.
Any thoughts?
As you seem to be doing arbitrary file system operations on your container (say, creation, deletion of new files in the container, overwriting existing files, appending), I think you should go for some kind of file system. Allocate a large file, then create a file system structure in it.
There are several options for the file system available: for both Berkeley UFS and Linux ext2/ext3, there are user-mode libraries available. It might also be possible that you find a FAT implementation somewhere. Make sure you understand the structure of the file system, and pick one that allows for extending - I know that ext2 is fairly easy to extend (by another block group), and FAT is difficult to extend (need to append to the FAT).
Alternatively, you can put a virtual disk format yet below the file system, allowing arbitrary remapping of blocks. Then "free" blocks of the file system don't need to appear on disk, and you can allocate the virtual disk much larger than the real container file will be.
Three things.
1) What Timothy Walters said is right on, I'll go in to more detail.
2) 4500 files and 500Mb of data is simply a lot of data and disk writes. If you're operating on the entire dataset, it's going to be slow. Just I/O truth.
3) As others have mentioned, there's no detail on the use case.
If we assume a read only, random access scenario, then what Timothy says is pretty much dead on, and implementation is straightforward.
In a nutshell, here is what you do.
You concatenate all of the files in to a single blob. While you are concatenating them, you track their filename, the file length, and the offset that the file starts within the blob. You write that information out in to a block of data, sorted by name. We'll call this the Table of Contents, or TOC block.
Next, then, you concatenate the two files together. In the simple case, you have the TOC block first, then the data block.
When you wish to get data from this format, search the TOC for the file name, grab the offset from the begining of the data block, add in the TOC block size, and read FILE_LENGTH bytes of data. Simple.
If you want to be clever, you can put the TOC at the END of the blob file. Then, append at the very end, the offset to the start of the TOC. Then you lseek to the end of the file, back up 4 or 8 bytes (depending on your number size), take THAT value and lseek even farther back to the start of your TOC. Then you're back to square one. You do this so you don't have to rebuild the archive twice at the beginning.
If you lay out your TOC in blocks (say 1K byte in size), then you can easily perform a binary search on the TOC. Simply fill each block with the File information entries, and when you run out of room, write a marker, pad with zeroes and advance to the next block. To do the binary search, you already know the size of the TOC, start in the middle, read the first file name, and go from there. Soon, you'll find the block, and then you read in the block and scan it for the file. This makes it efficient for reading without having the entire TOC in RAM. The other benefit is that the blocking requires less disk activity than a chained scheme like TAR (where you have to crawl the archive to find something).
I suggest you pad the files to block sizes as well, disks like work with regular sized blocks of data, this isn't difficult either.
Updating this without rebuilding the entire thing is difficult. If you want an updatable container system, then you may as well look in to some of the simpler file system designs, because that's what you're really looking for in that case.
As for portability, I suggest you store your binary numbers in network order, as most standard libraries have routines to handle those details for you.
Working on the assumption that you're only going to need read-only access to the files why not just merge them all together and have a second "index" file (or an index in the header) that tells you the file name, start position and length. All you need to do is seek to the start point and read the correct number of bytes. The method will vary depending on your language but it's pretty straight forward in most of them.
The hardest part then becomes creating your data file + index, and even that is pretty basic!
An ISO disk image might do the trick. It should be able to hold that many files easily, and is supported by many pieces of software on all the major operating systems.
First, thank-you for expanding your question, it helps a lot in providing better answers.
Given that you're going to need a SQLite database anyway, have you looked at the performance of putting it all into the database? My experience is based around SQL Server 2000/2005/2008 so I'm not positive of the capabilities of SQLite but I'm sure it's going to be a pretty fast option for looking up records and getting the data, while still allowing for delete and/or update options.
Usually I would not recommend to put files inside the database, but given that the total size of all images is around 500MB for 4500 images you're looking at a little over 100K per image right? If you're using a dynamic path to store the images then in a slightly more normalized database you could have a "ImagePaths" table that maps each path to an ID, then you can look for images with that PathID and load the data from the BLOB column as needed.
The XML file(s) could also be in the SQLite database, which gives you a single 'data file' for your app that can move between Windows and OSX without issue. You can simply rely on your SQLite engine to provide the performance and compatability you need.
How you optimize it depends on your usage, for example if you're frequently needing to get all images at a certain path then having a PathID (as an integer for performance) would be fast, but if you're showing all images that start with "A" and simply show the path as a property then an index on the ImageName column would be of more use.
I am a little concerned though that this sounds like premature optimization, as you really need to find a solution that works 'fast enough', abstract the mechanics of it so your application (or both apps if you have both Mac and PC versions) use a simple repository or similar and then you can change the storage/retrieval method at will without any implication to your application.
Check Solid File System - it seems to be what you need.

Resources