Quickest way to quickly write/read/cache simple information as described - caching

I do apologize for the terrible question. I'm a 3D guy who amateurs python for plugins and scripts.
I've successfully come up with the worst possible way to export particle information (two vectors per particle per frame for position and alignment). My first method was to write out a billion vectors per line to a .txt where each line represented a frame. Now I have it just writing out a .txt per frame and loading and closing the right one depending on the frame.
Yeah, it's slow. And dumb. Whatever. What direction would you suggest I go/research? A different file type? A :checks google: bin, perhaps? Or should my retarded method actually not take very long and something else is making things move more slowly? I don't need an exhaustive answer, just some general information to get me moving in the right direction.
Thanks a million.

if this info is going to be read by another python application ( especially if its the same application that wrote it out) look into just pickling your data structures. Just build them in memory and use pickle to dump them out to a binary file. The caveats here:
1) Do you have memory to do it all at once, or does it have to be one frame at time? You can make big combined files in the first case, you'd need to do one-frame-per-file in the second. If you're running out of memory the yield statement is your friend.
2) Pickled files need to be of the same python version to be reliable, so you need to be sure all the reading and writing apps are on the same python version
3) Pickled files are binary, so not human readable.
If you need to exchange with other applications, look into Alembic, which is an open source file format designed for this sort of problem - baking out large volumes of particle or simulation data. There's a commercial exporter avalable from EcoCortex which comes with a Python module for dealing with Alembic data

Related

Fortran95 access large files fast using direct access

I am currently working on a problem which requires me to store a large amount of well structured information in a file.
It is more data than I can keep in memory, but I need to access different parts of it very often and would like to do so as quickly as possible (of course).
Unfortunately, the file would be large enough that actually reading through it would take quite some time as well.
From what I have gathered so far, it seems to me that ACCESS="DIRECT" would be a good way of handling this problem. Do I understand correctly that with direct access, I am basically pointing at a specific chunk of memory and ask "What's in there?"? And do I correctly infer from that, that reading time does not depend on the overall file size?
Thank you very much in advance!
You can think of an ACCESS='DIRECT' file as a file consisting of a number of fixed size records. You can do operations like read or write record #N in O(1) time. That is, in order to access record #N you don't need to scan through all the preceding #M (M<N) records in the file.
If this maps reasonably well to the problem you're trying to solve, then ACCESS='DIRECT' might be the correct solution in your case. If not, ACCESS='STREAM' offers a little bit more flexibility in that the size of each record does not need to be fixed, though you need to be able to compute the correct file offset yourself. If you need even more flexibility there's things like NetCDF, or HDF5 like #HighPerformanceMark suggested, or even things like sqlite.

Module for animated plotting from Fortran code

Does anyone have a convenient way to plot time dependent data? Say you have a program that outputs a trajectory over a period of time, so a 3 column txt file (t,x,y). I'd like to create a video file (mp4 avi gif etc) that will show the latter two columns evolution in time. I've written a program that outputs data, calls gnuplot, output a png, repeat however long needed, then uses ffmpeg to mash all the pngs into an mp4. It takes a very long time to produce every png however (somewhere around 0.2 sec for each one) and a 2 minute 30fps will take about 12 minutes to execute because of this. Also, I end up creating a directory with 3600 png's and then deleting the directory. I can't help but feel there has been an easier way to do this developed by someone over the past few decades. There must be a more elegant way to do something like this. I'm running Windows 10 as well.
It's probably overkill for your application, but you may want to look into writing (or converting) your data to VTK format (see http://www.cacr.caltech.edu/~slombey/asci/vtk/vtk_formats.simple.html), then processing the result through Paraview (http://www.paraview.org/) or VisIt (https://wci.llnl.gov/simulation/computer-codes/visit). Legacy VTK format is relatively easy to write from Fortran; the hardest part is understanding the so-flexible-nobody-can-explain-how-to-do-simple-things-with-it file format. The second hardest part is finding where the options you want are hidden in the VisIt UI. There are existing F90 libraries for writing VTK (see https://people.sc.fsu.edu/~jburkardt/f_src/vtk_io/vtk_io.html) which may give you a head start.
Glowing praise, I know, but once you've sorted the bits out, it's easy to generate animated plots using VisIT and it should be much faster than gnuplot. I've used this method for making animated 2D maps of temperature, heat generation, etc. based on data written directly from Fortran code.
Another tactic is to look for simpler data formats supported by VisIt and use those. I chose VTK because it was (somewhat) documented and supported by multiple viewers but there may be a better format for your needs.

How Duplicate File search is implemented in Gemini For Mac os

I tried to search for Duplicate files in my mac machine via command line.
This process took almost half an hour for 10 gb Data files whereas Gemini and cleanmymac apps takes lesser time to find the files.
So my point here is how this fastness is achieved in these apps,what is the concept behind it?, in which language code is written.
I tried googling for information but didnot get anything related to duplicate finder.
if you have any ideas please input them here.
First of all Gemini locates files with equal size, than it uses it’s own hash-like type-dependent algorithm to compare files content. That algorithm is not 100% accurate but much more quick than classical hashes.
I contacted support, asking them what algorithm they use. Their response was that they compare parts of each file to each other, rather than the whole file or doing a hash. As a result, they can only check maybe 5% (or less) of each file that's reasonably similar in size to each other, and get a reasonably accurate result. Using this method, they don't have to pay the cost of comparing the whole file OR the cost of hashing files. They could be even more accurate, if they used this method for the initial comparison, and then did full comparisons among the potential matches.
Using this method, files that are minor variants of each other may be detected as identical. For example, I've had two songs (original mix and VIP mix) that counted as the same. I also had two images, one with a watermark and one without, listed as identical. In both these cases, the algorithm just happened to pick parts of the file that were identical across the two files.

How to make one File of different images?

I have question which i wanna discuss with u. i am a fresh gradutate and just got a job as IT programmer. my company is making a game, the images or graphics use inside the game have one folder but different files of images. They give me task that how we can convert different files of images into one file and the program still access that file. If u have any kind of idea share with me ..Thanks
I'm not really sure what the advantage of this approach is for a game that runs on the desktop, but if you've already carefully considered that and decided that having a single file is important, then it's certainly possible to do so.
Since the question, as Oded points out, shows very little research or otherwise effort on your part, I won't provide a complete solution. And even if I wanted to do so, I'm not sure I could because you don't give us any information on what programming language and UI framework you're using. Visual Studio 2010 supports a lot of different ones.
Anyway, the trick involves creating a sprite. This is a fairly common technique for web design, where it actually is helpful to reduce load times by using only a single image, and you can find plenty of explanation and examples by searching the web. For example, here.
Basically, what you do is make one large image that contains all of your smaller images, offset from each other by a certain number of pixels. Then, you load that single large image and access the individual images by specifying the offset coordinates of each image.
I do not, however, recommend doing as Jan recommends and compressing the image directory (into a ZIP file or any other format), because then you'll just have to pay the cost of uncompressing it each time you want to use one of the images. That also buys you extremely little; disk storage is cheap nowadays.

Tools for Feature Extraction from Binary Data of Images

I am working on a project where I am have image files that have been malformed (fuzzed i.e their image data have been altered). These files when rendered on various platforms lead to warning/crash/pass report from the platform.
I am trying to build a shield using unsupervised machine learning that will help me identify/classify these images as malicious or not. I have the binary data of these files, but I have no clue of what featureSet/patterns I can identify from this, because visually these images could be anything. (I need to be able to find feature set from the binary data)
I need some advise on the tools/methods I could use for automatic feature extraction from this binary data; feature sets which I can use with unsupervised learning algorithms such as Kohenen's SOM etc.
I am new to this, any help would be great!
I do not think this is feasible.
The problem is that these are old exploits, and training on them will not tell you much about future exploits. Because this is an extremely unbalanced problem: no exploit uses the same thing as another. So even if you generate multiple files of the same type, you will in the end have likely a relevant single training case for example for each exploit.
Nevertheless, what you need to do is to extract features from the file meta data. This is where the exploits are, not in the actual image. As such, parsing the files is already much the area where the problem is, and your detection tool may become vulnerable to exactly such an exploit.
As the data may be compressed, a naive binary feature thing will not work, either.
You probably don't want to look at the actual pixel data at all since the corruption most (almost certain) lay in the file header with it's different "chunks" (example for png, works differently but in the same way for other formats):
http://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header
It should be straight forward to choose features, make a program that reads all the header information from the file and if the information is missing and use this information as features. Still will be much smaller then the unnecessary raw image data.
Oh, and always start out with simpler algorithms like pca together with kmeans or something, and if they fail you should bring out the big guns.

Resources