I have a large data set (200GB uncompressed, 9GB compressed in bz2 -9 ) of stock tick data.
I want to run some basic time series analysis on them.
My machine has 16GB of RAM.
I would prefer to:
keep all data, compressed, in memory
decompress that data on the fly, and stream it [so nothing ever hits disk]
do all analysis in memory
Now, I think there's nice interactions here with Clojure's laziness, and future objects (i.e. I can define objects s.t. when I try to access them, I'll decompress them on the fly.)
Question: what are the things I should keep in mind when doing high performance time series analysis in Clojure?
I'm particular interested in tricks involving:
efficiently storing tick data in memory
efficiently doing computation
weird convolutions to reduce # of passes over the data
Books / articles / research paper suggestions welcome. (I'm a CS PhD student).
Thanks.
Some ideas:
In terms of storing the compressed data, I don't think you will be able to do much better than your OS's own file system caching. Just make sure it s configured to use 11GB+ of RAM for file system caching and it should pull your whole compressed data set into memory as it is read the first time.
You should then be able to define your Clojure code to pull into the data lazily via a ZipInputStream, which will perform the decompression for you.
If you need to perform a second pass on the data, just create a new ZipInputStream on the same file. OS level caching should ensure that you don't hit the disk again.
I have heard of systems like that implemented in Java. It is possible. You'll certainly want to understand how to create your own lazy sequences in order to accomplish this. I also wouldn't hesitate to drop down into Java if you need to make sure that you're dealing with the primitive types that you want to deal with. e.g. Clojure won't generate code to do math on 32-bit ints, it will only generate code to work with longs, and if you don't want that it could be a pain.
It would also be worth some effort to make your in-memory format compatible with a disk format. That would give you the option of memory mapping files, or (at the very least) make your startup easy if your program were to crash. e.g. It could just read the files on disk to recover its previous state.
Related
I am building a ML application for binary classification using ML.NET. It will have multiple ML models of varying sizes (built using different training data) which will be stored in SQL server database as Blob. Clients will send items for classification to this app in random order and based on client ID, corresponding model is to be used for classification. To classify item, model needs be read from database and then loaded into memory. Loading model in memory is taking considerable time depending on size and I don't see any way to optimize it. Hence I am planning to cache models in memory. If I cache many heavy models, it may put pressure on memory hampering performance of other processes running on server. So there is no straightforward way to limit caching. So looking for suggestions to handle this.
Spawn a new process
In my opinion this is the only viable option to accomplish what you're trying to do. Spawn a complete new process that communicates (via IPC?) with your "main application". You could set a memory limit using this property https://learn.microsoft.com/en-us/dotnet/api/system.gcmemoryinfo.totalavailablememorybytes?view=net-5.0 or maybe even use a 3rd-party-library (e.g. https://github.com/lowleveldesign/process-governor), that kills your process if it reaches a specific amount of RAM. Both of these approaches are quite rough and will basically kill your process.
If you have control over your side car application running, it might make sense to really monitor the RAM usage with something like this Getting a process's ram usage and gracefully stop the process.
Do it yourself solution (not recommended)
Basically there is no built in way of limiting memory usage by thread or similar.
What counts towards the memory limit?
Shared resources
Since you have a running process, you need to define what exactly counts towards the memory limit. For example if you have some static Dictionary that is manipulated by the running thread - what did it occupy? Only the diff between the old value and the new value? The whole new value? The key and the value?
There are many more cases like this you'll have to take into consideration.
The actual measuring
You need some kind of way to count the actual memory usage. This will probably be hard/near impossible to "implement":
Reference counting needed?
If you have a hostile thread, it might spawn an infinite amount of references to one object, no new keyword used. For each reference you'd have to count 32/64 bits.
What about built in types?
It might be "easy" to measure a byte[] included in your own type definition, but what about built in classes? If someone initializes a string with 100MB this might be an amount you need to keep track of.
... and many more ...
As you maybe noticed with previous samples, there is no easy definition of "RAM used by a thread". This is the reason there also is no easy to get the value of it.
In my opinion it's insanely complex to do such a thing and needs a lot of definition work to do on your side. It might be feasable with lots of effort but I'm not sure if that really is what you want. Even if you manage to - what will do you about it? Only killing the thread might not clean up the ressources.
Therefore I'd really think about having a OS managed, independent, process, that you can kill whenever you feel like it.
How big are your models? Even large models 100meg+ load pretty quickly off of fast/SSD storage. I would consider caching them on fast drives/SSDs, because pulling off of SQL Server is going to be much slower than raw disk. See if this helps your performance.
I'm working on an application that reads in a huge text file (can be up to 5gb in size). Currently, I am using fscanf to read in the file, because I have found it to be the fastest so far. However, it still takes quite a large quantity of time to read the whole file in.
Is there a faster way to read in data from a file?
First, you should strongly avoid reading a 5GB file into memory as a single step. The memory impact alone should keep you away from this approach. Instead, you should try to take another approach such as:
Process the data as you read it and throw away the data
Convert the file to a Core Data model prior to work
Convert the file to a fixed-length record format so you can do random-access
Modify the file format so that it is less redundant
Index the file so you can do random-access
Partition the data into separate files
Memory map the file using NSFileWrapper (far from a panacea, but can be useful in conjunction with the above; NSFileWrapper automatically does memory mapping)
You should start by getting a performance baseline:
time cat thebigfile.dat > /dev/null
It is hard to imagine reading the file much faster than that, so that's your floor.
You should definitely do some performance analysis in Instruments and make sure the problem is the reading and not the processing. In particular, memory allocation can be more expensive than you may expect, particularly in a multi-threaded app.
Once you've investigated the above, and you still need really fast management of on-disk data, look at dispatch_io and dispatch_data. This is a really awesome tool for high-speed data management. But it is almost always better to improve your basic algorithms first before worrying about this kind of optimization.
I need an on-disk key-value store, not too big or distributed. The use case is as follows:
The full DB will be few Gbs in size
Both key and value are of constant size
Its a constant data base. Once the entire database is written I don't need to write any more entries (or write very infrequently)
Keys will be accessed in unpredictable order
Supporting concurrent reads by multiple processes is a must.
Have to be very fast because the readers will be accessing millions of keys in a tight loop. So it should be as close as possible to being as performant as looping over an associative array (STL's std::map say)
Ideally it should allow one to set how much RAM to use, typically it should use a few hundreds of Mbs
Written in C or C++. An existing python extension will be a big plus, but iI can add that on my own
So cdb and gdbm look like good choices, but just wanted to know if there are more suitable choices. Pointers to relevant benchmarks or even relevant anecdotal evidence will be appreciated.
What database did you end up using?
If you like cdb and you need > 4 GB database, please have a look at mcdb, which is originally based on cdb, plus some performance enhancements and the addition of support for 4 GB+ constant databases.
https://github.com/gstrauss/mcdb/
Python, Perl, Lua, and Ruby extensions are provided. mcdb is written in C and uses mmap under the hood and so easily supports lock-free concurrent reads between threads and between processes. Since it is backed by a memory-mapped file, pages are mapped in from disk as needed and memory is effectively constant even as the number of processes accessing the database increases.
Have you looked at bdb? It sounds like a good use of BDB.
i like hamsterdb because i wrote it :)
http://www.hamsterdb.com
frequently used with database sizes of several GBs
keys/values can have any size you wish
random access + directional access (with cursors)
concurrent reads: hamsterdb is thread safe, but not yet concurrent. i'm working on this.
if your cache is big enough then access will be very fast (you can specify the cache size)
written in c++
python extension is available, but terribly outdated; will need fixes
if you want to evaluate hamsterdb and need some help then feel free to drop me a mail.
I am working on an analysis tool that reads output from a process and continuously converts this to an internal format. After the "logging phase" is complete, analysis is done on the data. The data is all held in memory.
However, due to the fact that all logged information is held in memory, there is a limit on the duration of the logging. For most use cases this is ok, but it should be possible to run for longer, even if this will hurt performance.
Ideally, the program should be able to start using hard drive space in addition to RAM once the RAM usage reaches a certain limit.
This leads to my question:
Are there any existing solutions for doing this? It has to work on both Unix and Windows.
To use the disk after memory is full, we use Cache technologies such as EhCache. They can be configured with the amount of memory to use, and to overflow to disk.
But they also have smarter algorithms you can configure as needed, such as sending to disk data not used in the last 10 minutes etc... This could be a plus for you.
Without knowing more about your application it is not possible to provide a perfect answer. However it does sound a bit like you are re-inventing the wheel. Have you considered using an in-process database library like sqlite?
If you used that or similar it will take care of moving the data to and from the disk and memory and give you powerful SQL query capabilities at the same time. Even if your logging data is in a custom format if each item has a key or index of some kind a small light database may be a good fit.
This might seem too obvious, but what about memory mapped files? This does what you want and even allows a 32 bit application to use much more than 4GB of memory. The principle is simple, you allocate the memory you need (on disk) and then map just a portion of that into system memory. You could, for example, map something like 75% of the available physical memory size. Then work on it, and when you need another portion of the data, just re-map. The downside to this is that you have to do the mapping manually, but that's not necessarily bad. The good thing is that you can use more data than what fits into physical memory and into the per-process memory limit. It works really great if you actually use only part of the data at any given time.
There may be libraries that do this automatically, like the one KLE suggested (though I do not know that one). Doing it manually means you'll learn a lot about it and have more control, though I'd prefer a library if it does exactly what you want with regard to how and when the disk is being used.
This works similar on both Windows on Unix. For Windows, here is an article by Raymond Chen that shows a simple example.
I need to store large amounts of data on-disk in approximately 1k blocks. I will be accessing these objects in a way that is hard to predict, but where patterns probably exist.
Is there an algorithm or heuristic I can use that will rearrange the objects on disk based on my access patterns to try to maximize sequential access, and thus minimize disk seek time?
On modern OSes (Windows, Linux, etc) there is absolutely nothing you can do to optimise seek times! Here's why:
You are in a pre-emptive multitasking system. Your application and all it's data can be flushed to disk at any time - user switches task, screen saver kicks in, battery runs out of charge, etc.
You cannot guarantee that the file is contiguous on disk. Doing Aaron's first bullet point will not ensure an unfragmented file. When you start writing the file, the OS doesn't know how big the file is going to be so it could put it in a small space, fragmenting it as you write more data to it.
Memory mapping the file only works as long as the file size is less than the available address range in your application. On Win32, the amount of address space available is about 2Gb - memory used by application. Mapping larger files usually involves un-mapping and re-mapping portions of the file, which won't be the best of things to do.
Putting data in the centre of the file is no help as, for all you know, the central portion of the file could be the most fragmented bit.
To paraphrase Raymond Chen, if you have to ask about OS limits, you're probably doing something wrong. Treat your filesystem as an immutable black box, it just is what it is (I know, you can use RAID and so on to help).
The first step you must take (and must be taken whenever you're optimising) is to measure what you've currently got. Never assume anything. Verify everything with hard data.
From your post, it sounds like you haven't actually written any code yet, or, if you have, there is no performance problem at the moment.
The only real solution is to look at the bigger picture and develop methods to get data off the disk without stalling the application. This would usually be through asynchronous access and speculative loading. If your application is always accessing the disk and doing work with small subsets of the data, you may want to consider reorganising the data to put all the useful stuff in one place and the other data elsewhere. Without knowing the full problem domain it's not possible to to be really helpful.
Depending on what you mean by "hard to predict", I can think of a few options:
If you always seek based on the same block field/property, store the records on disk sorted by that field. This lets you use binary search for O(log n) efficiency.
If you seek on different block fields, consider storing an external index for each field. A b-tree gives you O(log n) efficiency. When you seek, grab the appropriate index, search it for your block's data file address and jump to it.
Better yet, if your blocks are homogeneous, consider breaking them down into database records. A database gives you optimized storage, indexing, and the ability to perform advanced queries for free.
Use memory-mapped file access rather than the usual open-seek-read/write pattern. This technique works on Windows and Unix platforms.
In this way the operating system's virtual memory system will handle the caching for you. Accesses of blocks that are already in memory will result in no disk seek or read time. Writes from memory back to disk are handled automatically and efficiently and without blocking your application.
Aaron's notes are good too as they will affect initial-load time for a chunk that's not in memory. Combine that with the memory-mapped technique -- after all it's easier to reorder chunks using memcpy() than by reading/writing from disk and attempting swapouts etc.
The most simple way to solve this is to use an OS which solves that for you under the hood, like Linux. Give it enough RAM to hold 10% of the objects in RAM and it will try to keep as many of them in the cache as possible reducing the load time to 0. The recent server versions of Windows might work, too (some of them didn't for me, that's why I'm mentioning this).
If this is a no go, try this algorithm:
Create a very big file on the harddisk. It is very important that you write this in one go so the OS will allocate a continuous space on disk.
Write all your objects into that file. Make sure that each object is the same size (or give each the same space in the file and note the length in the first few bytes of of each chunk). Use an empty harddisk or a disk which has just been defragmented.
In a data structure, keep the offsets of each data chunk and how often it is accessed. When it is accessed very often, swap its position in the file with a chunk that is closer to the start of the file and which has a lesser access count.
[EDIT] Access this file with the memory-mapped API of your OS to allow the OS to effectively cache the most used parts to get best performance until you can optimize the file layout next time.
Over time, heavily accessed chunks will bubble to the top. Note that you can collect the access patterns over some time, analyze them and do the reorder over night when there is little load on your machine. Or you can do the reorder on a completely different machine and swap the file (and the offset table) when that's done.
That said, you should really rely on a modern OS where a lot of clever people have thought long and hard to solve these issues for you.
That's an interesting challenge. Unfortunately, I don't know how to solve this out of the box, either. Corbin's approach sounds reasonable to me.
Here's a little optimization suggestion, at least: Place the most-accessed items at the center of your disk (or unfragmented file), not at the start of end. That way, seeking to lesser-used data will be closer by average. Err, that's pretty obvious, though.
Please let us know if you figure out a solution yourself.