Reading matrix from file takes too much RAM - matrix

I am reading a matrix from a file using readdlm. The file is about 400 MB in size. My PC has 8 GB of RAM. When I try to readdlm the matrix from this file, my PC eventually freezes, while the RAM consumption goes up until it consumes everything. The matrix is simply a 0, 1 matrix.
I don't understand why this happens. Storing this matrix in memory shouldn't take more than the 400 MB necessary to store the file.
What can I do?
The code I am using is simple:
readdlm("data.txt")
where data.txt is a 400mb text file of tab-separated values. I am on Linux Mint 17.3, Julia 0.4.

Related

How do I increase memory limit (contiguous as well as overall) in Matlab r2012b?

I am using Matlab r2012b on win7 32-bit with 4GB RAM.
However, the memory limit on Matlab process is pretty low. On memory command, I am getting the following output:
Maximum possible array: 385 MB (4.038e+08 bytes) *
Memory available for all arrays: 1281 MB (1.343e+09 bytes) **
Memory used by MATLAB: 421 MB (4.413e+08 bytes)
Physical Memory (RAM): 3496 MB (3.666e+09 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
I need to increase the limit to as much as possible.
System: Windows 7 32 bit
RAM: 4 GB
Matlab: r2012b
For general guidance with memory management in MATLAB, see this MathWorks article. Some specific suggestions follow.
Set the /3GB switch in the boot.ini to increase the memory available to MATLAB. Or set it with a properties dialog if you are averse to text editors. This is mentioned in this section of the above MathWorks page.
Also use pack to increase the Maximum possible array by compacting the memory. The 32-bit MATLAB memory needs blocks of contiguous free memory, which is where this first value comes from. The pack command saves all the variables, clears the workspace, and reloads them so that they are contiguous in memory.
More on overall memory, try disabling the virtual machine, closing programs, stopping unnecessary Windows services. No easy answer for this part.

How do you save a large amount of data using SAGE?

I'm trying to save a 'big' rational matrix in SAGE, but I'm running into problems. After computing my matrix A, which has size 5 x 10,000 and each entry contains rational numbers in fraction form with total number of digits for numerator and denominator more than 10 pages, I run the following command:
save(A, DATA + 'A').
This gives me the following error message:
Traceback(most recent call last):
...
RuntimeError: Segmentation fault.
I tried the same save command with a 'smaller' matrix and that worked fine. I should also note that I'm using a laptop with a 64-bit operating system, x64-based processor, Windows 8, i7 CPU # 2.40 GHz and 8 GB RAM. I'm running SAGE on a virtual machine to which I allocated 5237 MB. Let me know if you need further information. My questions are:
Why can't I save my matrix? Why do I get the above error message? What does it mean?
How can I save my matrix A using this command? Is there any other way I can save it?
I have asked these same questions in another forum which specifically deals with SAGE, but I'm not getting an answer there. I have also spent a lot of time searching online about this question, but haven't seen anyone with this same problem.

Can't use more than about 1.5 GB of my 4 GB RAM for a simple sort

I'm using a circa summer 2007 MacBook Pro (x86-64) with a 32KB L1 (I think), 4MB L2, and 4GB RAM; I'm running OS X 10.6.8.
I'm writing a standard radix sort in C++: it copies from one array to another and back again as it sorts (so the memory used is twice the size of the array). I watch it by printing a '.' every million entries moved.
If the array is at most 750 MB then these dots usually move quite fast; however if the array is larger then the whole process crawls to a halt. If I radix sort 512 MB in blocks and then attempt to merge sort the blocks, the first block goes fast and then again the process crawls to a halt. That is, my process only seems to be able to use 1.5 GB of RAM for the sort. What is odd is that I have 4 GB of physical RAM.
I tried just allocating an 8 GB array and walking through it writing each byte and printing a '.' every million bytes; it seems that everything starts to slow down around 1.5 GB and stays at that rate even past 4 GB when I know it must be going to disk; so the OS starts writing pages to disk around 1.5 GB.
I want to use my machine to sort large arrays. How do I tell my OS to give my process at least, say, 3.5 GB of RAM ? I tried using mlock(), but that just seems to slow things down even more. Ideas?

Fortran array memory management

I am working to optimize a fluid flow and heat transfer analysis program written in Fortran. As I try to run larger and larger mesh simulations, I'm running into memory limitation problems. The mesh, though, is not all that big. Only 500,000 cells and small-peanuts for a typical CFD code to run. Even when I request 80 GB of memory for my problem, it's crashing due to insufficient virtual memory.
I have a few guesses at what arrays are hogging up all that memory. One in particular is being allocated to (28801,345600). Correct me if I'm wrong in my calculations, but a double precision array is 8 bits per value. So the size of this array would be 28801*345600*8=79.6 GB?
Now, I think that most of this array ends up being zeros throughout the calculation so we don't need to store them. I think I can change the solution algorithm to only store the non-zero values to work on in a much smaller array. However, I want to be sure that I'm looking at the right arrays to reduce in size. So first, did I correctly calculate the array size above? And second, is there a way I can have Fortran show array sizes in MB or GB during runtime? In addition to printing out the most memory intensive arrays, I'd be interested in seeing how the memory requirements of the code are changing during runtime.
Memory usage is a quite vaguely defined concept on systems with virtual memory. You can have large amounts of memory allocated (large virtual memory size) but only a small part of it actually being actively used (small resident set size - RSS).
Unix systems provide the getrusage(2) system call that returns information about the amount of system resources in use by the calling thread/process/process children. In particular it provides the maxmimum value of the RSS ever reached since the process was started. You can write a simple Fortran callable helper C function that would call getrusage(2) and return the value of the ru_maxrss field of the rusage structure.
If you are running on Linux and don't care about portability, then you may just open and read from /proc/self/status. It is a simple text pseudofile that among other things contains several lines with statistics about the process virtual memory usage:
...
VmPeak: 9136 kB
VmSize: 7896 kB
VmLck: 0 kB
VmHWM: 7572 kB
VmRSS: 6316 kB
VmData: 5224 kB
VmStk: 88 kB
VmExe: 572 kB
VmLib: 1708 kB
VmPTE: 20 kB
...
Explanation of the various fields - here. You are mostly interested in VmData, VmRSS, VmHWM and VmSize. You can open /proc/self/status as a regular file with OPEN() and process it entirely in your Fortran code.
See also what memory limitations are set with ulimit -a and ulimit -aH. You may be exceeding the hard virtual memory size limit. If you are submitting jobs through a distributed resource manager (e.g. SGE/OGE, Torque/PBS, LSF, etc.) check that you request enough memory for the job.

Random write Vs Seek time

I have a very weird question here...
I am trying to write the data randomly to a file of 100 MB.
data size is 4KB and the the random offset is page alligned.(4KB ).
I am trying to write 1 GB of data at random offset on 100 MB file.
If I remove the actual code that writes the data to the disk, the entire operation takes less than a second (say 0.04 sec).
If I keep the code that writes the data its takes several seconds .
In case of random write operation, what happens internally? whether the cost is seek time or the write time? From above scenario its really confusing.. !!!!
Can anybody explain in depth please....
The same procedure applied with a sequential offset, write is very fast.
Thank you ......
If you're writing all over the file, then the disk (I presume this is on a disk) needs to seek to a new place on every write.
Also, the write speed of hard disks isn't particularly stunning.
Say for the sake of example (taken from a WD Raptor EL150) that we have a 5.9 ms seek time. If you are writing 1GB randomly everywhere in 4KB chunks, you're seeking 1,000,000,000 ÷ 4,000 × 0.0059 seconds = a total seeking time of ~1,400 seconds!

Resources