How do you save a large amount of data using SAGE? - matrix

I'm trying to save a 'big' rational matrix in SAGE, but I'm running into problems. After computing my matrix A, which has size 5 x 10,000 and each entry contains rational numbers in fraction form with total number of digits for numerator and denominator more than 10 pages, I run the following command:
save(A, DATA + 'A').
This gives me the following error message:
Traceback(most recent call last):
...
RuntimeError: Segmentation fault.
I tried the same save command with a 'smaller' matrix and that worked fine. I should also note that I'm using a laptop with a 64-bit operating system, x64-based processor, Windows 8, i7 CPU # 2.40 GHz and 8 GB RAM. I'm running SAGE on a virtual machine to which I allocated 5237 MB. Let me know if you need further information. My questions are:
Why can't I save my matrix? Why do I get the above error message? What does it mean?
How can I save my matrix A using this command? Is there any other way I can save it?
I have asked these same questions in another forum which specifically deals with SAGE, but I'm not getting an answer there. I have also spent a lot of time searching online about this question, but haven't seen anyone with this same problem.

Related

Julia drawing from standard normal distribution

I need to draw 53000000 observations from a standard normal distribution. My current code takes a long time to run in Julia (in fact, it's been running for the past twenty minutes) and I'm wondering if there's anything I can do to speed it up. Here's what I tried:
using Distributions
d = Normal()
shock = rand(d, 1, 53000000)
The code works instantaneously when I execute it in REPL (I am working in Juno/Atom), but lags at this point (drawing from the standard normal) when I step through using the debugger. So I think the debugger may be the real culprit here.
It may be that the 1/2 gig of memory used by the allocation of the variable shock is sometimes causing swapping when the debugger is loaded.
Try running this to see, in the debugger:
using Distributions, Base.Sys
println("Free memory is $(Int(Sys.free_memory()))")
d = Normal()
shock = rand(d, 1, 53000000)
println("shock uses $(sizeof(shock)) bytes.")
println("Free memory is $(Int(Sys.free_memory()))")
Are you close to out of memory in gigs?

H2O - Not seeing much speed-up after moving to powerful machine

I am running a Python program that calls H2O for deep learning (training and testing). The program runs in a loop of 20 iterations and in each loop calls H2ODeepLearningEstimator() 4 times and associated predict() and model_performance(). I am doing h2o.remove_all() and cleaning up all data-related Python objects after each iteration.
Data size: training set 80,000 with 122 features (all float) with 20% for validation (10-fold CV). test set 20,000. Doing binary classification.
Machine 1: Windows 7, 4 core, Xeon, each core 3.5GHz, Memory 32 GB
Takes about 24 hours to complete
Machine 2: CentOS 7, 20 core, Xeon, each core 2.0GHz, Memory 128 GB
Takes about 17 hours to complete
I am using h2o.init(nthreads=-1, max_mem_size = 96)
So, the speed-up is not that much.
My questions:
1) Is the speed-up typical?
2) What can I do to achieve substantial speed-up?
2.1) Will adding more cores help?
2.2) Are there any H2O configuration or tips that I am missing?
Thanks very much.
- Mohammad,
Graduate student
If the training time is the main effort, and you have enough memory, then the speed up will be proportional to cores times core-speed. So, you might have expected a 40/14 = 2.85 speed-up (i.e. your 24hrs coming down to the 8-10 hour range).
There is a typo in your h2o.init(): 96 should be "96g". However, I think that was a typo when writing the question, as h2o.init() would return an error message. (And H2O would fail to start if you'd tried "96", with the quotes but without the "g".)
You didn't show your h2o.deeplearning() command, but I am guessing you are using early stopping. And that can be unpredictable. So, what might have happened is that your first 24hr run did, say, 1000 epochs, but your second 17hr run did 2000 epochs. (1000 vs. 2000 would be quite an extreme difference, though.)
It might be that you are spending too much time scoring. If you've not touched the defaults, this is unlikely. But you could experiment with train_samples_per_iteration (e.g. set it to 10 times the number of your training rows).
What can I do to achieve substantial speed-up?
Stop using cross-validation. That might be a bit controversial, but personally I think 80,000 training rows is going to be enough to do an 80%/10%/10% split into train/valid/test. That will be 5-10 times quicker.
If it is for a paper, and you want to show more confidence in the results, once you have your final model, and you've checked that test score is close to valid score, then rebuild it a couple of times using a different seed for the 80/10/10 split, and confirm you end up with the same metrics. (*)
*: By the way, take a look at the score for each of the 10 cv models you've already made; if they are fairly close to each other, then this approach should work well. If they are all over the place, you might have to re-consider the train/valid/test splits - or just think about what it is in your data that might be causing that sensitivity.

why executation time of tf.nn.conv2d function different while the multiply times are the same?

I am using tensorflow to build cnn net in image classification experiment,I found such phenomenon as:
operation 1:tf.nn.conv2d(x, [3,3,32,32], strides=[1,1,1,1], padding='SAME')
the shape of x is [128,128,32],means convolution using 3x3 kernel on x,both input channels and output channels are 32,the total multiply times is
3*3*32*32*128*128=150994944
operation 2:tf.nn.conv2d(x, [3,3,64,64], strides=[1,1,1,1], padding='SAME')
the shape of x is [64,64,64],means convolution using 3x3 kernel on x,both input channels and output channels are 64,the total multiply times is
3*3*64*64*64*64=150994944
In contrast with operation 1,the feature map size of operation 2 scale down to 1/2 and the channel number doubled. The multiply times are the same so the running time should be same.But in practice the running time of operation 1 is longer than operation 2.
My measure method was shown below
eliminate an convolution of operation 1,the training time for one epoch reduced 23 seconds,means the running time of operation 1 is 23 seconds.
eliminate an convolution of operation 2,the training time for one epoch reduced 13 seconds,means the running time of operation 2 is 13 seconds.
the phenomenon can reproduction every time。
My gpu is nvidia gtx980Ti,os is ubuntu 16.04。
So that comes the question: Why the running time of operation 1 was longer than operation 2?
If I had to guess it has to do with how the image is ordered in memory. Remember that in memory everything is stored in a flattened format. This means that if you have a tensor of shape [128, 128, 32], the 32 features/channels are stored next to eachover. Then all of the rows, then all of the columns. https://en.wikipedia.org/wiki/Row-major_order
Accessing closely packed memory is very important to performance especially on a GPU which has a large memory bus and is optimized for aligned in order memory access. In case with the larger image you have to skip around the image more and the memory access is more out of order. In case 2 you can do more in order memory access which gives you more speed. Multiplications are very fast operations. I bet with a convolution memory access if the bottleneck which limits performance.
chasep255's answer is good and probably correct.
Another possibility (or alternative way of thinking about chasep255's answer) is to consider how caching (all the little hardware tricks that can speed up memory fetches, address mapping, etc) could be producing what you see...
You have basically two things: a stream of X input data and a static filter matrix. In case 1, you have 9*1024 static elements, in case 2 you have 4 times as many. Both cases have the same total multiplication count, but in case 2 the process is finding more of its data where it expects (i.e. where it was last time it was asked for.) Net result: less memory access stalls, more speed.

Reading matrix from file takes too much RAM

I am reading a matrix from a file using readdlm. The file is about 400 MB in size. My PC has 8 GB of RAM. When I try to readdlm the matrix from this file, my PC eventually freezes, while the RAM consumption goes up until it consumes everything. The matrix is simply a 0, 1 matrix.
I don't understand why this happens. Storing this matrix in memory shouldn't take more than the 400 MB necessary to store the file.
What can I do?
The code I am using is simple:
readdlm("data.txt")
where data.txt is a 400mb text file of tab-separated values. I am on Linux Mint 17.3, Julia 0.4.

Examining Erlang crash dumps - how to account for all memory?

I've been poring over this Erlang crash dump where the VM has run out of heap memory. The problem is that there is no obvious culprit allocating all that memory.
Using some serious black awk magic I've summed up the fields Stack+heap, OldHeap, Heap unused and OldHeap unused for each process and ranked them by memory usage. The problem is that this number doesn't come even close to the number that is representing the total memory for all the processes processes_used according to the Erlang crash dump guide.
I've already tried the Crashdump Viewer and either I'm missing something or there isn't much help there for my kind of problem.
The number I get is 525 MB whereas the processes_used value is at 1348 MB. Where can I find the rest of the memory?
Edit: The Heap unused and OldHeap unused shouldn't have been included since they are a sub-part of Stack+Heap and OldHeap, that plus the fact that the number displayed for Stack+Heap and OldHeap are listed as number of words, not bytes, was the problem.
There is an module called crashdump_viewer which is great for these kinds of analysis.
Another thing to keep in mind is that Heap+Stack is afaik in words, not bytes which would mean that you have to multiply Heap+Stack with 4 on 32 and 8 on 64 bit. Can't find a reference in the manual for this but Processes talks about it a bit.

Resources