I was just trying some stuff for a quaternionic neural network when I realized that, even if I close my current Session in a for loop, my program slows down massively and I get a memory leak caused by ops being constructed. This is my code:
for step in xrange(0,200):#num_epochs * train_size // BATCH_SIZE):
338
339 with tf.Session() as sess:
340
341 offset = (BATCH_SIZE) % train_size
342 #print "Offset : %d" % offset
343
344 batch_data = []
345 batch_labels = []
346 batch_data.append(qtrain[0][offset:(offset + BATCH_SIZE)])
347 batch_labels.append(qtrain_labels[0][offset:(offset + BATCH_SIZE)]
352 retour = sess.run(test, feed_dict={x: batch_data})
357
358 test2 = feedForwardStep(retour, W_to_output,b_output)
367 #sess.close()
The problem seems to come from test2 = feedForward(..). I need to declare these ops after executing retour once, because retour can't be a placeholder (I need to iterate through it). Without this line, the program runs very well, fast and without a memory leak. I can't understand why it seems like TensorFlow is trying to "save" test2 even if I close the session ...
TL;DR: Closing a session does not free the tf.Graph data structure in your Python program, and if each iteration of the loop adds nodes to the graph, you'll have a leak.
Since your function feedForwardStep creates new TensorFlow operations, and you call it within the for loop, then there is a leak in your code—albeit a subtle one.
Unless you specify otherwise (using a with tf.Graph().as_default(): block), all TensorFlow operations are added to a global default graph. This means that every call to tf.constant(), tf.matmul(), tf.Variable() etc. adds objects to a global data structure. There are two ways to avoid this:
Structure your program so that you build the graph once, then use tf.placeholder() ops to feed in different values in each iteration. You mention in your question that this might not be possible.
Explicitly create a new graph in each for loop. This might be necessary if the structure of the graph depends on the data available in the current iteration. You would do this as follows:
for step in xrange(200):
with tf.Graph().as_default(), tf.Session() as sess:
# Remainder of loop body goes here.
Note that in this version, you cannot use Tensor or Operation objects from a previous iteration. (For example, it's not clear from your code snippet where test comes from.)
Related
I am referring to the popular paper: Correct Methods For Adding Delays To Verilog Behavioral Models by Cummings
http://www.sunburst-design.com/papers/CummingsHDLCON1999_BehavioralDelays_Rev1_1.pdf
always #(a or b or ci)
#12 {co, sum} = a + b + ci;
From the paper,
a input changes at time 15 as shown in Figure 3, then if
the a, b and ci inputs all change during the next 9ns, the
outputs will be updated with the latest values of a, b and
ci. This modeling style has just permitted the ci input to
propagate a value to the sum and carry outputs after
only 3ns instead of the required 12ns propagation delay.
Can anyone tell me why is this true?
We are using a blocking assignment here. So,shouldn't the always block remain inactive from 15 to 27 ns because the current always block is not completed? But here it remains active , means always block gets triggered whenever a change is noticed.
It helps to reformat the code with a different layout to more accurately capture the functionality that is happening
always
#(a or b or ci) // line 1
#12 // line 2
{co, sum} = a + b + ci; // line 3
The first line suspends the always process until it sees a change in one of the listed signals, which happens at time 15.
The second line suspends the process for 12 time units. During this time period, there are changes to a, b, and ci that are ignored.
The third line wakes up at time 27 (15+12) and uses the current values of a, b, and ci and makes a blocking assignment to {co, sum}.
Since this is a always block, after the assignment, it loops back to the first line and waits for another change.
I have a for loop with large objects. According to my trial-and-error, I can only load the large object once. If I load the object again, I would be returned the error "Error: cannot allocate vector of size *** Mb". I tried to overcome this issue by removing the object at the end of the for loop. However, I am still returned the error "Error: cannot allocate vector of size 699.2 Mb" at the beginning of the second run of the for loop.
My for loop has the following structure:
for (i in 1:22) {
VeryLargeObject <- ...i...
...
.
.
.
...
rm(VeryLargeOjbect)
}
The VeryLargeObjects ranges from 2-3GB. My PC has RAM of 16Gb, 8 cores, 64-bit Win10.
Any solution on how I can manage to complete the for loop?
The error "cannot allocate..." likely comes from the fact that rm() does not immediately free memory. So the first object still occupies RAM when you load the second one. Objects that are not assigned to any name (variable) anymore get garbage collected by R at time points that R decides for itself.
Most remedies come from not loading the entire object into RAM:
If you are working with a matrix, create a filebacked.big.matrix() with the bigmemory package. Write your data into this object using var[...,...] syntax like a normal matrix. Then, in a new R session (and a new R script to preserve reproducibility), you can load this matrix from disk and modify it.
The mmap package uses a similar approach, using your operating system's ability to map RAM pages to disk. So they appear to a program like they are in ram, but are read from disk. To improve speed, the operating system takes care of keeping the relevant parts in RAM.
If you work with data frames, you can use packages like fst and feather that enable you to load only parts of your data frame into a variable.
Transfer your data frame into a data base like sqlite and then access the data base with R. The package dbplyr enables you to treat a data base as a tidyverse-style data set. Here is the RStudio help page. You can also use raw SQL commands with the package DBI
Another approach is to not write interactively, but to write an R script that processes only one of your objects:
Write an R script, named, say processBigObject.R that gets the file name of your big object from the command line using commandArgs():
#!/usr/bin/env Rscript
#
# Process a big object
#
# Usage: Rscript processBigObject.R <FILENAME>
input_filename <- commandArgs(trailing = TRUE)[1]
output_filename <- commandArgs(trailing = TRUE)[2]
# I'm making up function names here, do what you must for your object
o <- readBigObject(input_filename)
s <- calculateSmallerSummaryOf(o)
writeOutput(s, output_filename)
Then, write a shell script or use system2() to call the script multiple times, with different file names. Because R is terminated after each object, the memory is freed:
system2("Rscript", c("processBigObject.R", "bigObject1.dat", "bigObject1_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject2.dat", "bigObject2_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject3.dat", "bigObject3_result.dat"))
...
This question already has an answer here:
In Lua, should I define a variable every iteration of a loop or before the loop?
(1 answer)
Closed 6 years ago.
I am used to do this:
do
local a
for i=1,1000000 do
a = <some expression>
<...> --do something with a
end
end
instead of
for i=1,1000000 do
local a = <some expression>
<...> --do something with a
end
My reasoning is that creating a local variable 1000000 times is less efficient than creating it just once and reuse it on each iteration.
My question is: is this true or there is another technical detail I am missing? I am asking because I don't see anyone doing this but not sure if the reason is because the advantage is too small or because it is in fact worse. By better I mean using less memory and running faster.
Like any performance question, measure first.
In a unix system you can use time:
time lua -e 'local a; for i=1,100000000 do a = i * 3 end'
time lua -e 'for i=1,100000000 do local a = i * 3 end'
the output:
real 0m2.320s
user 0m2.315s
sys 0m0.004s
real 0m2.247s
user 0m2.246s
sys 0m0.000s
The more local version appears to be a small percentage faster in Lua, since it does not initialize a to nil. However, that is no reason to use it, use the most local scope because it it is more readable (this is good style in all languages: see this question asked for C,
Java, and C#)
If you are reusing a table instead of creating it in the loop then there is likely a more significant performance difference. In any case, measure and favour readability whenever you can.
I think there's some confusion about the way compilers deal with variables. From a high-level kind of human perspective, it feels natural to think of defining and destroying a variable to have some kind of "cost" associated with it.
Yet that's not necessarily the case to the optimizing compiler. The variables you create in a high-level language are more like temporary "handles" into memory. The compiler looks at those variables and then translates it into an intermediate representation (something closer to the machine) and figures out where to store everything, predominantly with the goal of allocating registers (the most immediate form of memory for the CPU to use). Then it translates the IR into machine code where the idea of a "variable" doesn't even exist, only places to store data (registers, cache, dram, disk).
This process includes reusing the same registers for multiple variables provided that they do not interfere with each other (provided that they are not needed simultaneously: not "live" at the same time).
Put another way, with code like:
local a = <some expression>
The resulting assembly could be something like:
load gp_register, <result from expression>
... or it may already have the result from some expression in a register, and the variable ends up disappearing completely (just using that same register for it).
... which means there's no "cost" to the existence of the variable. It just translates directly to a register which is always available. There's no "cost" to "creating a register", as registers are always there.
When you start creating variables at a broader (less local) scope, contrary to what you think, you may actually slow down the code. When you do this superficially, you're kind of fighting against the compiler's register allocation, and making it harder for the compiler to figure out what registers to allocate for what. In that case, the compiler might spill more variables into the stack which is less efficient and actually has a cost attached. A smart compiler may still emit equally-efficient code, but you could actually make things slower. Helping the compiler here often means more local variables used in smaller scopes where you have the best chance for efficiency.
In assembly code, reusing the same registers whenever you can is efficient to avoid stack spills. In high-level languages with variables, it's kind of the opposite. Reducing the scope of variables helps the compiler figure out which registers it can reuse because using a more local scope for variables helps inform the compiler which variables aren't live simultaneously.
Now there are exceptions when you start involving user-defined constructor and destructor logic in languages like C++ where reusing an object might prevent redundant construction and destruction of an object that can be reused. But that doesn't apply in a language like Lua, where all variables are basically plain old data (or handles into garbage-collected data or userdata).
The only case where you might see an improvement using less local variables is if that somehow reduces work for the garbage collector. But that's not going to be the case if you simply re-assign to the same variable. To do that, you would have to reuse whole tables or user data (without re-assigning). Put another way, reusing the same fields of a table without recreating a whole new one might help in some cases, but reusing the variable used to reference the table is very unlikely to help and could actually hinder performance.
All local variables are "created" at compile (load) time and are simply indexes into function activation record's locals block. Each time you define a local, that block is grown by 1. Each time do..end/lexical block is over, it shrinks back. Peak value is used as total size:
function ()
local a -- current:1, peak:1
do
local x -- current:2, peak:2
local y -- current:3, peak:3
end
-- current:1, peak:3
do
local z -- current:2, peak:3
end
end
The above function has 3 local slots (determined at load, not at runtime).
Regarding your case, there is no difference in locals block size, and moreover, luac/5.1 generates equal listings (only indexes change):
$ luac -l -
local a; for i=1,100000000 do a = i * 3 end
^D
main <stdin:0,0> (7 instructions, 28 bytes at 0x7fee6b600000)
0+ params, 5 slots, 0 upvalues, 5 locals, 3 constants, 0 functions
1 [1] LOADK 1 -1 ; 1
2 [1] LOADK 2 -2 ; 100000000
3 [1] LOADK 3 -1 ; 1
4 [1] FORPREP 1 1 ; to 6
5 [1] MUL 0 4 -3 ; - 3 // [0] is a
6 [1] FORLOOP 1 -2 ; to 5
7 [1] RETURN 0 1
vs
$ luac -l -
for i=1,100000000 do local a = i * 3 end
^D
main <stdin:0,0> (7 instructions, 28 bytes at 0x7f8302d00020)
0+ params, 5 slots, 0 upvalues, 5 locals, 3 constants, 0 functions
1 [1] LOADK 0 -1 ; 1
2 [1] LOADK 1 -2 ; 100000000
3 [1] LOADK 2 -1 ; 1
4 [1] FORPREP 0 1 ; to 6
5 [1] MUL 4 3 -3 ; - 3 // [4] is a
6 [1] FORLOOP 0 -2 ; to 5
7 [1] RETURN 0 1
// [n]-comments are mine.
First note this: Defining the variable inside the loop makes sure that after one iteration of this loop the next iteration cannot use that same stored variable again. Defining it before the for-loop makes it possible to carry an variable through multiple iterations, as any other variable not defined within the loop.
Further, to answer your question: Yes, it is less efficient, because it re-initiates the variable. If the Lua JIT- /Compiler has good pattern recognition, it might be that it just resets the variable, but I cannot confirm nor deny that.
The Fortran program I am working is encountering a runtime error when processing an input file.
At line 182 of file ../SOURCE_FILE.f90 (unit = 1, file = 'INPUT_FILE.1')
Fortran runtime error: Bad value during integer read
Looking to line 182 I see a READ statement with an implicit/implied DO loop:
182: READ(IT4, 310 )((IPPRM2(IP,I),IP=1,NP),I=1,16) ! read 6 integers
183: READ(IT4, 320 )((PPARM2(IP,I),IP=1,NP),I=1,14) ! read 5 reals
Format statement:
310 FORMAT(1X,6I12)
When I reach this code in the debugger NP has a value of 2. I has a value of 6, and IP has a value of 67. I think I and IP should be reinitialized in the loop.
My problem is that when I try to step through in the debugger once I get to the READ statement it seems to execute and then throw the error. I'm not sure how to follow it as it reads. I tried stepping into the function, but it seems like that may be a difficult route to take since I am unfamiliar with the gfortran library. The input file looks OK, I think it should be read just fine. This makes me think this READ statement isn't looping as intended.
I am completely new to Fortran and implicit DO loops like this, but from what I can gather line 182 should read in 6 integers according to the format string #310. However, when I arrive NP has a value of 2 which makes me think it will only try to read 2 integers 16 times.
How can I debug this read statement to examine the values read into IPPARM as they are read from the file? Will I have to step through the Fortran library?
Any tips that can clear up my confusion regarding these implicit loops would be appreciated!
Thanks!
NOTE: I'm using gfortran/gcc and gdb on Linux.
Is there any reason you need specific formatting on the read? I would use READ(IT4, *) where feasible...
Later versions of gfortran support unlimited format reads (see link http://fortranwiki.org/fortran/show/Fortran+2008+status)
Then it may be helpful to specify
310 FORMAT("*(1X,6I12)")
Or for older compilers
310 FORMAT(1000(1X,6I12))
The variables IP and I are loop indices and so they are reinitialized by the loop. With NP=2 the first statement is going to read a total of 32 integers -- it is contributing to the determination the list of items to read. The format determines how they are read. With "1X,6I12" they will be read as 6 integers per line of the input file. When the first 6 of the requested 32 integers is read fron a line/record, Fortran will consider that line/record completed and advance to the next record.
With a format of "1X,6I12" the integers must be precisely arranged in the file. There should be a single blank, then the integers should each be right-justified in fields of 12 columns. If they get out of alignment you could get the wrong value read or a runtime error.
I have been using Ruby for a while now and I find, for bigger projects, it can take up a fair amount of memory. What are some best practices for reducing memory usage in Ruby?
Please, let each answer have one "best practice" and let the community vote it up.
When working with huge arrays of ActiveRecord objects be very careful... When processing those objects in a loop if on each iteration you are loading their related objects using ActiveRecord's has_many, belongs_to, etc. - the memory usage grows a lot because each object that belongs to an array grows...
The following technique helped us a lot (simplified example):
students.each do |student|
cloned_student = student.clone
...
cloned_student.books.detect {...}
ca_teachers = cloned_student.teachers.detect {|teacher| teacher.address.state == 'CA'}
ca_teachers.blah_blah
...
# Not sure if the following is necessary, but we have it just in case...
cloned_student = nil
end
In the code above "cloned_student" is the object that grows, but since it is "nullified" at the end of each iteration this is not a problem for huge array of students. If we didn't do "clone", the loop variable "student" would have grown, but since it belongs to an array - the memory used by it is never released as long as array object exists.
Different approach works too:
students.each do |student|
loop_student = Student.find(student.id) # just re-find the record into local variable.
...
loop_student.books.detect {...}
ca_teachers = loop_student.teachers.detect {|teacher| teacher.address.state == 'CA'}
ca_teachers.blah_blah
...
end
In our production environment we had a background process that failed to finish once because 8Gb of RAM wasn't enough for it. After this small change it uses less than 1Gb to process the same amount of data...
Don't abuse symbols.
Each time you create a symbol, ruby puts an entry in it's symbol table. The symbol table is a global hash which never gets emptied.
This is not technically a memory leak, but it behaves like one. Symbols don't take up much memory so you don't need to be too paranoid, but it pays to be aware of this.
A general guideline: If you've actually typed the symbol in code, it's fine (you only have a finite amount of code after all), but don't call to_sym on dynamically generated or user-input strings, as this opens the door to a potentially ever-increasing number
Don't do this:
def method(x)
x.split( doesn't matter what the args are )
end
or this:
def method(x)
x.gsub( doesn't matter what the args are )
end
Both will permanently leak memory in ruby 1.8.5 and 1.8.6. (not sure about 1.8.7 as I haven't tried it, but I really hope it's fixed.) The workaround is stupid and involves creating a local variable. You don't have to use the local, just create one...
Things like this are why I have lots of love for the ruby language, but no respect for MRI
Beware of C extensions which allocate large chunks of memory themselves.
As an example, when you load an image using RMagick, the entire bitmap gets loaded into memory inside the ruby process. This may be 30 meg or so depending on the size of the image.
However, most of this memory has been allocated by RMagick itself. All ruby knows about is a wrapper object, which is tiny(1).
Ruby only thinks it's holding onto a tiny amount of memory, so it won't bother running the GC. In actual fact it's holding onto 30 meg.
If you loop over a say 10 images, you can run yourself out of memory really fast.
The preferred solution is to manually tell the C library to clean up the memory itself - RMagick has a destroy! method which does this. If your library doesn't however, you may need to forcibly run the GC yourself, even though this is generally discouraged.
(1): Ruby C extensions have callbacks which will get run when the ruby runtime decides to free them, so the memory will eventually be successfully freed at some point, just perhaps not soon enough.
Measure and detect which parts of your code are creating objects that cause memory usage to go up. Improve and modify your code then measure again. Sometimes, you're using gems or libraries that use up a lot of memory and creating a lot of objects as well.
There are many tools out there such as busy-administrator that allow you to check the memory size of objects (including those inside hashes and arrays).
$ gem install busy-administrator
Example # 1: MemorySize.of
require 'busy-administrator'
data = BusyAdministrator::ExampleGenerator.generate_string_with_specified_memory_size(10.mebibytes)
puts BusyAdministrator::MemorySize.of(data)
# => 10 MiB
Example # 2: MemoryUtils.profile
Code
require 'busy-administrator'
results = BusyAdministrator::MemoryUtils.profile(gc_enabled: false) do |analyzer|
BusyAdministrator::ExampleGenerator.generate_string_with_specified_memory_size(10.mebibytes)
end
BusyAdministrator::Display.debug(results)
Output:
{
memory_usage:
{
before: 12 MiB
after: 22 MiB
diff: 10 MiB
}
total_time: 0.406452
gc:
{
count: 0
enabled: false
}
specific:
{
}
object_count: 151
general:
{
String: 10 MiB
Hash: 8 KiB
BusyAdministrator::MemorySize: 0 Bytes
Process::Status: 0 Bytes
IO: 432 Bytes
Array: 326 KiB
Proc: 72 Bytes
RubyVM::Env: 96 Bytes
Time: 176 Bytes
Enumerator: 80 Bytes
}
}
You can also try ruby-prof and memory_profiler. It is better if you test and experiment different versions of your code so you can measure the memory usage and performance of each version. This will allow you to check if your optimization really worked or not. You usually use these tools in development / testing mode and turn them off in production.