I have a 2.6 gigabyte text file containing a dump of a database table, and I'm trying to pull it into a logical structure so the fields can all be uniqued. The code I'm using to do this is here:
class Targetfile
include Enumerable
attr_accessor :inputfile, :headers, :input_array
def initialize(file)
#input_array = false
#inputfile = File.open(file, 'r')
#x = #inputfile.each.count
end
def get_headers
#y = 1
#inputfile.rewind
#input_array = Array.new
#headers = #inputfile.first.chomp.split(/\t/)
#inputfile.each do |line|
print "\n#{#y} / #{#x}"
#y+=1
self.assign_row(line)
end
end
def assign_row(line)
row_array = line.chomp.encode!('UTF-8', 'UTF-8', :invalid => :replace).split(/\t/)
#input_array << Hash[ #headers.zip(row_array) ]
end
def send_build
#input_array || self.get_headers
end
def each
self.send_build.each {|row| yield row}
end
end
The class is initialized successfully and I am left with a Targetfile class object.
The problem is that when I then call the get_headers method, which converts the file into an array of hashes, it begins slowing down immediately.
This isn't noticeable to my eyes until around item number 80,000, but then it becomes apparent that every 3-4,000 lines of the file, some sort of pause is occurring. That pause, each time it occurs, takes slightly longer, until by the millionth line, it's taking longer than 30 seconds.
For practical purposes, I can just chop up the file to avoid this problem, then combine the resulting lists and unique -that- to get my final outputs.
From a curiosity standpoint, however, I'm unsatisfied.
Can anyone tell me why this pause is occurring, why it gets longer, and if there's any way to avoid it elegantly? Really I just want to know what it is and why it happens, because now that I've noticed it, I see it in a lot of other Ruby scripts I run, both on this computer and on others.
I'd suggest doing this in the DBM, not Ruby or any other language. A DBM can tell you the unique values for a field very quickly, especially if it's already indexed.
Trying to do this in any language is duplicating the basic functionality of the database in something designed for general computing.
Instead, use Ruby with an ORM like Sequel or Active Record, and issue queries to the database and let it return the things you want to know. Don't iterate over every row, that's madness, ask it to give you the unique values and go from there.
I wouldn't blame Ruby, because the same problem would occur in any other language given the same host and RAM. C/C++ might delay the inevitable by generating more compact code, but your development time will slow drastically, especially as you learn an unknown language like C. And the risk of unintended errors goes up because you have to do a lot more housekeeping and defensive programming than you'd do in Ruby, Python, or Perl.
Use each tool for what it's designed for and you'll be ahead.
Looking at your code, you could probably improve the chances of making it through a complete run by NOT trying to keep every row in memory. You said you're trying to determine uniqueness, so keep only the unique column values you're interested in, which you can do easily using Ruby's Set class. You can throw the values of each thing you want to determine uniqueness on, walk the file, and Set will only keep the unique values.
This is the infamous garbage collector -- Ruby's memory managment mechanism.
Note: It's worth mentioning that Ruby, at least MRI, isn't a high performance language.
The garbage collector runs whenever memory starts to run out. The garbage collector pauses the execution of the program to deallocate any objects that can no longer be accessed. The garbage collector only runs when memory starts to run out. That's why you're seeing it periodically.
There's nothing you can do to avoid this, except write more memory efficiant code, or rewrite in a language that can has better/manual memory management.
Also, your OS may be paging. Do you have enough physical memory for this kind of task?
You are using the headers as keys for the hash. They are strings, and hashes duplicate string keys. That is a lot of unnecessary strings. Try if converting them to symbols speeds things up:
#headers = #headers.map{|header| header.to_sym}
This is the Garbage Collector. You can force garbage collection by putting in GC.start in your program. Have it run periodically.
I had to do the same thing for a daemon I wrote. It works well.
http://ruby-doc.org/core-1.9.3/GC.html
Related
I am working on a small project which progressively grows a list of links and then processes them through a queue. There exists the likelihood that a link may be entered into the queue twice and I would like to track my progress so I can skip anything that has already been processed. I'm estimating around 10k unique links at most.
For larger projects I would use a database but that seems overkill for the amount of data I am working with and would prefer some form of in-memory solution that can potentially be serialized if I want to save progress across runs.
What data structure would best fit this need?
Update: I am already using a hash to track which links I have completed processing. Is this the most efficient way of doing it?
def process_link(link)
return if #processed_links[link]
# ... processing logic
#processed_links[link] = Time.now # or other state
end
If you aren't concerned about memory, then just use a Hash to check inclusion; insert and lookup times are O(1) average case. Serialization is straightforward (Ruby's Marshal class should take care of that for you, or you could use a format like JSON). Ruby's Set is an array-like object that is backed with a Hash, so you could just use that if you're so inclined.
However, if memory is a concern, then this is a great problem for a Bloom filter! You can achieve set inclusion testing in constant time and the filter uses substantially less memory than a hash would. The tradeoff is the Bloom filters are probabilistic - you can get false inclusion positives. You can eliminate the probability of most false positives with the right bloom filter parameters, but if duplicates are the exception rather than the rule, you could implement something like:
Check for set inclusion in the Bloom filter [O(1)]
If the bloom filter reports that the entry is found, perform an O(n) check of the input data, to see if this item has been found in the array of input data prior to now.
That would get you very fast and memory-efficient lookups for the common case, and you could make the choice to accept the possibility of false negatives (to keep the whole thing small and fast), or you could perform verification of set inclusion when a duplicate is reported (to only do expensive work when you absolutely have to).
https://github.com/igrigorik/bloomfilter-rb is a Bloom filter implementation I've used in the past; it works nicely. There are also redis-backed Bloom filters, if you need something that can perform set membership tracking and testing across multiple app instances.
How about a Set and convert your links to value object (rather than reference object) like Structs. By creating a value object the Set will be able to detect its uniqueness. Alternately, you could use a hash and store links by their PK.
The data structure could be a hash:
current_status = { links: [link3, link4, link5], processed: [link1, link2, link3] }
To track your progress (in percent):
links_count = current_status[:links].length + current_status[:processed].length
progress = (current_status[:processed].length * 100) / links_count # Will give you percent of progress
To process your links:
push any new link you need to process to current_status[:links].
Use shift to take from current_status[:links] the next link to be processed.
After processing a link, push it to current_status[:processed]
EDIT
As I see it (and understand your question), the logic to process your links would be:
# Add any new link that needs to be processed to the queue unless it have been processed
def add_link_to_queue(link)
current_status[:to_process].push(link) unless current_status[:processed].include?(link)
end
# Process next link on the queue
def process_next_link
link = current_status[:to_process].shift # return first link on the queue
# ... login process the link
current_status[:processed].push(link)
end
# shift method will not only return but also remove the link from the original array to avoid duplications
In a Rails 3.2.x app, using (Re)tire to access an ES cluster a rake task is going through approx 1M rows to create a new index. (Ruby 1.9.3).
The task is using .to_json with specific attributes and methods listed to limit the resulting hash for each element.
Yet as the task run the memory is eaten away, ending with the process being killed usually by the system.
The task is already using find_by_batch. Smaller batches sizes (using find_each) don't help.
checking without index
Removing the index.import call does improve things (obviously). The task goes through the whole collection very fast without a problem. Pointing to either ES, tire or the JSON conversion (and the relations it might call upon).
reducing the scope of the task
Adding back index.import and passing a very limited hash (with string keys) for each item does make things slower but not too much and does not eat memory away. So json might no be the culprit here.
adding attributes and methods back
The culprit seems to be one of the method used to grab one of the additional attributes. It's based on a relation of the model and another ... Ending up with a lot of models being involved and sifted through.
As pointed out by Index the results of a method in ElasticSearch (Tire + ActiveRecord) adding includes does help a bit but the task does end up heavy too.
going around
I also tried to go around part of the problem and replace the calls to Tire with the use of ES bulk API.
Generating json files and sending them with a Ruby http lib can work. Yet, the same problem arise : memory since the same requests to the DB are made.
What's left ?
What I don't get is why even with the find_by_batch Ruby keeps eating away memory. I would expect that after each batch of data, memory related that batch would be freed.
Next to try : GC.start calls, Active Record caching de activation around the tasks.
Yet, except if a solution limiting the memory use drastically (300 or 500Mo instead of 800+) the background issue is : indexing a lot of instances of a Model including data related to some other models.
am I missing something for the import and includes that would solve the issue ?
would splitting that task into smaller background jobs (resque, sidekiq) help ? I would suppose so as each batch would be isolated from the others and once treated, really free up the memory (?) (orchestrating those tasks would be another trouble)
is there good practices related to indexing big quantities of data into ES ?
I've been using Rails + Elasticsearch for a while and did this kind of dance a few times.
A few things comes to mind, in no particular order.
Did you try to use the recent elasticsearch gem (instead of tire) ? I've updated my apps to use and like having more control on what is done.
I would also try to force a GC sweep after each ActiveRecord loop. You could also be extra careful with memory allocation by explicitly resetting all local variables each time.
You could use the fork & exec trick to fork a brand new process at each loop, it would be the most effective GC you can get. It's a little overhead when you write it the first time, but the pay-off is great. Take good care of limiting the amount of memory used in the outer part of the task. Using a process-based background task would partly achieve the same goal, but you might still get memory bloat.
Can you limit the use of ActiveRecord? If you need some basic associations you could use a lower-level/simpler tool like Sequel (or else) to use Ruby hashes/arrays instead of full fledged AR models.
Just curious to know how to list all the symbols used in a running ruby process? eventually want to know the size taken by all those symbols, and is there any limit on that or how to keep it limited or whether one should worry about them when size is too much
To see them all:
Symbol.all_symbols
reference: http://ruby-doc.org/core-2.1.1/Symbol.html#method-c-all_symbols
I'm not sure how to find out how much memory they are using, or if there is a limit. But, since they are never garbage collected, you SHOULD worry a bit about them. In particular, you should never allow untrusted user input to be turned into a symbol - this can be used to run your application out of memory.
For an example of turning user input into symbols, imagine a rails action which turns a user-supplied string into a symbol:
def some_action
my_sym = params[:p].to_sym
# ...
end
Now someone can fill your ruby process space with as many symbols as they like by requesting millions of urls like
http://your_app/some_action?p=a
http://your_app/some_action?p=b
http://your_app/some_action?p=c
...
Possibly (depending on lots of things) killing your server when it runs out of memory.
starting from Rails 4, everything would have to run in threaded environment by default. What this means is all of the code we write AND ALL the gems we use are required to be threadsafe
so, I have few questions on this:
what is NOT thread-safe in ruby/rails? Vs What is thread-safe in ruby/rails?
Is there a list of gems that is known to be threadsafe or vice-versa?
is there List of common patterns of code which are NOT threadsafe example #result ||= some_method?
Are the data structures in ruby lang core such as Hash etc threadsafe?
On MRI, where there a GVL/GIL which means only 1 ruby thread can run at a time except for IO, does the threadsafe change effect us?
None of the core data structures are thread safe. The only one I know of that ships with Ruby is the queue implementation in the standard library (require 'thread'; q = Queue.new).
MRI's GIL does not save us from thread safety issues. It only makes sure that two threads cannot run Ruby code at the same time, i.e. on two different CPUs at the exact same time. Threads can still be paused and resumed at any point in your code. If you write code like #n = 0; 3.times { Thread.start { 100.times { #n += 1 } } } e.g. mutating a shared variable from multiple threads, the value of the shared variable afterwards is not deterministic. The GIL is more or less a simulation of a single core system, it does not change the fundamental issues of writing correct concurrent programs.
Even if MRI had been single-threaded like Node.js you would still have to think about concurrency. The example with the incremented variable would work fine, but you can still get race conditions where things happen in non-deterministic order and one callback clobbers the result of another. Single threaded asynchronous systems are easier to reason about, but they are not free from concurrency issues. Just think of an application with multiple users: if two users hit edit on a Stack Overflow post at more or less the same time, spend some time editing the post and then hit save, whose changes will be seen by a third user later when they read that same post?
In Ruby, as in most other concurrent runtimes, anything that is more than one operation is not thread safe. #n += 1 is not thread safe, because it is multiple operations. #n = 1 is thread safe because it is one operation (it's lots of operations under the hood, and I would probably get into trouble if I tried to describe why it's "thread safe" in detail, but in the end you will not get inconsistent results from assignments). #n ||= 1, is not and no other shorthand operation + assignment is either. One mistake I've made many times is writing return unless #started; #started = true, which is not thread safe at all.
I don't know of any authoritative list of thread safe and non-thread safe statements for Ruby, but there is a simple rule of thumb: if an expression only does one (side-effect free) operation it is probably thread safe. For example: a + b is ok, a = b is also ok, and a.foo(b) is ok, if the method foo is side-effect free (since just about anything in Ruby is a method call, even assignment in many cases, this goes for the other examples too). Side-effects in this context means things that change state. def foo(x); #x = x; end is not side-effect free.
One of the hardest things about writing thread safe code in Ruby is that all core data structures, including array, hash and string, are mutable. It's very easy to accidentally leak a piece of your state, and when that piece is mutable things can get really screwed up. Consider the following code:
class Thing
attr_reader :stuff
def initialize(initial_stuff)
#stuff = initial_stuff
#state_lock = Mutex.new
end
def add(item)
#state_lock.synchronize do
#stuff << item
end
end
end
A instance of this class can be shared between threads and they can safely add things to it, but there's a concurrency bug (it's not the only one): the internal state of the object leaks through the stuff accessor. Besides being problematic from the encapsulation perspective, it also opens up a can of concurrency worms. Maybe someone takes that array and passes it on to somewhere else, and that code in turn thinks it now owns that array and can do whatever it wants with it.
Another classic Ruby example is this:
STANDARD_OPTIONS = {:color => 'red', :count => 10}
def find_stuff
#some_service.load_things('stuff', STANDARD_OPTIONS)
end
find_stuff works fine the first time it's used, but returns something else the second time. Why? The load_things method happens to think it owns the options hash passed to it, and does color = options.delete(:color). Now the STANDARD_OPTIONS constant doesn't have the same value anymore. Constants are only constant in what they reference, they do not guarantee the constancy of the data structures they refer to. Just think what would happen if this code was run concurrently.
If you avoid shared mutable state (e.g. instance variables in objects accessed by multiple threads, data structures like hashes and arrays accessed by multiple threads) thread safety isn't so hard. Try to minimize the parts of your application that are accessed concurrently, and focus your efforts there. IIRC, in a Rails application, a new controller object is created for every request, so it is only going to get used by a single thread, and the same goes for any model objects you create from that controller. However, Rails also encourages the use of global variables (User.find(...) uses the global variable User, you may think of it as only a class, and it is a class, but it is also a namespace for global variables), some of these are safe because they are read only, but sometimes you save things in these global variables because it is convenient. Be very careful when you use anything that is globally accessible.
It's been possible to run Rails in threaded environments for quite a while now, so without being a Rails expert I would still go so far as to say that you don't have to worry about thread safety when it comes to Rails itself. You can still create Rails applications that aren't thread safe by doing some of the things I mention above. When it comes other gems assume that they are not thread safe unless they say that they are, and if they say that they are assume that they are not, and look through their code (but just because you see that they go things like #n ||= 1 does not mean that they are not thread safe, that's a perfectly legitimate thing to do in the right context -- you should instead look for things like mutable state in global variables, how it handles mutable objects passed to its methods, and especially how it handles options hashes).
Finally, being thread unsafe is a transitive property. Anything that uses something that is not thread safe is itself not thread safe.
In addition to Theo's answer, I'd add a couple problem areas to lookout for in Rails specifically, if you're switching to config.threadsafe!
Class variables:
##i_exist_across_threads
ENV:
ENV['DONT_CHANGE_ME']
Threads:
Thread.start
starting from Rails 4, everything would have to run in threaded environment by default
This is not 100% correct. Thread-safe Rails is just on by default. If you deploy on a multi-process app server like Passenger (community) or Unicorn there will be no difference at all. This change only concerns you, if you deploy on a multi-threaded environment like Puma or Passenger Enterprise > 4.0
In the past if you wanted to deploy on a multi-threaded app server you had to turn on config.threadsafe, which is default now, because all it did had either no effects or also applied to a Rails app running in a single process (Prooflink).
But if you do want all the Rails 4 streaming benefits and other real time stuff of the multi-threaded deployment
then maybe you will find this article interesting. As #Theo sad, for a Rails app, you actually just have to omit mutating static state during a request. While this a simple practice to follow, unfortunately you cannot be sure about this for every gem you find. As far as i remember Charles Oliver Nutter from the JRuby project had some tips about it in this podcast.
And if you want to write a pure concurrent Ruby programming, where you would need some data structures which are accessed by more than one thread you maybe will find the thread_safe gem useful.
I have a chunk of lua code that I'd like to be able to (selectively) ignore. I don't have the option of not reading it in and sometimes I'd like it to be processed, sometimes not, so I can't just comment it out (that is, there's a whole bunch of blocks of code and I either have the option of reading none of them or reading all of them). I came up with two ways to implement this (there may well be more - I'm very much a beginner): either enclose the code in a function and then call or not call the function (and once I'm sure I'm passed the point where I would call the function, I can set it to nil to free up the memory) or enclose the code in an if ... end block. The former has slight advantages in that there are several of these blocks and using the former method makes it easier for one block to load another even if the main program didn't request it, but the latter seems the more efficient. However, not knowing much, I don't know if the efficiency saving is worth it.
So how much more efficient is:
if false then
-- a few hundred lines
end
than
throwaway = function ()
-- a few hundred lines
end
throwaway = nil -- to ensure that both methods leave me in the same state after garbage collection
?
If it depends a lot on the lua implementation, how big would the "few hundred lines" need to be to reliably spot the difference, and what sort of stuff should it include to best test (the main use of the blocks is to define a load of possibly useful functions)?
Lua's not smart enough to dump the code for the function, so you're not going to save any memory.
In terms of speed, you're talking about a different of nanoseconds which happens once per program execution. It's harming your efficiency to worry about this, which has virtually no relevance to actual performance. Write the code that you feel expresses your intent most clearly, without trying to be clever. If you run into performance issues, it's going to be a million miles away from this decision.
If you want to save memory, which is understandable on a mobile platform, you could put your conditional code in it's own module and never load it at all of not needed (if your framework supports it; e.g. MOAI does, Corona doesn't).
If there is really a lot of unused code, you can define it as a collection of Strings and loadstring() it when needed. Storing functions as strings will reduce the initial compile time, however of most functions the string representation probably takes up more memory than it's compiled form and what you save when compiling is probably not significant before a few thousand lines... Just saying.
If you put this code in a table, you could compile it transparently through a metatable for minimal performance impact on repeated calls.
Example code
local code_uncompiled = {
f = [=[
local x, y = ...;
return x+y;
]=]
}
code = setmetatable({}, {
__index = function(self, k)
self[k] = assert(loadstring(code_uncompiled[k]));
return self[k];
end
});
local ff = code.f; -- code of x gets compiled here
ff = code.f; -- no compilation here
for i=1, 1000 do
print( ff(2*i, -i) ); -- no compilation here either
print( code.f(2*i, -i) ); -- no compile either, but table access (slower)
end
The beauty of it is that this compiles as needed and you don't really have to waste another thought on it, it's just like storing a function in a table and allows for a lot of flexibility.
Another advantage of this solution is that when the amount of dynamically loaded code gets out of hand, you could transparently change it to load code from external files on demand through the __index function of the metatable. Also, you can mix compiled and uncompiled code by populating the "code" table with "real" functions.
Try the one that makes the code more legible to you first. If it runs fast enough on your target machine, use that.
If it doesn't run fast enough, try the other one.
lua can ignore multiple lines by:
function dostuff()
blabla
faaaaa
--[[
ignore this
and this
maybe this
this as well
]]--
end