Lets say I have 4 folders with 25 folders in each. In each of those 25 folders there is 20 folders each with 1 very long text document. The method i'm using now seems to have room to improve and in every scenario in which I implement ruby's threads, the result is slower than before. I have an array of the 54names of the folders. I iterate through each and use a foreach method to get the deeply nested files. In the foreach loop I do 3 things. I get the contents of today's file, I get the contents of yesterday's file, and I use my diff algorithm to find what has changed from yesterday to today. How would you do this faster with threads.
def backup_differ_loop device_name
device_name.strip!
Dir.foreach("X:/Backups/#{device_name}/#{#today}").each do |backup|
if backup != "." and backup != ".."
#today_filename = "X:/Backups/#{device_name}/#{#today}/#{backup}"
#yesterday_filename = "X:/Backups/#{device_name}/#{#yesterday}/#{backup.gsub(#today, #yesterday)}"
if File.exists?(#yesterday_filename)
today_backup_content = File.open(#today_filename, "r").read
yesterday_backup_content = File.open(#yesterday_filename, "r").read
begin
Diffy::Diff.new(yesterday_backup_content, today_backup_content, :include_plus_and_minus_in_html => true, :context => 1).to_s(:html)
rescue
#do nothing just continue
end
end
else
#file not found
end
end
end
The first part of your logic is finding all files in a specific folder. Instead of doing Dir.foreach and then checking against "." and ".." you can do this in one line:
files = Dir.glob("X:/Backups/#{device_name}/#{#today}/*").select { |item| File.file?(item)}
Notice the /* at the end? This will search 1 level deep (inside the #today folder). If you want to search inside sub-folders too, replace it with /**/* so you'll get array of all files inside all sub-folders of #today.
So I'd first have a method which would give me a double array containing a bunch of arrays of matching files:
def get_matching_files
matching_files = []
Dir.glob("X:/Backups/#{device_name}/#{#today}/*").select { |item| File.file?(item)}.each do |backup|
today_filename = File.absolute_path(backup) # should get you X:/Backups...converts to an absolute path
yesterday_filename = "X:/Backups/#{device_name}/#{#yesterday}/#{backup.gsub(#today, #yesterday)}"
if File.exists?(yesterday_filename)
matching_files << [today_filename, yesterday_filename]
end
end
return matching_files
end
and call it:
matching_files = get_matching_files
NOW we can start the multi-threading which is where things probably slow down. I'd first get all the files from the array matching_files into a queue, then start 5 threads which will go until the queue is empty:
queue = Queue.new
matching_files.each { |file| queue << file }
# 5 being the number of threads
5.times.map do
Thread.new do
until queue.empty?
begin
today_file_content, yesterday_file_content = queue.pop
Diffy::Diff.new(yesterday_backup_content, today_backup_content, :include_plus_and_minus_in_html => true, :context => 1).to_s(:html)
rescue
#do nothing just continue
end
end
end
end.each(&:join)
I can't guarantee my code will work because I don't have the entire context of your program. I hope I've given you some ideas.
And the MOST important thing: The standard implementation of Ruby can run only 1 thread at a time. This means even if you implement the code above, you won't get a significant performance difference. So get Rubinius or JRuby which allow more than 1 threads to be running at a time. Or if you prefer to use the standard MRI Ruby, then you'll need to re-structure your code (you can keep your original version) and start multiple processes. You'll just need something like a shared database where you can store the matching_files (as a single row, for example) and every time a process will 'take' something from that database, it will mark that row as 'used'. SQLite is a good db for this I think because it's thread safe by default.
Most Ruby implementations don't have "true" multicore threading i.e. threads won't gain you any performance improvement since the interpreter can only run one thread at a time. For applications like yours with lots of disk IO this is especially true. In fact, even with real multithreading your applications might be IO-bound and still not see much of an improvement.
You are more likely to get results by finding some inefficient algorithm in your code and improving it.
Related
I working on a json file, I think. But Regardless, I'm working with a lot of different hashes and fetching different values and etc. This is
{"notification_rule"=>
{"id"=>"0000000",
"contact_method"=>
{"id"=>"000000",
"address"=>"cod.lew#gmail.com",}
{"notification_rule"=>
{"id"=>"000000"
"contact_method"=>
{"id"=>"PO0JGV7",
"address"=>"cod.lew#gmail.com",}
Essential, this is the type of hash I'm currently working with. With my code:
I wanted to stop duplicates of the same thing in the text file. Because whenever I run this code it brings both the address of both these hashes. And I understand why, because its looping over again, but I thought this code that I added would help resolve that issue:
Final UPDATE
if jdoc["notification_rule"]["contact_method"]["address"].to_s.include?(".com")
numbers.print "Employee Name: "
numbers.puts jdoc["notification_rule"]["contact_method"]["address"].gsub(/#target.com/, '').gsub(/\w+/, &:capitalize)
file_names = ['Employee_Information.txt']
file_names.each do |file_name|
text = File.read(file_name)
lines = text.split("\n")
new_contents = lines.uniq.join("\n")
File.open(file_name, "w") { |file| file.puts new_contents }
end
else
nil
end
This code looks really confused and lacking a specific purpose. Generally Ruby that's this tangled up is on the wrong track, as with Ruby there's usually a simple way of expressing something simple, and testing for duplicated addresses is one of those things that shouldn't be hard.
One of the biggest sources of confusion is the responsibility of a chunk of code. In that example you're not only trying to import data, loop over documents, clean up email addresses, and test for duplicates, but somehow facilitate printing out the results. That's a lot of things going on all at once, and they all have to work perfectly for that chunk of code to be fully operational. There's no way of getting it partially working, and no way of knowing if you're even on the right track.
Always try and break down complex problems into a few simple stages, then chain those stages together as necessary.
Here's how you can define a method to clean up your email addresses:
def address_scrub(address)
address.gsub(/\#target.com/, '').gsub(/\w+/, &:capitalize)
end
Where that can be adjusted as necessary, and presumably tested to ensure it's working correctly, which you can now do indepenedently of the other code.
As for the rest, it looks like this:
require 'set'
# Read in duplicated addresses from a file, clean up with chomp, using a Set
# for fast lookups.
duplicates = Set.new(
File.open("Employee_Information.txt", "r").readlines.map(&:chomp)
)
# Extract addresses from jdoc document array
filtered = jdocs.map do |jdoc|
# Convert to jdoc/address pair
[ jdoc, address_scrub(jdoc["notification_rule"]["contact_method"]["address"]) ]
end.reject do |jdoc, address|
# Remove any that are already in the duplicates list
duplicates.include?(address)
end.map do |jdoc, _|
# Return only the document
jdoc
end
Where that processes jdocs, an array of jdoc structures, and removes duplicates in a series of simple steps.
With the chaining approach you can see what's happening before you add on the next "link", so you can work incrementally towards a solution, adjusting as you go. Any mistakes are fairly easy to catch because you're able to, at any time, inspect the intermediate products of those stages.
Say I have the following Ruby code which, given a hash of insert positions, reads a file and creates a new file with extra text inserted at those positions:
insertpos = {14=>25,16=>25}
File.open('file.old', 'r') do |oldfile|
File.open('file.new', 'w') do |newfile|
oldfile.each_with_index do |line,linenum|
inserthere = insertpos[linenum]
if(!inserthere.nil?)then
line.insert(inserthere,"foo")
end
newfile.write(line)
end
end
end
Now, instead of creating that new file, I would like to modify this original (old) file. Can someone give me a hint on how to modify the code? Thanks!
At a very fundamental level, this is an extremely difficult thing to do, in any language, on any operating system. Envision a file as a contiguous series of bytes on disk (this is a very simplistic scenario, but it serves to illustrate the point). You want to insert some bytes in the middle of the file. Where do you put those bytes? There's no place to put them! You would have to basically "shift" the existing bytes after the insertion point "down" by the number of bytes you want to insert. If you're inserting multiple sections into an existing file, you would have to do this multiple times! It will be extremely slow, and you will run a high risk of corrupting your data if something goes awry.
You can, however, overwrite existing bytes, and/or append to the end of the file. Most Unix utilities give the appearance of modifying files by creating new files and swapping them with the old. Some more sophisticated schemes, such as those used by databases, allow inserts in the middle of files by 1. reserving space for such operations (when the data is first written), 2. allowing non-contiguous blocks of data within the file through indexing and other techniques, and/or 3. copy-on-write schemes where a new version of the data is written to the end of the file and the old version is invalidated by overwriting an indicator of some kind. You are most likely not wanting to go through all this trouble for your simple use case!
Anyway, you've already found the best way to do what you're trying to do. The only thing you're missing is a FileUtils.mv('file.new', 'file.old') at the very end to replace the old file with the new. Please let me know in the comments if I can help explain this any further.
(Of course, you can read the entire file into memory, make your changes, and overwrite the old file with the updated contents, but I don't believe that's what you're asking here.)
Here's something that hopefully solves your purpose:
# 'source' param is a string, the entire source text
# 'lines' param is an array, a list of line numbers to insert after
# 'new' param is a string, the text to add
def insert(source, lines, new)
results = []
source.split("\n").each_with_index do |line, idx|
if lines.include?(idx)
results << (line + new)
else
results << line
end
end
results.join("\n")
end
File.open("foo", "w") do |f|
10.times do |i|
f.write("#{i}\n")
end
end
puts "initial text: \n\n"
txt = File.read("foo")
puts txt
puts "\n\n after inserting at lines 1,3, and 5: \n\n"
result = insert(txt, [1,3,5], "\nfoo")
puts result
Running this shows:
initial text:
0
1
2
3
4
5
6
7
8
9
after inserting at lines 1,3, and 5:
0
1
foo
2
3
foo
4
5
foo
6
7
8
If its a relatively simple operation you can do it with a ruby one-liner, like this
ruby -i -lpe '$_.reverse!' thefile.txt
(found e.g. at https://gist.github.com/KL-7/1590797).
I have a Ruby 1.8.7 script to parse iOS localization files:
singleline_comment = /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m
string_line = /\s*"(.*?)"\s*=\s*"(.*?)"\s*\;\s*/xm
out = decoded_src.scan(/(?:#{singleline_comment}|#{multiline_comment})?\s*?#{string_line}/)
It used to work fine, but today we tested it with a file that is 800Kb, and that doesn't have ; at the end of each line. The result was a high CPU load and no response from the Rails server. My assumption is that it took the whole file as a single string in the capturing group and that blocked the server.
The solution was to add ? (regex quantificator, 0 or 1 time) to the ; literal character:
/\s*"(.*?)"\s*=\s*"(.*?)"\s*\;?\s*/xm
Now it works fine again even with those files in the old iOS format, but my fear now is, what if a user submits a malformed file, like one with no ending ". Will my server get blocked again?
And how do I prevent this? Is there any way to try to run this only for five seconds? What I can I do to avoid halting my whole Rails application?
It looks like you're trying to parse an entire configuration as if it was a string. While that is doable, it's error-prone. Regular expression engines have to do a lot of looking forward and backward, and poorly written patterns can end up wasting a huge amount of CPU time. Sometimes a minor tweak will fix the problem, but the more text being processed, and the more complex the expression, the higher the chance of something happening that will mess you up.
From benchmarking different ways of getting at data for my own work, I've learned that anchoring regexp patterns can make a huge difference in speed. If you can't anchor a pattern somehow, then you are going to suffer from the backtracking and greediness of patterns unless you can limit what the engine wants to do by default.
I have to parse a lot of device configurations, but instead of trying to treat them as a single string, I break them down into logical blocks consisting of arrays of lines, and then I can provide logic to extract data from those blocks based on knowledge that blocks contain certain types of information. Small blocks are faster to search, and it's a lot easier to write patterns that can be anchored, providing huge speedups.
Also, don't hesitate to use Ruby's String methods, like split to tear apart lines, and sub-string matching to find lines containing what you want. They're very fast and less likely to induce slowdowns.
If I had a string like:
config = "name:\n foo\ntype:\n thingie\nlast update:\n tomorrow\n"
chunks = config.split("\n").slice_before(/^\w/).to_a
# => [["name:", " foo"], ["type:", " thingie"], ["last update:", " tomorrow"]]
command_blocks = chunks.map{ |k, v| [k[0..-2], v.strip] }.to_h
command_blocks['name'] # => "foo"
command_blocks['last update'] # => "tomorrow"
slice_before is a very useful method for this sort of task as it lets us define a pattern that is then used to test for breaks in the master array, and group by those. The Enumerable module has lots of useful methods in it, so be sure to look through it.
The same data could be parsed.
Of course, without sample data for what you're trying to do it's difficult to suggest something that works better, but the idea is, break down your input into small manageable chunks and go from there.
As a comment on how you're defining your patterns.
Instead of using /\/.../ (which is known as "leaning-toothpicks syndrome") use %r which allows you to define a different delimiter:
singleline_comment = /\/\/(.*)$/ # => /\/\/(.*)$/
singleline_comment = %r#//(.*)$# # => /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m # => /\/\*(.*?)\*\//m
multiline_comment = %r#/\*(.*?)\*/#m # => /\/\*(.*?)\*\//m
The first line in each sample above is how you're doing it, and the second is how I'd do it. They result in identical regexp objects, but the second ones are easier to understand.
You can even have Regexp help you by escaping things for you:
NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*?)'
GREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*)'
EOL = '$'
Regexp.new(Regexp.escape('//') + GREEDY_CAPTURE_NONE_TO_ALL_CHARS + EOL) # => /\/\/(.*)$/
Regexp.new(Regexp.escape('/*') + NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS + Regexp.escape('*/'), Regexp::MULTILINE) # => /\/\*(.*?)\*\//m
Doing this you can iteratively build up extremely complex expressions while keeping them relatively easy to maintain.
As far as halting your Rails app, don't try to process the files in the same Ruby process. Run a separate job that watches for the files and process them and store whatever you're looking for to be accessed as needed later. That way your server will continue to respond rather than lock up. I wouldn't do it in a thread, but would write a separate Ruby script that looks for incoming data, and if nothing is found, sleeps for some interval of time then looks again. Ruby's sleep method will help with that, or you could use the cron capability of your OS.
Let's say I have a large query (for the purposes of this exercise say it returns 1M records) in MongoDB, like:
users = Users.where(:last_name => 'Smith')
If I loop through this result, working with each member, with something like:
users.each do |user|
# Some manipulation to "user"
# Some calculation for "user"
...
# Saving "user"
end
I'll often get a Mongo cursor timeout (as the database cursor that is reserved exceeds the default timeout length). I know I can extend the cursor timeout, or even turn it off--but this isn't always the most efficient method. So, one way I get around this is to change the code to:
users = Users.where(:last_name => 'Smith')
user_array = []
users.each do |u|
user_array << u
end
THEN, I can loop through user_array (since it's a Ruby array), doing manipulations and calculations, without worrying about a MongoDB timeout.
This works fine, but there has to be a better way--does anyone have a suggestion?
If your result set is so large that it causes cursor timeouts, it's not a good idea to load it entirely to RAM.
A common approach is to process records in batches.
Get 1000 users (sorted by _id).
Process them.
Get another batch of 1000 users where _id is greater than _id of last processed user.
Repeat until done.
For a long running task, consider using rails runner.
runner runs Ruby code in the context of Rails non-interactively. For instance:
$ rails runner "Model.long_running_method"
For further details, see:
http://guides.rubyonrails.org/command_line.html
I need to build a huge XML file, about 1-50 MB. I thought that using builder would be effective enough and, well it is, somewhat. The problem is, after the program reaches its last line it doesn't end immediately, but Ruby is still doing something for several seconds, maybe garbage collection? After that the program finally ends.
To give a real example, I am measured the time of building an XML file. It outputs 55 seconds (there is a database behind so it takes long) when the XML was built, but Ruby still processes for about 15 more seconds and the processor is going crazy.
The pseudo/real code is as follows:
...
builder = Nokogiri::XML::Builder.with(doc) do |xml|
build_node(xml)
end
...
def build_node(xml)
...
xml["#{namespace}"] if namespace
xml.send("#{elem_name}", attrs_hash) do |elem_xml|
...
if has_children
if type
case type
when XML::TextContent::PLAIN
elem_xml.text text_content
when XML::TextContent::COMMENT
elem_xml.comment text_content
when XML::TextContent::CDATA
elem_xml.cdata text_content
end
else
build_node(elem_xml)
end
end
end
end
Note that I was using a different approach using my own structure of classes, and the speed of the build was the same, but at the last line the program normally ended, but now I am forced to use Nokogiri so I have to find a solution.
What I can do to avoid that X seconds long overhead after the XML is built? Is it even possible?
UPDATE:
Thanks to a suggestion from Adiel Mittmann, during the creation of my minimal working example I was able to locate the problem. I now have a small (well not that small) example demonstrating the problem.
The following code is causing the problem:
xml.send("#{elem_name}_") do |elem_xml|
...
elem_xml.text text_content #This line is the problem
...
end
So the line executes the following code based on Nokogiri's documentation:
def create_text_node string, &block
Nokogiri::XML::Text.new string.to_s, self, &block
end
Text node creation code gets executed then. So, what exactly is happening here?
UPDATE 2:
After some other tries, the problem can be easily reproduced by:
builder = Nokogiri::XML::Builder.new do |xml|
0.upto(81900) do
xml.text "test"
end
end
puts "End"
So is it really Nokogiri itself? Is there any option for me?
Your example also takes a long time to execute here. And you were right: it's the garbage collector that's taking so long to execute. Try this:
require 'nokogiri'
class A
def a
builder = Nokogiri::XML::Builder.new do |xml|
0.upto(81900) do
xml.text "test"
end
end
end
end
A.new.a
puts "End1"
GC.start
puts "End2"
Here, the delay happens between "End1" and "End2". After "End2" is printed, the program closes immediately.
Notice that I created an object to demonstrate it. Otherwise, the data generated by the builder can only be garbage collected when the program finishes.
As for the best way to do what you're trying to accomplish, I suggest you ask another question giving details of what exactly you're trying to do with the XML files.
Try using the Ruby built-in (sic) Builder. I use it to generate large XML files as well, and it has such an small footprint.