fastest way to create huge files? - ruby

I need to create a huge file filled with anything. I'm doing it this way but it takes so long:
exit 1 unless ARGV.length > 0
File.open("file-#{ARGV[0]}M.txt", 'w') do |f|
(ARGV[0].to_i*1048576).times {f.write(1) }
end
What's the best way of doing that (in platform independent way?)

In *nix, use dd:
system("dd if=/dev/zero of=" + f + " bs=1 count=0 seek=" + ARGV[0] + "M");
If you want some content (instead of zeros) in the file, use
/dev/random
for if instead of /dev/zero
If you want a non-sparse file, use
bs=#{ARGV[0]}M
and omit seek
Universal method:
#Create a 1M fill buffer
fills = '1'*1048576
File.open("file-#{ARGV[0]}M.txt", 'w') do |f|
(ARGV[0].to_i).times {f.write(fills) }
end
It is similar to the one you have, but it writes 1M at a time. You write 1 byte at a time which creates a lot of overhead for hard disk to search and write. Writing 1M at a time will be much faster. If you have an even faster hard drive (like 16M/s), you can try to increase 1M to 16M.

A pure Ruby option:
n = ARGV[0] or exit 1
File.open("file-#{n}M.txt", 'w') do |f|
contents = "x" * (1024*1024)
n.to_i.times { f.write(contents) }
end

Related

Ruby Zlib compression gives different outputs for the same input

I have this ruby method for compressing a string -
def compress_data(data)
output = StringIO.new
gz = Zlib::GzipWriter.new(output)
gz.write(data)
gz.close
compressed_data = output.string
compressed_data
end
When I call this method with the same input, I get different outputs at different times. I am trying to get the byte array for the compressed outputs and compare them.
The output is Different when I run the below -
input = "hello world"
output1 = (compress_data input).bytes.to_a
sleep 1
output2 = (compress_data input).bytes.to_a
if output1 == output2
puts 'Same'
else
puts 'Different'
end
The output is Same when I remove the sleep. Does the compression algorithm have something to do with the current time?
Option 1 - fixed mtime:
Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:
gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close
Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).
Option 2 - skip the header:
Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:
data = compress_data(input).bytes[10..-1]
Note that you do not need to call to_a on bytes. It is already an Array:
String.bytes -> an_array
Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.

Fast and furious random reading of huge files

I know that the question isn't new but I haven't found anything useful. In my case I have a 20 GB file and I need to read random lines from it. Now I have simple file index which contains line numbers and corresponding seek offsets. Also I disabled buffering when reading to read only the needed line.
And this is my code:
def create_random_file_gen(file_path, batch_size=0, dtype=np.float32, delimiter=','):
index = load_file_index(file_path)
if (batch_size > len(index)) or (batch_size == 0):
batch_size = len(index)
lines_indices = np.random.random_integers(0, len(index), batch_size)
with io.open(file_path, 'rb', buffering=0) as f:
for line_index in lines_indices:
f.seek(index[line_index])
line = f.readline(2048)
yield __get_features_from_line(line, delimiter, dtype)
The problem is that it's extremely slow: reading of 5000 lines takes 89 seconds on my Mac(here I point to ssd drive). There is code I used for testing:
features_gen = tedlium_random_speech_gen(5000) # just a wrapper for function given above
i = 0
for feature, cls in features_gen:
if i % 1000 == 0:
print("Got %d features" % i)
i += 1
print("Total %d features" % i)
I've read something about files memory mapping but I don't really understand how it works: how the mapping works in essence and will it speed up the process or no.
So the main question what are the possible ways to speed up the process? The only way I see now is to read randomly not every line but blocks of lines.

Implemented merge sort for large CSV files in Python3, slow performance

so I decided to implement merge sort in Python3 to handle large CSV files (working with 5GB files >.<) and I think I have the logic down correctly, the problem is, that it's quite slow, I'm just wondering if you guys have any suggestions on how to alter my code for a faster performance?
Thanks and please bear with my code, I'm still new to Python ^^
Here's the main piece of the merge sort code, note that this is after breaking the file into chunks and sorting each chunk:
def merge_sort():
files_to_merge = os.listdir(temp_folder)
files_left = len(files_to_merge)
print("Merging {} files...".format(files_left))
temp_file_count = files_left + 1
while files_left != 1:
first_file = temp_folder + files_to_merge[0]
print(first_file)
second_file = temp_folder + files_to_merge[1]
print(second_file)
# Process both files.
with open(first_file, 'r', encoding='utf-8') as file_1:
with open(second_file, 'r', encoding='utf-8')as file_2:
# Setup
temp_file = temp_folder + "tempFile - {:03}.csv".format(temp_file_count)
file1_line, file2_line = file_1.readline(), file_2.readline()
compare_values_list = [file1_line.split(','), file2_line.split(',')]
print("Writing to >> {}...".format(temp_file))
# Keep going until all values have been read from both files.
with open(temp_file, 'a', encoding='utf-8') as m_file:
while len(compare_values_list) != 0 or (file1_line != '' or file2_line != ''):
# Grab the highest value from the list, write to a file, and delete it.
compare_values_list.sort(key=sorter) # sorter = operator.itemgetter(sort_key)
line_to_write = ','.join(compare_values_list[0])
del compare_values_list[0]
m_file.write(line_to_write)
# Get the next values from the file and check whether to add to the list.
file1_line, file2_line = file_1.readline(), file_2.readline()
if file1_line != '' and file2_line != '':
compare_values_list.append(file1_line.split(','))
compare_values_list.append(file2_line.split(','))
elif file1_line != '' and file2_line == '':
compare_values_list.append(file1_line.split(','))
elif file1_line == '' and file2_line != '':
compare_values_list.append(file2_line.split(','))
# Clean up files and update values.
os.remove(first_file)
os.remove(second_file)
temp_file_count += 1
files_to_merge = os.listdir(temp_folder)
files_left = len(files_to_merge)
print("Finish merging files.")
There are 2 slow parts that jump out.
First is that your script opens the tempfile whenever it writes something. Move these line outside the nested while loop:
with open(temp_file, 'a', encoding='utf-8') as m_file:
m_file.write(line_to_write)
You could might also consider saving the data to a variable in memory, but I'm not sure how good of an idea that is if the file will be large.
Second, is your use of compare_values_list. You are frequently appending and deleting, which is requires a lot of work for reallocating space in memory. You're also recreating the list from scratch very often. First try avoiding the copy of the list for each loop and sort in place:
compare_values_list.sort(key=sorter)
should help you avoid that. If you want to try to make it faster, preallocate the list and manage it's size. Something like:
compare_values_list_capacity = 1000
compare_values_list_size = 0
compare_values_list = [None]*compare_values_list_capacity
though I am hazy on the details of mixing these 2 solutions - I'm not sure this will work with the sorting in place, so it's worth trying both and seeing which works.

Get numbers from a list in a file, output to another file in Ruby?

I have a big text file that contains - among others- lines like these:
"X" : "452345230"
I want to find all lines that contain "X" , and take just the number (without the quotation marks), and then output the numbers in another file, in this fashion:
452349532
234523452
213412411
219456433
etc.
What I did so far is this:
myfile = File.open("myfile.txt")
x = []
myfile.grep(/"X"/) {|line|
x << line.match( /"(\d{9})/ ).values_at( 1 )[0]
puts x
File.open("output.txt", 'w') {|f| f.write(x) }
}
it works, but the list it produces is of this form:
["23419230", "2349345234" , ... ]
How do I output it like I showed before, just numbers and each number in a line?
Thanks.
Here's a solution that doesn't leave files open:
File.open("output.txt", 'w') do |output|
File.open("myfile.txt").each do |line|
output.puts line[/\d{9}/] if line[/"X"/]
end
end
I couldn't reproduce what you saw:
$ cat myfile.txt
"X" : "452345230"
"X" : "452345231"
"X" : "452345232"
"X" : "452345233"
$ ./scanner.rb
452345230
452345230
452345231
452345230
452345231
452345232
452345230
452345231
452345232
452345233
$ cat output.txt
452345230452345231452345232452345233$
However, I did notice that your application is incredibly wasteful and probably not doing what you expect: You open output.txt, write some content to it, then close it again. The next time it is opened in the loop, it is overwritten. If your file is 1000 lines long, this won't be so bad, you're only making 1000 files. If your file is 1,000,000 lines long, this is going to represent a pretty horrible performance penalty as you create a file, write into it, and then delete it again, one million times. Oops.
I re-wrote your tool a little bit:
$ cat scanner.rb
#!/usr/bin/ruby -w
myfile = File.open("myfile.txt")
output = File.open("output.txt", 'w')
myfile.grep(/"X"/) {|line|
x = line.match( /"(\d{9})/ ).values_at( 1 )[0]
puts x
output.write(x + "\n")
}
This opens each file exactly onces, writes each new line one at a time, and then lets them both be closed when the application quits. Depending upon if this is a small portion of your application or the entire thing, this might be alright. (If this is a small portion of the program, then definitely close the files when you're done with them.)
This might still be wasteful for one million matched lines -- those writes are almost certainly handed straight to the system call write(2), which will involve some overhead.
How many of these will you be running? Millions? Billions? If this needs more refinement feel free to ask...
Solution:
myfile = File.open("myfile.txt")
File.open("output.txt", 'w') do |output|
content = myfile.lines.map { |line| line.scan(/^"X".*(\d{9})/) }.flatten.join("\n")
output.write(content)
end
Edited: I updated the code reducing it a bit. If the example above seems complicated, you can also grab the data you want with the following statement (could be a little bit clear of what's happening):
content = myfile.lines.select { |line| line =~ /"X"/ }.map { |line| line.scan(/\d{9}/) }.join("\n")

test if a PDF file is finished in Ruby (on Solaris/Unix)?

i have a server, that generates or copies PDF-Files to a specific folder.
i wrote a ruby script (my first ever), that regularily checks for own PDF-files and displayes them with acrobat. So simple so nice.
But now I have the Problem: how to detect the PDF is complete?
The generated PDF ends with %%EOF\n
but the copied ones are generated with some Apple-Magic (Acrobat Writer I think), that has an %%EOF near the beginning of the File, lots of binary Zeros and another %%EOF near the end with a carriage return (or line feed) and a binary zero at the end.
while true
dir = readpfad
Dir.foreach(dir) do |f|
datei = File.join(dir, f)
if File.file?(datei)
if File.stat(datei).owned?
if datei[-9..-1].upcase == "__PDF.PDF"
if File.stat(datei).size > 5
test = File.new(datei)
dummy = test.readlines
if dummy[-1][0..4] == "%%EOF"
#move the file, so it will not be shown again
cmd = "mv " + datei + " " + movepfad
system(cmd)
acro = ACROREAD + " " + File.join(movepfad, f) + "&"
system(acro)
else
puts ">>>" + dummy[-1] + "<<<"
end
end
end
end
end
end
sleep 1
end
Any help or idea?
Thanks
Peter
All the %%EOF token means is that there should be one within the last 1024 bytes of the physical end of file. The structure of PDF is such that a PDF document may have 1 or more %%EOF tokens within it (the details are in the spec).
As such, "contains %%EOF" is not equivalent to "completely copied". Really, the correct answer is that the server should signal when it's done and your code should be a client of that signal. In general, polling -- especially IO bound polling is the wrong answer to this problem.

Resources