Get numbers from a list in a file, output to another file in Ruby? - ruby

I have a big text file that contains - among others- lines like these:
"X" : "452345230"
I want to find all lines that contain "X" , and take just the number (without the quotation marks), and then output the numbers in another file, in this fashion:
452349532
234523452
213412411
219456433
etc.
What I did so far is this:
myfile = File.open("myfile.txt")
x = []
myfile.grep(/"X"/) {|line|
x << line.match( /"(\d{9})/ ).values_at( 1 )[0]
puts x
File.open("output.txt", 'w') {|f| f.write(x) }
}
it works, but the list it produces is of this form:
["23419230", "2349345234" , ... ]
How do I output it like I showed before, just numbers and each number in a line?
Thanks.

Here's a solution that doesn't leave files open:
File.open("output.txt", 'w') do |output|
File.open("myfile.txt").each do |line|
output.puts line[/\d{9}/] if line[/"X"/]
end
end

I couldn't reproduce what you saw:
$ cat myfile.txt
"X" : "452345230"
"X" : "452345231"
"X" : "452345232"
"X" : "452345233"
$ ./scanner.rb
452345230
452345230
452345231
452345230
452345231
452345232
452345230
452345231
452345232
452345233
$ cat output.txt
452345230452345231452345232452345233$
However, I did notice that your application is incredibly wasteful and probably not doing what you expect: You open output.txt, write some content to it, then close it again. The next time it is opened in the loop, it is overwritten. If your file is 1000 lines long, this won't be so bad, you're only making 1000 files. If your file is 1,000,000 lines long, this is going to represent a pretty horrible performance penalty as you create a file, write into it, and then delete it again, one million times. Oops.
I re-wrote your tool a little bit:
$ cat scanner.rb
#!/usr/bin/ruby -w
myfile = File.open("myfile.txt")
output = File.open("output.txt", 'w')
myfile.grep(/"X"/) {|line|
x = line.match( /"(\d{9})/ ).values_at( 1 )[0]
puts x
output.write(x + "\n")
}
This opens each file exactly onces, writes each new line one at a time, and then lets them both be closed when the application quits. Depending upon if this is a small portion of your application or the entire thing, this might be alright. (If this is a small portion of the program, then definitely close the files when you're done with them.)
This might still be wasteful for one million matched lines -- those writes are almost certainly handed straight to the system call write(2), which will involve some overhead.
How many of these will you be running? Millions? Billions? If this needs more refinement feel free to ask...

Solution:
myfile = File.open("myfile.txt")
File.open("output.txt", 'w') do |output|
content = myfile.lines.map { |line| line.scan(/^"X".*(\d{9})/) }.flatten.join("\n")
output.write(content)
end
Edited: I updated the code reducing it a bit. If the example above seems complicated, you can also grab the data you want with the following statement (could be a little bit clear of what's happening):
content = myfile.lines.select { |line| line =~ /"X"/ }.map { |line| line.scan(/\d{9}/) }.join("\n")

Related

Ruby Zlib compression gives different outputs for the same input

I have this ruby method for compressing a string -
def compress_data(data)
output = StringIO.new
gz = Zlib::GzipWriter.new(output)
gz.write(data)
gz.close
compressed_data = output.string
compressed_data
end
When I call this method with the same input, I get different outputs at different times. I am trying to get the byte array for the compressed outputs and compare them.
The output is Different when I run the below -
input = "hello world"
output1 = (compress_data input).bytes.to_a
sleep 1
output2 = (compress_data input).bytes.to_a
if output1 == output2
puts 'Same'
else
puts 'Different'
end
The output is Same when I remove the sleep. Does the compression algorithm have something to do with the current time?
Option 1 - fixed mtime:
Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:
gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close
Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).
Option 2 - skip the header:
Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:
data = compress_data(input).bytes[10..-1]
Note that you do not need to call to_a on bytes. It is already an Array:
String.bytes -> an_array
Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.

python regex specific blocks of text from large text file

I'm new to python and this site so thank-you in advance for your... understanding. This is my first attempt at a python script.
I'm having what I think is a performance issue trying to solve this problem which is causing me to not get any data back.
This code works on a small text file of a couple pages but when I try to use it on my 35MB real data text file it just hits the CPU and hasn't returned any data (>24 hours now).
Here's a snippet of the real data from the 35MB text file:
D)dddld
d00d90d
dd
ddd
vsddfgsdfgsf
dfsdfdsf
aAAAAAa
221546
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
sadfsdfsdf
sdfff
ff
f
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
aaa
aaa
aaaaa
What I'm trying to copy into a new file:
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
My Python code is:
import re
with open('testdata.txt') as myfile:
content = myfile.read()
text = re.search(r'\d{11}.*\n.*\n.*(\d{4})\D+(\d{2})\D+(\d{1})\D+(\d{2})\D+(\d{2})\D+\d{2}\D+\d{1}', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
myfile2.write(text)
Regex isn't the fastest way to search a string. You also compounded the problem by having a very big string (35MB). Reading an entire file into memory is generally not recommended because you may run into memory issues.
Judging from your regex pattern, it seems like you want to capture 4-line groups that start with an 11-digit string and end with some time-line string. Try this code:
import re
start_pattern = re.compile(r'^\d{11}$')
end_pattern = re.compile(r'^\d{4}\D+\d{2}\D+\d{1}\D+\d{2}\D+\d{2}\D+\d{2}\D+\d{1}$')
capturing = 0
capture = ''
with open('output.txt', 'w') as output_file:
with open('input.txt', 'r') as input_file:
for line in input_file:
if capturing > 0 and capturing <= 4:
capturing += 1
capture += line
elif start_pattern.match(line):
capturing = 1
capture = line
if capturing == 4:
if end_pattern.match(line):
output_file.write(capture + '\n')
else:
capturing = 0
It iterates over the input file, line by line. If it finds a line matching the start_pattern, it will read in 3 more. If the 4th line matches the end_pattern, it will write the whole group to the output file.

error in writing to a file

I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)
The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though.
The line looks like this in the input file
gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0
In output file however only gi|19125 is printed. How do I solve this?
Any help will be appreciated.
Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items.
Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.
FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously.
To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)
You can write it in pure Python e.g., for a small input file:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file
If you need to sort a file that does not fit in memory; see Sorting text file by using Python

Recovering hex data from a large log-file using Ruby and RegEx

I'm trying to filter/append lines of hex data from a large log-file, using Ruby and RegEx.
The lines of the log-file that I need look like this:
Data: 10 55 61 (+ lots more hex data)
I want to add all of the hex data, for further processing later. The regex /^\sData:(.+)/ should do the trick.
My Ruby-program looks like this:
puts "Start"
fileIn = File.read("inputfile.txt")
fileOut = File.new("outputfile.txt", "w+")
fileOut.puts "Start of regex data\n"
fileIn.each_line do
dataLine = fileIn.match(/^\sData:(.+)/).captures
fileOut.write dataLine
end
fileOut.puts "\nEOF"
fileOut.close
puts "End"
It works - sort of - but the lines in the output file are all the same, just repeating the result of the first regex match.
What am I doing wrong?
You are iterating over the same entire file. You need to iterate over the line.
fileIn.each_line do |line|
dataLine = line.match(/^\sData:(.+)/).captures
fileOut.write dataLine
end

fastest way to create huge files?

I need to create a huge file filled with anything. I'm doing it this way but it takes so long:
exit 1 unless ARGV.length > 0
File.open("file-#{ARGV[0]}M.txt", 'w') do |f|
(ARGV[0].to_i*1048576).times {f.write(1) }
end
What's the best way of doing that (in platform independent way?)
In *nix, use dd:
system("dd if=/dev/zero of=" + f + " bs=1 count=0 seek=" + ARGV[0] + "M");
If you want some content (instead of zeros) in the file, use
/dev/random
for if instead of /dev/zero
If you want a non-sparse file, use
bs=#{ARGV[0]}M
and omit seek
Universal method:
#Create a 1M fill buffer
fills = '1'*1048576
File.open("file-#{ARGV[0]}M.txt", 'w') do |f|
(ARGV[0].to_i).times {f.write(fills) }
end
It is similar to the one you have, but it writes 1M at a time. You write 1 byte at a time which creates a lot of overhead for hard disk to search and write. Writing 1M at a time will be much faster. If you have an even faster hard drive (like 16M/s), you can try to increase 1M to 16M.
A pure Ruby option:
n = ARGV[0] or exit 1
File.open("file-#{n}M.txt", 'w') do |f|
contents = "x" * (1024*1024)
n.to_i.times { f.write(contents) }
end

Resources