Split a complex file into a hash - ruby

I am running a command line program, called Primer 3. It takes an input file and returns data to standard output. I am trying to write a Ruby script which will accept that input, and put the entries into a hash.
The results returned are below. I would like to split the data on the '=' sign, so that the has would something like this:
{:SEQUENCE_ID => "example", :SEQUENCE_TEMPLATE => "GTAGTCAGTAGACNAT..etc", :SEQUENCE_TARGET => "37,21" etc }
I would also like to lower case the keys, ie:
{:sequence_id => "example", :sequence_template => "GTAGTCAGTAGACNAT..etc", :sequence_target => "37,21" etc }
This is my current script:
#!/usr/bin/ruby
puts 'Primer 3 hash'
primer3 = {}
while line = gets do
name, height = line.split(/\=/)
primer3[name] = height.to_i
end
puts primer3
It is returning this:
Primer 3 hash
{"SEQUENCE_ID"=>0, "SEQUENCE_TEMPLATE"=>0, "SEQUENCE_TARGET"=>37, "PRIMER_TASK"=>0, "PRIMER_PICK_LEFT_PRIMER"=>1, "PRIMER_PICK_INTERNAL_OLIGO"=>1, "PRIMER_PICK_RIGHT_PRIMER"=>1, "PRIMER_OPT_SIZE"=>18, "PRIMER_MIN_SIZE"=>15, "PRIMER_MAX_SIZE"=>21, "PRIMER_MAX_NS_ACCEPTED"=>1, "PRIMER_PRODUCT_SIZE_RANGE"=>75, "P3_FILE_FLAG"=>1, "SEQUENCE_INTERNAL_EXCLUDED_REGION"=>37, "PRIMER_EXPLAIN_FLAG"=>1, "PRIMER_THERMODYNAMIC_PARAMETERS_PATH"=>0, "PRIMER_LEFT_EXPLAIN"=>0, "PRIMER_RIGHT_EXPLAIN"=>0, "PRIMER_INTERNAL_EXPLAIN"=>0, "PRIMER_PAIR_EXPLAIN"=>0, "PRIMER_LEFT_NUM_RETURNED"=>0, "PRIMER_RIGHT_NUM_RETURNED"=>0, "PRIMER_INTERNAL_NUM_RETURNED"=>0, "PRIMER_PAIR_NUM_RETURNED"=>0, ""=>0}
Data source
SEQUENCE_ID=example
SEQUENCE_TEMPLATE=GTAGTCAGTAGACNATGACNACTGACGATGCAGACNACACACACACACACAGCACACAGGTATTAGTGGGCCATTCGATCCCGACCCAAATCGATAGCTACGATGACG
SEQUENCE_TARGET=37,21
PRIMER_TASK=pick_detection_primers
PRIMER_PICK_LEFT_PRIMER=1
PRIMER_PICK_INTERNAL_OLIGO=1
PRIMER_PICK_RIGHT_PRIMER=1
PRIMER_OPT_SIZE=18
PRIMER_MIN_SIZE=15
PRIMER_MAX_SIZE=21
PRIMER_MAX_NS_ACCEPTED=1
PRIMER_PRODUCT_SIZE_RANGE=75-100
P3_FILE_FLAG=1
SEQUENCE_INTERNAL_EXCLUDED_REGION=37,21
PRIMER_EXPLAIN_FLAG=1
PRIMER_THERMODYNAMIC_PARAMETERS_PATH=/usr/local/Cellar/primer3/2.3.4/bin/primer3_config/
PRIMER_LEFT_EXPLAIN=considered 65, too many Ns 17, low tm 48, ok 0
PRIMER_RIGHT_EXPLAIN=considered 228, low tm 159, high tm 12, high hairpin stability 22, ok 35
PRIMER_INTERNAL_EXPLAIN=considered 0, ok 0
PRIMER_PAIR_EXPLAIN=considered 0, ok 0
PRIMER_LEFT_NUM_RETURNED=0
PRIMER_RIGHT_NUM_RETURNED=0
PRIMER_INTERNAL_NUM_RETURNED=0
PRIMER_PAIR_NUM_RETURNED=0
=
$ primer3_core < example2 | ruby /Users/sean/Dropbox/bin/rb/read_primer3.rb

#!/usr/bin/ruby
puts 'Primer 3 hash'
primer3 = {}
while line = gets do
key, value = line.split(/=/, 2)
primer3[key.downcase.to_sym] = value.chomp
end
puts primer3

For fun, here are a couple of purely-functional solutions. Both assume that you've already pulled your data from the file, e.g.
my_data = ARGF.read # read the file passed on the command line
This one feels sort of gross, but it is a (long) one-liner :)
hash = Hash[ my_data.lines.map{ |line|
line.chomp.split('=',2).map.with_index{ |s,i| i==0 ? s.downcase.to_sym : s }
} ]
This one is two lines, but feels cleaner than using with_index:
keys,values = my_data.lines.map{ |line| line.chomp.split('=',2) }.transpose
hash = Hash[ keys.map(&:downcase).map(&:to_sym).zip(values) ]
Both of these are likely less efficient and certainly more memory-intense than your already-accepted answer; iterating the lines and slowly mutating your hash is the best way to go. These non-mutating variations are just a mental exercise.
Your final answer should use ARGF to allow filenames on the command line or via STDIN. I would write it like so:
#!/usr/bin/ruby
module Primer3
def self.parse( file )
{}.tap do |primer3|
# Process one line at a time, without reading it all into memory first
file.each_line do |line|
key, value = line.chomp.split('=', 2)
primer3[key.downcase.to_sym] = value
end
end
end
end
Primer3.parse( ARGF ) if __FILE__==$0
This way you can either call the file from the command line, with or without STDIN, or you can require this file and use the module function it defines in other code.

OK I have it (almost). The only problem is it is adding a \n at the end of each value.
puts 'Primer 3 hash'
primer3 = {}
while line = gets do
key, value = line.split(/\=/)
puts key
puts value
primer3[key.downcase] = value
end
puts primer3
{"sequence_id"=>"example\n", "sequence_template"=>"GTAGTCAGTAGACNATGACNACTGACGATGCAGACNACACACACACACACAGCACACAGGTATTAGTGGGCCATTCGATCCCGACCCAAATCGATAGCTACGATGACG\n", "sequence_target"=>"37,21\n", "primer_task"=>"pick_detection_primers\n", "primer_pick_left_primer"=>"1\n", "primer_pick_internal_oligo"=>"1\n", "primer_pick_right_primer"=>"1\n", "primer_opt_size"=>"18\n", "primer_min_size"=>"15\n", "primer_max_size"=>"21\n", "primer_max_ns_accepted"=>"1\n", "primer_product_size_range"=>"75-100\n", "p3_file_flag"=>"1\n", "sequence_internal_excluded_region"=>"37,21\n", "primer_explain_flag"=>"1\n", "primer_thermodynamic_parameters_path"=>"/usr/local/Cellar/primer3/2.3.4/bin/primer3_config/\n", "primer_left_explain"=>"considered 65, too many Ns 17, low tm 48, ok 0\n", "primer_right_explain"=>"considered 228, low tm 159, high tm 12, high hairpin stability 22, ok 35\n", "primer_internal_explain"=>"considered 0, ok 0\n", "primer_pair_explain"=>"considered 0, ok 0\n", "primer_left_num_returned"=>"0\n", "primer_right_num_returned"=>"0\n", "primer_internal_num_returned"=>"0\n", "primer_pair_num_returned"=>"0\n", ""=>"\n"}

Related

How to efficiently split file into arbitrary byteranges in ruby

I need to upload a big file to a third party service.
This third party service gives me a list of urls and byteranges:
requests = [
{url: "https://.../part1", from: 0, to: 20_000_000},
{url: "https://.../part2", from: 20_000_001, to: 40_000_000},
{url: "https://.../part3", from: 40_000_001, to: 54_184_279}
]
I'm using the httpx gem to upload the data, the :body option can receive an IO or Enumerable object.
I would like to split and upload chunks in an efficient way. This is why I think I should avoid writing chunks to the disks and also avoid loading the entire file into memory. I suppose that the best option would be some kind of "lazy Enumerable" but I dont know how to write the part function that would return this IO or Enumerable object.
file = File.open("bigFile", "rb")
results = requests.each do |request|
Thread.start { HTTPX.post(request[:url]), body: part(file, request[:from], request[:to]) }
end.map(&:value)
def part(file, from, to)
# ???
end
The easiest way to generate an enumerator for each "byterange" would be to let the part function handle the opening of the file:
def part(filepath, from, to = nil, chunk_size = 4096, &block)
return to_enum(__method__, filepath, from, to, chunk_size) unless block_given?
size = File.size(filepath)
to = size-1 unless to and to >= from and to < size
io = File.open(filepath, "rb")
io.seek(from, IO::SEEK_SET)
while (io.pos <= to)
size = (io.pos + chunk_size <= to) ? chunk_size : 1 + to - io.pos
chunk = io.read(size)
yield chunk
end
ensure
io.close if io
end
Warning: the chunk size calculation may be wrong, I will check it in a while (I have to take care of my child)
Note: You may want to improve this function to ensure that you always read a full physical HDD block (or a multiple of it), as it will greatly speed-up the IO. You'll have a misalignment when from is not a multiple of the physical HDD block.
The part function now returns an Enumerator when called without a block:
part("bigFile", 0, 1300, 512)
#=> #<Enumerator: main:part("bigFile", 0, 1300, 512)
And of course you can call it directly with a block:
part("bigFile", 0, 1300, 512) do |chunk|
puts "#{chunk.inspect}"
end
IO.read("bigFile", 1000, 2000)
will read 1000 bytes, starting at offset 2000. Ruby starts counting at zero, so I think
IO.read("bigFile", 20_000_000, 0) #followed by
IO.read("bigFile,20_000_000,20_000_000) #not 20_000_001
would be correct. Without bookkeeping:
f = File.open("bigFile")
partname = "part0"
until f.eof? do
partname = partname.succ
chunk = f.read(20_000_000)
#do something with chunk and partname
end
f.close

How to remove headers and second column in CSV in ruby?

I have a CSV that looks like this:
user_id,is_user_unsubscribed
131072,1
7077888,1
11010048,1
12386304,1
327936,1
2228480,1
6553856,1
9830656,1
10158336,1
10486016,1
10617088,1
11010304,1
11272448,1
393728,1
7012864,1
8782336,1
11338240,1
11928064,1
4326144,1
8127232,1
11862784,1
but I want the data to look like this:
131072
7077888
11010048
12386304
327936
...
any ideas on what to do? I have 330,000 rows...
You can read your file as an array and ignore the first row like this:
data = CSV.read("dataset.csv")[1 .. -1]
This way you can remove the header.
Regarding the column, you can delete a column like this:
data = CSV.read("dataset.csv")[1 .. -1]
data.delete("is_user_unsubscribed")
data.to_csv # => The new CSV in string format
Check this for more info: http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV/Table.html
http://ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html
My recommendation would be to read in a line from your file as a string, then split the String that you get by commas (there is a comma separating your columns).
Splitting a Ruby String:
https://code-maven.com/ruby-split
require 'pp'
line_num=0
text=File.open('myfile.csv').read
text.each_line do |line|
textArray = line.split
textIWant = textArray[0]
line_num = line_num + 1
print "#{textIWant}"
end
In this code we open a text file, and read line by line. Each line we split into the text we want by choosing the text from the first column (zeroth item in the array), then print it.
If you do not want the headers, when line_num = 0, add an if statement to not pick up the data. Even better use unless.
Just rewrite a new file with your new data.
I wound up doing this. Is this kosher?
user_ids = []
[]
CSV.foreach("eds_users_sept15.csv", headers:true) do |row|
user_ids << row['user_id']
end
nil
user_ids.count
322101
CSV.open('some_new_file.csv', 'w') do |c|
user_ids.each do |id|
c << [id]
end
end
I have 330,000 rows...
So I guess speed matters, right?
I took your method and the other 2 that was proposed, tested them on a 330,000 rows csv file and made a benchmark to show you something interesting.
require 'csv'
require 'benchmark'
Benchmark.bm(10) do |bm|
bm.report("Method 1:") {
data = Array.new
CSV.foreach("input.csv", headers:true) do |row|
data << row['user_id']
end
}
bm.report("Method 2:") {
data = CSV.read("input.csv")[1 .. -1]
data.delete("is_user_unsubscribed")
}
bm.report("Method 3:") {
data = Array.new
File.open('input.csv').read.each_line do |line|
data << line.split(',')[0]
end
data.shift # => remove headers
}
end
The output:
user system total real
Method 1: 3.110000 0.010000 3.120000 ( 3.129409)
Method 2: 1.990000 0.010000 2.000000 ( 2.004016)
Method 3: 0.380000 0.010000 0.390000 ( 0.383700)
As you can see handling the CSV file as a simple text file, splitting the lines and pushing them into the array is ~5 times faster than using CSV Module. Of course it has some disadvantages too; i.e., if you'll ever add columns in the input file you'll have to review the code.
It's up to you if you prefer lightspeed code or easier scalability.
I'm guessing that you plan to convert each string that precedes a comma to an integer. If so,
CSV.read("dataset.csv").drop(1).map(:to_i)
is all you need. (For example, "131072,1".to_i #=> 131072.)
If you want strings, you could write
CSV.read("dataset.csv").drop(1).map { |s| s[/d+/] }

Write an array to multi column CSV format using Ruby

I have an array of arrays in Ruby that i'm trying to output to a CSV file (or text). That I can then easily transfer over to another XML file for graphing.
I can't seem to get the output (in text format) like so. Instead I get one line of data which is just a large array.
0,2
0,3
0,4
0,5
I originally tried something along the lines of this
File.open('02.3.gyro_trends.text' , 'w') { |file| trend_array.each { |x,y| file.puts(x,y)}}
And it outputs
0.2
46558
0
46560
0
....etc etc.
Can anyone point me in the "write" direction for getting either:
(i) .text file that can put my data like so.
trend_array[0][0], trend_array[0][1]
trend_array[1][0], trend_array[1][1]
trend_array[2][0], trend_array[2][1]
trend_array[3][0], trend_array[3][1]
(ii) .csv file that would put this data in separate columns.
edit I recently added more than two values into my array, check out my answer combining Cameck's solution.
This is currently what I have at the moment.
trend_array=[]
j=1
# cycle through array and find change in gyro data.
while j < gyro_array.length-2
if gyro_array[j+1][1] < 0.025 && gyro_array[j+1][1] > -0.025
trend_array << [0, gyro_array[j][0]]
j+=1
elsif gyro_array[j+1][1] > -0.025 # if the next value is increasing by x1.2 the value of the previous amount. Log it as +1
trend_array << [0.2, gyro_array[j][0]]
j+=1
elsif gyro_array[j+1][1] < 0.025 # if the next value is decreasing by x1.2 the value of the previous amount. Log it as -1
trend_array << [-0.2, gyro_array[j][0]]
j+=1
end
end
#for graphing and analysis purposes (wanted to print it all as a csv in two columns)
File.open('02.3test.gyro_trends.text' , 'w') { |file| trend_array.each { |x,y| file.puts(x,y)}}
File.open('02.3test.gyro_trends_count.text' , 'w') { |file| trend_array.each {|x,y| file.puts(y)}}
I know it's something really easy, but for some reason I'm missing it. Something with concatenation, but I found that if I try and concatenate a \\n in my last line of code, it doesn't output it to the file. It outputs it in my console the way I want it, but not when I write it to a file.
Thanks for taking the time to read this all.
File.open('02.3test.gyro_trends.text' , 'w') { |file| trend_array.each { |a| file.puts(a.join(","))}}
Alternately using the CSV Class:
def write_to_csv(row)
if csv_exists?
CSV.open(#csv_name, 'a+') { |csv| csv << row }
else
# create and add headers if doesn't exist already
CSV.open(#csv_name, 'wb') do |csv|
csv << CSV_HEADER
csv << row
end
end
end
def csv_exists?
#exists ||= File.file?(#csv_name)
end
Call write_to_csv with an array [col_1, col_2, col_3]
Thank you both #cameck & #tkupari, both answers were what I was looking for. Went with Cameck's answer in the end, because it "cut out" cutting and pasting text => xml. Here's what I did to get an array of arrays into their proper places.
require 'csv'
CSV_HEADER = [
"Apples",
"Oranges",
"Pears"
]
#csv_name = "Test_file.csv"
def write_to_csv(row)
if csv_exists?
CSV.open(#csv_name, 'a+') { |csv| csv << row }
else
# create and add headers if doesn't exist already
CSV.open(#csv_name, 'wb') do |csv|
csv << CSV_HEADER
csv << row
end
end
end
def csv_exists?
#exists ||= File.file?(#csv_name)
end
array = [ [1,2,3] , ['a','b','c'] , ['dog', 'cat' , 'poop'] ]
array.each { |row| write_to_csv(row) }

Ruby colorize a section of text

I am writing a command line script in ruby and I am trying to color a section of lines. Currently, I am using the 'colorize' gem, but from the documentation it only lets you color one line of text at a time
puts "test".colorize(:green)
puts "test".colorize(:green)
puts "test".colorize(:green)
But, that seems a bit redundant to me and I would like to color all the lines of text, but only call 'colorize(:green)' once and not 3 times.
How can this be done in Ruby?
Define a method for this:
def putsg(text)
puts text.colorize(:green)
end
And than call that method:
putsg "test"
putsg "test"
putsg "test"
puts ["test", "test", "test"].join($/).colorize(:green)
or
puts ["test", "test", "test"].map{|s| s.colorize(:green)}
Regarding your statement "...but from the documentation it only lets you color one line of text at a time"
You can colorize different parts of the same line with different colors.
s = "Hello"
ss = "world"
puts "#{s.red} #{"there".white} #{s.blue}"
You can also accomplish your aim like this:
s = "test"
puts "#{s}\n#{s}\n#{s}".green
Or:
s1 = "check"
s2 = "this"
s3 = "out"
puts "#{s1}\n#{s2}\n#{s3}".green
You can also define a method like this:
The input is the color you want as a string and the text you want outputed. This will work in the command prompt.
def colorized_text(color,text)
#Find colors here: https://en.wikipedia.org/wiki/ANSI_escape_code#8-bit
color_code_hash = {
'red' => 196,
'green' => 40,
'yellow' => 226,
'blue' => 27,
}
puts"\e[38;5;#{color_code_hash[color]}m#{text}\e[0m"
end
I found this to be much better than using the colorize gem since that only gives you 8 colors. You can find all the colors on the ansi escape wiki page under the 8 bit section.

How do I read a gzip file line by line?

I have a gzip file and currently I read it like this:
infile = open("file.log.gz")
gz = Zlib::GzipReader.new(infile)
output = gz.read
puts result
I think this converts the file to a string, but I would like to read it line by line.
What I want to accomplish is that the file has some warning messages with some garbage, I want to grep those warning messages and then write them to another file. But, some warning messages are repeated so I have to make sure that i only grep them once. Hence line by line reading would help me.
You should be able to simply loop over the gzip reader like you do with regular streams (according to the docs)
infile = open("file.log.gz")
gz = Zlib::GzipReader.new(infile)
gz.each_line do |line|
puts line
end
Try this:
infile = open("file.log.gz")
gz = Zlib::GzipReader.new(infile)
while output = gz.gets
puts output
end
Other answers show how to read the file line by line, but not how to only capture the errors once. Building on #Tigraine's answer:
require 'set'
infile = open("file.log.gz")
gz = Zlib::GzipReader.new(infile)
errors = Set.new
# or ...
# errors = [].to_set
gz.each_line do |line|
errors << line if (line[/^Error:/])
# or ...
# errors << line if (line['Error:'])
end
puts errors
Set acts like Array, but is built using Hash, so it's like a Hash but we're only concerned with the keys, i.e. only unique values are stored. If you try to add duplicates they will be thrown away, leaving you with only the unique values. You could use an Array, and afterwards use uniq, on it, but a Set will manage it for you up-front.
>> require 'set'
=> true
>> errors = Set.new
=> #<Set: {}>
>> errors << 'a'
=> #<Set: {"a"}>
>> errors << 'b'
=> #<Set: {"a", "b"}>
>> errors << 'a'
=> #<Set: {"a", "b"}>

Resources