What's wrong with my code? Is FileNameArray being reused?
f.rb:17: warning: already initialized constant FileNameArray
number = 0
while number < 99
number = number + 1
if number <= 9
numbers = "000" + number.to_s
elsif
numbers = "00" + number.to_s
end
files = Dir.glob("/home/product/" + numbers + "/*/*.txt")
files.each do |file_name|
File.open(file_name,"r:utf-8").each do | txt |
if txt =~ /http:\/\//
if txt =~ /static.abc.com/ or txt =~ /static0[1-9].abc.com/
elsif
$find = txt
FileNameArray = file_name.split('/')
f = File.open("error.txt", 'a+')
f.puts FileNameArray[8], txt , "\n"
f.close
end
end
end
end
end
You might be a ruby beginner, I tried to rewrite the same code in ruby way...
(1..99).each do |number|
Dir.glob("/home/product/" + ("%04d" % numbers) + "/*/*.txt").each do |file_name|
File.open(file_name,"r:utf-8").each do | txt |
next unless txt =~ /http:\/\//
next if txt =~ /static.abc.com/ || txt =~ /static0[1-9].abc.com/
$find = txt
file_name_array = file_name.split('/')
f = File.open("error.txt", 'a+')
f.puts file_name_array[8], txt , "\n"
f.close
end
end
end
Points to note down,
In ruby if you use a variable prefixed with $ symbol, it is taken as a global variable. So use $find, only if it is required.
In ruby a constant variable starts with capital letter, usually we are NOT supposed to change a constant value. This might have caused the error in your program.
(1..99) is a literal used to create instance of Range class, which returns values from 1 to 99
In Ruby variable name case matters. Local variables must start with a lower case character. Constants - with an upper case.
So, please try to rename FileNameArray to fileNameArray.
Also, glob takes advanced expressions that can save you one loop and a dozen of LOCs. In your case this expression should look something like:
Dir.glob("/home/product/00[0-9][0-9]/*/*.txt")
Related
I have a group of large text files with a ton of information in them with inconsistent formatting. I don't really care all that much about most of the info, but I'm trying to extract IDs that are included in the file. I've drafted a fairly simple script to do this (IDs are 3 digits - 7 digits).
puts("What's the name of the file you'd like to check? (don't include .txt)")
file_to_check = gets.chomp
file_to_write = file_to_check + "IDs" + ".txt"
file_to_check = file_to_check + ".txt"
output_text = ""
count_of_lines = 0
File.open(file_to_check, "r").each_line do |line|
count_of_lines += 1
if /.*\d{3}-\d{7}.*/ =~ line
temp_case = line.match(/\d{3}-\d{7}/).to_s
temp_case = temp_case + "\n"
output_text = output_text + temp_case
else
# puts("this failed")
end
end
File.open(file_to_write, "w") do |file|
file.puts(output_text)
file.puts(count_of_lines)
end
One of the files includes characters that VIM shows as ^Z, which seem to be killing the script before it actually gets to the end of the file.
Is there anything I can do to have Ruby ignore these characters and keep moving through the file?
Per Mircea's comment, the answer is here. I used "rt" based on one of the comments on the selected answer.
I am looking to extract all Methionine residues to the end from a sequence.
In the below sequence:
MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
Original Amino Acid sequence:
atgtttgaaatcgaagaacatatgaaggattcacaggtggaatacataattggccttcataatatcccattattgaatgcaactatttcagtgaagtgcacaggatttcaaagaactatgaatatgcaaggttgtgctaataaatttatgcaaagacattatgagaatcccctgacgggg
I want to extract from the sequence any M residue to the end, and obtain the following:
- MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
- MKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
- MNMQGCANKFMQRHYENPLTG
- MQGCANKFMQRHYENPLTG
- MQRHYENPLTG
With the data I am working with there are cases where there are a lot more "M" residues in the sequence.
The script I currently have is below. This script translates the genomic data first and then works with the amino acid sequences. This does the first two extractions but nothing further.
I have tried to repeat the same scan method after the second scan (See the commented part in the script below) but this just gives me an error:
private method scan called for #<Array:0x7f80884c84b0> No Method Error
I understand I need to make a loop of some kind and have tried, but all in vain. I have also tried matching but I haven't been able to do so - I think that you cannot match overlapping characters a single match method but then again I'm only a beginner...
So here is the script I'm using:
#!/usr/bin/env ruby
require "bio"
def extract_open_reading_frames(input)
file_output = File.new("./output.aa", "w")
input.each_entry do |entry|
i = 1
entry.naseq.translate(1).scan(/M\w*/i) do |orf1|
file_output.puts ">#{entry.definition.to_s} 5\'3\' frame 1:#{i}\n#{orf1}"
i = i + 1
orf1.scan(/.(M\w*)/i) do |orf2|
file_output.puts ">#{entry.definition.to_s} 5\'3\' frame 1:#{i}\n#{orf2}"
i = i + 1
# orf2.scan(/.(M\w*)/i) do |orf3|
# file_output.puts ">#{entry.definition.to_s} 5\'3\' frame 1:#{i}\n#{orf3}"
# i = i + 1
# end
end
end
end
file_output.close
end
biofastafile = Bio::FlatFile.new(Bio::FastaFormat, ARGF)
extract_open_reading_frames(biofastafile)
The script has to be in Ruby since this is part of a much longer script that is in Ruby.
You can do:
str = "MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG"
str.scan(/(?=(M.*))./).flatten
#=> ["MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG", MKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG", "MNMQGCANKFMQRHYENPLTG", "MQGCANKFMQRHYENPLTG", "MQRHYENPLTG"]
This works by capturing loookaheads starting with M and advancing one char at a time.
str = "MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG"
pos = 0
while pos < str.size
if md = str.match(/M.*/, pos)
puts md[0]
pos = md.offset(0)[0] + 1
else
break
end
end
--output:--
MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
MKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
MNMQGCANKFMQRHYENPLTG
MQGCANKFMQRHYENPLTG
MQRHYENPLTG
md -- stands for the MatchData object.
match() -- returns nil if there is no match, the second argument is the start position of the search.
md[0] -- is the whole match (md[1] would be the first parenthesized group, etc.).
md.offset(n) -- returns an array containing the beginning and ending position in the string of md[n].
Running the program on the string "MMMM" produces the output:
MMMM
MMM
MM
M
I have also tried matching but I haven't been able to do so - I think
that you cannot match overlapping characters a single match method but
then again I'm only a beginner...
Yes, that's true. String#scan will not find overlapping matches. After scan finds a match, the search continues from the end of the match. Perl has some ways to make regexes back-up, I don't know whether Ruby has those.
Edit:
For Ruby 1.8.7:
str = "MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG"
pos = 0
while true
str = str[pos..-1]
if md = str.match(/M.*/)
puts md[0]
pos = md.offset(0)[0] + 1
else
break
end
end
I'm trying to execute this program from the command line, and I'm not able to use gets.chomp, instead, it returns the key value.
I am entering: ruby name_of_file.rb name_of_file.txt
def caesar_cipher(key)
s = gets.chomp
encoded = ""
s.each_byte do |l|
if ((l >= 65 && l <= 90) || (l >= 97 && l <= 122))
encoded += (l+key).chr
else
encoded += l.chr
end
end
encoded
end
File.readlines(ARGV[0]).map(&:to_i).each {|key| puts caesar_cipher(key)}
I know the program does not execute the caesar cipher completely, I am just trying to figure out how to run it from the command line without having to use pry or irb.
You want to manually enter the cipher key?
Use STDIN.gets
#vgoff has the answer, but here's how I'd rewrite the the code to be more readable:
def caesar_cipher(key)
encoded = ""
s = STDIN.gets.chomp
s.each_char do |l|
case l
when 'A' .. 'Z', 'a' .. 'z'
encoded += (l.ord + key).chr
else
encoded += l
end
end
encoded
end
# File.readlines(ARGV[0]).map(&:to_i).each {|key| puts caesar_cipher(key)}
puts caesar_cipher(0)
puts caesar_cipher(1)
Instead of splitting characters into bytes, I'd probably use each_char to maintain the character-encoding. I'd use a case statement to let me use two ranges to define upper and lower-case characters cleanly, and use ord to get the actual ordinal value for a character, instead of the byte.
It's more readable, but might not fully satisfy your needs.
I want to split a txt file into multiple files where each file contains no more than 5Mb. I know there are tools for this, but I need this for a project and want to do it in Ruby. Also, I prefer to do this with File.open in block context if possible, but I fail miserably :o(
#!/usr/bin/env ruby
require 'pp'
MAX_BYTES = 5_000_000
file_num = 0
bytes = 0
File.open("test.txt", 'r') do |data_in|
File.open("#{file_num}.txt", 'w') do |data_out|
data_in.each_line do |line|
data_out.puts line
bytes += line.length
if bytes > MAX_BYTES
bytes = 0
file_num += 1
# next file
end
end
end
end
This work, but I don't think it is elegant. Also, I still wonder if it can be done with File.open in block context only.
#!/usr/bin/env ruby
require 'pp'
MAX_BYTES = 5_000_000
file_num = 0
bytes = 0
File.open("test.txt", 'r') do |data_in|
data_out = File.open("#{file_num}.txt", 'w')
data_in.each_line do |line|
data_out = File.open("#{file_num}.txt", 'w') unless data_out.respond_to? :write
data_out.puts line
bytes += line.length
if bytes > MAX_BYTES
bytes = 0
file_num += 1
data_out.close
end
end
data_out.close if data_out.respond_to? :close
end
Cheers,
Martin
[Updated] Wrote a short version without any helper variables and put everything in a method:
def chunker f_in, out_pref, chunksize = 1_073_741_824
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
File.open("#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt","w") do |fh_out|
fh_out << fh_in.read(chunksize)
end
end
end
end
chunker "inputfile.txt", "output_prefix" (, desired_chunk_size)
Instead of a line loop you can use .read(length) and do a loop only for the EOF marker and the file cursor.
This takes care that the chunky files are never bigger than your desired chunk size.
On the other hand it never takes care for line breaks (\n)!
Numbers for chunk files will be generated from integer division of current file curser position by chunksize, formatted with "%05d" which result in 5-digit numbers with leading zero (00001).
This is only possible because .read(chunksize) is used. In the second example below, it could not be used!
Update: Splitting with line break recognition
If your really need complete lines with \n then use this modified code snippet:
def chunker f_in, out_pref, chunksize = 1_073_741_824
outfilenum = 1
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
File.open("#{out_pref}_#{outfilenum}.txt","w") do |fh_out|
loop do
line = fh_in.readline
fh_out << line
break if fh_out.size > (chunksize-line.length) || fh_in.eof?
end
end
outfilenum += 1
end
end
end
I had to introduce a helper variable line because I want to ensure that the chunky file size is always below the chunksize limit! If you don't do this extended check you will get also file sizes above the limit. The while statement only successfully checks in next iteration step when the line is already written. (Working with .ungetc or other complex calculations will make the code more unreadable and not shorter than this example.)
Unfortunately you have to have a second EOF check, because the last chunk iteration will mostly result in a smaller chunk.
Also two helper variables are needed: the line is described above, the outfilenum is needed, because the resulting file sizes mostly do not match the exact chunksize.
For files of any size, split will be faster than scratch-built Ruby code, even taking the cost of starting a separate executable into account. It's also code that you don't have to write, debug or maintain:
system("split -C 1M -d test.txt ''")
The options are:
-C 1M Put lines totalling no more than 1M in each chunk
-d Use decimal suffixes in the output filenames
test.txt The name of the input file
'' Use a blank output file prefix
Unless you're on Windows, this is the way to go.
Instead of opening your outfile inside the infile block, open the file and assign it to variable. When you hit the filesize limit, close the file and open a new one.
This code actually works, it's simple and it uses array which make it faster:
#!/usr/bin/env ruby
data = Array.new()
MAX_BYTES = 3500
MAX_LINES = 32
lineNum = 0
file_num = 0
bytes = 0
filename = 'W:/IN/tangoZ.txt_100.TXT'
r = File.exist?(filename)
puts 'File exists =' + r.to_s + ' ' + filename
file=File.open(filename,"r")
line_count = file.readlines.size
file_size = File.size(filename).to_f / 1024000
puts 'Total lines=' + line_count.to_s + ' size=' + file_size.to_s + ' Mb'
puts ' '
file = File.open(filename,"r")
#puts '1 File open read ' + filename
file.each{|line|
bytes += line.length
lineNum += 1
data << line
if bytes > MAX_BYTES then
# if lineNum > MAX_LINES then
bytes = 0
file_num += 1
#puts '_2 File open write ' + file_num.to_s + ' lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}
data.clear
lineNum = 0
end
}
## write leftovers
file_num += 1
#puts '__3 File open write FINAL' + file_num.to_s + ' lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}
I'm looking for a much more idiomatic way to do the following little ruby script.
File.open("channels.xml").each do |line|
if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
end
Thanks in advance for any suggestions.
The original:
File.open("channels.xml").each do |line|
if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
end
can be changed into this:
m = nil
open("channels.xml").each do |line|
puts m if m = line.match(%r|(mms://{1}[\w\./-]+)|)
end
File.open can be changed to just open.
if XYZ
puts XYZ
end
can be changed to puts x if x = XYZ as long as x has occurred at some place in the current scope before the if statement.
The Regexp '(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)' can be refactored a little bit. Using the %rXX notation, you can create regular expressions without the need for so many backslashes, where X is any matching character, such as ( and ) or in the example above, | |.
This character class [a-zA-Z\.\d\/\w-] (read: A to Z, case insensitive, the period character, 0 to 9, a forward slash, any word character, or a dash) is a little redundant. \w denotes "word characters", i.e. A-Za-z0-9 and underscore. Since you specify \w as a positive match, A-Za-z and \d are redundant.
Using those 2 cleanups, the Regexp can be changed into this: %r|(mms://{1}[\w\./-]+)|
If you'd like to avoid the weird m = nil scoping sorcery, this will also work, but is less idiomatic:
open("channels.xml").each do |line|
m = line.match(%r|(mms://{1}[\w\./-]+)|) and puts m
end
or the longer, but more readable version:
open("channels.xml").each do |line|
if m = line.match(%r|(mms://{1}[\w\./-]+)|)
puts m
end
end
One very easy to read approach is just to store the result of the match, then only print if there's a match:
File.open("channels.xml").each do |line|
m = line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts m if m
end
If you want to start getting clever (and have less-readable code), use $& which is the global variable that receives the match variable:
File.open("channels.xml").each do |line|
puts $& if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
Personally, I would probably just use the POSIX grep command. But there is Enumerable#grep in Ruby, too:
puts File.readlines('channels.xml').grep(%r|mms://{1}[\w\./-]+|)
Alternatively, you could use some of Ruby's file and line processing magic that it inherited from Perl. If you pass the -p flag to the Ruby interpreter, it will assume that the script you pass in is wrapped with while gets; ...; end and at the end of each loop it will print the current line. You can then use the $_ special variable to access the current line and use the next keyword to skip iteration of the loop if you don't want the line printed:
ruby -pe 'next unless $_ =~ %r|mms://{1}[\w\./-]+|' channels.xml
Basically,
ruby -pe 'next unless $_ =~ /re/' file
is equivalent to
grep -E re file