Why aren't local variables within nested functions associated? - bioinformatics

Could someone please explain why local variables passed to the same nested function give different results.
The following function import_alignment() includes a conditional block if trimref: which changes the input data and calls the function again on this new data.
When I wrote this function it appeared to do the expected manipulations but returned the original data.
I've just realised that I neglected to assign results returned from the second call to the associated variables.
originally import_alignment(modified *args) now...
sequences, numseqs, refseq, refname = import_alignment(modified *args)
Which does what it should, but I'm still not sure why the first version didn't work.
...more details...
The function imports DNA sequence data from a Fasta file (a text file with >name and sequence on alternate lines), and returns list of sequences, along with the sequence count, and the name and sequence of a gap-free reference (the first sequence in the file).
It first calls the numseqs() function which does a basic quality check on the input file.
It then opens the file, assigns the odd lines to the sequences list (lines 0, 2, 4,... are sequence names; 1, 3, 5 are sequences), extracts the reference name and removes gaps ('-') from the reference sequence.
It then tests whether trimref == True if so then it harvests the sequence names from the file,
`trim_TOref() removes reference gap positions from all sequences
output_data() writes the resulting sequence alignment to a new fasta file (name_trimmed.fas)
this file is then passed back to import_alignment() to extract the modified sequences
def import_alignment(filename, path, trimref=True):
"""
this function extracts sequences from a fasta file and does basic QC
"""
infile = path + filename
filesize = file_size(infile)
print("Input path and filename : ",infile)
print("File Size : {0}".format(filesize))
numseqs = check_even(infile)
try:
with open(infile) as f:
f.seek(0)
sequences = pick_lines(f, odds=True)
refseq = sequences[0].replace("-","")
f.seek(0)
refname = f.readline().strip('>\n')
if trimref:
f.seek(0)
seqnames = pick_lines(f, False)
sequences = trim_TOref(sequences)
outfile = filename[0:-4] + "_trimmed"
output_data(outfile, sequences, outpath = (path + 'output/'), filetype = '.fas', keys = seqnames)
trimref = False
sequences, numseqs, refseq, refname = import_alignment((outfile + '.fas'), outpath, trimref)
except FileNotFoundError:
print("File not found. Check the path variable and filename")
exit()
return sequences, numseqs, refseq, refname

Related

Ruby script which can replace a string in a binary file to a different, but same length string?

I would like to write a Ruby script (repl.rb) which can replace a string in a binary file (string is defined by a regex) to a different, but same length string.
It works like a filter, outputs to STDOUT, which can be redirected (ruby repl.rb data.bin > data2.bin), regex and replacement can be hardcoded. My approach is:
#!/usr/bin/ruby
fn = ARGV[0]
regex = /\-\-[0-9a-z]{32,32}\-\-/
replacement = "--0ca2765b4fd186d6fc7c0ce385f0e9d9--"
blk_size = 1024
File.open(fn, "rb") {|f|
while not f.eof?
data = f.read(blk_size)
data.gsub!(regex, str)
print data
end
}
My problem is that when string is positioned in the file that way it interferes with the block size used by reading the binary file. For example when blk_size=1024 and my 1st occurance of the string begins at byte position 1000, so I will not find it in the "data" variable. Same happens with the next read cycle. Should I process the whole file two times with different block size to ensure avoiding this worth case scenario, or is there any other approach?
I would posit that a tool like sed might be a better choice for this. That said, here's an idea: Read block 1 and block 2 and join them into a single string, then perform the replacement on the combined string. Split them apart again and print block 1. Then read block 3 and join block 2 and 3 and perform the replacement as above. Split them again and print block 2. Repeat until the end of the file. I haven't tested it, but it ought to look something like this:
File.open(fn, "rb") do |f|
last_block, this_block = nil
while not f.eof?
last_block, this_block = this_block, f.read(blk_size)
data = "#{last_block}#{this_block}".gsub(regex, str)
last_block, this_block = data.slice!(0, blk_size), data
print last_block
end
print this_block
end
There's probably a nontrivial performance penalty for doing it this way, but it could be acceptable depending on your use case.
Maybe a cheeky
f.pos = f.pos - replacement.size
at the end of the while loop, just before reading the next chunk.

Custom serialize and parse methods in Ruby

I have developed this class Directory that some what emulates a directory using hashes. I have difficulties figuring out how to do the serialize and parse methods. The returned string from the serialize method should look something like this:
2:README:19:string:Hello world!spec.rb:20:string:describe RBFS1:rbfs:4:0:0:
Now to explain what exactly this means. This is the master directory and the 2 upfront means the number of files, than we have the file name README and after that the length of the contents of the file 19, represented with a string that I get from the parse method of the other class in the module. And after that the second file, also notice that the two files are not separated by :, we don't need it here since when know the string length. So in a little better look:
<file count><file1_data><file2_data>1:rbfs:4:0:0:, here <file1_data>, encompasses the name, length and contents part.
Now the 1:rbfs:4:0:0: means we have one sub-directory with name rbfs, 4 representing the length of it's contents as a string and 0:0: representing that it's empty, no file and no sub-directories. Here is another example:
0:1:directory1:40:0:1:directory2:22:1:README:9:number:420: which is equivalent to:
.
`-- directory1
`-- directory2
`-- README
I have no problem with the files part,and i know how to get the number of directories and their names, but the other part I have no idea what to do. I know that recursion is the best answer, but I have no clue what should the bottom of that recursion be and how to implement it. Also solving this will help greatly in figuring out how to do the parse method by reverse engineering it.
The code is below:
module RBFS
class File
... # here I have working `serialize` and `parse` methods for `File`
end
class Directory
attr_accessor :content
def initialize
#content = {}
end
def add_file (name,file)
#content[name]=file
end
def add_directory(name, subdirectory = nil)
if subdirectory
#content[name] = subdirectory
else
#content[name] = RBFS::Directory.new
end
end
def serialize
...?
end
def self.parse (string)
...?
end
end
end
PS: I check the kind of values in the hash with the is_a? method.
Another example for #Jordan:
2:file1:17:string:Test test?file2:10:number:4322:direc1:34:0:1:dir2:22:1:README:9:number:420:direc2::1:README2:9:number:33:0
...should be this structure (if I've formulated it right):
. ->file1,file2
`-- direc1,.....................................direc2 -> README2
`-- dir2(subdirectory of direc1) -> README
direc1 contains only a directory and no files, while direc2 contains only a file.
You can see that the master directory doesn't specify it's string length while all others do.
Okay, let's work through this iteratively, starting with your example:
str = "2:README:19:string:Hello world!spec.rb:20:string:describe RBFS1:rbfs:4:0:0:"
entries = {} # No entries yet!
The very first thing we need to know is how many files there are, and we know we know that's the number before the first ::
num_entries, rest = str.split(':', 2)
num_entries = Integer(num_entries)
# num_entries is now 2
# rest is now "README:19:string:Hello world!spec.rb:20:string:describe RBFS1:rbfs:4:0:0:"
The second argument to split says "I only want 2 pieces," so it stops splitting after the first :.) We use Integer(n) instead of n.to_i because it's stricter. (to_i will convert "10xyz" to 10; Integer will raise an error, which is what we want here.)
Now we know we have two files. We don't know anything else yet, but what's left of our string is this:
README:19:string:Hello world!spec.rb:20:string:describe RBFS1:rbfs:4:0:0:
The next thing we can get is the name and length of the first file.
name, len, rest = rest.split(':', 3)
len = Integer(len.to_i)
# name = "README"
# len = 19
# rest = "string:Hello world!spec.rb:20:string:describe RBFS1:rbfs:4:0:0:"
Cool, now we have the name and length of the first file, so we can get its content:
content = rest.slice!(0, len)
# content = "string:Hello world!"
# rest = "spec.rb:20:string:describe RBFS1:rbfs:4:0:0:"
entries[name] = content
# entries = { "README" => "string:Hello world!" }
We used rest.slice! which modifies removes len characters from the front of the string and returns them, so content is just what we want (string:Hello world!) and rest is everything that was after it. Then we added it to entries Hash. One file down, one to go!
For the second file, we do the exact same thing:
name, len, rest = rest.split(':', 3)
len = Integer(len)
# name = "spec.rb"
# len = 20
# rest = "string:describe RBFS1:rbfs:4:0:0:"
content = rest.slice!(0, len)
# content = "string:describe RBFS"
# rest = "1:rbfs:4:0:0:"
entries[name] = content
# entries = { "README" => "string:Hello world!",
# "spec.rb" => "string:describe RBFS" }
Since we do the exact same thing twice, obviously we should do this in a loop! But before we write that, we need to get organized. So far we have two discrete steps: First, get the number of files. Second, get those files' contents. We also know we'll need to get the number of directories and the directories. We'll take a guess at how this'll look:
def parse(serialized)
files, rest = parse_files(serialized)
# `files` will be a Hash of file names and their contents and `rest` will be
# the part of the string we haven't serialized yet
directories, rest = parse_directories(rest)
# `directories` will be a Hash of directory names and their contents
files.merge(directories)
end
def parse_files(serialized)
# Get the number of files from the beginning of the string
num_entries, rest = str.split(':', 2)
num_entries = Integer(num_entries)
entries = {}
# `rest` now starts with the first file (e.g. "README:19:...")
num_entries.times do
name, len, rest = rest.split(':', 3) # get the file name and length
len = Integer(len)
content = rest.slice!(0, len) # get the file contents from the beginning of the string
entries[name] = content # add it to the hash
end
[ entries, rest ]
end
def parse_directories(serialized)
# TBD...
end
That parse_files method is a bit long for my taste, though, so how about we split it up?
def parse_files(serialized)
# Get the number of files from the beginning of the string
num_entries, rest = str.split(':', 2)
num_entries = Integer(num_entries)
entries = {}
# `rest` now starts with the first file (e.g. "README:19:...")
num_entries.times do
name, content, rest = parse_file(rest)
entries[name] = content # add it to the hash
end
[ entries, rest ]
end
def parse_file(serialized)
name, len, rest = serialized.split(':', 3) # get the name and length of the file
len = Integer(len)
content = rest.slice!(0, len) # use the length to get its contents
[ name, content, rest ]
end
Clean!
Now, I'm going to give you a big spoiler: Since the serialization format is reasonably well-designed, we don't actually need a parse_directories method, because it would do exactly the same thing as parse_files. The only difference is that after this line:
name, content, rest = parse_file(rest)
...we want to do something different if we're parsing directories instead of files. In particular, we want to call parse(content), which will do all of this over again on the directory's contents. Since it's pulling double-duty now, we should probably change it's name to something more general like parse_entries, and we also need to give it another argument to tell it when to do that recursion.
Rather than post more code here, I've posted my "finished" product over in this Gist.
Now, I know that doesn't help you with the serialize part, but hopefully it'll help get you started. serialize is the easier part because there are plenty of questions and answers on SO about recursively iterating over a Hash.

Ruby splitting string into different files

Here I've created an algorithm that extracts an array of the Federalist papers and splits them up saving them into separate files titled "Federalist No." followed by their respective numbers. Everything works perfectly and the files are being created beautifully; however, the only problem I run into now is that it fails to create the last output.
Maybe it's because I've been staring at this for too many hours but I'm at an impasse.
I've inserted the line puts fedSections.length to see what the output is.
Using a smaller version of the compilation of the Fed papers for testing, the terminal output is 3... it creates "Federalist No. 0" a blank document to take into account empty space and "Federalist No. 1" with the first federalist paper. No "Federalist No. 2."
Any thoughts?
# Create new string to add array l to
fedString = " "
for f in 0...l.length-1
fedString += l[f] + ''
end
# Create variables applied to new files
Federalist_No= "Federalist No."
a = "0"
b = "FEDERALIST No."
fedSections = Array.new() # New array to insert Federalist paper to
fedSections = fedString.split("FEDERALIST No.") # Split string into elements of the array at each change in Federalist paper
puts fedSections.length
# Split gives empty string, off by one
for k in 0...fedSections.length-1 # Use of loop to write each Fed paper to its own file
new_text = File.open(Federalist_No + a + ".txt", "w") # Open said file with write capabilities
new_text.puts(b+a) # Write the "FEDERALIST No" and the number from "a"
new_text.puts fedSections[k] # Write contents of string (section of paper) to a file
new_text.close()
a = a.to_i + 1 # Increment "a" by one to accomodate for consecutive papers
a = a.to_s # Restore to string
end
The error is in your for loop
for k in 0...fedSections.length-1
you actually want
for k in 0..fedSections.length-1
... does not include the last element in the range
but as screenmutt said, it is more idiomatic ruby to use an each loop
fedSections.each do |section|

Ruby: How do you search for a substring, and increment a value within it?

I am trying to change a file by finding this string:
<aspect name=\"lineNumber\"><![CDATA[{CLONEINCR}]]>
and replacing {CLONEINCR} with an incrementing number. Here's what I have so far:
file = File.open('input3400.txt' , 'rb')
contents = file.read.lines.to_a
contents.each_index do |i|contents.join["<aspect name=\"lineNumber\"><![CDATA[{CLONEINCR}]]></aspect>"] = "<aspect name=\"lineNumber\"><![CDATA[#{i}]]></aspect>" end
file.close
But this seems to go on forever - do I have an infinite loop somewhere?
Note: my text file is 533,952 lines long.
You are repeatedly concatenating all the elements of contents, making a substitution, and throwing away the result. This is happening once for each line, so no wonder it is taking a long time.
The easiest solution would be to read the entire file into a single string and use gsub on that to modify the contents. In your example you are inserting the (zero-based) file line numbers into the CDATA. I suspect this is a mistake.
This code replaces all occurrences of <![CDATA[{CLONEINCR}]]> with <![CDATA[1]]>, <![CDATA[2]]> etc. with the number incrementing for each matching CDATA found. The modified file is sent to STDOUT. Hopefully that is what you need.
File.open('input3400.txt' , 'r') do |f|
i = 0
contents = f.read.gsub('<![CDATA[{CLONEINCR}]]>') { |m|
m.sub('{CLONEINCR}', (i += 1).to_s)
}
puts contents
end
If what you want is to replace CLONEINCR with the line number, which is what your above code looks like it's trying to do, then this will work. Otherwise see Borodin's answer.
output = File.readlines('input3400.txt').map.with_index do |line, i|
line.gsub "<aspect name=\"lineNumber\"><![CDATA[{CLONEINCR}]]></aspect>",
"<aspect name=\"lineNumber\"><![CDATA[#{i}]]></aspect>"
end
File.write('input3400.txt', output.join(''))
Also, you should be aware that when you read the lines into contents, you are creating a String distinct from the file. You can't operate on the file directly. Instead you have to create a new String that contains what you want and then overwrite the original file.

Double "gsub" Variable

Is it possible to use variables in both fields of the gsub method ?
I'm trying to get this piece of code work :
$I = 0
def random_image
$I.to_s
random = rand(1).to_s
logo = File.read('logo-standart.txt')
logo_aleatoire = logo.gsub(/#{$I}/, random)
File.open('logo-standart.txt', "w") {|file| File.puts logo_aleatoire}
$I.to_i
$I += 1
end
Thanks in advance !
filecontents = File.read('logo-standart.txt')
filecontents.gsub!(/\d+/){rand(100)}
File.open("logo-standart.txt","w"){|f| f << filecontents }
The magic line is the second line.
The gsub! function modifies the string in-place, unlike the gsub function, which would return a new string and leave the first string unmodified.
The single parameter that I passed to gsub! is the pattern to match. Here, the goal is to match any string of one or more digits -- this is the number that you're going to replace. There's no need to loop through all of the possible numbers running gsub on each one. You can even match numbers as high as a googol (or higher) without your program taking longer and longer to run.
The block that gsub! takes is evaluated each time the pattern matches to programmatically generate a replacement number. So each time, you get a different random number. This is different from the more usual form of gsub! that takes two parameters -- there the parameter is evaluated once before any pattern matching occurs, and all matches are replaced by the same string.
Note that the way this is structured, you get a new random number for each match. So if the number 307 appears twice, it turns into two different random numbers.
If you wanted to map 307 to the same random number each time, you could do the following:
filecontents = File.read('logo-standart.txt')
randomnumbers = Hash.new{|h,k| h[k]=rand(100)}
filecontents.gsub!(/\d+/){|match| randomnumbers[match]}
File.open("logo-standart.txt","w"){|f| f << filecontents }
Here, randomnumbers is a hash that lets you look up the numbers and find what random number they correspond to. The block passed when constructing the hash tells the hash what to do when it finds a number that it hasn't seen before -- in this case, generate a new random number, and remember what that random number the mapping. So gsub!'s block just asks the hash to map numbers for it, and randomnumbers takes care of generating a new random number when you encounter a new number from the original file.

Resources