Opening filehandles for use with TabularMSA in skbio - skbio

Hey there skbio team.
So I need to allow either DNA or RNA MSAs. When I do the following, if I leave out the alignment_fh.close() skbio reads the 'non header' line in the except block making me think I need to close the file first so it will start at the beginning, but if I add alignment_fh.close() I cannot get it to read the file. I've tried opening it via a variety of methods, but I believe TabularMSA.read() should allow files OR file handles. Thoughts? Thank you!
try:
aln = skbio.TabularMSA.read(alignment_fh, constructor=skbio.RNA)
except:
alignment_fh.close()
aln = skbio.TabularMSA.read(alignment_fh, constructor=skbio.DNA)

I've tried opening it via a variety of methods, but I believe TabularMSA.read() should allow files OR file handles.
You're correct: scikit-bio generally supports reading and writing files using open file handles or file paths.
The issue you're running into is that your first TabularMSA.read() call reads the entire contents of the open file handle, so that when the second TabularMSA.read() call is hit within the except block, the file pointer is already at the end of the open file handle -- this is why you're getting an error message hinting that the file is empty.
This behavior is intentional; when scikit-bio is given an open file handle, it will read from or write to the file but won't attempt to manage the handle's file pointer (that type of management is up to the caller of the code).
Now, when asking scikit-bio to read a file path (i.e. a string containing the path to a file on disk or accessible at some URI), scikit-bio will handle opening and closing the file handle for you, so that's often the easier way to go.
You can use file paths or file handles to accomplish your goal. In the following examples, suppose aln_filepath is a str pointing to your alignment file on disk (e.g. "/path/to/my/alignment.fasta").
With file paths: You can simply pass the file path to both TabularMSA.read() calls; no open() or close() calls are necessary on your part.
try:
aln = skbio.TabularMSA.read(aln_filepath, constructor=skbio.RNA)
except ValueError:
aln = skbio.TabularMSA.read(aln_filepath, constructor=skbio.DNA)
With file handles: You'll need to open a file handle and reset the file pointer within your except block before reading a second time.
with open(aln_filepath, 'r') as aln_filehandle:
try:
aln = skbio.TabularMSA.read(aln_filehandle, constructor=skbio.RNA)
except ValueError:
aln_filehandle.seek(0) # reset file pointer to beginning of file
aln = skbio.TabularMSA.read(aln_filehandle, constructor=skbio.DNA)
Note: In both examples, I've used except ValueError instead of a "catch-all" except statement. I recommend catching specific error types (e.g. ValueError) instead of any exception because the code could be failing in different ways than what you're expecting. For example, with a "catch-all" except statement, users won't be able to interrupt your program with Ctrl-C because KeyboardInterrupt will be caught and ignored.

Related

How to read a large text file line-by-line and append this stream to a file line-by-line in Ruby?

Let's say I want to combine several massive files into one and then uniq! the one (THAT alone might take a hot second)
It's my understanding that File.readlines() loads ALL the lines into memory. Is there a way to read it line by line, sort of like how node.js pipe() system works?
One of the great things about Ruby is that you can do file IO in a block:
File.open("test.txt", "r").each_line do |row|
puts row
end # file closed here
so things get cleaned up automatically. Maybe it doesn't matter on a little script but it's always nice to know you can get it for free.
you aren't operating on the entire file contents at once, and you don't need to store the entirety of each line either if you use readline.
file = File.open("sample.txt", 'r')
while !file.eof?
line = file.readline
puts line
end
Large files are best read by streaming methods like each_line as shown in the other answer or with foreach which opens the file and reads line by line. So if the process doesn't request to have the whole file in memory you should use the streaming methods. While using streaming the required memory won't increase even if the file size increases opposing to non-streaming methods like readlines.
File.foreach("name.txt") { |line| puts line }
uniq! is defined on Array, so you'll have to read the files into an Array anyway. You cannot process the file line-by-line because you don't want to process a file, you want to process an Array, and an Array is a strict in-memory data structure.

Psychopy: Call a conditions file from the outer loop's conditions file

I am a bit new to PsychoPy and Python coding, so please excuse my question if it is basic. In my task, I have a number of files that dictate the position of stimuli. My outer loop has a variable, ExcelList, which has the previously mentioned file names listed under it. The inner loop, which dictates each trial, attempts to call these files at random by entering $ExcelList into the space asking for a conditions file. As I understand it, the command for $ExcelList should access the conditions file in the outer loop and pull one of the files containing stimuli positions for that trial. However, I am instead presented with the following error:
File "/Users/bencline/Desktop/Psychexp/NegPriming2080_lastrun.py",
line 247, in module>
trialList=data.importConditions(ExcelList), File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/data.py",
line 1366, in importConditions
raise ImportError('Conditions file not found: %s' %os.path.abspath(fileName)) ImportError: Conditions file not found:
/Users/bencline/Desktop/Psychexp/Trials_20_95.xlsx
It would appear that the inner loop is not finding the condition in the outer loop (or not reading the outer loop altogether). If I try instead writing $eval(ExcelList) I am presented with the following error:
File "/Users/bencline/Desktop/Psychexp/NegPriming2080_lastrun.py",
line 247, in
trialList=data.importConditions(eval(ExcelList)), File "", line 1, in NameError: name 'Trials_20_95' is not
defined
This seems more indicative of the underlying problem, but I'm still not sure how to proceed from here. Do you have any suggestions for why this is happening and how I could potentially fix it?
Thank you,
-Ben
Your strategy is right. It reads your ExcelList in the outer loop but fails to find the filenames in ExcelList on your filesystem. In particular, in your first error message, it fails to find /Users/bencline/Desktop/Psychexp/Trials_20_95.xlsx. So check whether it actually exists. I strongly suspect that it does not because of one or both of these:
It is in a different folder, e.g. a subfolder. The solution is to write out the path (relative or absolute) in the ExcelList file.
It is spelled differently, e.g. with a small T or capital postfix (XLSX), a spacebar or the like. The solution is of course to make the filenames in the ExcelList and the actual filenames match.

Deleting contents of file after a specific line in ruby

Probably a simple question, but I need to delete the contents of a file after a specific line number? So I wan't to keep the first e.g 5 lines and delete the rest of the contents of a file. I have been searching for a while and can't find a way to do this, I am an iOS developer so Ruby is not a language I am very familiar with.
That is called truncate. The truncate method needs the byte position after which everything gets cut off - and the File.pos method delivers just that:
File.open("test.csv", "r+") do |f|
f.each_line.take(5)
f.truncate( f.pos )
end
The "r+" mode from File.open is read and write, without truncating existing files to zero size, like "w+" would.
The block form of File.open ensures that the file is closed when the block ends.
I'm not aware of any methods to delete from a file so my first thought was to read the file and then write back to it. Something like this:
path = '/path/to/thefile'
start_line = 0
end_line = 4
File.write(path, File.readlines(path)[start_line..end_line].join)
File#readlines reads the file and returns an array of strings, where each element is one line of the file. You can then use the subscript operator with a range for the lines you want
This isn't going to be very memory efficient for large files, so you may want to optimise if that's something you'll be doing.

Opening a file without creating a new one

I have a ruby function that accesses files in my unix filesystem.
I have 2050 files each representing an hashed value in a dedicated directory.
The function reads a file containing email addresses and performs a hashing function, finds out the file id and prints.
Usually I do those things in Java but I wanna start doing it in Ruby. My problem is, that within my function, I try to open the correct file for reading, but I see that open works the same as new when no code block is provided. From the IO class, method ::openWith no associated block, IO.open is a synonym for ::new.
What I simply need to do is, open the file, set the reader pointer to the first available line, write and flush.
For simplicity I will put every code statement in one line. The file should be opened with its current status (see the HERE comment).
def dispatch
while (id=IDS_FILE.gets)
bucket="#{BUCKETS}" << (PERFORM HERE THE HASH CALCULATION) ".txt"
#HERE
bucket_file=File.open("#{bucket}","w")
bucket_file.write(id)
bucket_file.close
end
log "Writing #{id.chomp!} to #{bucket_file.to_path}"
end
end
To get "the file to be opened with its current status" you'll need the "append mode", as documented here:
"a" Write-only, starts at end of file if file exists,
otherwise creates a new file for writing.
So your code should read like so:
File.open(bucket, "a") do |f|
f.write id
end

How to call a Win32API function from Ruby?

I want to lock a file as MS Office applications do from a Ruby program so that deletions won't be allowed because "the file is opened on another program".
The Ruby standard library does not seem to be able to do that - I just tried flock() - so I'm trying to invoke the LockFileEx function.
fd = File.open("locked.file", File::RDWR|File::CREAT, 0644)
fd.write "this file to be locked"
import_array = %w(p i i i i i)
wapi = Win32API.new('kernel32', 'LockFileEx', import_array, "i")
wapi.call(fd, 1, 0, 0, 0, 0)
The wapi.call fails with a TypeError exception "Can't convert File into String".
What should I use as first item in import_array to represent the file handle ?
How do I pack the file descriptor into a String ? Where do I get the file descriptor structure ?
I'm using Ruby 1.9.3.
The Ruby file locking mechanism is co-operative and relies on all parties knowing the convention of the Ruby lock file. Microsoft Office will not co-operate.
Instead I suggest that you get the file system to enforce the lock. Simply open the file with an exclusive lock using standard Ruby file handling mechanisms.
First, you have to map the Ruby file descriptor to a C runtime file descriptor. I'm not familiar enough with Ruby source to know how to do that; it may be an identity transform.
Second, you have to map the C runtime descriptor to a Win32 file handle. For this, you need _get_osfhandle.
Third, you need to fix your call to LockFileEx so that you actually pass a valid OVERLAPPED structure; NULL will NOT work.

Resources