Writing to popen and reading back several files in Ruby - ruby

I need to run some shell commands on a number of files and sometimes I get back more than one file in response. The question is: How can I read back several files from IO.popen in Ruby?
For instance, imagine the following case:
file = grid.get(record['_id']) # fetch a file from database
IO.popen('tar -Oxmz', 'ab') {|pipe| pipe.write(file.read)} # pass to tar and extract
This necessitates that I reread all the extracted files from the filesystem. I figured out this is the speed bottleneck of my script and I wonder if I can accomplish the same task in-memroy. I tried the following:
file = grid.get(record['_id'])
IO.popen('tar -Oxmz', 'w+b') do |pipe|
pipe.write(file.read)
pipe.close_write
output = pipe.read
end
It works, but I get the whole response, here including several extracted files, in one piece (in variable output). I need the files separate from each other and possibly with their names. Is there any way to do this?
By the way, the resulting files are most of the time text, but sometimes binary. Running a pipe for each output file is not a solution, because the actual overhead of running the commands for each file outweights the benefits of doing the transformation in-memory.
P.S. The actual use case does not rely on tar only. I use software that do not have Ruby wrappers.

Related

Temp file not being deleted

I'm trying to create a temporary file in my pipeline, then use that file in another rule.
For example, I have two rules in a .smk file:
#Unzip adapter trimmed fastq file
rule unzip_fastq:
input:
'{sample}.adapterTrim.round2.fastq.gz',
output:
temp('{sample}.adapterTrim.round2.fastq')
conda:
'../envs/rep_element.yaml'
shell:
'gunzip -c {input[0]} > {output[0]}'
#Run bowtie2 to align to rep elements and parse output
rule parse_bowtie2_output_realtime:
input:
'{sample}.adapterTrim.round2.fastq'
output:
'rep_element_pipeline/{sample}.fastq.gz.mapped_vs_' + config["ref"]["bt2_index"] + '.sam'
params:
bt2=config["ref"]["bt2_index_path"], eid=config["ref"]["enst2id"]
conda:
'../envs/rep_element.yaml'
shell:
'perl ../scripts/parse_bowtie2_output_realtime_includemultifamily.pl '
'{input[0]} {params.bt2} {output[0]} {params.eid}'
{sample}.adapterTrim.round2.fastq is used once and should ultimately be deleted upon completion. However, I'm finding that this file is uploaded to Amazon S3, even with the addition of temp(). I'm also finding that this file is removed locally, but still persists on S3.
Am I doing this correctly? '{sample}.adapterTrim.round2.fastq' is not currently written in the rule-all of the Snakefile.
We ultimately need to prevent this file from being uploaded to S3, so if there is a way to specify not to upload this file in the rule, that would be useful.
It seems that the snippet in the question is not consistent with actual use, since for S3 files one would need to wrap file names in remote.
However, as a general solution, documentation contains the following:
The remote() wrapper is mutually-exclusive with the temp() and protected() wrappers.
Hence, if you intend to use a temp file, make sure it's not wrapped in remote, or explicitly wrap the file in local.

Shell script to verify data packages

I need to make shell script to check my algorithms with loads of data(tests packages saved in .in files, every package contains folder with .in file and the other one with .out file where supposed to be correct result)
Sometimes It's about 1000 files in one packages so there's no point of doing it manually. I need some kind of loop which opens this .in file then redirect input of my c++ program and also redirect output of this program(save result to .out files) But the point is I can't get this language as quick as I need.
And I would like this script to compare results of my algorithm to .out files from packages
for f in ExternalIn/*.in; do//part of code which opens process with my algorithm and compare its .out file to .out file from package
Skipping checks for missing files, whitespace-safety, etc., you probably need something like:
for f in ExternalIn/*.in; do
# diff the result of my_cpp_app eating file.in with file.out
# and store the comparison result in file.diff
diff ${f/.in/.out} <(my_cpp_app <$f 2>/dev/null) > ${f/.in/.diff}
done
Although I would probably do it with find / xargs pipeline which is not only safer but also allows parallel execution.
Or even write a Makefile for this and use make, which after all is a tool for exactly this kind of work.

How to find text file in same directory

I am trying to read a list of baby names from the year 1880 in CSV format. My program, when run in the terminal on OS X returns an error indicating yob1880.txt doesnt exist.
No such file or directory # rb_sysopen - /names/yob1880.txt (Errno::ENOENT)
from names.rb:2:in `<main>'
The location of both the script and the text file is /Users/*****/names.
lines = []
File.expand_path('../yob1880.txt', __FILE__)
IO.foreach('../yob1880.txt') do |line|
lines << line
if lines.size >= 1000
lines = FasterCSV.parse(lines.join) rescue next
store lines
lines = []
end
end
store lines
If you're running the script from the /Users/*****/names directory, and the files also exist there, you should simply remove the "../" from your pathnames to prevent looking in /Users/***** for the files.
Use this approach to referencing your files, instead:
File.expand_path('yob1880.txt', __FILE__)
IO.foreach('yob1880.txt') do |line|
Note that the File.expand_path is doing nothing at the moment, as the return value is not captured or used for any purpose; it simply consumes resources when it executes. Depending on your actual intent, it could realistically be removed.
Going deeper on this topic, it may be better for the script to be explicit about which directory in which it locates files. Consider these approaches:
Change to the directory in which the script exists, prior to opening files
Dir.chdir(File.dirname(File.expand_path(__FILE__)))
IO.foreach('yob1880.txt') do |line|
This explicitly requires that the script and the data be stored relative to one another; in this case, they would be stored in the same directory.
Provide a specific path to the files
# do not use Dir.chdir or File.expand_path
IO.foreach('/Users/****/yob1880.txt') do |line|
This can work if the script is used in a small, contained environment, such as your own machine, but will be brittle if it data is moved to another directory or to another machine. Generally, this approach is not useful, except for short-lived scripts for personal use.
Never put a script using this approach into production use.
Work only with files in the current directory
# do not use Dir.chdir or File.expand_path
IO.foreach('yob1880.txt') do |line|
This will work if you run the script from the directory in which the data exists, but will fail if run from another directory. This approach typically works better when the script detects the contents of the directory, rather than requiring certain files to already exist there.
Many Linux/Unix utilities, such as cat and grep use this approach, if the command-line options do not override such behavior.
Accept a command-line option to find data files
require 'optparse'
base_directory = "."
OptionParser.new do |opts|
opts.banner = "Usage: example.rb [options]"
opts.on('-d', '--dir NAME', 'Directory name') {|v| base_directory = Dir.chdir(File.dirname(File.expand_path(v))) }
end
IO.foreach(File.join(base_directory, 'yob1880.txt')) do |line|
# do lines
end
This will give your script a -d or --dir option in which to specify the directory in which to find files.
Use a configuration file to find data files
This code would allow you to use a YAML configuration file to define where the files are located:
require 'yaml'
config_filename = File.expand_path("~/yob/config.yml")
config = {}
name = nil
config = YAML.load_file(config_filename)
base_directory = config["base"]
IO.foreach(File.join(base_directory, 'yob1880.txt')) do |line|
# do lines
end
This doesn't include any error handling related to finding and loading the config file, but it gets the point across. For additional information on using a YAML config file with error handling, see my answer on Asking user for information, and never having to ask again.
Final thoughts
You have the tools to establish ways to locate your data files. You can even mix-and-match solutions for a more sophisticated solution. For instance, you could default to the current directory (or the script directory) when no config file exists, and allow the command-line option to manually override the directory, when necessary.
Here's a technique I always use when I want to normalize the current working directory for my scripts. This is a good idea because in most cases you code your script and place the supporting files in the same folder, or in a sub-folder of the main script.
This resets the current working directory to the same folder as where the script is situated in. After that it's much easier to figure out the paths to everything:
# Reset working directory to same folder as current script file
Dir.chdir(File.dirname(File.expand_path(__FILE__)))
After that you can open your data file with just:
IO.foreach('yob1880.txt')

Ruby: FileUtils.cp truncates file; FileUtils.mv it does not?

This is weird… and I can't figure out for the life of me why it's doing it this way.
I've got a folder full of various CoffeeScript, SASS, HTML, and XML files.
I've got a Ruby script that's taking them all, compiling them, and minifying them into one master XML file (it's for iGoogle Gadget development).
This script takes command line args using trollop (I only state this to clarify my code below).
I want this script to copy this file from the current directory where it's created to a destination directory where it will be run.
So far, the building/compiling/minifying step runs like magic. It's #3 that's borked to Twilight Zone-level.
#!/usr/bin/ruby
…
if opts[:deploy_local]
FileUtils.cp 'build.xml', '/path/to/destination/'
puts "Copied #{written_file_name} to #{output_destination}." if opts[:verbose]
end
When this copies the file, the destination file is truncated about 3/4 of the way through it. The source file is just fine. However, moving the file works like a charm, for some strange reason.
FileUtils.mv 'build.xml', '/path/to/destination/'
To add another level of weirdness, if I just do a system copy, it also gets truncated.
system("cp build.xml /path/to/destination")
FWIW, I'm running this script from zsh and not bash. In both instances (copying and moving) the source and destination files are not in use by any other process.
Can anybody explain this freaky behavior?
A few things:
Are you moving to the same disk volume? If so, then, yeah, cam's comment about atomicity is definitely true; the OS is probably just messing with the inode table during a move, as opposed to writing out the data. IF you're moving the data between volumes, then it wouldn't be so simple.
Have you tried passing
:verbose => true
to the FileUtils.cp command? That might give a diagnostic about the failure.

What is the fastest way to unzip textfiles in Matlab during a function?

I would like to scan text of textfiles in Matlab with the textscan function. Before I can open the textfile with fid = fopen('C:\path'), I need to unzip the files first. The files have the extension: *.gz
There are thousands of files which I need to analyze and high performance is important.
I have two ideas:
(1) Use an external program an call it from the command line in Matlab
(2) Use a Matlab 'zip'toolbox. I have heard of gunzip, but don't know about its performance.
Does anyone knows a way to unzip these files as quick as possible from within Matlab?
Thanks!
You could always try the Matlab unzip() function:
unzip
Extract contents of zip file
Syntax
unzip(zipfilename)
unzip(zipfilename, outputdir)
unzip(url, ...)
filenames = unzip(...)
Description
unzip(zipfilename) extracts the archived contents of zipfilename into the current folder and sets the files' attributes, preserving the timestamps. It overwrites any existing files with the same names as those in the archive if the existing files' attributes and ownerships permit it. For example, files from rerunning unzip on the same zip filename do not overwrite any of those files that have a read-only attribute; instead, unzip issues a warning for such files.
Internally, this uses Java's zip library org.apache.tools.zip. If your zip archives each contain many text files it might be faster to drop down into Java and extract them entry by entry, without explicitly unzipped files. look at the source of unzip.m to get some ideas, and also the Java documentation.
I've found 7zip-commandline(Windows) / p7zip(Unix) to be somewhat speedier for this.
[edit]From some quick testing, it seems making a system call to gunzip is faster than using MATLAB's native gunzip. You could give that a try as well.
Just write a new function that imitates basic MATLAB gunzip functionality:
function [] = sunzip(fullfilename,output_dir)
if ~exist('output_dir','var'), output_dir = fileparts(fullfilename); end
app_path = '/usr/bin/7za';
switches = ' e'; %extract files ignoring directory structure
options = [' -o' output_dir];
system([app_path switches options '_' fullfilename]);
Then use it as you would use gunzip:
sunzip('/data/time_1000.out.gz',tmp_dir);
With MATLAB's toc timer, I get the following extraction times with 6 uncompressed 114MB ASCII files:
gunzip: 10.15s
sunzip: 7.84s
worked well, just needed a minor change to Max's syntax calling the executable.
system([app_path switches ' ' fullfilename options ]);

Resources