I'm using a lambda function to untar files. The lambda is supposed to untar files and once it's done it moves the package to an archive folder.
Code below
def untar_file(zip_key,source_bucket,source_path,file):
zip_obj = s3_resource.Object(bucket_name=source_bucket,key=zip_key)
buffer = BytesIO(zip_obj.get()["Body"].read())
with tarfile.open(fileobj=buffer, mode=('r:gz')) as z:
for filename in z.getmembers():
s3_resource.meta.client.upload_fileobj(
z.extractfile(filename),
Bucket=source_bucket,
Key=source_path+f'/{d1}/{filename}.csv'
)
copy_objects (zip_key,source_bucket,source_path,file)
I want to only untar specific files in the package. Can I specify which file to not untar? Just to avoid the lambda timeout
Figured it out with a simple if statement.
zip_obj = s3_resource.Object(bucket_name=source_bucket, key=zip_key)
buffer = BytesIO(zip_obj.get()["Body"].read())
with tarfile.open(fileobj=buffer, mode=('r:gz')) as z:
for filename in z.getmembers():
if any(word not in str(filename) for word in ['text']):
print(filename)
s3_resource.meta.client.upload_fileobj(
z.extractfile(filename),
Bucket=source_bucket,
Key=source_path+f'/{d1}/{filename}.csv'
)
print ('uploaded')
Related
I have a Nextflow pipeline that has two channels.
The first channel runs and outputs 6 .tsv files to a folder called 'results'.
The second channel is supposed to use all of these 6 .tsv files and create a .pdf report using knitr in R in a process called 'createReport'.
My workflow code looks like this:
workflow {
inputFileChannel = Channel.fromPath(params.pathOfInputFile, type: 'file') // | collect | createReport // creating channel to pass in input file
findNumOfProteins(inputFileChannel) // passing in the channel to the process
findAminoAcidFrequency(inputFileChannel)
getProteinDescriptions(inputFileChannel)
getNumberOfLines(inputFileChannel)
getNumberOfLinesWithoutSpaces(inputFileChannel)
getLengthFreq(inputFileChannel)
outputFileChannel = Channel.fromPath("$params.outdir.main/*.tsv", type: 'file').buffer(size:6)
createReport(outputFileChannel)
My 'createReport' process currently looks like this:
process createReport {
module 'R/4.2.2'
publishDir params.outdir.output, mode: 'copy'
output:
path 'report.pdf'
script:
"""
R -e "rmarkdown::render('./createReport.Rmd')"
"""
}
And my 'createReport.Rmd' looks like this (tested in Rstudio and gives the correct .pdf output:
---
title: "R Markdown Practice"
author: "-"
date: "2022-12-08"
output: pdf_document
---
{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
dataSet <- list.files(path="/Users/-/Desktop/code/nextflow_practice/results/", pattern="*.tsv")
print(dataSet)
for (data in dataSet) {
print(paste("Showing the table for:", data))
targetData <- read.table(file=paste("/Users/-/Desktop/code/nextflow_practice/results/", data, sep=""),
head=TRUE,
nrows=5,
sep="\t")
print(targetData)
if (data == "length_data.tsv") {
data_to_graph <- read_tsv(paste("/Users/-/Desktop/code/nextflow_practice/results/", data, sep=""), show_col_types = FALSE)
plot(x = data_to_graph$LENGTH,y = data_to_graph$FREQ, xlab = "x-axis", ylab = "y-axis", main = "P")
}
writeLines("-----------------------------------------------------------------")
}
What would be the correct way to write the createReport process and the workflow sections so as to be able to pass the 6 .tsv outputs from the first channel into the second channel to create the report?
Sorry I am very new to Nextflow and the documentation doesn't help me as much as I would like it to!
Your outputFileChannel looks like it is trying to access files in the publishDir. The problem with accessing files in this directory (i.e. 'results') is that:
Files are copied into the specified directory in an asynchronous
manner, thus they may not be immediately available in the published
directory at the end of the process execution. For this reason files
published by a process must not be accessed by other downstream
processes.
Assuming your inputFileChannel is intended to be a value channel, you could use the following. This requires the outputs of the six process to be declared in their output blocks (using the path qualifier). We could then just mix and collect these files. Your Rmd file and list of TSV files could then be passed to your createReport process. Note that if you move your Rmd into the base directory of your pipeline project (i.e. in the same directory as your main.nf script), you can distribute it with your workflow. By providing the Rmd over a channel, this approach ensures it is staged into the process working directory when the job is run. For example:
workflow {
inputFile = file( params.pathOfInputFile )
findNumOfProteins( inputFile )
findAminoAcidFrequency( inputFile )
getProteinDescriptions( inputFile )
getNumberOfLines( inputFile )
getNumberOfLinesWithoutSpaces( inputFile )
getLengthFreq( inputFile )
Channel.empty() \
| mix( findNumOfProteins.out ) \
| mix( findAminoAcidFrequency.out ) \
| mix( getProteinDescriptions.out ) \
| mix( getNumberOfLines.out ) \
| mix( getNumberOfLinesWithoutSpaces.out ) \
| mix( getLengthFreq.out ) \
| collect \
| set { outputs }
rmd = file("${baseDir}/createReport.Rmd")
createReport( outputs, rmd )
}
process createReport {
module 'R/4.2.2'
publishDir "${params.outdir}/report", mode: 'copy'
input:
path 'input_dir/*'
path rmd
output:
path 'report.pdf'
"""
Rscript -e "rmarkdown::render('${rmd}')"
"""
}
Note that the createReport process above will stage the input TSV files under a folder called 'input_dir' in the process working directory. You could change this if you want to, but I think this keeps the working directory neat and tidy. Just be sure to modify your Rmd script to point to this folder. For example, you might choose to use something like:
dataSet <- list.files(path="./input_dir", pattern="*.tsv")
Or perhaps even:
dataSet <- list.files(pattern="*.tsv", recursive=TRUE)
I have a set of files in a folder. I would like to pass an array of the files in a folder to some function. I saw the following example
$files= ["C:/dir/file1", "C:/dir/file2", "C:/dir/file3",
"C:/dir/file4", "C:/dir/file5"]
# function call with lambda:
$binaries.each |String $binary| {
file {"/usr/bin/$binary":
ensure => file,
}
}
but instead of declaring files manually, can I read all the files from a directory and pass it to some function?
You can use Dir to fetch all files using some pattern. For example:
[1] pry(main)> Dir["/Users/smefju/tmp/*"]
=> ["/Users/smefju/tmp/a.rb",
"/Users/smefju/tmp/asd",
"/Users/smefju/tmp/bm.rb",
"/Users/smefju/tmp/cert",
"/Users/smefju/tmp/gc",
"/Users/smefju/tmp/qq"]
Silly question, but I want to do some processing on a dataset and put them into different CSVs, like UDID1.csv, UDID2.csv, ..., UDID1000.csv. So this is my code:
for i in 1..1000
logfile = File.new('C:\Users\hp1\Desktop\Datasets\New File\UDID#{i}\.csv',"a")
#I'll do some processing here
end
But the program throws an error when running because of the UDID#{i} part. So, how to overcome this issue? Thanks.
Edit: This is the error:
in `initialize': No such file or directory # rb_sysopen - C:\Users\hp1\Desktop\Datasets\New File\udid#{1}\.csv (Errno::ENOENT)from C:/Ruby21/bin/hashedUDID.rb:38:in `new' from C:/Ruby21/bin/hashedUDID.rb:38:in '<main>'
The ' is one problem, another problem is the path.
In your posting the New File must exist as a directory. Inside this directory must exist another directories like UDID0001. This gets a .csv file.
Correct is (I don't use the non-rubyesk for-loop):
1.upto(1000) do |i|
logfile = File.new("C:\\Users\\hp1\\Desktop\\Datasets\\UDID#{i}.csv", "a")
#I'll do some processing here
logfile.close #Don't forget to close the file
end
Inside " the backslash must be masked (\\). Instead you may use /:
logfile = File.new("C:/Users/hp1/Desktop/Datasets/New File/UDID#{i}/.csv", "a")
Another possibility is the usage of %i to insert the number:
logfile = File.new("C:/Users/hp1/Desktop/Datasets/New File/UDID%02i/.csv" % i, "a")
I prefer to use open, then the file is closed with the end of the block:
File.open("C:/Users/hp1/Desktop/Datasets/New File/UDID%04i/.csv" % i, "a") do |logfile|
#I'll do some processing here
end #closes the file
Warning:
I'm not sure, if you really want to create 1000 log files (The File is opened inside the loop. so each step creates a file.).
If yes, then the %04i-version has the advantage, that the files get all the same number of digits (starting with 0001 and ending with 1000).
(1..10).each { |i| logfile = File.new("/base/path/UDID#{i}.csv") }
You must use double quote (") when you need string interpolation.
#{} can only be used in strings with double quotes ". So change your code to:
for i in 1..1000
logfile = File.new("C:\Users\hp1\Desktop\Datasets\New File\UDID#{i}\.csv","a")
# other stuff
end
I am trying to edit particular html files that I download in python. I am running into a problem where I run my code to edit the file and my python context locks up. I checked the file it's writing to and found that there are two files. The html file and a .bak file.
The html file starts out at 0kb and the .bak file constantly grows to a point, maybe 12 mb or so, then the .html file will grow to a larger size, then the .bak file will grow again. This seems to cycle endlessly. The html file I am editing is 22kb. I watched the output file grow to a gig once just to see if it would stop... It doesn't.
Here is the function I am using to edit the file:
def replace(self, search_str, replace_str):
f = open(self.path,'r+')
content = f.readlines()
for i, line in enumerate(content):
content[i] = line.replace(search_str, replace_str)
f.writelines(content)
f.close()
The issue, I imagine relates to the fact that the html file, as downloaded, is mostly in a single line with ~ 21,000 characters in it. Any ideas?
edit:
I have also tried another function, but get the same result:
def replace(self, search_str, replace_str):
assert self.path != None, 'No file path provided.'
fi = fileinput.FileInput(self.path,inplace=1)
for line in fi:
if search_str in line:
line=line.replace(search_str,replace_str)
print line
fi.close()
Try using generator. Thats the way to go if you need to read a large file
for line in open(self.path,'r+'):
# do stuff with line
I re-wrote the function to write everything out to a new file and it works.
def replace(self, search_str, replace_str):
f = open(self.path,'r+')
new_path = self.path.split('.')[0]+'.TEMP'
new_f = open(new_path,'w')
new_lines = [x.replace(search_str, replace_str) for x in f]
new_f.writelines(new_lines)
f.close()
new_f.close()
os.remove(self.path)
os.rename(new_path, self.path)
Anyone out there know of a graceful way to install the "Image::OCR::Tesseract" module on Windows? The module fails to install on Windows via CPAN due to a *NIX only module dependency called "LEOCHARRE::CLI". This module does not seem to be required to run "Image::OCR::Tesseract" itself.
I've managed to get the module working by first manually installing the dependency modules listed in the makefile.pl (except for "LEOCHARRE::CLI") and then by moving the module file to the correct directory structure under "C:\Perl\site\lib\Image\OCR". The final part of getting it to work was to alter the section of code that calls the ImageMagick and Tesseract executables from the command line to put quotes around the program names when the executables are called by module.
This works, but I'd really feel better about doing a PPM or CPAN install on a production system from a repo that works on Windows.
Never mind, I got it, though I can't decide what is the better solution.
To get the installer to work on Windows via the traditional "perl makefile.pl, make, make test, make install" routine requires an edit to the Makefile.pl script, including the missing Windows install module (Devel::AssertOS::MSWin32), and patch to AssertEXE.pm to use "File::Which" rather than the built in shell "which" command that Windows lacks. All this still requires that The "Image::OCR::Tesseract" be patched to put quotes around program names when executing "convert" and "tesseract" from the command line.
Given the number of steps involved to make the installer work on Windows, and the fact the module does not create a binary component for the module to link to, I'd say the best option for installing and getting the Tesseract module working on windows would be to first install the following binary packages:
ImageMagick
Link
Tesseract
http://code.google.com/p/tesseract-ocr/downloads/list
Next, locate your Perl module directory - on my system it is "C:\Perl\site\lib". Create a folder "Image", if you don't have one. Next, open the Image folder and create a folder called "OCR". Open the OCR folder. At this point, your path should be something along the lines of "C:\Perl\site\lib\Image\OCR". Create a new text file called "Tesseract.pm", and copy in the following content...
package Image::OCR::Tesseract;
use strict;
use Carp;
use Cwd;
use String::ShellQuote 'shell_quote';
use Exporter;
use vars qw(#EXPORT_OK #ISA $VERSION $DEBUG $WHICH_TESSERACT $WHICH_CONVERT %EXPORT_TAGS #TRASH);
#ISA = qw(Exporter);
#EXPORT_OK = qw(get_ocr get_hocr _tesseract convert_8bpp_tif tesseract);
$VERSION = sprintf "%d.%02d", q$Revision: 1.24 $ =~ /(\d+)/g;
%EXPORT_TAGS = ( all => \#EXPORT_OK );
BEGIN {
use File::Which 'which';
$WHICH_TESSERACT = which('tesseract');
$WHICH_CONVERT = which('convert');
if($^O=~m/MSWin/) {
$WHICH_TESSERACT='"'.$WHICH_TESSERACT.'"';
$WHICH_CONVERT='"'.$WHICH_CONVERT.'"';
}
$WHICH_TESSERACT or die("Is tesseract installed? Cannot find bin path to tesseract.");
$WHICH_CONVERT or die("Is convert installed? Cannot find bin path to convert.");
}
END {
scalar #TRASH or return;
if ( $DEBUG ){
print STDERR "Debug on, these are trash files:\n".join("\n",#TRASH) ;
}
else {
unlink #TRASH;
}
}
sub DEBUG { Carp::cluck("Image::OCR::Tesseract::DEBUG() deprecated") }
sub get_hocr {
my ($abs_image,$abs_tmp_dir,$lang)= #_;
-f $abs_image or croak("$abs_image is not a file on disk");
my $hocr="hocr";
if(defined $abs_tmp_dir){
-d $abs_tmp_dir or die("tmp dir arg $abs_tmp_dir not a dir on disk.");
$abs_image=~/([^\/]+)$/ or die("cant match filename in path arg '$abs_image'");
my $abs_copy = "$abs_tmp_dir/$1";
# TODO, what if source and dest are same, i want it to die
require File::Copy;
File::Copy::copy($abs_image, $abs_copy)
or die("cant make copy of $abs_image to $abs_copy, $!");
# change the image to get ocr from to be the copy
$abs_image = $abs_copy;
# since it's a copy. erase that on exit
push #TRASH, $abs_image;
}
my $tmp_tif = convert_8bpp_tif($abs_image);
push #TRASH, $tmp_tif; # for later delete
_tesseract($tmp_tif,$lang,$hocr) || '';
}
sub get_ocr {
my ($abs_image,$abs_tmp_dir,$lang)= #_;
-f $abs_image or croak("$abs_image is not a file on disk");
if(defined $abs_tmp_dir){
-d $abs_tmp_dir or die("tmp dir arg $abs_tmp_dir not a dir on disk.");
$abs_image=~/([^\/]+)$/ or die("cant match filename in path arg '$abs_image'");
my $abs_copy = "$abs_tmp_dir/$1";
# TODO, what if source and dest are same, i want it to die
require File::Copy;
File::Copy::copy($abs_image, $abs_copy)
or die("cant make copy of $abs_image to $abs_copy, $!");
# change the image to get ocr from to be the copy
$abs_image = $abs_copy;
# since it's a copy. erase that on exit
push #TRASH, $abs_image;
}
my $tmp_tif = convert_8bpp_tif($abs_image);
push #TRASH, $tmp_tif; # for later delete
_tesseract($tmp_tif,$lang) || '';
}
sub convert_8bpp_tif {
my ($abs_img,$abs_out) = (shift,shift);
defined $abs_img or die('missing image arg');
$abs_out ||= $abs_img.'.tmp.'.time().(int rand(9000)).'.tif';
my #arg = ( $WHICH_CONVERT, $abs_img, '-compress','none','+matte', $abs_out );
#die (join(" ", #arg));
system(#arg) == 0 or die("convert $abs_img error.. $?");
$DEBUG and warn("made $abs_out 8bpp tiff.");
$abs_out;
}
# people expect tesseract to automatically convert
*tesseract = \&_tesseract;
sub _tesseract {
my ($abs_image,$lang,$hocr) = #_;
defined $abs_image or croak('missing image path arg');
$abs_image=~/\.tif+$/i or warn("Are you sure '$abs_image' is a tif image? This operation may fail.");
#my #arg = (
# $WHICH_TESSERACT, shell_quote($abs_image), shell_quote($abs_image),
# (defined $lang and ('-l', $lang) ), '2>/dev/null'
#);
my $cmd =
( sprintf '%s %s %s',
$WHICH_TESSERACT,
shell_quote($abs_image),
shell_quote($abs_image)
) .
( defined $lang ? " -l $lang" : '' ) .
( defined $hocr ? " hocr" : '' ) .
" 2>/dev/null";
$DEBUG and warn "command: $cmd";
system($cmd); # hard to check ==0
my $txt = $abs_image.($hocr?".html":".txt");
unless( -f $txt ){
Carp::cluck("no text output for image '$abs_image'. (No text file '$txt' found on disk)");
return;
}
$DEBUG and warn "Found text file '$txt'";
my $content = (_slurp($txt) || '');
$DEBUG and warn("content length of text in '$txt' from image '$abs_image' is ". length $content );
push #TRASH, $txt;
$content;
}
sub _slurp {
my $abs = shift;
open(FILE,'<', $abs) or die("can't open file for reading '$abs', $!");
local $/;
my $txt = <FILE>;
close FILE;
$txt;
}
1;
__END__
#sub _force_imgtype {
# my $img = shift;
# my $type = shift;
# my $delete_original = shift;
# $delete_original ||=0;
#
#
# if($img=~/\.$type$/i){
# return $img;
# }
#
# my $img_out= $img;
# $img_out=~s/\.\w{1,5}$/\.$type/ or die("cant get file ext for $img");
#
#
#
#}
Save and close. Close the command line session and open a new one if you've had one open from before you did the ImageMagick and Tesseract binary installs. Test the module with the following script:
use Image::OCR::Tesseract;
my $image = 'SomeImageFileThatContainsText.jpg';
my $text = Image::OCR::Tesseract::get_ocr($image);
print "Text...\n";
print $text."\n";
print "Normal Exit\n";
exit;
That's it. Messy, I know, but there's no good way around the fact that the module installer really needs to be updated to support Windows (and other) systems even though the actual module code almost runs without modification. Really, if Tesseract and ImageMagick were installed to paths without spaces then the "Image::OCR::Tesseract" module code would not need any changes, but this minor tweak lets the supporting executables be installed anywhere, including the default locations.