In TCL,l open a file from the tail - performance

I have this issue:
I need sometimes to process huge text files.
In some cases, I know that the data I care about is in the tail of the file. If I use normal open, I need to scroll through the lines. Is there a way to "start from the end", and save time, maybe read the file backwards?

You can use the seek command:
set linesz 256 ; # assume an approximate line size
set fh [open myfile.txt]
seek $fh [expr {$linesz*-5}] end
gets $fh line ; # this first line is probably a partial line, ignore it.
while { [gets $fh line] >= 0 } {
# process
}
close $fh
References: seek

Related

How to remove small contigs from fasta files?

## removesmalls.pl
#!/usr/bin/perl
use strict;
use warnings;
my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
local $/=">";
while(<>) {
chomp;
next unless /\w/;
s/>$//gs;
my #chunk = split /\n/;
my $header = shift #chunk;
my $seqlen = length join "", #chunk;
print ">$_" if($seqlen >= $minlen);
}
local $/="\n";
}
Exexecuting the script as follows:
perl removesmalls.pl 1000 contigs.fasta > contigs-1000.fasta
The above script works for me but there is a problem,
i have 109 different fasta files with different file names.
i can run the script for individual file but i want to run the script at once for all files and the result file should be individually different for each.
file names are like SRR8224532.fasta, SRR8224533.fasta, SRR8224534.fasta, and so on
i want the result files after removing the contigs (i.e., for me less than 1000) something like SRR8224532-out.fasta,
SRR8224533-out.fasta, and so on.
Any help or suggestion would be helpfull.

Delete a specific line from a 12GB file

I'm trying to delete a specific line from a 12GB text file.
I do not have the sed -i option available on HP-UX, and other options like saving to a temporary file aren't working because I have only 20GB space available with 12 GB already used by the text file.
Considering the space requirement I'm trying to do this using Perl.
This solution works to delete last 9 lines from a file of 12 GB.
#!/usr/bin/env perl
use strict;
use warnings;
use Tie::File;
tie my #lines, 'Tie::File', 'test.txt' or die "$!\n";
$#lines -= 9;
untie #lines;
I want to modify the above code to delete any specific line number.
Tie::File is never the answer.
It's insanely slow.
It can use up more memory than just slurping the entire file into memory, even if you limit the size of its buffer.
You are encountering both of those problems. You encounter every line of the file, so Tie::File will read the entire file and store the index of every line in memory. This takes 28 bytes per line on a 64-bit build of Perl (not counting any overhead in the memory allocator).
To delete the last 9 lines of the file, you can use the following:
use File::ReadBackwards qw( );
my $qfn = '...';
my $pos;
{
my $bw = File::ReadBackwards->new($qfn)
or die("Can't open \"$qfn\": $!\n");
for (1..9) {
defined( my $line = $bw->readline() )
or last;
}
$pos = $bw->tell();
}
# Can't use $bw->get_handle because it's a read-only handle.
truncate($qfn, $pos)
or die("Can't truncate \"$qfn\": $!\n");
To delete an arbitrary line, you can use the following:
my $qfn = '...';
open(my $fh_src, '<:raw', $qfn)
or die("Can't open \"$qfn\": $!\n");
open(my $fh_dst, '+<:raw', $qfn)
or die("Can't open \"$qfn\": $!\n");
while (<$fh_src>) {
next if $. == 9; # Or "if /keyword/", or whatever condition you want.
print($fh_dst $_)
or die($!);
}
truncate($fh_dst, tell($fh_dst))
or die($!);
The following optimized version assumes there's only one line (or block of lines) to remove:
use Fcntl qw( SEEK_CUR SEEK_SET );
use constant BLOCK_SIZE => 4*1024*1024;
my $qfn = 'file';
open(my $fh_src, '<:raw', $qfn)
or die("Can't open \"$qfn\": $!\n");
open(my $fh_dst, '+<:raw', $qfn)
or die("Can't open \"$qfn\": $!\n");
my $dst_pos;
while (1) {
$dst_pos = tell($fh_src);
defined( my $line = <$fh_src> )
or do {
$dst_pos = undef;
last;
};
last if $. == 9; # Or "if /keyword/", or whatever condition you want.
}
if (defined($dst_pos)) {
# We're switching from buffered I/O to unbuffered I/O,
# so we need to move the system file pointer from where the
# buffered read left off to where we actually finished reading.
sysseek($fh_src, tell($fh_src), SEEK_SET)
or die($!);
sysseek($fh_dst, $dst_pos, SEEK_SET)
or die($!);
while (1) {
my $rv = sysread($fh_src, my $buf, BLOCK_SIZE);
die($!) if !defined($rv);
last if !$rv;
my $written = 0;
while ($written < length($buf)) {
my $rv = syswrite($fh_dst, $buf, length($buf)-$written, $written);
die($!) if !defined($rv);
$written += $rv;
}
}
# Must use sysseek instead of tell with sysread/syswrite.
truncate($fh_dst, sysseek($fh_dst, 0, SEEK_CUR))
or die($!);
}

How can I reduce this to a single file open?

Using Strawberry Perl 5.22.0 in Windows 7. Is there a more "perlish" way to write this snippet of code? I hate the duplication of file open sections, but cannot think of a way to make it only open once because of the requirement to test the creation time.
...
my $x;
my $fh;
my $sentinelfile = "Logging.yes"; #if this file exists then enable logging
my $logfile = "transfers.log";
my $log = 0; #default to NO logging
$log = 1 if -e $sentinelfile; #enable logging if sentinel file exists
if($log){
#logfile remains open after this so remember to close at end of program!
if (-e $logfile) { #file exists
open($fh, "<", $logfile); #open for read will NOT create if not exist
chomp ($x = <$fh>); #grab first row
close $fh;
if (((scalar time - $x)/3600/24) > 30) { #when ~30 days since created
rename($logfile, $logfile . time); #rename existing logfile
open($fh, ">", $logfile); #open for write and truncate
print $fh time,"\n"; #save create date
print $fh "--------------------------------------------------\n";
} else { #file is not older than 30 days
open($fh, ">>", $logfile); #open for append
}
} else { #file not exist
open($fh, ">", $logfile); #open new for write
print $fh time,"\n"; #save create date
print $fh "--------------------------------------------------\n";
}
} #if $log
...
To recap: logfile logs stuff. First row of file contains the logfile creation date. Second row contains horizontal rule. Rest of file contains text. Around 30 days after file was created, rename file and start a new one. After the above chunk of code the logfile is open and ready for logging stuff. It gets closed at the end of the rest of the program.
There are other, non-cosmetic problems with your code: a) You do not ever check if your calls to open succeeded; b) You are creating a race condition. The file can come into existence after the -e check has failed. The subsequent open $fh, '>' ... would then clobber it; c) You don't check if your rename call succeeded etc.
The following would be a partial improvement on your existing code:
if ($log) {
if (open $fh, '<', $logfile) { #file exists
chomp ($x = <$fh>);
close $fh
or die "Failed to close '$logfile': $!";
if (((time - $x)/3600/24) > 30) {
my $rotated_logfile = join '.', $logfile, time;
rename $logfile => $rotated_logfile
or die "Failed to rename '$logfile' to '$rotated_logfile': $!";
open $fh, '>', $logfile
or die "Failed to create '$logfile'";
print $fh time, "\n", '-' x 50, "\n";
}
else {
open $fh, '>>', $logfile
or die "Cannot open '$logfile' for appending: $!";
}
}
else {
open $fh, '>', $logfile
or die "Cannot to create '$logfile': $!";
print $fh time, "\n", '-' x 50, "\n";
}
}
It would be better to abstract every bit of discrete functionality to suitably named functions.
For example, here is a completely untested re-write:
use autouse Carp => qw( croak );
use constant SENTINEL_FILE => 'Logging.yes';
use constant ENABLE_LOG => -e SENTINEL_FILE;
use constant HEADER_SEPARATOR => '-' x 50;
use constant SECONDS_PER_DAY => 24 * 60 * 60;
use constant ROTATE_AFTER => 30 * SECONDS_PER_DAY;
my $fh;
if (ENABLE_LOG) {
if (my $age = read_age( $logfile )) {
if ( is_time_to_rotate( $age ) ) {
rotate_log( $logfile );
}
else {
$fh = open_log( $logfile );
}
}
unless ($fh) {
$fh = create_log( $logfile );
}
}
sub is_time_to_rotate {
my $age = shift;
return $age > ROTATE_AFTER;
}
sub rotate_log {
my $file = shift;
my $saved_file = join '.', $file, time;
rename $file => $saved_file
or croak "Failed to rename '$file' to '$saved_file': $!"
return;
}
sub create_log {
my $file = shift;
open my $fh, '>', $file
or croak "Failed to create '$file': $!";
print $fh time, "\n", HEADER_SEPARATOR, "\n"
or croak "Failed to write header to '$file': $!";
return $fh;
}
sub open_log {
my $file = shift;
open my $fh, '>>', $file
or croak "Failed to open '$file': $!";
return $fh;
}
sub read_age {
my $file = shift;
open my $fh, '<', $file
or return;
defined (my $creation_time = <$fh>)
or croak "Failed to read creation time from '$file': $!";
return time - $creation_time;
}
If you need to read a line of a file, rename it and then work with it, you have to open it twice.
However, you can also do away with using that first line.
On Windows, according to perlport (Files and Filesystems), the inode change time time-stamp (ctime) "may really" mark the file creation time. This is likely to be completely suitable for a log file that doesn't get manipulated and moved around. It can be obtained with the -C file-test operator
my $days_float = -C $filename;
Now you can numerically test this against 30. Then there is no need to print the file's creation time to its first line (but you may as well if it is useful for viewing or other tools).
Also, there is the module Win32API::File::Time, with the purpose to
provide maximal access to the file creation, modification, and access times under MSWin32
Plese do read the docs for some caveats. I haven't used it but it seems tailored for your need.
A good point is raised in a comment: apparently the OS retains the original time-stamp as the file is being renamed. In that case, when the file's too old copy it into a new one (with the new name) and delete it, instead of using rename. Then open that log file anew, so with a new time-stamp.
Here is a complete example
archive_log($logfile) if -f $logfile and -C $logfile > 30;
open my $fh_log, '>>', $logfile or die "Can't open $logfile: $!";
say $fh_log "Log a line";
sub archive_log {
my ($file) = #_;
require POSIX; POSIX->import('strftime');
my $ts = strftime("%Y%m%d_%H:%M:%S", localtime); # 20170629_12:44:10
require File::Copy; File::Copy->import('copy');
my $archive = $file . "_$ts";
copy ($file, $archive) or die "Can't copy $file to $archive: $!";
unlink $file or die "Can't unlink $file: $!";
}
The archive_log archives the current log by copying it and then removes it.
So after that we can just open for append, which creates the file if not there.
The -C tests for file existence but since its output is used in a numerical test we need -f first.
Since this happens once a month I load modules at runtime, with require and import, once the log actually need be rotated. If you already use File::Copy then there is no need for this. As for the time-stamp, I threw in something to make this a working example.
I tested this on UNIX, by changing -C to -M and tweaking the timestamp by touch -t -c.
Better yet, to reduce the caller's code fully move the tests into the sub as well, for
my $fh_log = open_log($logfile);
say $fh_log "Log a line";
sub open_log {
my ($file) = #_;
if (-f $file and -C $file > 30) {
# code from archive_log() above, to copy and unlink $file
}
open my $fh_log, '>>', $file or die "Can't open $file: $!";
return $fh_log;
}
Note. On UNIX the file's creation time is not kept anywhere. The closest notion is the ctime above, but this is of course different. For one thing, it changes with many operations, for instance mv, ln, chmod, chown, chgrp (and probably others).

Search for specific lines from a file

I have an array that contains the data from a text file.
I want to filter the array and copy some information to another array. grep seems to not work.
Here's what I have
$file = 'files.txt';
open (FH, "< $file") or die "Can't open $file for read: $!";
#lines = <FH>;
close FH or die "Cannot close $file: $!";
chomp(#lines);
foreach $y (#lines){
if ( $y =~ /(?:[^\\]*\\|^)[^\\]*$/g ) {
print $1, pos $y, "\n";
}
}
files.txt
public_html
Trainings and Events
General Office\Resources
General Office\Travel
General Office\Office Opperations\Contacts
General Office\Office Opperations\Coordinator Operations
public_html\Accordion\dependencies\.svn\tmp\prop-base
public_html\Accordion\dependencies\.svn\tmp\props
public_html\Accordion\dependencies\.svn\tmp\text-base
The regular expression should take the last one or two folders and put them into their own array for printing.
A regex can get very picky for this. It is far easier to split the path into components and then count off as many as you need. And there is a tool for this exact purpose, the core module File::Spec, as mentioned by xxfelixxx in a comment.
You can use its splitdir to break up the path, and catdir to compose one.
use warnings 'all';
use strict;
use feature 'say';
use File::Spec::Functions qw(splitdir catdir);
my $file = 'files.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
my #dirs;
while (<$fh>) {
next if /^\s*$/; # skip empty lines
chomp;
my #path = splitdir $_;
push #dirs, (#path >= 2 ? catdir #path[-2,-1] : #path);
}
close $fh;
say for #dirs;
I use the module's functional interface while for heavier work you want its object oriented one. Reading the whole file into an array has its uses but in general process line by line. The list manipulations can be done more elegantly but I went for simplicity.
I'd like to add a few general comments
Always start your programs with use strict and use warnings
Use lexical filehandles, my $fh instead of FH
Being aware of (at least) a dozen-or-two of most used modules is really helpful. For example, in the above code we never had to even mention the separator \.
I can't write a full answer because I'm using my phone. In any case zdim has mostly answered you. But my solution would look like this
use strict;
use warnings 'all';
use feature 'say';
use File::Spec::Functions qw/ splitdir catdir /;
my $file = 'files.txt';
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
my #results;
while ( <$fh> ) {
next unless /\S/;
chomp;
my #path = splitdir($_);
shift #path while #path > 2;
push #results, catdir #path;
}
print "$_\n" for #results;

How do you find the paths of linked images in Adobe Illustrator 9? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have an Illustrator file with linked images. I'd actually like to instead embed the images. I first have to know which files they are. How do I find out? I'm using Illustrator 9.
The first Illustrator version I ever used was 10, but, is there a links pallete in version 9? Try the window menu and look for "links". It has a few options there to search for the image you want, relink, open the original, etc.
Great to know the poster is set, but this doesn't answer the question. In CS3, if you double-click on the image in links palette, ti will bring up Link Info for the linked element, which shows the path to the file (provided it isn't longer than the window).
Maybe the older versions allow you to do this too.
Unfortunately, if you're dealing with a missing link element (which you ignored to fix upon opening the file), this field is blank. Illustrator sucks in comparison to InDesign for file packaging and linking. It would be nice if it could package files like InDesign, and store relative references to the external resources.
With newer versions of Illustrator you might be able to replace links to broken images using this script:
http://techblog.willshouse.com/2011/01/16/update-illustrator-linked-files-script/
I just had this problem in Illustrator CS4; a lot of my stuff was archived recently.
Click on the "missing" image in the artboard.
In the top left you will see the file name displayed.
Click "edit original" in the subsequent drop down menu. Illustrator will attempt to find the file, and flash a warning window "windows cannot find file" etc giving you the full file location.
This is useful as edit original is greyed out in the links window. And very useful for people like me who have a vast library of files.
I use the following perl script to keep track of linked images in Illustrator files. This is especially helpful for broken links, because it will still tell you the full path to the linked image by peeking inside the Illustrator file. It obviously does more than anyone here needs, but perhaps it will be useful. The help should explain how to use it. On my machine I have called it ailinkedfiles.pl and I have put it in ~/bin which is in my PATH.
#!/usr/bin/perl
# program to find the linked files inside an Adobe Illustrator file
require 5.004;
use File::Basename; # to extract a filename from a full path
use File::Find; # to build a list of files
use File::Spec; # Platform independent way to build paths
use vars qw/ %opt /; # for command line options - see init()
use strict;
init(); # process command line options
# Autoflush stdout
$|=1;
if ($opt{p}){
die "Did you really mean to call your script ".$opt{p}."!\n" if($opt{p} =~ /\.ai$/i);
print "Generating script file $opt{p}\n" if $opt{v};
open SCRIPT, "> $opt{p}";
}
die "No input specified; use ".basename($0)." -h for help\n" if(#ARGV==0);
my $arg; foreach $arg (#ARGV){
if(-d $arg){
# nb it is necesary to convert the directory specification
# to an absolute path to ensure that the open in &findLinkedFiles
# works properly during multi directory traversal
my $InDir=File::Spec->rel2abs($arg);
find(\&handleFind,$InDir);
} elsif (-f $arg) {
my $InDir=File::Spec->rel2abs(dirname($ARGV[0]));
&findLinkedFiles(File::Spec->rel2abs($ARGV[0]),$InDir) ;
# &findLinkedFiles(File::Spec->rel2abs($arg)) ;
}
}
sub init()
# copied from: http://www.cs.mcgill.ca/~abatko/computers/programming/perl/howto/getopts
{
use Getopt::Std; # to handle command line options
my $opt_string = 'hvlzdsftnp:ux:';
getopts( "$opt_string", \%opt ) or usage();
usage() if $opt{h};
}
# Print out usage information
sub usage()
{
print STDERR << "EOF";
Usage: $0 [OPTIONS] <AIFILE/DIR>
Parse an Adobe Illustrator file or (recursively) parse a directory of ai files
and print a list of the linked files to STDOUT. These could be piped to xargs eg:
$0 aifile.ai | xargs -I {} ln -vs
-h print this help
-v verbose ouput
-s print file names with short path
-d print current directory on each line
-n no-print (suppresses printing of linked file names)
-x <regex> exclude files whose full path matches regex
-l symlink in current directory if file linked from Illustrator file exists somewhere else
-f force symlink to overwrite existing target file
-t test run
-p <file> write commands to a script file
-u status of src and target
- doesn't exist
F plain file
L symbolic link
E exists (unknown file type)
Note that src is the link contained in the Illustrator file and
target is a file of the same name in the same directory as the Illustrator file
If the status is -- you will have problems in Illustrator
If the status is -F Illustrator will substitute the local file for the unavailable linked file
If the status is F- you can run this script with the -s option to make a symlink
If the status is FF then Illustrator will be happy
EOF
exit();
}
sub mysymlink{
my ($src,$targetdir)=#_;
my $target=File::Spec->catdir($targetdir,basename($src));
if(File::Spec->rel2abs($src) eq File::Spec->rel2abs($target)){
print "src and target identical for src=$src\n" if $opt{v};
return;
}
if(-e $src){
my $opts=$opt{f}?"-fsv":"-sv";
my $cmd="ln $opts \"$src\" \"$target\"";
myexec("$cmd");
} else {
print "No link made: $src doesn't exist\n" if $opt{v};
}
}
sub myexec {
my ($cmd) = #_;
if ($opt{t}){
print STDERR "test: $cmd\n";
} elsif ($opt{p}){
print SCRIPT $cmd,"\n";
} else {
# should get to see output with system
print STDERR "run: $cmd\n" if $opt{v};
return system $cmd;
}
}
sub mystatus{
my ($src,$targetdir)=#_;
my $target=File::Spec->catdir($targetdir,basename($src));
my ($ss,$ts)=("-","-");
$ss = "E" if(-e $src);
$ss = "F" if(-f $src);
$ss = "L" if(-l $src);
$ts = "E" if(-e $target);
$ts = "F" if(-f $target);
$ts = "L" if(-l $target);
return ($ss.$ts);
}
# This extracts the file info from the header
sub handleFind{
# get the file name
my $FullFoundFile = $File::Find::name;
#print $FullFoundFile,"\n";
return if ($opt{x} and $FullFoundFile =~ /$opt{x}/i);
# parse if it ends in ai
findLinkedFiles($FullFoundFile, $File::Find::dir) if ($FullFoundFile =~ /\.ai$/i);
}
# This does the actual parsing of the Illustrator Files
sub findLinkedFiles{
my ($InFile,$InDir)=#_;
# protect with escaped quotes for shell if non-empty
my $ProtectedInDir=$InDir?"\"$InDir\"":$InDir;
die "Can't open $InFile \: $!\n" unless open(AIFILE, "<$InFile");
binmode(AIFILE);
# %%DocumentFiles is the starting point
$/="%%";
my #lines = readline<AIFILE>;
if(#lines==0){
print STDERR "can't read header of $InFile\n" if $opt{v} ; # the header length
return;
}
print "################\n";
if ($opt{s}){
print "# FILE = ",basename($InFile),"\n";
} else {
print "# FILE = ",$InFile,"\n";
}
for my $i ( 0 .. $#lines ){
# if ( $lines[$i]=~/^DocumentFiles\:(.*?)\W+%%/){
# not sure why we need two % signs here
if ( $lines[$i]=~/^DocumentFiles\:(.*?)\W+%/){
print mystatus($1,$InDir)," " if $opt{u} and not $opt{n};
print "\"$1\" ",$opt{d}?$ProtectedInDir:"","\n" unless $opt{n};
$i++;
mysymlink($1,$InDir) if $opt{l};
while($lines[$i]=~/^[+](.*?)\W\%.*$/){
# print "\"$1\" $InDir\n"; $i++;
print mystatus($1,$InDir)," " if $opt{u} and not $opt{n};
print "\"$1\" ",$opt{d}?$ProtectedInDir:"","\n"unless $opt{n};
$i++;
mysymlink($1,$InDir) if $opt{l};
}
}
}
}

Resources