Delete a specific line from a 12GB file - bash

I'm trying to delete a specific line from a 12GB text file.
I do not have the sed -i option available on HP-UX, and other options like saving to a temporary file aren't working because I have only 20GB space available with 12 GB already used by the text file.
Considering the space requirement I'm trying to do this using Perl.
This solution works to delete last 9 lines from a file of 12 GB.
#!/usr/bin/env perl
use strict;
use warnings;
use Tie::File;
tie my #lines, 'Tie::File', 'test.txt' or die "$!\n";
$#lines -= 9;
untie #lines;
I want to modify the above code to delete any specific line number.

Tie::File is never the answer.
It's insanely slow.
It can use up more memory than just slurping the entire file into memory, even if you limit the size of its buffer.
You are encountering both of those problems. You encounter every line of the file, so Tie::File will read the entire file and store the index of every line in memory. This takes 28 bytes per line on a 64-bit build of Perl (not counting any overhead in the memory allocator).
To delete the last 9 lines of the file, you can use the following:
use File::ReadBackwards qw( );
my $qfn = '...';
my $pos;
{
my $bw = File::ReadBackwards->new($qfn)
or die("Can't open \"$qfn\": $!\n");
for (1..9) {
defined( my $line = $bw->readline() )
or last;
}
$pos = $bw->tell();
}
# Can't use $bw->get_handle because it's a read-only handle.
truncate($qfn, $pos)
or die("Can't truncate \"$qfn\": $!\n");
To delete an arbitrary line, you can use the following:
my $qfn = '...';
open(my $fh_src, '<:raw', $qfn)
or die("Can't open \"$qfn\": $!\n");
open(my $fh_dst, '+<:raw', $qfn)
or die("Can't open \"$qfn\": $!\n");
while (<$fh_src>) {
next if $. == 9; # Or "if /keyword/", or whatever condition you want.
print($fh_dst $_)
or die($!);
}
truncate($fh_dst, tell($fh_dst))
or die($!);
The following optimized version assumes there's only one line (or block of lines) to remove:
use Fcntl qw( SEEK_CUR SEEK_SET );
use constant BLOCK_SIZE => 4*1024*1024;
my $qfn = 'file';
open(my $fh_src, '<:raw', $qfn)
or die("Can't open \"$qfn\": $!\n");
open(my $fh_dst, '+<:raw', $qfn)
or die("Can't open \"$qfn\": $!\n");
my $dst_pos;
while (1) {
$dst_pos = tell($fh_src);
defined( my $line = <$fh_src> )
or do {
$dst_pos = undef;
last;
};
last if $. == 9; # Or "if /keyword/", or whatever condition you want.
}
if (defined($dst_pos)) {
# We're switching from buffered I/O to unbuffered I/O,
# so we need to move the system file pointer from where the
# buffered read left off to where we actually finished reading.
sysseek($fh_src, tell($fh_src), SEEK_SET)
or die($!);
sysseek($fh_dst, $dst_pos, SEEK_SET)
or die($!);
while (1) {
my $rv = sysread($fh_src, my $buf, BLOCK_SIZE);
die($!) if !defined($rv);
last if !$rv;
my $written = 0;
while ($written < length($buf)) {
my $rv = syswrite($fh_dst, $buf, length($buf)-$written, $written);
die($!) if !defined($rv);
$written += $rv;
}
}
# Must use sysseek instead of tell with sysread/syswrite.
truncate($fh_dst, sysseek($fh_dst, 0, SEEK_CUR))
or die($!);
}

Related

Executing Perl script from windows-command line with 2 entry

this is my Perl script
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;
sub xml2array{
my $path = shift;
my $twig = XML::Twig->new->parsefile($path);
return map { $_ -> att('VirtualPath') } $twig -> get_xpath('//Signals');
}
sub compareMappingToArray {
my $mapping = shift;
my $signalsRef = shift;
my $i = 1;
print "In file : $mapping\n";
open(my $fh, $mapping);
while (my $r = <$fh>) {
chomp $r;
if ($r =~ /\'(ModelSpecific.*)\'/) {
my $s = $1;
my #matches = grep { /^$s$/ } #{$signalsRef};
print "line $i : not found - $s\n" if scalar #matches ==0;
print "line $i : multiple $s\n" if scalar #matches > 1;
}
$i = $i + 1 # keep line index
}
}
my $mapping = "C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm";
my #virtualpath = xml2array("SignalModel.xml");
compareMappingToArray($mapping, \#virtualpath);
The script works well, the aim of it is to compare the file "SignalModel.xml" and "CAN_ESP_002_mapping.pm" and putting the lines that didn't matches in a .TXT file. Here is how the .TXT file looks like:
In file : C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm
line 331 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_202.R2B_VCU_202__byte_3
line 348 : not found - ModelSpecific.EID.NET.CAN_Engine.CMM_WX.Transmit.CMM_HYB_208.R2B_CMM_HYB_208__byte_2
line 368 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_222.R2B_VCU_222__byte_0
But for this script, I put the two files that need to be compare inside of the code and instead of doing that, I would like to run the script in windows cmd line and having something like:
C:\Users>perl CANMappingChecker.pl -'file 1' 'file 2'
All the files are in .zip file so if I can execute the script that he goes inside and take the 2 files that I need for comparison, it should be perfect.
I really don't know how to do and what to put inside my script to make that in the cmd windows. Thanks for your help !
Program (or script) parameters are stored in the #ARGV array. shift and pop without any parameter will work on #ARGV when used outside of a sub, in a sub they operate on #_.
See Archive::Zip for zip file handling.

How can I reduce this to a single file open?

Using Strawberry Perl 5.22.0 in Windows 7. Is there a more "perlish" way to write this snippet of code? I hate the duplication of file open sections, but cannot think of a way to make it only open once because of the requirement to test the creation time.
...
my $x;
my $fh;
my $sentinelfile = "Logging.yes"; #if this file exists then enable logging
my $logfile = "transfers.log";
my $log = 0; #default to NO logging
$log = 1 if -e $sentinelfile; #enable logging if sentinel file exists
if($log){
#logfile remains open after this so remember to close at end of program!
if (-e $logfile) { #file exists
open($fh, "<", $logfile); #open for read will NOT create if not exist
chomp ($x = <$fh>); #grab first row
close $fh;
if (((scalar time - $x)/3600/24) > 30) { #when ~30 days since created
rename($logfile, $logfile . time); #rename existing logfile
open($fh, ">", $logfile); #open for write and truncate
print $fh time,"\n"; #save create date
print $fh "--------------------------------------------------\n";
} else { #file is not older than 30 days
open($fh, ">>", $logfile); #open for append
}
} else { #file not exist
open($fh, ">", $logfile); #open new for write
print $fh time,"\n"; #save create date
print $fh "--------------------------------------------------\n";
}
} #if $log
...
To recap: logfile logs stuff. First row of file contains the logfile creation date. Second row contains horizontal rule. Rest of file contains text. Around 30 days after file was created, rename file and start a new one. After the above chunk of code the logfile is open and ready for logging stuff. It gets closed at the end of the rest of the program.
There are other, non-cosmetic problems with your code: a) You do not ever check if your calls to open succeeded; b) You are creating a race condition. The file can come into existence after the -e check has failed. The subsequent open $fh, '>' ... would then clobber it; c) You don't check if your rename call succeeded etc.
The following would be a partial improvement on your existing code:
if ($log) {
if (open $fh, '<', $logfile) { #file exists
chomp ($x = <$fh>);
close $fh
or die "Failed to close '$logfile': $!";
if (((time - $x)/3600/24) > 30) {
my $rotated_logfile = join '.', $logfile, time;
rename $logfile => $rotated_logfile
or die "Failed to rename '$logfile' to '$rotated_logfile': $!";
open $fh, '>', $logfile
or die "Failed to create '$logfile'";
print $fh time, "\n", '-' x 50, "\n";
}
else {
open $fh, '>>', $logfile
or die "Cannot open '$logfile' for appending: $!";
}
}
else {
open $fh, '>', $logfile
or die "Cannot to create '$logfile': $!";
print $fh time, "\n", '-' x 50, "\n";
}
}
It would be better to abstract every bit of discrete functionality to suitably named functions.
For example, here is a completely untested re-write:
use autouse Carp => qw( croak );
use constant SENTINEL_FILE => 'Logging.yes';
use constant ENABLE_LOG => -e SENTINEL_FILE;
use constant HEADER_SEPARATOR => '-' x 50;
use constant SECONDS_PER_DAY => 24 * 60 * 60;
use constant ROTATE_AFTER => 30 * SECONDS_PER_DAY;
my $fh;
if (ENABLE_LOG) {
if (my $age = read_age( $logfile )) {
if ( is_time_to_rotate( $age ) ) {
rotate_log( $logfile );
}
else {
$fh = open_log( $logfile );
}
}
unless ($fh) {
$fh = create_log( $logfile );
}
}
sub is_time_to_rotate {
my $age = shift;
return $age > ROTATE_AFTER;
}
sub rotate_log {
my $file = shift;
my $saved_file = join '.', $file, time;
rename $file => $saved_file
or croak "Failed to rename '$file' to '$saved_file': $!"
return;
}
sub create_log {
my $file = shift;
open my $fh, '>', $file
or croak "Failed to create '$file': $!";
print $fh time, "\n", HEADER_SEPARATOR, "\n"
or croak "Failed to write header to '$file': $!";
return $fh;
}
sub open_log {
my $file = shift;
open my $fh, '>>', $file
or croak "Failed to open '$file': $!";
return $fh;
}
sub read_age {
my $file = shift;
open my $fh, '<', $file
or return;
defined (my $creation_time = <$fh>)
or croak "Failed to read creation time from '$file': $!";
return time - $creation_time;
}
If you need to read a line of a file, rename it and then work with it, you have to open it twice.
However, you can also do away with using that first line.
On Windows, according to perlport (Files and Filesystems), the inode change time time-stamp (ctime) "may really" mark the file creation time. This is likely to be completely suitable for a log file that doesn't get manipulated and moved around. It can be obtained with the -C file-test operator
my $days_float = -C $filename;
Now you can numerically test this against 30. Then there is no need to print the file's creation time to its first line (but you may as well if it is useful for viewing or other tools).
Also, there is the module Win32API::File::Time, with the purpose to
provide maximal access to the file creation, modification, and access times under MSWin32
Plese do read the docs for some caveats. I haven't used it but it seems tailored for your need.
A good point is raised in a comment: apparently the OS retains the original time-stamp as the file is being renamed. In that case, when the file's too old copy it into a new one (with the new name) and delete it, instead of using rename. Then open that log file anew, so with a new time-stamp.
Here is a complete example
archive_log($logfile) if -f $logfile and -C $logfile > 30;
open my $fh_log, '>>', $logfile or die "Can't open $logfile: $!";
say $fh_log "Log a line";
sub archive_log {
my ($file) = #_;
require POSIX; POSIX->import('strftime');
my $ts = strftime("%Y%m%d_%H:%M:%S", localtime); # 20170629_12:44:10
require File::Copy; File::Copy->import('copy');
my $archive = $file . "_$ts";
copy ($file, $archive) or die "Can't copy $file to $archive: $!";
unlink $file or die "Can't unlink $file: $!";
}
The archive_log archives the current log by copying it and then removes it.
So after that we can just open for append, which creates the file if not there.
The -C tests for file existence but since its output is used in a numerical test we need -f first.
Since this happens once a month I load modules at runtime, with require and import, once the log actually need be rotated. If you already use File::Copy then there is no need for this. As for the time-stamp, I threw in something to make this a working example.
I tested this on UNIX, by changing -C to -M and tweaking the timestamp by touch -t -c.
Better yet, to reduce the caller's code fully move the tests into the sub as well, for
my $fh_log = open_log($logfile);
say $fh_log "Log a line";
sub open_log {
my ($file) = #_;
if (-f $file and -C $file > 30) {
# code from archive_log() above, to copy and unlink $file
}
open my $fh_log, '>>', $file or die "Can't open $file: $!";
return $fh_log;
}
Note. On UNIX the file's creation time is not kept anywhere. The closest notion is the ctime above, but this is of course different. For one thing, it changes with many operations, for instance mv, ln, chmod, chown, chgrp (and probably others).

Keep shared entries among many files

I have hundreds of files, each with different number of entries (>xxxx) and want to keep only shared entries among all files, separately. I'm not sure what is the best method to do this, maybe perl! I used sort, uniq of bash, but I didn't get the correct answer. The format of IDs start with > and follows 4 characters among all files.
1.fa
>abcd
CTGAATGCC
2.fa
>abcd
AAATGCGCG
>efgh
CGTAC
3.fa
>abcd
ATGCAATA
>efgh
TAACGTAA
>ijkl
TGCAA
Final results, of this example would be:
1.fa
>abcd
CTGAATGCC
2.fa
>abcd
AAATGCGCG
3.fa
>abcd
ATGCAATA
This Perl program will do as you ask. It uses Perl's built-in edit in place functionality and renames the original files to 1.fa.bak etc. It shouldn't have a problem with blank lines in your data as long as the sequence is always on one line immediately following the ID
use strict;
use warnings 'all';
my #files = glob '*.fa';
printf "Processing %d file%s\n", scalar #files, #files == 1 ? "" : "s";
exit if #files < 2;
my %ids;
{
local #ARGV = #files;
while ( <> ) {
++$ids{$1} if /^>(\S+)/;
}
}
# remove keys that aren't in all files
delete #ids{ grep { $ids{$_} < #files } keys %ids };
my $n = keys %ids;
printf "%d ID%s common to all files\n", $n, $n == 1 ? '' : "s";
exit unless $n;
{
local #ARGV = #files;
local $^I = '.bak';
while ( <> ) {
next unless /^>(\S+)/ and $ids{$1};
print;
print scalar <>;
}
}
Here is Perl solution, that may help you:
use feature qw(say);
use strict;
use warnings;
my $file_dir = 'files';
chdir $file_dir;
my #files = <*.fa>;
my $num_files = scalar #files;
my %ids;
for my $file (#files) {
open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
while (my $id = <$fh>) {
chomp $id;
chomp (my $sequence = <$fh>);
$ids{$id}++;
}
close $fh;
}
for my $file (#files) {
open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
my $new_name = $file . '.new';
open ( my $fh_write, '>', $new_name ) or die "Could not open file '$new_name': $!";
while (my $id = <$fh>) {
chomp $id;
chomp (my $sequence = <$fh>);
if ( $ids{$id} == $num_files ) {
say $fh_write $id;
say $fh_write $sequence;
}
}
close $fh_write;
close $fh;
}
It assumes that all the .fa files are located in the directory named $file_dir, and it writes the new sequences to new files in the same directory. The new file names get the .new extension.

Search for specific lines from a file

I have an array that contains the data from a text file.
I want to filter the array and copy some information to another array. grep seems to not work.
Here's what I have
$file = 'files.txt';
open (FH, "< $file") or die "Can't open $file for read: $!";
#lines = <FH>;
close FH or die "Cannot close $file: $!";
chomp(#lines);
foreach $y (#lines){
if ( $y =~ /(?:[^\\]*\\|^)[^\\]*$/g ) {
print $1, pos $y, "\n";
}
}
files.txt
public_html
Trainings and Events
General Office\Resources
General Office\Travel
General Office\Office Opperations\Contacts
General Office\Office Opperations\Coordinator Operations
public_html\Accordion\dependencies\.svn\tmp\prop-base
public_html\Accordion\dependencies\.svn\tmp\props
public_html\Accordion\dependencies\.svn\tmp\text-base
The regular expression should take the last one or two folders and put them into their own array for printing.
A regex can get very picky for this. It is far easier to split the path into components and then count off as many as you need. And there is a tool for this exact purpose, the core module File::Spec, as mentioned by xxfelixxx in a comment.
You can use its splitdir to break up the path, and catdir to compose one.
use warnings 'all';
use strict;
use feature 'say';
use File::Spec::Functions qw(splitdir catdir);
my $file = 'files.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
my #dirs;
while (<$fh>) {
next if /^\s*$/; # skip empty lines
chomp;
my #path = splitdir $_;
push #dirs, (#path >= 2 ? catdir #path[-2,-1] : #path);
}
close $fh;
say for #dirs;
I use the module's functional interface while for heavier work you want its object oriented one. Reading the whole file into an array has its uses but in general process line by line. The list manipulations can be done more elegantly but I went for simplicity.
I'd like to add a few general comments
Always start your programs with use strict and use warnings
Use lexical filehandles, my $fh instead of FH
Being aware of (at least) a dozen-or-two of most used modules is really helpful. For example, in the above code we never had to even mention the separator \.
I can't write a full answer because I'm using my phone. In any case zdim has mostly answered you. But my solution would look like this
use strict;
use warnings 'all';
use feature 'say';
use File::Spec::Functions qw/ splitdir catdir /;
my $file = 'files.txt';
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
my #results;
while ( <$fh> ) {
next unless /\S/;
chomp;
my #path = splitdir($_);
shift #path while #path > 2;
push #results, catdir #path;
}
print "$_\n" for #results;

Why can't I use more than 20 files with my Perl script and Windows's SendTo?

I'm trying to emulate RapidCRC's ability to check crc32 values within filenames on Windows Vista Ultimate 64-bit. However, I seem to be running into some kind of argument limitation.
I wrote a quick Perl script, created a batch file to call it, then placed a shortcut to the batch file in %APPDATA%\Microsoft\Windows\SendTo
This works great when I select about 20 files or less, right-click and "send to" my batch file script. However, nothing happens at all when I select more than that. I suspect there's a character or number of arguments limit somewhere.
Hopefully I'm missing something simple and that the solution or a workaround isn't too painful.
References:
batch file (crc32_inline.bat):
crc32_inline.pl %*
Perl notes:
I'm using (strawberry) perl v5.10.0
I have C:\strawberry\perl\bin in my path, which is where crc32.bat exists.
perl script (crc32_inline.pl):
#!/usr/bin/env perl
use strict;
use warnings;
use Cwd;
use English qw( -no_match_vars );
use File::Basename;
$OUTPUT_AUTOFLUSH = 1;
my $crc32_cmd = 'crc32.bat';
my $failure_report_basename = 'crc32_failures.txt';
my %failures = ();
print "\n";
foreach my $arg (#ARGV) {
# if the file has a crc, check to see if it matches the calculated
# crc.
if (-f $arg and $arg =~ /\[([0-9a-f]{8})\]/i) {
my $crc = uc $1;
my $basename = basename($arg);
print "checking ${basename}... ";
my $calculated_crc = uc `${crc32_cmd} "${arg}"`;
chomp($calculated_crc);
if ($crc eq $calculated_crc) {
print "passed.\n";
}
else {
print "FAILED (calculated ${calculated_crc})\n";
my $dirname = dirname($arg);
$failures{$dirname}{$basename} = $calculated_crc;
}
}
}
print "\nReport Summary:\n";
if (scalar keys %failures == 0) {
print " All files OK\n";
}
else {
print sprintf(" %d / %d files failed crc32 validation.\n" .
" See %s for details.\n",
scalar keys %failures,
scalar #ARGV,
$failure_report_basename);
my $failure_report_fullname = $failure_report_basename;
if (defined -f $ARGV[0]) {
$failure_report_fullname
= dirname($ARGV[0]) . '/' . $failure_report_basename;
}
$OUTPUT_AUTOFLUSH = 0;
open my $fh, '>' . $failure_report_fullname or die $!;
foreach my $dirname (sort keys %failures) {
print {$fh} $dirname . "\n";
foreach my $basename (sort keys %{$failures{$dirname}}) {
print {$fh} sprintf(" crc32(%s) basename(%s)\n",
$failures{$dirname}{$basename},
$basename);
}
}
close $fh;
$OUTPUT_AUTOFLUSH = 1;
}
print sprintf("\n%s done! (%d seconds elapsed)\n" .
"Press enter to exit.\n",
basename($0),
time() - $BASETIME);
<STDIN>;
I will recommend just putting a shortcut to your script in the "Send To" directory instead of doing it via a batch file (which is subject to cmd.exes limits on command line length).

Resources