How to compare the content of multiple txt file in bash shell and delete the one (file) which is duplicate - bash

I am trying to achieve this is Mac OS, tried to achieve similar by using fdupes but didn't work. Here is what I am trying to achieve:
There are 100 files in directory 'alpha'
Pick one file A and compare it with each remaining file in the directory 'alpha'
If content of file A matches any file (duplicate), delete the duplicate file
Move to file B, and compare with the remaining file, and do the same (check for duplicate)
Repeat the same until all files are checked for duplicates. Remaining files should be unique
Update
I modified a bit something similar I found here, but I have to run it multiple times to take out the duplicates. It is not detecting duplicates in a single run (have to run it multiple times to detect duplicate). Not sure if it is working correctly
use Digest::MD5;
%check = ();
while (<*>) {
-d and next;
$fname = "$_";
print "checking .. $fname\n";
$md5 = getmd5($fname) . "\n";
if ( !defined( $check{$md5} ) ) {
$check{$md5} = "$fname";
}
else {
print "Found duplicate files: $fname and $check{$md5}\n";
print "Deleting duplicate $check{$md5}\n";
unlink $check{$md5};
}
}
sub getmd5 {
my $file = "$_";
open( FH, "<", $file ) or die "Cannot open file: $!\n";
binmode(FH);
my $md5 = Digest::MD5->new;
$md5->addfile(FH);
close(FH);
return $md5->hexdigest;
}

You should limit the number of times that you have to read each file's contents:
Inventory the files using Path::Class or some similar method.
a. Build a hash relating file sizes and MD5::Digest to a list of file names.
Compare likely duplicates only. Matching file size and digest.
The following is untested:
use strict;
use warnings;
use Path::Class;
use Digest::MD5;
my $dir = dir('.');
my %files_per_digest;
# Inventory Directory
while ( my $file = $dir->next ) {
my $size = $file->stat->size;
my $digest = do {
my $md5 = Digest::MD5->new;
$md5->addfile( $file->openr );
$md5->hexdigest;
};
push #{ $files_per_digest{"$size - $digest"} }, $file;
}
# Compare likely duplicates only
for my $files ( grep { #$_ > 1 } values %files_per_digest ) {
# Sort by alpha
#$files = sort #$files;
print "Comparing: #files\n";
for my $i ( reverse 0 .. $#files ) {
for my $j ( 0 .. $i - 1 ) {
my $fh1 = $files->[$i]->openr;
my $fh2 = $files->[$j]->openr;
my $diff = 0;
while ( !eof($fh1) && !eof($fh2) ) {
$diff = 1, last if scalar(<$fh1>) ne scalar(<$fh2>);
}
if ( $diff or !eof($fh1) or !eof($fh2) ) {
print " $files->[$i] ($i) is duplicate of $files->[$j] ($j)\n";
$files->[$i]->remove();
splice #$files, $i, 1;
}
}
}
}

I've used rdfind in the past with very good success. It's very accurate, fast, and seems to run leaner than fdupes. According to RDFind's web site (http://rdfind.pauldreik.se/), it can be installed using MacPorts.

Related

Executing Perl script from windows-command line with 2 entry

this is my Perl script
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;
sub xml2array{
my $path = shift;
my $twig = XML::Twig->new->parsefile($path);
return map { $_ -> att('VirtualPath') } $twig -> get_xpath('//Signals');
}
sub compareMappingToArray {
my $mapping = shift;
my $signalsRef = shift;
my $i = 1;
print "In file : $mapping\n";
open(my $fh, $mapping);
while (my $r = <$fh>) {
chomp $r;
if ($r =~ /\'(ModelSpecific.*)\'/) {
my $s = $1;
my #matches = grep { /^$s$/ } #{$signalsRef};
print "line $i : not found - $s\n" if scalar #matches ==0;
print "line $i : multiple $s\n" if scalar #matches > 1;
}
$i = $i + 1 # keep line index
}
}
my $mapping = "C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm";
my #virtualpath = xml2array("SignalModel.xml");
compareMappingToArray($mapping, \#virtualpath);
The script works well, the aim of it is to compare the file "SignalModel.xml" and "CAN_ESP_002_mapping.pm" and putting the lines that didn't matches in a .TXT file. Here is how the .TXT file looks like:
In file : C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm
line 331 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_202.R2B_VCU_202__byte_3
line 348 : not found - ModelSpecific.EID.NET.CAN_Engine.CMM_WX.Transmit.CMM_HYB_208.R2B_CMM_HYB_208__byte_2
line 368 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_222.R2B_VCU_222__byte_0
But for this script, I put the two files that need to be compare inside of the code and instead of doing that, I would like to run the script in windows cmd line and having something like:
C:\Users>perl CANMappingChecker.pl -'file 1' 'file 2'
All the files are in .zip file so if I can execute the script that he goes inside and take the 2 files that I need for comparison, it should be perfect.
I really don't know how to do and what to put inside my script to make that in the cmd windows. Thanks for your help !
Program (or script) parameters are stored in the #ARGV array. shift and pop without any parameter will work on #ARGV when used outside of a sub, in a sub they operate on #_.
See Archive::Zip for zip file handling.

Kaldi librispeech data preparation error

I'm trying to do ASR system. Im using kaldi manual and librispeech corpus.
In data preparation step i get this error
utils/data/get_utt2dur.sh: segments file does not exist so getting durations
from wave files
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file
headers, using wav-to-duration
utils/data/get_utt2dur.sh: line 99: wav-to-duration: command not found
And here the piece of code where this error occures
if cat $data/wav.scp | perl -e '
while (<>) { s/\|\s*$/ |/; # make sure final | is preceded by space.
#A = split;
if (!($#A == 5 && $A[1] =~ m/sph2pipe$/ &&
$A[2] eq "-f" && $A[3] eq "wav" && $A[5] eq "|")) { exit (1); }
$utt = $A[0]; $sphere_file = $A[4];
if (!open(F, "<$sphere_file")) { die "Error opening sphere file $sphere_file"; }
$sample_rate = -1; $sample_count = -1;
for ($n = 0; $n <= 30; $n++) {
$line = <F>;
if ($line =~ m/sample_rate -i (\d+)/) { $sample_rate = $1; }
if ($line =~ m/sample_count -i (\d+)/) { $sample_count = $1;
}
if ($line =~ m/end_head/) { break; }
}
close(F);
if ($sample_rate == -1 || $sample_count == -1) {
die "could not parse sphere header from $sphere_file";
}
$duration = $sample_count * 1.0 / $sample_rate;
print "$utt $duration\n";
} ' > $data/utt2dur; then
echo "$0: successfully obtained utterance lengths from sphere-file headers"
else
echo "$0: could not get utterance lengths from sphere-file headers,
using wav-to-duration"
if command -v wav-to-duration >/dev/null; then
echo "$0: wav-to-duration is not on your path"
exit 1;
fi
In file wav.scp i got such lines:
6295-64301-0002 flac -c -d -s /home/tinin/kaldi/egs/librispeech/s5/LibriSpeech/dev-clean/6295/64301/6295-64301-0002.flac |
In this dataset i have only flac files(they downloaded via provided script) and i dont understand why we search wav-files? And how run data preparation correctly(i didnt change source code in this manual.
Also, if you explain to me what is happening in this code, then I will be very grateful to you, because i'm not familiar with bash and perl.
Thank you a lot!
The problem I see from this line
utils/data/get_utt2dur.sh: line 99: wav-to-duration: command not found
is that you have not added the kaldi tools in your path.
Check the file path.sh and see if the directories that it adds to your path are correct (because it has ../../.. inside and it might not match your current folder setup)
As for the perl script, it counts the samples of the sound file and then it divides with the sample rate in order to get the duration. Don't worry about the 'wav' word, your files might be on another format, it's just the name of the kaldi functions.

Keep shared entries among many files

I have hundreds of files, each with different number of entries (>xxxx) and want to keep only shared entries among all files, separately. I'm not sure what is the best method to do this, maybe perl! I used sort, uniq of bash, but I didn't get the correct answer. The format of IDs start with > and follows 4 characters among all files.
1.fa
>abcd
CTGAATGCC
2.fa
>abcd
AAATGCGCG
>efgh
CGTAC
3.fa
>abcd
ATGCAATA
>efgh
TAACGTAA
>ijkl
TGCAA
Final results, of this example would be:
1.fa
>abcd
CTGAATGCC
2.fa
>abcd
AAATGCGCG
3.fa
>abcd
ATGCAATA
This Perl program will do as you ask. It uses Perl's built-in edit in place functionality and renames the original files to 1.fa.bak etc. It shouldn't have a problem with blank lines in your data as long as the sequence is always on one line immediately following the ID
use strict;
use warnings 'all';
my #files = glob '*.fa';
printf "Processing %d file%s\n", scalar #files, #files == 1 ? "" : "s";
exit if #files < 2;
my %ids;
{
local #ARGV = #files;
while ( <> ) {
++$ids{$1} if /^>(\S+)/;
}
}
# remove keys that aren't in all files
delete #ids{ grep { $ids{$_} < #files } keys %ids };
my $n = keys %ids;
printf "%d ID%s common to all files\n", $n, $n == 1 ? '' : "s";
exit unless $n;
{
local #ARGV = #files;
local $^I = '.bak';
while ( <> ) {
next unless /^>(\S+)/ and $ids{$1};
print;
print scalar <>;
}
}
Here is Perl solution, that may help you:
use feature qw(say);
use strict;
use warnings;
my $file_dir = 'files';
chdir $file_dir;
my #files = <*.fa>;
my $num_files = scalar #files;
my %ids;
for my $file (#files) {
open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
while (my $id = <$fh>) {
chomp $id;
chomp (my $sequence = <$fh>);
$ids{$id}++;
}
close $fh;
}
for my $file (#files) {
open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
my $new_name = $file . '.new';
open ( my $fh_write, '>', $new_name ) or die "Could not open file '$new_name': $!";
while (my $id = <$fh>) {
chomp $id;
chomp (my $sequence = <$fh>);
if ( $ids{$id} == $num_files ) {
say $fh_write $id;
say $fh_write $sequence;
}
}
close $fh_write;
close $fh;
}
It assumes that all the .fa files are located in the directory named $file_dir, and it writes the new sequences to new files in the same directory. The new file names get the .new extension.

Split very large files on record border

I have to split a very large file into N smaller files with the following constraints:
I have to split on record border
Record separator can be any character
the number of records in the resulting N files should be the same (+/- 1 record)
I can just use bash and standard coreutils (I have a working solution in Perl but we're not allowed to install Perl/Python/etc)
This is not a real constraint but - if possible - I'd like to scan the original (large) file just once.
Sort order of the resulting files is not important.
My working solution in Perl reads the original file and writes...
- the 1st record to the first file
- ...
- the Nth record to the Nth file
- the N+1 record back to the first file
- etc
So - at the end - with a single scan of the initial file I do get several smaller files with the same number of records (+/- 1).
For example, assume this is the input file:
1,1,1,1A2,2,2,2A3,
3,3,3A4,4,4,4A5,5,
5,5A6,6,6,6A7,7,7,
7,A8,8,8,8A9,9,9,9
A0,0,0,0
With record separator = 'A' and N = 3 I should get three files:
# First file:
1,1,1,1A2,2,2,2A3,
3,3,3
# Second file
4,4,4,4A5,5,
5,5A6,6,6,6
# Third file:
7,7,7,
7,A8,8,8,8A9,9,9,9
A0,0,0,0
UPDATE
Here you have the perl code. I tried to make it as simple and readable as I can:
#!/usr/bin/perl
use warnings;
use strict;
use locale;
use Getopt::Std;
#-----------------------------------------------------------------------------
# Declaring variables
#-----------------------------------------------------------------------------
my %op = (); # Command line parameters hash
my $line = 0; # Output file line number
my $fnum = 0; # Output file number
my #fout = (); # Output file names array
my #fhnd = (); # Output file handles array
my #ifiles = (); # Input file names
my $i = 0; # Loop variable
#-----------------------------------------------------------------------------
# Handling command line arguments
#-----------------------------------------------------------------------------
getopts("o:n:hvr:", \%op);
die "Usage: lfsplit [-h] -n number_of_files",
" [-o outfile_prefix] [-r rec_sep_decimal] [-v] input_file(s)\n"
if $op{h} ;
if ( #ARGV ) {
#ifiles = #ARGV ;
} else {
die "No input files...\n" ;
}
$/ = chr($op{r}) if $op{r} ;
#-----------------------------------------------------------------------------
# Setting Default values
#-----------------------------------------------------------------------------
$op{o} |= 'out_' ;
#-----------------------------------------------------------------------------
# Body - split in round-robin to $op{n} files
#-----------------------------------------------------------------------------
for ( $i = 0 ; $i < $op{n} ; $i++ ) {
local *OUT ; # Localize file glob
$fout[$i] = sprintf "%s_%04d.out", $op{o}, $i ;
open ( OUT, "> $fout[$i]" ) or
die "[lfsplit] Error writing to $fout[$i]: $!\n";
push ( #fhnd , *OUT ) ;
}
$i = 0 ;
foreach ( #ifiles ) {
print "Now reading $_ ..." if $op{v} ;
open ( IN, "< $_" ) or
die "[lfsplit] Error reading $op{i}: $!\n" ;
while ( <IN> ) {
print { $fhnd[$i] } $_ ;
$i = 0 if ++$i >= $op{n} ;
}
close IN ;
}
for ( $i = 0 ; $i < $op{n} ; $i++ ) {
close $fhnd[$i] ;
}
#-----------------------------------------------------------------------------
# Exit
#-----------------------------------------------------------------------------
exit 0 ;
Just for kicks, a pure bash solution, no external programs and no forking (I think):
#!/bin/bash
input=$1
separator=$2
outputs=$3
i=0
while read -r -d"$separator" record; do
out=$((i % outputs)).txt
if ((i < outputs)); then
: > $out
else
echo -n "$separator" >> $out
fi
echo -n "$record" >> $out
((i++))
done < $input
Sadly this will reopen every file for every output operation. I'm sure it's possible to fix this, using <> to open a file descriptor and keep it open, but using that with non-literal file descriptors is a bit of a pain.

Why can't I use more than 20 files with my Perl script and Windows's SendTo?

I'm trying to emulate RapidCRC's ability to check crc32 values within filenames on Windows Vista Ultimate 64-bit. However, I seem to be running into some kind of argument limitation.
I wrote a quick Perl script, created a batch file to call it, then placed a shortcut to the batch file in %APPDATA%\Microsoft\Windows\SendTo
This works great when I select about 20 files or less, right-click and "send to" my batch file script. However, nothing happens at all when I select more than that. I suspect there's a character or number of arguments limit somewhere.
Hopefully I'm missing something simple and that the solution or a workaround isn't too painful.
References:
batch file (crc32_inline.bat):
crc32_inline.pl %*
Perl notes:
I'm using (strawberry) perl v5.10.0
I have C:\strawberry\perl\bin in my path, which is where crc32.bat exists.
perl script (crc32_inline.pl):
#!/usr/bin/env perl
use strict;
use warnings;
use Cwd;
use English qw( -no_match_vars );
use File::Basename;
$OUTPUT_AUTOFLUSH = 1;
my $crc32_cmd = 'crc32.bat';
my $failure_report_basename = 'crc32_failures.txt';
my %failures = ();
print "\n";
foreach my $arg (#ARGV) {
# if the file has a crc, check to see if it matches the calculated
# crc.
if (-f $arg and $arg =~ /\[([0-9a-f]{8})\]/i) {
my $crc = uc $1;
my $basename = basename($arg);
print "checking ${basename}... ";
my $calculated_crc = uc `${crc32_cmd} "${arg}"`;
chomp($calculated_crc);
if ($crc eq $calculated_crc) {
print "passed.\n";
}
else {
print "FAILED (calculated ${calculated_crc})\n";
my $dirname = dirname($arg);
$failures{$dirname}{$basename} = $calculated_crc;
}
}
}
print "\nReport Summary:\n";
if (scalar keys %failures == 0) {
print " All files OK\n";
}
else {
print sprintf(" %d / %d files failed crc32 validation.\n" .
" See %s for details.\n",
scalar keys %failures,
scalar #ARGV,
$failure_report_basename);
my $failure_report_fullname = $failure_report_basename;
if (defined -f $ARGV[0]) {
$failure_report_fullname
= dirname($ARGV[0]) . '/' . $failure_report_basename;
}
$OUTPUT_AUTOFLUSH = 0;
open my $fh, '>' . $failure_report_fullname or die $!;
foreach my $dirname (sort keys %failures) {
print {$fh} $dirname . "\n";
foreach my $basename (sort keys %{$failures{$dirname}}) {
print {$fh} sprintf(" crc32(%s) basename(%s)\n",
$failures{$dirname}{$basename},
$basename);
}
}
close $fh;
$OUTPUT_AUTOFLUSH = 1;
}
print sprintf("\n%s done! (%d seconds elapsed)\n" .
"Press enter to exit.\n",
basename($0),
time() - $BASETIME);
<STDIN>;
I will recommend just putting a shortcut to your script in the "Send To" directory instead of doing it via a batch file (which is subject to cmd.exes limits on command line length).

Resources