Split very large files on record border - bash

I have to split a very large file into N smaller files with the following constraints:
I have to split on record border
Record separator can be any character
the number of records in the resulting N files should be the same (+/- 1 record)
I can just use bash and standard coreutils (I have a working solution in Perl but we're not allowed to install Perl/Python/etc)
This is not a real constraint but - if possible - I'd like to scan the original (large) file just once.
Sort order of the resulting files is not important.
My working solution in Perl reads the original file and writes...
- the 1st record to the first file
- ...
- the Nth record to the Nth file
- the N+1 record back to the first file
- etc
So - at the end - with a single scan of the initial file I do get several smaller files with the same number of records (+/- 1).
For example, assume this is the input file:
1,1,1,1A2,2,2,2A3,
3,3,3A4,4,4,4A5,5,
5,5A6,6,6,6A7,7,7,
7,A8,8,8,8A9,9,9,9
A0,0,0,0
With record separator = 'A' and N = 3 I should get three files:
# First file:
1,1,1,1A2,2,2,2A3,
3,3,3
# Second file
4,4,4,4A5,5,
5,5A6,6,6,6
# Third file:
7,7,7,
7,A8,8,8,8A9,9,9,9
A0,0,0,0
UPDATE
Here you have the perl code. I tried to make it as simple and readable as I can:
#!/usr/bin/perl
use warnings;
use strict;
use locale;
use Getopt::Std;
#-----------------------------------------------------------------------------
# Declaring variables
#-----------------------------------------------------------------------------
my %op = (); # Command line parameters hash
my $line = 0; # Output file line number
my $fnum = 0; # Output file number
my #fout = (); # Output file names array
my #fhnd = (); # Output file handles array
my #ifiles = (); # Input file names
my $i = 0; # Loop variable
#-----------------------------------------------------------------------------
# Handling command line arguments
#-----------------------------------------------------------------------------
getopts("o:n:hvr:", \%op);
die "Usage: lfsplit [-h] -n number_of_files",
" [-o outfile_prefix] [-r rec_sep_decimal] [-v] input_file(s)\n"
if $op{h} ;
if ( #ARGV ) {
#ifiles = #ARGV ;
} else {
die "No input files...\n" ;
}
$/ = chr($op{r}) if $op{r} ;
#-----------------------------------------------------------------------------
# Setting Default values
#-----------------------------------------------------------------------------
$op{o} |= 'out_' ;
#-----------------------------------------------------------------------------
# Body - split in round-robin to $op{n} files
#-----------------------------------------------------------------------------
for ( $i = 0 ; $i < $op{n} ; $i++ ) {
local *OUT ; # Localize file glob
$fout[$i] = sprintf "%s_%04d.out", $op{o}, $i ;
open ( OUT, "> $fout[$i]" ) or
die "[lfsplit] Error writing to $fout[$i]: $!\n";
push ( #fhnd , *OUT ) ;
}
$i = 0 ;
foreach ( #ifiles ) {
print "Now reading $_ ..." if $op{v} ;
open ( IN, "< $_" ) or
die "[lfsplit] Error reading $op{i}: $!\n" ;
while ( <IN> ) {
print { $fhnd[$i] } $_ ;
$i = 0 if ++$i >= $op{n} ;
}
close IN ;
}
for ( $i = 0 ; $i < $op{n} ; $i++ ) {
close $fhnd[$i] ;
}
#-----------------------------------------------------------------------------
# Exit
#-----------------------------------------------------------------------------
exit 0 ;

Just for kicks, a pure bash solution, no external programs and no forking (I think):
#!/bin/bash
input=$1
separator=$2
outputs=$3
i=0
while read -r -d"$separator" record; do
out=$((i % outputs)).txt
if ((i < outputs)); then
: > $out
else
echo -n "$separator" >> $out
fi
echo -n "$record" >> $out
((i++))
done < $input
Sadly this will reopen every file for every output operation. I'm sure it's possible to fix this, using <> to open a file descriptor and keep it open, but using that with non-literal file descriptors is a bit of a pain.

Related

Local `commit-msg` doesn't work in conjunction with global `prepare-commit-msg`

When I have my global git hooks directory (which contains just a prepare-commit-msg hook) set up in config, my local commit-msg doesn't run (although the global hook does). However, when I disable the global prepare-commit-msg hook (by commenting out core.hookspath in gitconfig), the local commit-msg hook works just fine.
~/dotfiles/git-hooks/prepare-commit-msg
#!/usr/bin/env bash
pcregrep -Mv '(# Please.*|# with.*|^#$\n(?!#)|^#$(?=\n# On))' $1 > /tmp/msg && cat /tmp/msg > $1
./.git/hooks/commit-msg (Gerrit's hook to add change-id's if necessary, trimmed to remove license comments)
#!/bin/sh
...
unset GREP_OPTIONS
CHANGE_ID_AFTER="Bug|Depends-On|Issue|Test|Feature|Fixes|Fixed"
MSG="$1"
# Check for, and add if missing, a unique Change-Id
#
add_ChangeId() {
clean_message=`sed -e '
/^diff --git .*/{
s///
q
}
/^Signed-off-by:/d
/^#/d
' "$MSG" | git stripspace`
if test -z "$clean_message"
then
return
fi
# Do not add Change-Id to temp commits
if echo "$clean_message" | head -1 | grep -q '^\(fixup\|squash\)!'
then
return
fi
if test "false" = "`git config --bool --get gerrit.createChangeId`"
then
return
fi
# Does Change-Id: already exist? if so, exit (no change).
if grep -i '^Change-Id:' "$MSG" >/dev/null
then
return
fi
id=`_gen_ChangeId`
T="$MSG.tmp.$$"
AWK=awk
if [ -x /usr/xpg4/bin/awk ]; then
# Solaris AWK is just too broken
AWK=/usr/xpg4/bin/awk
fi
# Get core.commentChar from git config or use default symbol
commentChar=`git config --get core.commentChar`
commentChar=${commentChar:-#}
# How this works:
# - parse the commit message as (textLine+ blankLine*)*
# - assume textLine+ to be a footer until proven otherwise
# - exception: the first block is not footer (as it is the title)
# - read textLine+ into a variable
# - then count blankLines
# - once the next textLine appears, print textLine+ blankLine* as these
# aren't footer
# - in END, the last textLine+ block is available for footer parsing
$AWK '
BEGIN {
if (match(ENVIRON["OS"], "Windows")) {
RS="\r?\n" # Required on recent Cygwin
}
# while we start with the assumption that textLine+
# is a footer, the first block is not.
isFooter = 0
footerComment = 0
blankLines = 0
}
# Skip lines starting with commentChar without any spaces before it.
/^'"$commentChar"'/ { next }
# Skip the line starting with the diff command and everything after it,
# up to the end of the file, assuming it is only patch data.
# If more than one line before the diff was empty, strip all but one.
/^diff --git / {
blankLines = 0
while (getline) { }
next
}
# Count blank lines outside footer comments
/^$/ && (footerComment == 0) {
blankLines++
next
}
# Catch footer comment
/^\[[a-zA-Z0-9-]+:/ && (isFooter == 1) {
footerComment = 1
}
/]$/ && (footerComment == 1) {
footerComment = 2
}
# We have a non-blank line after blank lines. Handle this.
(blankLines > 0) {
print lines
for (i = 0; i < blankLines; i++) {
print ""
}
lines = ""
blankLines = 0
isFooter = 1
footerComment = 0
}
# Detect that the current block is not the footer
(footerComment == 0) && (!/^\[?[a-zA-Z0-9-]+:/ || /^[a-zA-Z0-9-]+:\/\//) {
isFooter = 0
}
{
# We need this information about the current last comment line
if (footerComment == 2) {
footerComment = 0
}
if (lines != "") {
lines = lines "\n";
}
lines = lines $0
}
# Footer handling:
# If the last block is considered a footer, splice in the Change-Id at the
# right place.
# Look for the right place to inject Change-Id by considering
# CHANGE_ID_AFTER. Keys listed in it (case insensitive) come first,
# then Change-Id, then everything else (eg. Signed-off-by:).
#
# Otherwise just print the last block, a new line and the Change-Id as a
# block of its own.
END {
unprinted = 1
if (isFooter == 0) {
print lines "\n"
lines = ""
}
changeIdAfter = "^(" tolower("'"$CHANGE_ID_AFTER"'") "):"
numlines = split(lines, footer, "\n")
for (line = 1; line <= numlines; line++) {
if (unprinted && match(tolower(footer[line]), changeIdAfter) != 1) {
unprinted = 0
print "Change-Id: I'"$id"'"
}
print footer[line]
}
if (unprinted) {
print "Change-Id: I'"$id"'"
}
}' "$MSG" > "$T" && mv "$T" "$MSG" || rm -f "$T"
}
_gen_ChangeIdInput() {
echo "tree `git write-tree`"
if parent=`git rev-parse "HEAD^0" 2>/dev/null`
then
echo "parent $parent"
fi
echo "author `git var GIT_AUTHOR_IDENT`"
echo "committer `git var GIT_COMMITTER_IDENT`"
echo
printf '%s' "$clean_message"
}
_gen_ChangeId() {
_gen_ChangeIdInput |
git hash-object -t commit --stdin
}
add_ChangeId
~/dotfiles/gitconfig (trimmed) when global hook is enabled.
...
[core]
editor = code -rwg $1:2
excludesfile = /Users/shreyasminocha/.gitignore
compression = 0
hookspath = /Users/shreyasminocha/dotfiles/git-hooks
...
Git version: 2.18.0
Edit: As #phd pointed out in the comments, "The problem is that global hookspath completely takes over local hooks. If the global hookspath is defined local hooks are never consulted. [I had] to create global /Users/shreyasminocha/dotfiles/git-hooks/commit-msg that [would] run local .git/hooks/commit-msg". Confirming duplicate.
Eventually solved this by adding some code to the global hook, just as #phd suggested. It looks for a local hook and runs it if it exists.
Global prepare-commit-msg hook:
#!/usr/bin/env bash
# if a local hook exists, run it
if [ -e ./.git/hooks/prepare-commit-msg ]; then
./.git/hooks/prepare-commit-msg "$#"
fi
if [ -e ./.git/hooks/commit-msg ]; then
./.git/hooks/commit-msg "$#"
fi
# global_hook-related things follow
pcregrep -Mv '(# Please.*|# with.*|^#$\n(?!#)|^#$(?=\n# On))' $1 > /tmp/msg && cat /tmp/msg > $1

Executing Perl script from windows-command line with 2 entry

this is my Perl script
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;
sub xml2array{
my $path = shift;
my $twig = XML::Twig->new->parsefile($path);
return map { $_ -> att('VirtualPath') } $twig -> get_xpath('//Signals');
}
sub compareMappingToArray {
my $mapping = shift;
my $signalsRef = shift;
my $i = 1;
print "In file : $mapping\n";
open(my $fh, $mapping);
while (my $r = <$fh>) {
chomp $r;
if ($r =~ /\'(ModelSpecific.*)\'/) {
my $s = $1;
my #matches = grep { /^$s$/ } #{$signalsRef};
print "line $i : not found - $s\n" if scalar #matches ==0;
print "line $i : multiple $s\n" if scalar #matches > 1;
}
$i = $i + 1 # keep line index
}
}
my $mapping = "C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm";
my #virtualpath = xml2array("SignalModel.xml");
compareMappingToArray($mapping, \#virtualpath);
The script works well, the aim of it is to compare the file "SignalModel.xml" and "CAN_ESP_002_mapping.pm" and putting the lines that didn't matches in a .TXT file. Here is how the .TXT file looks like:
In file : C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm
line 331 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_202.R2B_VCU_202__byte_3
line 348 : not found - ModelSpecific.EID.NET.CAN_Engine.CMM_WX.Transmit.CMM_HYB_208.R2B_CMM_HYB_208__byte_2
line 368 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_222.R2B_VCU_222__byte_0
But for this script, I put the two files that need to be compare inside of the code and instead of doing that, I would like to run the script in windows cmd line and having something like:
C:\Users>perl CANMappingChecker.pl -'file 1' 'file 2'
All the files are in .zip file so if I can execute the script that he goes inside and take the 2 files that I need for comparison, it should be perfect.
I really don't know how to do and what to put inside my script to make that in the cmd windows. Thanks for your help !
Program (or script) parameters are stored in the #ARGV array. shift and pop without any parameter will work on #ARGV when used outside of a sub, in a sub they operate on #_.
See Archive::Zip for zip file handling.

Kaldi librispeech data preparation error

I'm trying to do ASR system. Im using kaldi manual and librispeech corpus.
In data preparation step i get this error
utils/data/get_utt2dur.sh: segments file does not exist so getting durations
from wave files
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file
headers, using wav-to-duration
utils/data/get_utt2dur.sh: line 99: wav-to-duration: command not found
And here the piece of code where this error occures
if cat $data/wav.scp | perl -e '
while (<>) { s/\|\s*$/ |/; # make sure final | is preceded by space.
#A = split;
if (!($#A == 5 && $A[1] =~ m/sph2pipe$/ &&
$A[2] eq "-f" && $A[3] eq "wav" && $A[5] eq "|")) { exit (1); }
$utt = $A[0]; $sphere_file = $A[4];
if (!open(F, "<$sphere_file")) { die "Error opening sphere file $sphere_file"; }
$sample_rate = -1; $sample_count = -1;
for ($n = 0; $n <= 30; $n++) {
$line = <F>;
if ($line =~ m/sample_rate -i (\d+)/) { $sample_rate = $1; }
if ($line =~ m/sample_count -i (\d+)/) { $sample_count = $1;
}
if ($line =~ m/end_head/) { break; }
}
close(F);
if ($sample_rate == -1 || $sample_count == -1) {
die "could not parse sphere header from $sphere_file";
}
$duration = $sample_count * 1.0 / $sample_rate;
print "$utt $duration\n";
} ' > $data/utt2dur; then
echo "$0: successfully obtained utterance lengths from sphere-file headers"
else
echo "$0: could not get utterance lengths from sphere-file headers,
using wav-to-duration"
if command -v wav-to-duration >/dev/null; then
echo "$0: wav-to-duration is not on your path"
exit 1;
fi
In file wav.scp i got such lines:
6295-64301-0002 flac -c -d -s /home/tinin/kaldi/egs/librispeech/s5/LibriSpeech/dev-clean/6295/64301/6295-64301-0002.flac |
In this dataset i have only flac files(they downloaded via provided script) and i dont understand why we search wav-files? And how run data preparation correctly(i didnt change source code in this manual.
Also, if you explain to me what is happening in this code, then I will be very grateful to you, because i'm not familiar with bash and perl.
Thank you a lot!
The problem I see from this line
utils/data/get_utt2dur.sh: line 99: wav-to-duration: command not found
is that you have not added the kaldi tools in your path.
Check the file path.sh and see if the directories that it adds to your path are correct (because it has ../../.. inside and it might not match your current folder setup)
As for the perl script, it counts the samples of the sound file and then it divides with the sample rate in order to get the duration. Don't worry about the 'wav' word, your files might be on another format, it's just the name of the kaldi functions.

How to compare the content of multiple txt file in bash shell and delete the one (file) which is duplicate

I am trying to achieve this is Mac OS, tried to achieve similar by using fdupes but didn't work. Here is what I am trying to achieve:
There are 100 files in directory 'alpha'
Pick one file A and compare it with each remaining file in the directory 'alpha'
If content of file A matches any file (duplicate), delete the duplicate file
Move to file B, and compare with the remaining file, and do the same (check for duplicate)
Repeat the same until all files are checked for duplicates. Remaining files should be unique
Update
I modified a bit something similar I found here, but I have to run it multiple times to take out the duplicates. It is not detecting duplicates in a single run (have to run it multiple times to detect duplicate). Not sure if it is working correctly
use Digest::MD5;
%check = ();
while (<*>) {
-d and next;
$fname = "$_";
print "checking .. $fname\n";
$md5 = getmd5($fname) . "\n";
if ( !defined( $check{$md5} ) ) {
$check{$md5} = "$fname";
}
else {
print "Found duplicate files: $fname and $check{$md5}\n";
print "Deleting duplicate $check{$md5}\n";
unlink $check{$md5};
}
}
sub getmd5 {
my $file = "$_";
open( FH, "<", $file ) or die "Cannot open file: $!\n";
binmode(FH);
my $md5 = Digest::MD5->new;
$md5->addfile(FH);
close(FH);
return $md5->hexdigest;
}
You should limit the number of times that you have to read each file's contents:
Inventory the files using Path::Class or some similar method.
a. Build a hash relating file sizes and MD5::Digest to a list of file names.
Compare likely duplicates only. Matching file size and digest.
The following is untested:
use strict;
use warnings;
use Path::Class;
use Digest::MD5;
my $dir = dir('.');
my %files_per_digest;
# Inventory Directory
while ( my $file = $dir->next ) {
my $size = $file->stat->size;
my $digest = do {
my $md5 = Digest::MD5->new;
$md5->addfile( $file->openr );
$md5->hexdigest;
};
push #{ $files_per_digest{"$size - $digest"} }, $file;
}
# Compare likely duplicates only
for my $files ( grep { #$_ > 1 } values %files_per_digest ) {
# Sort by alpha
#$files = sort #$files;
print "Comparing: #files\n";
for my $i ( reverse 0 .. $#files ) {
for my $j ( 0 .. $i - 1 ) {
my $fh1 = $files->[$i]->openr;
my $fh2 = $files->[$j]->openr;
my $diff = 0;
while ( !eof($fh1) && !eof($fh2) ) {
$diff = 1, last if scalar(<$fh1>) ne scalar(<$fh2>);
}
if ( $diff or !eof($fh1) or !eof($fh2) ) {
print " $files->[$i] ($i) is duplicate of $files->[$j] ($j)\n";
$files->[$i]->remove();
splice #$files, $i, 1;
}
}
}
}
I've used rdfind in the past with very good success. It's very accurate, fast, and seems to run leaner than fdupes. According to RDFind's web site (http://rdfind.pauldreik.se/), it can be installed using MacPorts.

Why can't I use more than 20 files with my Perl script and Windows's SendTo?

I'm trying to emulate RapidCRC's ability to check crc32 values within filenames on Windows Vista Ultimate 64-bit. However, I seem to be running into some kind of argument limitation.
I wrote a quick Perl script, created a batch file to call it, then placed a shortcut to the batch file in %APPDATA%\Microsoft\Windows\SendTo
This works great when I select about 20 files or less, right-click and "send to" my batch file script. However, nothing happens at all when I select more than that. I suspect there's a character or number of arguments limit somewhere.
Hopefully I'm missing something simple and that the solution or a workaround isn't too painful.
References:
batch file (crc32_inline.bat):
crc32_inline.pl %*
Perl notes:
I'm using (strawberry) perl v5.10.0
I have C:\strawberry\perl\bin in my path, which is where crc32.bat exists.
perl script (crc32_inline.pl):
#!/usr/bin/env perl
use strict;
use warnings;
use Cwd;
use English qw( -no_match_vars );
use File::Basename;
$OUTPUT_AUTOFLUSH = 1;
my $crc32_cmd = 'crc32.bat';
my $failure_report_basename = 'crc32_failures.txt';
my %failures = ();
print "\n";
foreach my $arg (#ARGV) {
# if the file has a crc, check to see if it matches the calculated
# crc.
if (-f $arg and $arg =~ /\[([0-9a-f]{8})\]/i) {
my $crc = uc $1;
my $basename = basename($arg);
print "checking ${basename}... ";
my $calculated_crc = uc `${crc32_cmd} "${arg}"`;
chomp($calculated_crc);
if ($crc eq $calculated_crc) {
print "passed.\n";
}
else {
print "FAILED (calculated ${calculated_crc})\n";
my $dirname = dirname($arg);
$failures{$dirname}{$basename} = $calculated_crc;
}
}
}
print "\nReport Summary:\n";
if (scalar keys %failures == 0) {
print " All files OK\n";
}
else {
print sprintf(" %d / %d files failed crc32 validation.\n" .
" See %s for details.\n",
scalar keys %failures,
scalar #ARGV,
$failure_report_basename);
my $failure_report_fullname = $failure_report_basename;
if (defined -f $ARGV[0]) {
$failure_report_fullname
= dirname($ARGV[0]) . '/' . $failure_report_basename;
}
$OUTPUT_AUTOFLUSH = 0;
open my $fh, '>' . $failure_report_fullname or die $!;
foreach my $dirname (sort keys %failures) {
print {$fh} $dirname . "\n";
foreach my $basename (sort keys %{$failures{$dirname}}) {
print {$fh} sprintf(" crc32(%s) basename(%s)\n",
$failures{$dirname}{$basename},
$basename);
}
}
close $fh;
$OUTPUT_AUTOFLUSH = 1;
}
print sprintf("\n%s done! (%d seconds elapsed)\n" .
"Press enter to exit.\n",
basename($0),
time() - $BASETIME);
<STDIN>;
I will recommend just putting a shortcut to your script in the "Send To" directory instead of doing it via a batch file (which is subject to cmd.exes limits on command line length).

Resources