Kaldi librispeech data preparation error - bash

I'm trying to do ASR system. Im using kaldi manual and librispeech corpus.
In data preparation step i get this error
utils/data/get_utt2dur.sh: segments file does not exist so getting durations
from wave files
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file
headers, using wav-to-duration
utils/data/get_utt2dur.sh: line 99: wav-to-duration: command not found
And here the piece of code where this error occures
if cat $data/wav.scp | perl -e '
while (<>) { s/\|\s*$/ |/; # make sure final | is preceded by space.
#A = split;
if (!($#A == 5 && $A[1] =~ m/sph2pipe$/ &&
$A[2] eq "-f" && $A[3] eq "wav" && $A[5] eq "|")) { exit (1); }
$utt = $A[0]; $sphere_file = $A[4];
if (!open(F, "<$sphere_file")) { die "Error opening sphere file $sphere_file"; }
$sample_rate = -1; $sample_count = -1;
for ($n = 0; $n <= 30; $n++) {
$line = <F>;
if ($line =~ m/sample_rate -i (\d+)/) { $sample_rate = $1; }
if ($line =~ m/sample_count -i (\d+)/) { $sample_count = $1;
}
if ($line =~ m/end_head/) { break; }
}
close(F);
if ($sample_rate == -1 || $sample_count == -1) {
die "could not parse sphere header from $sphere_file";
}
$duration = $sample_count * 1.0 / $sample_rate;
print "$utt $duration\n";
} ' > $data/utt2dur; then
echo "$0: successfully obtained utterance lengths from sphere-file headers"
else
echo "$0: could not get utterance lengths from sphere-file headers,
using wav-to-duration"
if command -v wav-to-duration >/dev/null; then
echo "$0: wav-to-duration is not on your path"
exit 1;
fi
In file wav.scp i got such lines:
6295-64301-0002 flac -c -d -s /home/tinin/kaldi/egs/librispeech/s5/LibriSpeech/dev-clean/6295/64301/6295-64301-0002.flac |
In this dataset i have only flac files(they downloaded via provided script) and i dont understand why we search wav-files? And how run data preparation correctly(i didnt change source code in this manual.
Also, if you explain to me what is happening in this code, then I will be very grateful to you, because i'm not familiar with bash and perl.
Thank you a lot!

The problem I see from this line
utils/data/get_utt2dur.sh: line 99: wav-to-duration: command not found
is that you have not added the kaldi tools in your path.
Check the file path.sh and see if the directories that it adds to your path are correct (because it has ../../.. inside and it might not match your current folder setup)
As for the perl script, it counts the samples of the sound file and then it divides with the sample rate in order to get the duration. Don't worry about the 'wav' word, your files might be on another format, it's just the name of the kaldi functions.

Related

Executing Perl script from windows-command line with 2 entry

this is my Perl script
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;
sub xml2array{
my $path = shift;
my $twig = XML::Twig->new->parsefile($path);
return map { $_ -> att('VirtualPath') } $twig -> get_xpath('//Signals');
}
sub compareMappingToArray {
my $mapping = shift;
my $signalsRef = shift;
my $i = 1;
print "In file : $mapping\n";
open(my $fh, $mapping);
while (my $r = <$fh>) {
chomp $r;
if ($r =~ /\'(ModelSpecific.*)\'/) {
my $s = $1;
my #matches = grep { /^$s$/ } #{$signalsRef};
print "line $i : not found - $s\n" if scalar #matches ==0;
print "line $i : multiple $s\n" if scalar #matches > 1;
}
$i = $i + 1 # keep line index
}
}
my $mapping = "C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm";
my #virtualpath = xml2array("SignalModel.xml");
compareMappingToArray($mapping, \#virtualpath);
The script works well, the aim of it is to compare the file "SignalModel.xml" and "CAN_ESP_002_mapping.pm" and putting the lines that didn't matches in a .TXT file. Here is how the .TXT file looks like:
In file : C:/Users/HOR1DY/Desktop/Global/TA_Mapping/CAN/CAN_ESP_002_mapping.pm
line 331 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_202.R2B_VCU_202__byte_3
line 348 : not found - ModelSpecific.EID.NET.CAN_Engine.CMM_WX.Transmit.CMM_HYB_208.R2B_CMM_HYB_208__byte_2
line 368 : not found - ModelSpecific.EID.NET.CAN_Engine.VCU.Transmit.VCU_222.R2B_VCU_222__byte_0
But for this script, I put the two files that need to be compare inside of the code and instead of doing that, I would like to run the script in windows cmd line and having something like:
C:\Users>perl CANMappingChecker.pl -'file 1' 'file 2'
All the files are in .zip file so if I can execute the script that he goes inside and take the 2 files that I need for comparison, it should be perfect.
I really don't know how to do and what to put inside my script to make that in the cmd windows. Thanks for your help !
Program (or script) parameters are stored in the #ARGV array. shift and pop without any parameter will work on #ARGV when used outside of a sub, in a sub they operate on #_.
See Archive::Zip for zip file handling.

Recursively convert media directory from HEVC to h.264 with ffmpeg

I have media server with two directories: Movies and TV Shows. Within each of those directories, each entry exists in a sub-directory which contains the video file and subtitle files.
I've scoured the web and have found an excellent perl script from Michelle Sullivan, posted here:
#!/usr/bin/perl
use strict;
use warnings;
open DIR, "ls -1 |";
while (<DIR>)
{
chomp;
next if ( -d "$_"); # skip directories
next unless ( -r "$_"); # if it's not readable skip it!
my $file = $_;
open PROBE, "ffprobe -show_streams -of csv '$file' 2>/dev/null|" or die ("Unable to launch ffmpeg for $file! ($!)");
my ($v, $a, $s, #c) = (0,0,0);
while (<PROBE>)
{
my #streaminfo = split(/,/, $_);
push(#c, $streaminfo[2]) if ($streaminfo[5] eq "video");
$a++ if ($streaminfo[5] eq "audio");
$s++ if ($streaminfo[5] eq "subtitle");
}
close PROBE;
$v = scalar #c;
if (scalar #c eq 1 and $c[0] eq "ansi")
{
warn("Text file detected, skipping...\n");
next;
}
warn("$file: Video Streams: $v, Audio Streams: $a, Subtitle Streams: $s, Video Codec(s): " . join (", ", #c) . "\n");
if (scalar #c > 1)
{
warn("$file has more than one video stream, bailing!\n");
next;
}
if ($c[0] eq "hevc")
{
warn("HEVC detected for $file ...converting to AVC...\n");
system("mkdir -p h265");
my #params = ("-hide_banner", "-threads 2");
push(#params, "-map 0") if ($a > 1 or $s > 1 or $v > 1);
push(#params, "-c:a copy") if ($a);
push(#params, "-c:s copy") if ($s);
push(#params, "-c:v libx264 -pix_fmt yuv420p") if ($v);
if (system("mv '$file' 'h265/$file'"))
{
warn("Error moving $file -> h265/$file\n");
next;
}
if (system("ffmpeg -xerror -i 'h265/$file' " . join(" ", #params) . " '$file' 2>/dev/null"))
{
warn("FFMPEG ERROR. Cannot convert $file restoring original...\n");
system("mv 'h265/$file' '$file'");
next;
}
} else {
warn("$file doesn't appear to need converting... Skipping...\n");
}
}
close DIR;
The script performs perfectly - as long as it is run from within the directory containing the media.
My question: Can this script be modified to run recursively from the root directory? How?
Thanks in advance.
(Michelle's script can be seen here: http://www.michellesullivan.org/blog/1636)
Why do you want to run recursively? Do you mean that you want to run it on all the files under a particular directory?
In this problems, I'd rather separate the part that generates the list of files to process from the processing. With a long list of files, I might take the lines from standard input instead:
while( <> ) {
...
}
Pipe the list into the script:
$ find ... | script
Or take it from a file:
$ script list_of_files.txt
With a short list, I might use a favorite xargs trick:
$ find ... -print0 | xargs -0 script
In that case I go through the command-line arguments:
foreach ( #ARGV ) {
...
}
If you want to do all of this in the program, you can use File::Find.
Beyond that, it sounds like you are asking someone to do the work for you.

How to find out if a command exists in a POSIX compliant manner?

See the discussion at Is `command -v` option required in a POSIX shell? Is posh compliant with POSIX?. It describes that type as well as command -v option is optional in POSIX.1-2004.
The answer marked correct at Check if a program exists from a Bash script doesn't help either. Just like type, hash is also marked as XSI in POSIX.1-2004. See http://pubs.opengroup.org/onlinepubs/009695399/utilities/hash.html.
Then what would be a POSIX compliant way to write a shell script to find if a command exists on the system or not?
How do you want to go about it? You can look for the command on directories in the current value of $PATH; you could look in the directories specified by default for the system PATH (getconf PATH as long as getconf
exists on PATH).
Which implementation language are you going to use? (For example: I have a Perl implementation that does a decent job finding executables on $PATH — but Perl is not part of POSIX; is it remotely relevant to you?)
Why not simply try running it? If you're going to deal with Busybox-based systems, lots of the executables can't be found by searching — they're built into the shell. The major caveat is if a command does something dangerous when run with no arguments — but very few POSIX commands, if any, do that. You might also need to determine what command exit statuses indicate that the command is not found versus the command objecting to not being called with appropriate arguments. And there's little guarantee that all systems will be consistent on that. It's a fraught process, in case you hadn't gathered.
Perl implementation pathfile
#!/usr/bin/env perl
#
# #(#)$Id: pathfile.pl,v 3.4 2015/10/16 19:39:23 jleffler Exp $
#
# Which command is executed
# Loosely based on 'which' from Kernighan & Pike "The UNIX Programming Environment"
#use v5.10.0; # Uses // defined-or operator; not in Perl 5.8.x
use strict;
use warnings;
use Getopt::Std;
use Cwd 'realpath';
use File::Basename;
my $arg0 = basename($0, '.pl');
my $usestr = "Usage: $arg0 [-AafhqrsVwx] [-p path] command ...\n";
my $hlpstr = <<EOS;
-A Absolute pathname (determined by realpath)
-a Print all possible matches
-f Print names of files (as opposed to symlinks, directories, etc)
-h Print this help message and exit
-q Quiet mode (don't print messages about files not found)
-r Print names of files that are readable
-s Print names of files that are not empty
-V Print version information and exit
-w Print names of files that are writable
-x Print names of files that are executable
-p path Use PATH
EOS
sub usage
{
print STDERR $usestr;
exit 1;
}
sub help
{
print $usestr;
print $hlpstr;
exit 0;
}
sub version
{
my $version = 'PATHFILE Version $Revision: 3.4 $ ($Date: 2015/10/16 19:39:23 $)';
# Beware of RCS hacking at RCS keywords!
# Convert date field to ISO 8601 (ISO 9075) notation
$version =~ s%\$(Date:) (\d\d\d\d)/(\d\d)/(\d\d) (\d\d:\d\d:\d\d) \$%\$$1 $2-$3-$4 $5 \$%go;
# Remove keywords
$version =~ s/\$([A-Z][a-z]+|RCSfile): ([^\$]+) \$/$2/go;
print "$version\n";
exit 0;
}
my %opts;
usage unless getopts('AafhqrsVwxp:', \%opts);
version if ($opts{V});
help if ($opts{h});
usage unless scalar(#ARGV);
# Establish test and generate test subroutine.
my $chk = 0;
my $test = "-x";
my $optlist = "";
foreach my $opt ('f', 'r', 's', 'w', 'x')
{
if ($opts{$opt})
{
$chk++;
$test = "-$opt";
$optlist .= " -$opt";
}
}
if ($chk > 1)
{
$optlist =~ s/^ //;
$optlist =~ s/ /, /g;
print STDERR "$arg0: mutually exclusive arguments ($optlist) given\n";
usage;
}
my $chk_ref = eval "sub { my(\$cmd) = \#_; return -f \$cmd && $test \$cmd; }";
my #PATHDIRS;
my %pathdirs;
my $path = defined($opts{p}) ? $opts{p} : $ENV{PATH};
#foreach my $element (split /:/, $opts{p} // $ENV{PATH})
foreach my $element (split /:/, $path)
{
$element = "." if $element eq "";
push #PATHDIRS, $element if $pathdirs{$element}++ == 0;
}
my $estat = 0;
CMD:
foreach my $cmd (#ARGV)
{
if ($cmd =~ m%/%)
{
if (&$chk_ref($cmd))
{
print "$cmd\n" unless $opts{q};
next CMD;
}
print STDERR "$arg0: $cmd: not found\n" unless $opts{q};
$estat = 1;
}
else
{
my $found = 0;
foreach my $directory (#PATHDIRS)
{
my $file = "$directory/$cmd";
if (&$chk_ref($file))
{
$file = realpath($file) if $opts{A};
print "$file\n" unless $opts{q};
next CMD unless defined($opts{a});
$found = 1;
}
}
print STDERR "$arg0: $cmd: not found\n" unless $found || $opts{q};
$estat = 1;
}
}
exit $estat;

How to compare the content of multiple txt file in bash shell and delete the one (file) which is duplicate

I am trying to achieve this is Mac OS, tried to achieve similar by using fdupes but didn't work. Here is what I am trying to achieve:
There are 100 files in directory 'alpha'
Pick one file A and compare it with each remaining file in the directory 'alpha'
If content of file A matches any file (duplicate), delete the duplicate file
Move to file B, and compare with the remaining file, and do the same (check for duplicate)
Repeat the same until all files are checked for duplicates. Remaining files should be unique
Update
I modified a bit something similar I found here, but I have to run it multiple times to take out the duplicates. It is not detecting duplicates in a single run (have to run it multiple times to detect duplicate). Not sure if it is working correctly
use Digest::MD5;
%check = ();
while (<*>) {
-d and next;
$fname = "$_";
print "checking .. $fname\n";
$md5 = getmd5($fname) . "\n";
if ( !defined( $check{$md5} ) ) {
$check{$md5} = "$fname";
}
else {
print "Found duplicate files: $fname and $check{$md5}\n";
print "Deleting duplicate $check{$md5}\n";
unlink $check{$md5};
}
}
sub getmd5 {
my $file = "$_";
open( FH, "<", $file ) or die "Cannot open file: $!\n";
binmode(FH);
my $md5 = Digest::MD5->new;
$md5->addfile(FH);
close(FH);
return $md5->hexdigest;
}
You should limit the number of times that you have to read each file's contents:
Inventory the files using Path::Class or some similar method.
a. Build a hash relating file sizes and MD5::Digest to a list of file names.
Compare likely duplicates only. Matching file size and digest.
The following is untested:
use strict;
use warnings;
use Path::Class;
use Digest::MD5;
my $dir = dir('.');
my %files_per_digest;
# Inventory Directory
while ( my $file = $dir->next ) {
my $size = $file->stat->size;
my $digest = do {
my $md5 = Digest::MD5->new;
$md5->addfile( $file->openr );
$md5->hexdigest;
};
push #{ $files_per_digest{"$size - $digest"} }, $file;
}
# Compare likely duplicates only
for my $files ( grep { #$_ > 1 } values %files_per_digest ) {
# Sort by alpha
#$files = sort #$files;
print "Comparing: #files\n";
for my $i ( reverse 0 .. $#files ) {
for my $j ( 0 .. $i - 1 ) {
my $fh1 = $files->[$i]->openr;
my $fh2 = $files->[$j]->openr;
my $diff = 0;
while ( !eof($fh1) && !eof($fh2) ) {
$diff = 1, last if scalar(<$fh1>) ne scalar(<$fh2>);
}
if ( $diff or !eof($fh1) or !eof($fh2) ) {
print " $files->[$i] ($i) is duplicate of $files->[$j] ($j)\n";
$files->[$i]->remove();
splice #$files, $i, 1;
}
}
}
}
I've used rdfind in the past with very good success. It's very accurate, fast, and seems to run leaner than fdupes. According to RDFind's web site (http://rdfind.pauldreik.se/), it can be installed using MacPorts.

Why can't I use more than 20 files with my Perl script and Windows's SendTo?

I'm trying to emulate RapidCRC's ability to check crc32 values within filenames on Windows Vista Ultimate 64-bit. However, I seem to be running into some kind of argument limitation.
I wrote a quick Perl script, created a batch file to call it, then placed a shortcut to the batch file in %APPDATA%\Microsoft\Windows\SendTo
This works great when I select about 20 files or less, right-click and "send to" my batch file script. However, nothing happens at all when I select more than that. I suspect there's a character or number of arguments limit somewhere.
Hopefully I'm missing something simple and that the solution or a workaround isn't too painful.
References:
batch file (crc32_inline.bat):
crc32_inline.pl %*
Perl notes:
I'm using (strawberry) perl v5.10.0
I have C:\strawberry\perl\bin in my path, which is where crc32.bat exists.
perl script (crc32_inline.pl):
#!/usr/bin/env perl
use strict;
use warnings;
use Cwd;
use English qw( -no_match_vars );
use File::Basename;
$OUTPUT_AUTOFLUSH = 1;
my $crc32_cmd = 'crc32.bat';
my $failure_report_basename = 'crc32_failures.txt';
my %failures = ();
print "\n";
foreach my $arg (#ARGV) {
# if the file has a crc, check to see if it matches the calculated
# crc.
if (-f $arg and $arg =~ /\[([0-9a-f]{8})\]/i) {
my $crc = uc $1;
my $basename = basename($arg);
print "checking ${basename}... ";
my $calculated_crc = uc `${crc32_cmd} "${arg}"`;
chomp($calculated_crc);
if ($crc eq $calculated_crc) {
print "passed.\n";
}
else {
print "FAILED (calculated ${calculated_crc})\n";
my $dirname = dirname($arg);
$failures{$dirname}{$basename} = $calculated_crc;
}
}
}
print "\nReport Summary:\n";
if (scalar keys %failures == 0) {
print " All files OK\n";
}
else {
print sprintf(" %d / %d files failed crc32 validation.\n" .
" See %s for details.\n",
scalar keys %failures,
scalar #ARGV,
$failure_report_basename);
my $failure_report_fullname = $failure_report_basename;
if (defined -f $ARGV[0]) {
$failure_report_fullname
= dirname($ARGV[0]) . '/' . $failure_report_basename;
}
$OUTPUT_AUTOFLUSH = 0;
open my $fh, '>' . $failure_report_fullname or die $!;
foreach my $dirname (sort keys %failures) {
print {$fh} $dirname . "\n";
foreach my $basename (sort keys %{$failures{$dirname}}) {
print {$fh} sprintf(" crc32(%s) basename(%s)\n",
$failures{$dirname}{$basename},
$basename);
}
}
close $fh;
$OUTPUT_AUTOFLUSH = 1;
}
print sprintf("\n%s done! (%d seconds elapsed)\n" .
"Press enter to exit.\n",
basename($0),
time() - $BASETIME);
<STDIN>;
I will recommend just putting a shortcut to your script in the "Send To" directory instead of doing it via a batch file (which is subject to cmd.exes limits on command line length).

Resources