What does "ERROR: fstsymbols: Saving osymbols but there are no output symbols" means and how can i solve it? - openfst

I wanted to update the language model of Kaldi model used in Vosk. I was following the Updating the language model in Vosk Adaptation
I had this ERROR: fstsymbols: Saving osymbols but there are no output symbols when I try to run fstsymbols --save_osymbols=words.txt Gr.fst > /dev/null .
I guess the words.txt is the new text which I want to recognize right?

What this command does is it saves all the output symbol table of the current finite state transducer (language model # .fst file) into a text file called words.txt.
$ fstsymbols --help
...
--save_osymbols: type = string, default = ""
Save fst file's output symbol table to file
...
This file would be used in the next command, where text.txt is a file containing your custom list of words.
farcompilestrings --fst_type=compact --symbols=words.txt --keep_symbols text.txt | \
ngramcount | ngrammake | \
fstconvert --fst_type=ngram > Gr.new.fst

Related

save stream output as multiple files

I have a program (pull) which downloads files and emits their contents (JSON) to stdout, the input of the program is the id of every document I want to download, like so:
pull one two three
>
> { ...one }
> {
...two
}
> { ...three }
However, I now would like to pipe that output to a different file for each file it has emitted, ideally being able to reference the filename by the order of args initially used: one two three.
So, the outcome I am looking for, would something like the below.
pull one two three | > $1.json
>
> saved one.json
> saved two.json
> saved three.json
Is there any way to achieve this or something similar at all?
Update
I just would like to clarify how the program works and why it may not be ideal looping through arguments and executing the program multiple times for each argument declared.
Whenever pull gets executed, it performs two operations:
A: Expensive operation (timely to resolve): This retrieves all documents available in a database where we can lookup items by the argument names provided when invoking pull.
B: Operation specific to the provided argument: after A resolves, we will use its response in order to get the data needed for specifically retrieving the individual document.
This means that, having A+B called multiple times for every argument, wouldn't be ideal as A is an expensive operation.
So instead of having, AB AB AB AB I would like to have ABBBB.
You're doing it the hard way.
for f in one two three; do pull "$f" > "$f.json" & done
Unless something in the script is not compatible with multiple simultaneous copies, this will make the process faster as well. If it is, just change the & to ;.
Update
Try just always writing the individual files. If you also need to be able to send them to stdout, just cat the file afterwards, or use tee when writing it.
If that's not ok, then you will need to clearly identify and parse the data blocks. For example, if the start of a section is THE ONLY place { appears as the first character on a line, that's a decent sentinel value. Split your output to files using that.
For example, throw this into another script:
awk 'NR==FNR { ndx=1; split($0,fn); name=""; next; } /^{/ { name=fn[ndx++]; } { if (length(name)) print $0 > name".json"; }' <( echo "$#" ) <( pull "$#" )
call that script with one two three and it should do what you want.
Explanation
awk '...' <( echo "$#" ) <( pull "$#" )
This executes two commands and returns their outputs as "files", streams of input for awk to process. The first just puts the list of arguments provided on one line for awk to load into an array. The second executes your pull script with those args, which provides the streaming output you already get.
NR==FNR { ndx=1; split($0,fn); name=""; next; }
This tells awk to initialize a file-controlling index, read the single line from the echo command (the args) and split them into an array of filename bases desired, then skip the rest of processing for that record (it isn't "data", it's metadata, and we're done with it.) We initialize name to an empty string so that we can check for length - otherwise those leading blank lines end up in .json, which probably isn't what you want.
/^{/ { name=fn[ndx++]; }
This tells awk each time it sees { as the very first character on a line, set the output filename base to the current index (which we initialized at 1 above) and increment the index for the next time.
{ if (length(name)) print $0 > name".json"; }
This tells awk to print each line to a file named whatever the current index is pointing at, with ".json" appended. if (length(name)) throws away the leading blank line(s) before the first block of JSON.
The result is that each new set will trigger a new filename from your given arguments.
That work for you?
In Use
$: ls *.json
ls: cannot access '*.json': No such file or directory
$: pull one two three # my script to simulate output
{ ...one... }
{
...two...
}
{ ...three... }
$: splitstream one two three # the above command in a file to receive args
$: grep . one* two* three* # now they exist
one.json:{ ...one... }
two.json:{
two.json: ...two...
two.json:}
three.json:{ ...three... }

converting output from string to integer

I am trying to programm a small tool which merges some files on a unix server. I am now forced to merge 20 files into 1. all of those files contain a header and trailer, which needs to be removed and the new created file needs to have a header and trailer. header and trailer are a bit tricky for me to create (the have to be excatly 334 chars none more none less). I was able to create everything but the trailer. the special thing is that the trailer should contain to number of lines.
I have set up my small tool like this:
//loop to cat all 20 files (remove header and trailer)
//generate header from date
//execute wc -l on generated file and add +1 (bc trailer is missing)
//append trailer with executed wc -l information in it
I have tried several commands to add +1 to my trailer but none of them worked properly:
This is what I worked out up to now:
lineCount=echo more someFile.dat | wc -l
echo $lineCount
//echo "$((lineCount + 1))" -> 1
//echo "$(($lineCount + 1))" -> 1
//let "lineCount+=1" -> 1
//$lineCount=lineCount+1 -> won't work
//$lineCount=$lineCount+1 -> won't work
//x=$lineCount+1 -> won't work
This was the output of echo $lineCount (without any changes or anything) there seem to be two empty spaces before the number
163108
What my goal was that instead of 163108 the number should be 163109
edit:
my input files look something like this:
HFFP20190 *
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
YYYYYYYYYYYYYYYYYYYYYXXXXXXXXXXXXXX YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
ABCDEFGHIJKLMNOPQWERSTUVWXYZ ASDFASDFASDFASDFASDFASDFASDFASDFASDFASD
TFFP2019000031795 *
whereas HFFP is the header and TFFP is the trailer -> the main difference between header and trailer is the last number (31795), which contains the number of rows in the file. The output after this merge should be something like this:
HFFP20190 *
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
YYYYYYYYYYYYYYYYYYYYYXXXXXXXXXXXXXX YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
ABCDEFGHIJKLMNOPQWERSTUVWXYZ ASDFASDFASDFASDFASDFASDFASDFASDFASDFASD
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
YYYYYYYYYYYYYYYYYYYYYXXXXXXXXXXXXXX YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
ABCDEFGHIJKLMNOPQWERSTUVWXYZ ASDFASDFASDFASDFASDFASDFASDFASDFASDFASD
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
YYYYYYYYYYYYYYYYYYYYYXXXXXXXXXXXXXX YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
ABCDEFGHIJKLMNOPQWERSTUVWXYZ ASDFASDFASDFASDFASDFASDFASDFASDFASDFASD
...
TFFP2019000163109 *
You can try this:
# get only the lines from wc command
lineCount=$(wc -l someFile.dat | cut -d' ' -f1)
# add 1
((lineCount++))
echo "$lineCount"
Because
lineCount=echo
assigns the string "echo" to the variable. In full,
lineCount=echo more someFile.dat | wc -l
runs the command
more someFile.dat | wc -l
with output to standard output, not to your variable, while temporarily assigning lineCount="echo" for just the duration of this single command.
For the record, the syntax to capture standard output to the variable is
lineCount=$(wc -l <someFile.dat)
where I have factored out the useless more and the even more useless echo.
On the whole, a much better solution is probably to refactor all of this into an Awk script. You haven't described the header and footer logic in enough detail, but something like
awk '
# Skip first line in all files except the first
FNR==1 && NR>1 { next }
# Print and increment, excluding trailer
!/^TFFP/ { print; c++ }
# Add back last trailer
END { sub(/000[1-9][0-9]*$/, "000" 1+c); print} ' someFiles*
The wildcard someFiles* will need to be replaced with something which actually matches your input files in the right order; perhaps *.dat?

How to make and name multiple text files after using the cut command?

I have about 50 data text files that I need to remove several columns from.
I have been using the cut command to remove and rename them individually but I will have many more of the files and need a way to do it large scale.
Currently I have been using:
cut -f1,6,7,8 filename.txt >> filename_Fixed.txt
And I am able to remove the columns from all the files using:
cut -f1,6,7,8 *.txt
But I'm only able to get all the output in the terminal or I can write it to a single text file.
What I want is to edit several files using cut to remove the required columns:
filename1.txt
filename2.txt
filename3.txt
filename4.txt
.
.
.
And get the edited output to write to individual files:
filename_Fixed1.txt
filename_Fixed2.txt
filename_Fixed3.txt
filename_Fixed4.txt
.
.
.
But haven't been able to find a way to write the output to new text files. I'm new to using the command line and not much of a coder, so maybe I don't know what terms to search for? I haven't even been able to find anything doing google searches that has helped me. It seems like it should be simple, but I am struggling.
In desperation, I did try this bit of code, knowing it wouldn't work:
cut -f1,6,7,8 *.txt >> ( FILENAME ".fixed" )
I found the portion after ">>" nested in an awk command that output multiple files.
I also tried (again knowing it wouldn't work) to wild card the output files but got an ambiguous redirect error.
Did you try for?
for f in *.txt ; do
cut -f 1,6,7,8 "$f" > $(basename "$f" .txt)_fixed.txt
done
(N.B. I can't try the basename now, you can replace it with "${f}_fixed")
You can also process it all in awk itself which would make the process much more efficient, especially for large numbers of files, for example:
awk '
NF < 8 {
print "contains less than 8 fields: ", FILENAME
next
}
{ fn=FILENAME
idx=match(fn, /[0-9]+.*$/)
if (idx == 0) {
print "no numeric suffix for file: ", fn
next;
}
newfn=substr(fn,1,idx-1) "_Fixed" substr(fn,idx)
print $1,$6,$7,$8 > newfn
}
' *.txt
Which contains two rules (the expressions between {...}). The first:
NF < 8 {
print "contains less than 8 fields: ", FILENAME
next
}
simply checks that the file contains at least 8 fields (since you want field 8 as your last field). If the file contains less than 8 fields, it just skips to the next file in your list.
The second rule:
{ fn=FILENAME
idx=match(fn, /[0-9]+.*$/)
if (idx == 0) {
print "no numeric suffix for file: ", fn
next;
}
newfn=substr(fn,1,idx-1) "_Fixed" substr(fn,idx)
print $1,$6,$7,$8 > newfn
}
fn=FILENAME stores the current filename as fn to cut down typing,
idx=match(fn, /[0-9]+.*$/) locates the index where the numeric suffix for the filename begins (e.g. were "3.txt" starts),
if (idx == 0) then a numeric suffix was not found, warn, and move on to the next file,
newfn=substr(fn,1,idx-1) "_Fixed" substr(fn,idx) form the new filename from the non-numeric prefix (e.g. "filename"), add "_Fixed" with string-concatenation and then add the numeric suffix, and finally
print $1,$6,$7,$8 > newfn print fields (columns) 1,6,7,8 redirecting output to the new filename.
For more information on each of the string-functions used above, see the GNU awk User's Guide - 9.1.3 String-Manipulation Functions
If I understand what you were attempting, this should be able to handle as many files as you have -- so long as the files have a numeric suffix to place "_Fixed" before in the filename and each file has at least 8 fields (columns). You can just copy/middle-mouse-paste the entire command at the command-line to test.

Shell Script not appending, instead it's erasing contents

My goal is to curl my newly created API with a list of usernames in a .txt file, then receive the API response, save it to a .json, then create a .csv in the end (To read it easier).
This is my script:
echo "$result" | jq 'del(.id, .Time, .username)' | jq '{andres:[.[]]}' > newresult
Input: sudo bash script.sh usernames.txt
Usernames.txt:
test1
test2
test3
test4
Result:
"id","username"
4,"test4"
Desired Result:
"id","username"
1,"test1"
2,"test2"
3,"test3"
4,"test4"
It creates the files as required, and even saves the result. However, it only outputs 1 Result. I can open the CSV/Json as it's running, and see it's querying for different Usernames, but then when it starts another query, rather than Appending it all to the same file, it deletes the Newresult, Result.json, Results.csv, and creates new ones, meaning in the end, i only end up with a result of one username, rather than a list of 5 for example. Can someone tell me what i'm doing wrong?
Thanks!
Use >> to append to file. Try:
: >results.csv
for ligne in $(seq 1 "$nbrlignes");
do
...
jq -r '
["id", "username"] as $headers
| $headers, (.andres[][] | [.[$headers[]]]) | #csv
' < result.json >> results.csv
done
By using > you overwrite the file each time the loop runs.
Also your script looks like it should be largely rewritten and simplified.

Adding an extra value into CSV data, according to filename

Let's say i have the following type of filename formats :
CO#ATH2000.dat , CO#MAR2000.dat
Each of these, have data like that following:
....
"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
Now i also have the following .sh file that can merge ALL those .dat files into one single output .dat file.
for filename in `ls CO#*`; do
cat $filename >> CO#combined.dat
done
Now here is the problem. I want inside CO#combined.dat, at each line, before the start of the values, to have a 'standard' value according to the filename-parameter. For example i want each file with ATH in its filename have 3, at the start of each line and with MAR in its filename have 22,.
So the CO#combined.dat should be something like this:
....
3,"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
3,"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
20,"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
20,"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
So in conclusion i want the script to do the above procedure!
Thanks in advance!
With awk you can take advantage of the built-in FILENAME variable along with the fact that you can supply multiple files to a given invocation. awk processes each file in turn, setting FILENAME to the name of the file whose records are currently being read.
With that you can set your prefix according to whatever pattern you wish to search for in the file name. Finally you can print the prefix and the original record.
Here's a demonstration on simplified versions of your sample input:
$ cat CO\#ATH2000.dat
1
2
3
$ cat CO\#MAR2000.dat
A
B
C
$ awk 'FILENAME ~ /MAR/ {pre=22} FILENAME ~ /ATH/ {pre=3} { print pre "," $0 }' CO*.dat
3,1
3,2
3,3
22,A
22,B
22,C
can be done simply
for f in CO#*; do
case ${f:3:3} in
ATH) k=3 ;;
*) k=22 ;;
esac;
sed "s/^/$k,/" $f >> all;
done
${f:3:3} extract the code ATH or MAR from the filename it's bash substring function; case converts the code to numerical counterpart; sed insert the numerical value and comma at the beginning of each line.

Resources