Writing a Bash script that takes a text file as input and pipes the text file through several commands - bash

I keep text files with definitions in a folder. I like to convert them to spoken word so I can listen to them. I already do this manually by running a few commands to insert some pre-processing codes into the text files and then convert the text to spoken word like so:
sed 's/\..*$/[[slnc 2000]]/' input.txt inserts a control code after first period
sed 's/$/[[slnc 2000]]/' input.txt" inserts a control code at end of each line
cat input.txt | say -v Alex -o input.aiff
Instead of having to retype these each time, I would like to create a Bash script that pipes the output of these commands to the final product. I want to call the script with the script name, followed by an input file argument for the text file. I want to preserve the original text file so that if I open it again, none of the control codes are actually inserted, as the only purpose of the control codes is to insert pauses in the audio file.
I've tried writing
#!/bin/bash
FILE=$1
sed 's/$/ [[slnc 2000]]/' FILE -o FILE
But I get hung up immediately as it says sed: -o: No such file or directory. Can anyone help out?

If you just want to use foo.txt to generate foo.aiff with control characters, you can do:
#!/bin/sh
for file; do
test "${file%.txt}" = "${file}" && continue
sed -e 's/\..*$/[[slnc 2000]]/' "$file" |
sed -e 's/$/[[slnc 2000]]/' |
say -v Alex -o "${file%.txt}".aiff
done
Call the script with your .txt files as arguments (eg, ./myscript *.txt) and it will generate the .aiff files. Be warned, if say overwrites files, then this will as well. You don't really need two sed invocations, and the sed that you're calling can be cleaned up, but I don't want to distract from the core issue here, so I'm leaving that as you have it.

This will:-
a} Make a list of your text files to process in the current directory, with find.
b} Apply your sed commands to each text file in the list, but only for the current use, allowing you to preserve them intact.
c} Call "say" with the edited files.
I don't have say, so I can't test that or the control codes; but as long as you have Ed, the loop works. I've used it many times. I learned it as a result of exposure to FORTH, which is a language that still permits unterminated loops. I used to have problems with remembering to invoke next at the end of the script in order to start it, but I got over that by defining my words (functions) first, in FORTH style, and then always placing my single-use commands at the end.
#!/bin/sh
next() {
[[ -s stack ]] && main
end
}
main() {
line=$(ed -s stack < edprint+.txt)
infile=$(cat "${line}" | sed 's/\..*$/[[slnc 2000]]/' | sed 's/$/[[slnc 2000]]/')
say "${infile}" -v Alex -o input.aiff
ed -s stack < edpop+.txt
next
}
end() {
rm -v ./stack
rm -v ./edprint+.txt
rm -v ./edpop+.txt
exit 0
}
find *.txt -type -f > stack
cat >> edprint+.txt << EOF
1
q
EOF
cat >> edpop+.txt << EOF
1d
wq
EOF
next

Related

Search text and append to each end of line of text file - OSX

I'm new to OSX command line tools.
I am trying to find a block of text in a file and append this text at the end of all lines in another text file. At run time I don't know what this text will be, I just know it will be located within "BEGINHMM" and "ENDHMM". Also, I don't know the makeup of the destination file, except for that it will not be an empty text file.
The command which finds the block of text of interest is:
sed -n '/<BEGINHMM>/,/<ENDHMM>/p' proto
where "proto" is a text file containing the text of interest.
I've been trying to pipe the output of the above command to another 'sed' command, in the following manner:
xargs -I '{}' sed -i .bak 's/$/{}/' monophones0.txt
but I am getting some bizarre results, I see the "{}" inserted in the text for example.
I've also tried piping to:
xargs -0 sed -i .bak 's/$/&/' monophones0.txt
but I just get the printout (similar to terminal echo) of the text I am trying to grab.
Ultimately I want to loop over several 'proto' files in multiple directories and copy the text between the "BEGINHMM", "ENDHMM" block in each directory, and append the selected text to that directory's monophones.txt lines.
I am running the commands in the terminal, bash, OSX 10.12.2
Any help would be appreciated.
(1) Your sed command is of the form sed -n '/A/,/B/p'; this will include the lines on which A and B occur, even if these strings do not appear at the beginning of the line. This form may have other surprises in store for you as well (what do expect will happen if B is missing or repeated?), but the remainder of this post assumes that's what you want.
(2) It's not clear how you intend to specify the "proto" files, but you do indicate they might be in several directories, so for the remainder of this post, I'll assume they are listed, one per line, in a file named proto.txt in each directory. This will ensure that you don't run into any limitations on command-line length, but the following can easily be modified if you don't want to create such a file.
(3) Here is a script which will use the sed command you've mentioned to copy segments from each of the "proto" files specified in a directory to monophones0.txt in the directory in which the script is executed.
#!/bin/bash
OUT=monophones0.txt
cat proto.txt | while read file
do
if [ -r "$file" ] ; then
sed -n '/<BEGINHMM>/,/<ENDHMM>/p' "$file" >> $OUT
elif [ -n "$file" ] ; then
echo "NOT FOUND: $file" >&2
fi
done
Just like what you did before. tmpfile=$(mktemp); sed -n '/<BEGINHMM>/,/<ENDHMM>/p' proto >$tmpfile; sed -i .bak "r $tmpfile" monophones0.txt; rm $tmpfile. This is the basic idea; there are other checks you need to perform to make this a robust script.
– 4ae1e1

how to make a winmerge equivalent in linux

My friend recently asked how to compare two folders in linux and then run meld against any text files that are different. I'm slowly catching on to the linux philosophy of piping many granular utilities together, and I put together the following solution. My question is, how could I improve this script. There seems to be quite a bit of redundancy and I'd appreciate learning better ways to script unix.
#!/bin/bash
dir1=$1
dir2=$2
# show files that are different only
cmd="diff -rq $dir1 $dir2"
eval $cmd # print this out to the user too
filenames_str=`$cmd`
# remove lines that represent only one file, keep lines that have
# files in both dirs, but are just different
tmp1=`echo "$filenames_str" | sed -n '/ differ$/p'`
# grab just the first filename for the lines of output
tmp2=`echo "$tmp1" | awk '{ print $2 }'`
# convert newlines sep to space
fs=$(echo "$tmp2")
# convert string to array
fa=($fs)
for file in "${fa[#]}"
do
# drop first directory in path to get relative filename
rel=`echo $file | sed "s#${dir1}/##"`
# determine the type of file
file_type=`file -i $file | awk '{print $2}' | awk -F"/" '{print $1}'`
# if it's a text file send it to meld
if [ $file_type == "text" ]
then
# throw out error messages with &> /dev/null
meld $dir1/$rel $dir2/$rel &> /dev/null
fi
done
please preserve/promote readability in your answers. An answer that is shorter but harder to understand won't qualify as an answer.
It's an old question, but let's work a bit on it just for fun, without thinking in the final goal (maybe SCM) nor in tools that already do this in a better way. Just let's focus in the script itself.
In the OP's script, there are a lot of string processing inside bash, using tools like sed and awk, sometimes more than once in the same command line or inside a loop executing n times (one per file).
That's ok, but it's necessary to remember that:
Each time the script calls any of those programs, it's created a new process in the OS, and that is expensive in time and resources. So the less programs are called, the better is the performance of script that is executing:
diff 2 times (1 just to print to user)
sed 1 time processing diff result and 1 time for each file
awk 1 time processing sed result and 2 times for each file (processing file result)
file 1 time for each file
That doesn't apply to echo, read, test and others that are builtin commands of bash, so no external program is executed.
meld is the final command that will display the files to user, so it doesn't count.
Even with the builtin commands, redirection pipelines | has a cost too, because the shell has to create pipes, duplicate handles, and maybe even creating forks of the shell (that is a process itself). So again: less is better.
The messages of diff command are locale dependants, so if the system is not in english, the whole script won't work.
Thinking that, let's clean a bit the original script, mantaining the OP's logic:
#!/bin/bash
dir1=$1
dir2=$2
# Set english as current language
LANG=en_US.UTF-8
# (1) show files that are different only
diff -rq $dir1 $dir2 |
# (2) remove lines that represent only one file, keep lines that have
# files in both dirs, but are just different, delete all but left filename
sed '/ differ$/!d; s/^Files //; s/ and .*//' |
# (3) determine the type of file
file -i -f - |
# (4) for each file
while IFS=":" read file file_type
do
# (5) drop first directory in path to get relative filename
rel=${file#$dir1}
# (6) if it's a text file send it to meld
if [[ "$file_type" =~ "text/" ]]
then
# throw out error messages with &> /dev/null
meld ${dir1}${rel} ${dir2}${rel} &> /dev/null
fi
done
A little explaining:
Unique chain of commands cmd1 | cmd2 | ... where the output (stdout) of previous one is the input (stdin) of the next one.
Execute sed just once to execute 3 operations (separated with ;) in diff output:
Deleting lines ending with " differ"
Delete "Files " at the beginning of remaining lines
Delete from " and " to the end of remaining lines
Execute command file once to process the file list in stdin (option -f -)
Use the while bash sentence to read two values separated by : for each line line of stdin.
Use bash variable substitution to extract filename from a variable
Use bash test to compare a file type with a regular expression
For clarity reasons, I didn't considerate that file and directory names may have spaces. In such cases, both scripts will fail. To avoid that is necessary enclose in double quotes any reference to file/dir name variable.
I didn't use awk, because it is powerful enough that can replace almost the entire script ;-)

BASH shell scripting file parsing [newbie]

I am trying to write a bash script that goes through a file line by line (ignoring the header), extracts a file name from the beginning of each line, and then finds a file by this name in one directory and moves it to another directory. I will be processing hundreds of these files in a loop and moving over a million individual files. A sample of the file is:
ImageFileName Left_Edge_Longitude Right_Edge_Longitude Top_Edge_Latitude Bottom_Edge_Latitude
21088_82092.jpg: -122.08007812500000 -122.07733154296875 41.33763821961143 41.33557596965434
21088_82093.jpg: -122.08007812500000 -122.07733154296875 41.33970040427444 41.33763821961143
21088_82094.jpg: -122.08007812500000 -122.07733154296875 41.34176252364274 41.33970040427444
I would like to ignore the first line and then grab 21088_82092.jpg as a variable. File names may not always be the same length, but they will always have the format digits_digits.jpg
Any help for an efficient approach is much appreciated.
This should get you started:
$ tail -n +2 input | cut -f 1 -d: | while read file; do test -f $dir/$file && mv -v $dir/$file $destination; done
You can construct a script that will do something like this, then simply run the script. The following command will give you a script which will copy the files from one place to another, but you can make the script generation more complex simply by changing the awk output:
pax:~$ cat qq.in
ImageFileName Left_Edge_Longitude Right_Edge_Longitude
21088_82092.jpg: -122.08007812500000 -122.07733154296875
21088_82093.jpg: -122.08007812500000 -122.07733154296875
21088_82094.jpg: -122.08007812500000 -122.07733154296875
pax:~$ awk -F: '/^[0-9]+_[0-9]+.jpg:/ {
printf "cp /srcdir/%s /dstdir\n",$1
} {}' qq.in
cp /srcdir/21088_82092.jpg /dstdir
cp /srcdir/21088_82093.jpg /dstdir
cp /srcdir/21088_82094.jpg /dstdir
You capture the output of that script (the last three lines) to another file then that file is your script for doing the actual copies.

Merging, then splitting files

Using a for loop, I can merge all of the files in a directory that end with *.txt:
for filename in *.txt; do
cat "${filename}"
echo
done > output.txt
After doing this, I will run output.txt through various scripts, in which the text will be changed considerably. After that, I want to split the files, at the same places at which they were merged, into different files (output01.txt, output02.txt, etc.).
How can I split the files at the same place they were merged?
This cannot be based on line number, because the scripts will add \t in places.
I think a solution that might work is to place "#########" at the end of each of the initial *.txt files before merging them, but I don't know how to get BASH to split the files again at that mark.
Instead of that for loop for concatenating, you can just use cat *.txt.
Anyway, why don't you just perform the scripts on each file independently within the for loop?
If you really want to combine and re-segregate, you can use:
for filename in *.txt; do
cat "${filename}"
echo "#####"
done > output.txt
# Pass output.txt through whatever
awk 'BEGIN { fileno = 1; file = sprintf("output%02d.txt", fileno) };
{ if($1 ~ /#####/) { fileno++;
file = sprintf("output%02d.txt", fileno);
next }
else print >file
}' output.txt
The canonical answer would be:
tar c *.txt > output.txt
You could split/unmerge them exactly by doing
tar xf output.txt # in the current directory
tar x -C /tmp/splitfiles/ -f output.txt
Now if you really want to do stuff like that in a loop and extract to stdout/a pipe, you could:
while read fname < <(tar tf output.txt)
do
# extract named to pipe
tar -xOf output.txt "$fname" | myprogram "$fname"
done
However, that would possibly not be very efficient. You could consider just doing
while read fname < <(tar x -v -C /tmp/splitfiles/ -f output.txt)
do
# handle extracted file
myprogram "/tmp/splitfiles/$fname"
unlink "/tmp/splitfiles/$fname" # drop the temp file
done
This will be completely asynchronous (so if extraction or even the transmission of the archive is slow, the first files can already be processed while waiting for more data to arrive).
See also my other answer https://stackoverflow.com/a/8341221/85371 (look for the older answer part, since that question was changed to be very specific later)
As Fredrik wrote here you can use csplit to split your merged file.

Unix: How can I prepend output to a file?

Specifically, I'm using a combination of >> and tee in a custom alias to store new Homebrew updates in a text file, as well as output on screen:
alias bu="echo `date "+%Y-%m-%d at %H:%M"` \
>> ~/Documents/Homebrew\ Updates.txt && \
brew update | tee -a ~/Documents/Homebrew\ Updates.txt"
Question: What if I wish to prepend this output in my textfile, i.e. placed at the beginning of the file as opposed to appending it to the end?
Edit1: As someone reported in the answers below, the use of temp files might be a good approach, which at least helped me partially:
targetLog="~/Documents/Homebrew\ Updates.txt"
alias bu="(brew update | cat - $targetLog \
> /tmp/out1 && mv /tmp/out1 $targetLog \
&& echo `date "+%Y-%m-%d at %H:%M":%S` | \
cat - $targetLog > /tmp/out2 \
&& mv /tmp/out2 $targetLog)"
But the problem is the output to STDOUT (previously made possible by tee), which I'm not sure can be incorporated in this tempfile approach …?
sed will happily do that for you, using -i to edit in place, eg.
sed -i -e "1i `date "+%Y-%m-%d at %H:%M"`" some_file
This works by creating an output file:
Let's say we have the initial contents on file.txt
echo "first line" > file.txt
echo "second line" >> file.txt
So, file.txt is our 'bottom' text file. Now prepend into a new 'output' file
echo "add new first line" | cat - file.txt > output.txt # <--- Just this command
Now, output has the contents the way we want. If you need your old name:
mv output.txt file.txt
cat file.txt
The only simple and safe way to modify an input file using bash tools, is to use a temp file, eg. sed -i uses a temp file behind the scenes (but to be robust sed needs more).
Some of the methods used have a subtle "can break things" trap, when, rather than running your command on the real data file, you run it on a symbolic link (to the file you intend to modify). Unless catered for correctly, this can break the link and convert it into a real file which receives the mods and leaves the original real file without the intended mods and without the symlink (no error exit-code results)
To avoid this with sed, you need to use the --follow-symlinks option.
For other methods, just be aware that it needs to follow symlinks (when you act on such a link)
Using a temp file, then rm temp file works only if "file" is not a symlink.
One safe way is to use sponge from package moreutils
Unlike a shell redirect, sponge soaks up all its input before
opening
the output file. This allows for constructing pipelines that read from
and write to the same file.
sponge is a good general way to handle this type of situation.
Here is an example, using sponge
hbu=~/'Documents/Homebrew Updates.txt'
{ date "+%Y-%m-%d at %H:%M"; cat "$hbu"; } | sponge "$hbu"
Simplest way IMO would be to use echo and cat:
echo "Prepend" | cat - inputfile > outputfile
Or for your example basically replace the tee -a ~/Documents/Homebrew\ Updates.txt with cat - ~/Documents/Homebrew\ Updates.txt > ~/Documents/Homebrew\ Updates.txt
Edit: As stated by hasturkun this won't work, try:
echo "Prepend" | cat - file | tee file
But this isn't the most efficient way of doing it any more...
Similar to the accepted answer, however if you are coming here because you want to prepend to the first line - rather than prepend an entirely new line - then use this command.
sed -i "1 s/^/string_replacement/" some_file
The -i flag will do a replacement within the file (rather than creating a new file).
Then the 1 will only do the replacement on line 1.
Finally, the s command is used which has the following syntax s/find/replacement/flags.
In our case we don't need any flags. The ^ is called a caret and it is used to represent the very start of a string.
Try this http://www.unix.com/shell-programming-scripting/42200-add-text-beginning-file.html
There is no direct operator or command AFAIK.You use echo, cat, and mv to get the effect.
{ date; brew update |tee /dev/tty; cat updates.txt; } >updates.txt.new
mv updates.txt.new updates.txt
I've no idea why you want to do this. It's pretty standard that logs like this have later entries appearing, well, later in the file.

Resources