Bash Iterative approach in place of process substitution not working as expected - bash

complete bash noob here. Had the following command (1.) and it worked as expected but it seemed a bit naive for what I needed:
Essentially generating a wordlist from a messy input file with tab delimiters
cat users.txt | tee >(cut -f 1 >> cut_out.txt) >(cut -f 2 >> cut_out.txt) >(cut -f 3 >> cut_out.txt) >(cut -f 4 >> cut_out.txt)
Output:
W Humphrey
SummersW
FoxxR
noreply
DaibaN
PeanutbutterM
PetersJ
DaviesJ
BlaireJ
GongoH
MurphyF
JeffersD
HorsemanB
...
Thought I could cut down on the ridiculous command above with the following
cat users.txt | for i in {1..4}; do cut -f $i >> cut_out.txt; done
Output:
HumphreyW
The command above only returned a single word from the list and some white-space.
The solution. I knew that I could get it working logically by simply looping the entire command instead, this did exactly what I wanted but just wanted to know why the command above (2.) returned an almost empty file?
for i in {1..4}; do cat users.txt | cut -f $i >> cut_out.txt; done
Have a solution, more-so wanted an explanation because I am dumb and still learning about I/O in bash. Cheers.

Just a remark
awk -F '[\t]' '{for(i = 1; i <= 4; i++) print $i}' users.txt > cut_out.txt
Is basically what your cat ... | tee >(cut ...) ... does.

If the order of the output is unimportant, and there are only four coumns in the file, simply
tr '\t' '\n' <users.txt >cut_out.txt
If you only want the first four columns in any order,
cut -f1-4 users.txt |
rt '\t' '\n' >cut_out.txt
(Thanks to #KamilCuk for raising this in a comment.)
Otherwise your third attempt is basically fine, though you want to avoid the useless cat and redirect only once;
for i in {1..4}; do
cut -f "$i" users.txt
done > cut_out.txt
This is obviously less efficient than only reading the file once. If the file is small enough to fit into memory, you could write a simple Awk script to read it once and split it up into variables, and then write out these variables in the order you want.
The second attempt is wrong because cat only supplies a single instance of the data to the pipe, and the first iteration of the loop consumes it all.

Related

In bash, how to ingest lines from a file, and then that set of lines as a file itself

In a bash script, is there a way to ingest specific lines from a file, and then treat that set of lines as a file itself? For example, let's say file input_data looks like
boring line 1 A
boring line 2 B
boring line 3 C
start of interesting block
line 1 A
line 2 B
line 3 C
end of interesting block
boring line 4 D
There's not much I can assume about the structure of the file except that 1) it will indeed have "start of interesting block" and "end of interesting block" and 2) it will generally be much larger and more complicated than this example. And let's say I want to do a great deal of processing in the "interesting block".
So I need to do something sort of like this:
interesting_lines=$(tail -n 6 input_file.txt | head -n 5)
process_1 interesting_lines #(Maybe process_1 is a "grep" or something more complicated.)
process_2 interesting_lines
etc.
Of course, that won't work, which is why I'm asking this question, but that's the idea that I need.
I guess one thing that would work would be
tail -n 6 input_file.txt | head -n 5 > tmpfile
process_1 tmpfile
process_2 tmpfile
etc.
rm tmpfile
but I am trying to avoid temporary files.
You can use process substitution.
interesting_lines=$(tail -n 6 input_file.txt | head -n 5)
process_1 <(printf '%s\n' "$interesting_lines")
process_2 <(printf '%s\n' "$interesting_lines")
Don't use head or tail. You can give a range with sed or awk:
process_1 <(sed -n '/start of interesting block/,/end of interesting block/p' input_file.txt)
process_2 <(awk '/start of interesting block/,/end of interesting block/' input_file.txt)
When you need better control over your boundaries, use awk smarter. The next solution only seems to be more verbose, but now you can ad all kind of conditions
process_1 <(awk '/start of interesting block/ {interesting=1}
/end of interesting block/ {interesting=0}
interesting {print}
' input_file.txt)
When you have to look in a very long file for only a few interesting lines, you can do
sed -n '/start of interesting block/,/end of interesting block/p' input_file.txt |
tee >(process_1) | process_2
Demo:
printf "%s\n" {1..10} | tee >(sed 's/4/xxx/p') | sed 's/^/ /'

Splitting and looping over live command output in Bash

I am archiving and using split to produce several parts while also printing the output files (from split on STDERR, which I am redirecting to STDOUT). However the loop over the output data doesn't happen until after the command returns.
Is there anyway to actively split over the STDOUT output of a command before it returns?
The following is what I currently have, but it only prints the list of filenames after the command returns:
export IFS=$'\n'
for line in `data_producing_command | split -d -b $CHUNK_SIZE --verbose - $ARCHIVE_PREFIX 2>&1`; do
FILENAME=`echo $line | awk '{ print $3 }'`
echo " - $FILENAME"
done
Try this:
data_producing_command | split -d -b $CHUNK_SIZE --verbose - $ARCHIVE_PREFIX 2>&1 | while read -r line
do
FILENAME=`echo $line | awk '{ print $3 }'`
echo " - $FILENAME"
done
Note however that any variables set in the while loop will not preserve their values after the loop (the while loop runs in a subshell).
There's no reason for the for loop or the read or the echo. Just pipe the stream to awk:
... | split -d -b $CHUNK_SIZE --verbose - test 2>&1 |
awk '{printf " - %s\n", $3 }'
You are going to see some delay from buffering, but unless your system is very slow or you are very perceptive, you're not likely to notice it.
The command substitution needs1 to run before the for loop can start.
for item in $(command which produces items); do ...
whereas a while read -r can start consuming output as soon as the first line is produced (or, more realistically, as soon as the output buffer is full):
command which produces items |
while read -r item; do ...
1 Well, it doesn't absolutely need to, from a design point of view, I suppose, but that's how it currently works.
As William Pursell already noted, there is no particular reason to run Awk inside a while read loop, because that's something Awk does quite well on its own, actually.
command which produces items |
awk '{ print " - " $3 }'
Of course, with a reasonably recent GNU Coreutils split, you could simply do
split --filter='printf " - %s\n" "$FILE"'; cat >"$FILE" ... options

Echo something while piping stdout

I know how to pipe stdout:
./myScript | grep 'important'
Example output of the above command:
Very important output.
Only important stuff here.
But while greping I would also like to echo something each line so it looks like this:
1) Very important output.
2) Only important stuff here.
How can I do that?
Edit: Apparently, I haven't specified well enough what I want to do. Numbering of lines is just an example, I want to know in general how to add text (any text, including variables and whatnot) to pipe output. I see one can achieve that using awk '{print $0}' where $0 is the solution I'm looking for.
Are there any other ways to achieve this?
This will number the hits from 0
./myScript | grep 'important' | awk '{printf("%d) %s\n", NR, $0)}'
1) Very important output.
2) Only important stuff here.
This will give you the line number of the hit
./myScript | grep -n 'important'
3:Very important output.
47:Only important stuff here.
If you want line numbers on the new output running from 1..n where n is number of lines in the new output:
./myScript | awk '/important/{printf("%d) %s\n", ++i, $0)}'
# ^ Grep part ^ Number starting at 1
A solution with a while loop is not suited for large files, so you should only use this solution when you do not have a lot important stuff:
i=0
while read -r line; do
((i++))
printf "(%s) Look out: %s" $i "${line}"
done < <(./myScript | grep 'important')

Optimize shell script for multiple sed replacements

I have a file containing a list of replacement pairs (about 100 of them) which are used by sed to replace strings in files.
The pairs go like:
old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2
and my current code is:
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done
I cannot help but think that there is a more optimal way of performing the replacements. I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.
Are there any other ways of speeding up this script?
EDIT
Thanks for all the quick responses. Let me try out the various suggestions before choosing an answer.
One thing to clear up: I also need subexpressions/groups functionality. For example, one replacement I might need is:
([0-9])U|\10 #the extra brackets and escapes were required for my original code
Some details on the improvements (to be updated):
Method: processing time
Original script: 0.85s
cut instead of awk: 0.71s
anubhava's method: 0.18s
chthonicdaemon's method: 0.01s
You can use sed to produce correctly -formatted sed input:
sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file
I recently benchmarked various string replacement methods, among them a custom program, sed -e, perl -lnpe and an probably not that widely known MySQL command line utility, replace. replace being optimized for string replacements was almost an order of magnitude faster than sed. The results looked something like this (slowest first):
custom program > sed > LANG=C sed > perl > LANG=C perl > replace
If you want performance, use replace. To have it available on your system, you'll need to install some MySQL distribution, though.
From replace.c:
Replace strings in textfile
This program replaces strings in files or from stdin to stdout. It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string. The first occurrence of a found string is matched. If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.
...
The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces). A line is assumed ending with \n or \0. There are no limit exept memory on length of strings.
More on sed. You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through sed commands, something like this:
$ sed -e 's/A/B/g; ...' file.txt | \
sed -e 's/B/C/g; ...' | \
sed -e 's/C/D/g; ...' | \
sed -e 's/D/E/g; ...' > out
Also, if you use sed or perl and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C in front of the commands:
$ LANG=C sed ...
You can cut down unnecessary awk invocations and use BASH to break name-value pairs:
while IFS='|' read -r old new; do
# echo "$old :: $new"
sed -i "s~$old~$new~g" file
done < replacement_list
IFS='|' will give enable read to populate name-value in 2 different shell variables old and new.
This is assuming ~ is not present in your name-value pairs. If that is not the case then feel free to use an alternate sed delimiter.
Here is what I would try:
store your sed search-replace pair in a Bash array like ;
build your sed command based on this array using parameter expansion
run command.
patterns=(
old new
tobereplaced replacement
)
pattern_count=${#patterns[*]} # number of pattern
sedArgs=() # will hold the list of sed arguments
for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement…
search=${patterns[i]};
replace=${patterns[i+1]}; # … here we got the replacement part
sedArgs+=" -e s/$search/$replace/g"
done
sed ${sedArgs[#]} file
This result in this command:
sed -e s/old/new/g -e s/tobereplaced/replacement/g file
You can try this.
pattern=''
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
pattern=${pattern}"s/${old}/${new}/g;"
done
sed -r ${pattern} -i file
This will run the sed command only once on the file with all the replacements. You may also want to replace awk with cut. cut may be more optimized then awk, though I am not sure about that.
old=`echo $i | cut -d"|" -f1`
new=`echo $i | cut -d"|" -f2`
You might want to do the whole thing in awk:
awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file
Build up a list of old and new words from the first file. The next ensures that the rest of the script isn't run on the first file. For the second file, loop through the list of replacements and perform them each one by one. The 1 at the end means that the line is printed.
{ cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g
t again
:again
/^-End-³\n/ {s///;b done
}
s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/
t again
s/^[^³]*³\n//
t again
:done
p
}'
More for fun to code via sed. Try maybe for a time perfomance because this start only 1 sed that is recursif.
for posix sed (so --posix with GNU sed)
explaination
copy replacement list in front of file content with a delimiter (for line with ³ and for list with -End-) for an easier sed handling (hard to use \n in class character in posix sed.
place all line in buffer (add the delimiter of line for replacement list and -End- before)
if this is -End-³, remove the line and go to final print
replace each first pattern (group 1) found in text by second patttern (group 2)
if found, restart (t again)
remove first line
restart process (t again). T is needed because b does not reset the test and next t is always true.
Thanks to #miku above;
I have a 100MB file with a list of 80k replacement-strings.
I tried various combinations of sed's sequentially or parallel, but didn't see throughputs getting shorter than about a 20-hour runtime.
Instead I put my list into a sequence of scripts like "cat in | replace aold anew bold bnew cold cnew ... > out ; rm in ; mv out in".
I randomly picked 1000 replacements per file, so it all went like this:
# first, split my replace-list into manageable chunks (89 files in this case)
split -a 4 -l 1000 80kReplacePairs rep_
# next, make a 'replace' script out of each chunk
for F in rep_* ; do \
echo "create and make executable a scriptfile" ; \
echo '#!/bin/sh' > run_$F.sh ; chmod +x run_$F.sh ; \
echo "for each chunk-file line, strip line-ends," ; \
echo "then with sed, turn '{long list}' into 'cat in | {long list}' > out" ; \
cat $F | tr '\n' ' ' | sed 's/^/cat in | replace /;s/$/ > out/' >> run_$F.sh ;
echo "and append commands to switch in and out files, for next script" ; \
echo -e " && \\\\ \nrm in && mv out in\n" >> run_$F.sh ; \
done
# put all the replace-scripts in sequence into a main script
ls ./run_rep_aa* > allrun.sh
# make it executable
chmod +x allrun.sh
# run it
nohup ./allrun.sh &
.. which ran in under 5 mins, a lot less than 20 hours !
Looking back, I could have used more pairs per script, by finding how many lines would make up the limit.
xargs --show-limits </dev/null 2>&1 | grep --color=always "actually use:"
Maximum length of command we could actually use: 2090490
So just under 2MB; how many pairs would that be for my script ?
head -c 2090490 80kReplacePairs | wc -l
76923
So it seems I could have used 2 * 40000-line chunks
to expand on chthonicdaemon's solution
live demo
#! /bin/sh
# build regex from text file
REGEX_FILE=some-patch.regex.diff
# test
# set these with "export key=val"
SOME_VAR_NAME=hello
ANOTHER_VAR_NAME=world
escape_b() {
echo "$1" | sed 's,/,\\/,g'
}
regex="$(
(echo; cat "$REGEX_FILE"; echo) \
| perl -p -0 -e '
s/\n#[^\n]*/\n/g;
s/\(\(SOME_VAR_NAME\)\)/'"$(escape_b "$SOME_VAR_NAME")"'/g;
s/\(\(ANOTHER_VAR_NAME\)\)/'"$(escape_b "$ANOTHER_VAR_NAME")"'/g;
s/([^\n])\//\1\\\//g;
s/\n-([^\n]+)\n\+([^\n]*)(?:\n\/([^\n]+))?\n/s\/\1\/\2\/\3;\n/g;
'
)"
echo "regex:"; echo "$regex" # debug
exec perl -00 -p -i -e "$regex" "$#"
prefixing lines with -+/ allows empty "plus" values, and protects leading whitespace from buggy text editors
sample input: some-patch.regex.diff
# file format is similar to diff/patch
# this is a comment
# replace all "a/a" with "b/b"
-a/a
+b/b
/g
-a1|a2
+b1|b2
/sg
# this is another comment
-(a1).*(a2)
+b\1b\2b
-a\na\na
+b
-a1-((SOME_VAR_NAME))-a2
+b1-((ANOTHER_VAR_NAME))-b2
sample output
s/a\/a/b\/b/g;
s/a1|a2/b1|b2/;;
s/(a1).*(a2)/b\1b\2b/;
s/a\na\na/b/;
s/a1-hello-a2/b1-world-b2/;
this regex format is compatible with sed and perl
since miku mentioned mysql replace:
replacing fixed strings with regex is non-trivial,
since you must escape all regex chars,
but you also must handle backslash escapes ...
naive escaper:
echo '\(\n' | perl -p -e 's/([.+*?()\[\]])/\\\1/g'
\\(\n

Passing input to sed, and sed info to a string

I have a list of files (~1000) and there is 1 file per line in my text file named: 'files.txt'
I have a macro that looks something like the following:
#!/bin/sh
b=$(sed '${1}q;d' files.txt)
cat > MyMacro_${1}.C << +EOF
myFile = new TFile("/MYPATHNAME/$b");
+EOF
and I use this input script by doing
./MakeMacro.sh 1
and later I want to do
./MakeMacro.sh 2
./MakeMacro.sh 3
...etc
So that it reads the n'th line of my files.txt and feeds that string to my created .C macro.
So that it reads the n'th line of my files.txt and feeds that string to my created .C macro.
Given this statement and your tags, I'm going to answer using shell tools and not really address the issue of the .c macro.
The first line of your script contains a sed script. There are numerous ways to get the Nth line from a text file. The simplest might be to use head and tail.
$ head -n "${i}" files.txt | tail -n 1
This takes the first $i lines of files.txt, and shows you the last 1 lines of that set.
$ sed -ne "${i}p" files.txt
This use of sed uses -n to avoid printing by default, then prints the $ith line. For better performance, try:
$ sed -ne "${i}{p;q;}" files.txt
This does the same, but quits after printing the line, so that sed doesn't bother traversing the rest of the file.
$ awk -v i="$i" 'NR==i' files.txt
This passes the shell variable $i into awk, then evaluates an expression that tests whether the number of records processed is the same as that variable. If the expression evaluates true, awk prints the line. For better performance, try:
$ awk -v i="$i" 'NR==i{print;exit}' files.txt
Like the second sed script above, this will quit after printing the line, so as to avoid traversing the rest of the file.
Plenty of ways you could do this by loading the file into an array as well, but those ways would take more memory and perform less well. I'd use one-liners if you can. :)
To take any of these one-liners and put it into your script, you already have the notation:
if expr "$i" : '[0-9][0-9]*$' >/dev/null; then
b=$(sed -ne "${i}{p;q;}" files.txt)
else
echo "ERROR: invalid line number" >&2; exit 1
fi
If I am understanding you correctly, you can do a for loop in bash to call the script multiple times with different arguments.
for i in `seq 1 n`; do ./MakeMacro.sh $i; done
Based on the OP's comment, it seems that he wants to submit the generated files to Condor. You can modify the loop above to include the condor submission.
for i in `seq 1 n`; do ./MakeMacro.sh $i; condor_submit <OutputFile> ; done
i=0
while read file
do
((i++))
cat > MyMacro_${i}.C <<-'EOF'
myFile = new TFile("$file");
EOF
done < files.txt
Beware: you need tab indents on the EOF line.
I'm puzzled about why this is the way you want to do the job. You could have your C++ code read files.txt at runtime and it would likely be more efficient in most ways.
If you want to get the Nth line of files.txt into MyMacro_N.C, then:
{
echo
sed -n -e "${1}{s/.*/myFile = new TFILE(\"&\");/p;q;}" files.txt
echo
} > MyMacro_${1}.C
Good grief. The entire script should just be (untested):
awk -v nr="$1" 'NR==nr{printf "\nmyFile = new TFile(\"/MYPATHNAME/%s\");\n\n",$0 > ("MyMacro_"nr".C")}' files.txt
You can throw in a ;exit before the } if performance is an issue but I doubt if it will be.

Resources