bash for with awk command inside - bash

i have this piece of code of a bash script
for file in "$(ls | grep .*.c)"
do
cat $file |awk '/.*open/{print $0}'|awk -v nomeprog=$file 'BEGIN{FS="(";printf "the file e %s with the open call:", nameprog}//{ print $2}'
done
this give me this error :
*awk: cmd. line:1: file.c
awk: cmd. line:1: ^ syntax error
*i have this error when i have more of a file c into the folder , with just 1 file it works

Overall, you should probably follow Charles Duffy's recommendation to use more appropriate tools for the task. But I'd like to go over why the current script isn't working and how to fix it, as a learning exercise.
Also, two quick recommendations for shell script checking & troubleshooting: run your scripts through shellcheck.net to point out common mistakes, and when debugging put set -x before the problem section (and set +x after), so the shell will print out what it thinks is going on as the script runs.
The problem is due to how you're using the file variable. Let's look at what this does:
for file in "$(ls | grep .*.c)"
First, ls prints a list of files in the current directory, one per line. ls is really intended for interactive use, and its output can be ambiguous and hard to parse correctly; in a script, there are almost always better ways to get lists of filenames (and I'll show you one in a bit).
The output of ls is piped to grep .*.c, which is wrong in a number of ways. First, since that pattern contains a wildcard character ("*"), the shell will try to expand it into a list of matching filenames. If the directory contains any hidden (with a leading ".") .c files, it'll replace it with a list of those, and nothing is going to work at all right. Always quote the pattern argument to grep to prevent this.
But the pattern itself (".*.c") is also wrong; it searches for any number of arbitrary characters (".*"), followed by a single arbitrary character ("." -- this is in a regex, so "." is not treated literally), followed by a "c". And it searches for this anywhere in the line, so any filename that contains a "c" somewhere other than the first position will match. The pattern you want would be something like '[.]c$' (note that I wrapped it in single-quotes, so the shell won't try to treat $ as a variable reference like it would in double-quotes).
Then there's another problem, which is (part of) the problem you're actually experiencing: the output of that ls | grep is expanded in double-quotes. The double-quotes around it tell the shell not to do its usual word-split-and-wildcard-expand thing on the result. The common (but still wrong) thing to do here is to leave off the double-quotes, because word-splitting will probably break the list of filenames up into individual filenames, so you can iterate over them one-by-one. (Unless any filenames contain funny characters, in which case it can give weird results.) But with double-quotes it doesn't split them, it just treats the whole thing as one big item, so your loop runs once with file set to "src1.c\nsrc2.c\nsrc3.c" (where the \n's represent actual newlines).
This is the sort of trouble you can get into by parsing ls. Don't do it, just use a shell wildcard directly:
for file in *.c
This is much simpler, avoids all the confusion about regex pattern syntax vs wildcard pattern syntax, ambiguity in ls's output, etc. It's simple, clear, and it just works.
That's probably enough to get it to work for you, but there are a couple of other things you really should fix if you're doing something like this. First, you should double-quote variable references (i.e. use "$file" instead of just $file). This, is another part of the error you're getting; look at the second awk command:
awk -v nomeprog=$file 'BEGIN{FS="(";printf "the file e %s with the open call:", nameprog}//{ print $2}'
With file set to "src1.c\nsrc2.c\nsrc3.c", the shell will do its word-split-and-wildcard-expand thing on it, giving:
awk -v nomeprog=src1.c src2.c src3.c 'BEGIN{FS="(";printf "the file e %s with the open call:", nameprog}//{ print $2}'
awk will thus set its nomeprog variable to "src1.c", and then try to run "src2.c" as an awk command (on input files named "src3.c" and "BEGIN{FS=..."). "src2.c" is, of course, not a valid awk command, so you get syntax error.
This sort of confusion is typical of the chaos that can result from unquoted variable references. Double-quote your variable references.
The other thing, which is much less important, is that you have a useless use of cat. Anytime you have the pattern:
cat somefile | somecommand
(and it's just a single file, not several that need to be catenated together), you should just use:
somecommand <somefile
and in some cases like awk and grep, the command itself can take input filename(s) directly as arguments, so you can just use:
somecommand somefile
so in your case, rather than
cat "$file" | awk '/.*open/{print $0}' | awk -v nomeprog="$file" 'BEGIN{FS="(";printf "the file e %s with the open call:", nameprog}//{ print $2}'
I'd just use:
awk '/.*open/{print $0}' "$file" | awk -v nomeprog="$file" 'BEGIN{FS="(";printf "the file e %s with the open call:", nameprog}//{ print $2}'
(Although, as Charles Duffy pointed out, even that can be simplified quite a lot.)

Related

Can I do a Bash wildcard expansion (*) on an entire pipeline of commands?

I am using Linux. I have a directory of many files, I want to use grep, tail and wildcard expansion * in tandem to print the last occurrence of <pattern> in each file:
Input: <some command>
Expected Output:
<last occurrence of pattern in file 1>
<last occurrence of pattern in file 2>
...
<last occurrence of pattern in file N>
What I am trying now is grep "pattern" * | tail -n 1 but the output contains only one line, which is the last occurrence of pattern in the last file. I assume the reason is because the * wildcard expansion happens before pipelining of commands, so the tail runs only once.
Does there exist some Bash syntax so that I can achieve the expected outcome, i.e. let tail run for each file?
I know I can always use a for-loop to solve the problem. I'm just curious if the problem can be solved with a more condensed command.
I've also tried grep -m1 "pattern" <(tac *), and it seems like the aforementioned reasoning still applies: wildcard expansion applies to only to the immediate command it is associated with, and the "outer" command runs only once.
Wildcards are expanded on the command line before any command runs. For example if you have files foo and bar in your directory and run grep pattern * | tail -n1 then bash transforms this into grep pattern foo bar | tail -n1 and runs that. Since there's only one stream of output from grep, there's only one stream of input to tail and it prints the last line of that stream.
If you want to search each file and print the last line of grep's output separately you can use a loop:
for file in * ; do
grep pattern "${file}" | tail -n1
done
The problem with non-loop solutions is that tail doesn't inherently know where the output of one file ends and the output of another file begins, or indeed that there are even files involved on the other end of the pipe. It just knows input is coming in from somewhere and it has to print the last line of that input. If you didn't want a loop, you'd have to use a more powerful tool like awk and perhaps use the fact that grep prepends the names of matched files (if multiple files are matched, or with -H) to delimit the start and end of outputs from each file. But, the work to write an awk program that keeps track of the current file to know when its output ends and print its last line is probably more effort than is worth when the loop solution is so simple.
You can achieve what you want using xargs. For your example it would be:
ls * | xargs -n 1 sh -c 'grep "pattern" $0 | tail -n 1'
Can save you from having to write a loop.
You can do this with awk, although (as tjm3772 pointed out in their answer) it's actually more complicated than the shell for loop. For the record, here's what I came up with:
awk -v pattern="YourPatternHere" '(FNR==1 && line!="") {print line; line=""}; $0~pattern {line=$0}; END {if (line!="") print line}'
Explanation: when it finds a matching line ($0~pattern), it stores that line in the line variable ({line=$0}) (this means that at the end of the file, line will hold the last matching line.
(Note: if you want to just include a literal pattern in the program, remove the -v pattern="YourPatternHere" part and replace $0~pattern with just /YourPatternHere/)
There's no simple trigger to print a match at the end of each file, so that part's split into two pieces: if it's the first line of a file AND line is set because of a match in the previous file ((FNR==1 && line!="")), print line and then clear it so it's not mistaken for a match in the current file ({print line; line=""}). Finally, at the end of the final file (END), print a match found in that last file if there was one ({if (line!="") print line}).
Also, note that the print-at-beginning-of-new-file test must be before the check for a matching line, or else it'll get very confused if the first line of the new file matches.
So... yeah, a shell for loop is simpler (and much easier to get right).

How do I trim whitespace, but not newlines, from subshell output in bash?

There are many tens, maybe a hundred or more previous questions that seem "identical" to this already here, but after extensive search, I found NOTHING that even came close to working - though I did learn quite a lot - and so I decided to just RTFM and figure this out on my own.
The Problem
I wanted to search the output of a ps auxwww command to find processes of interest, and the issue was that I can't just simply use cut to find the exact data from them that I wanted. ps, it turns out, tries to columnate the output, adding either extra spaces or tabs that get in the way of using cut to get the correct data.
So, since I'm not a master at bash, I did a search... The answers I found were all focused on either variables - a "backup strategy" from my point of view that itself didn't solve the whole problem - or they only trimmed leading or trailing space or all "whitespace" including newlines. NOPE, Won't Work For Cut! And, neither will removing trailing newlines and so forth.
So, restated, the question is, how do we efficiently end up with the white space defined as simply a single space between other characters without eliminating newlines?
Below, I will give my answer, but I welcome others to give theirs - who knows, maybe someone has a better answer?!
Answer:
At least MY answer - please leave your own, too! - was to do this:
ps auxwww | grep <program> | tr -s [:blank:] | cut -d ' ' -f <field_of_interest>
This worked great!
Obviously, there are many ways to adapt this to other needs.
As an alternative to all of the pipes and grep with cut, you could simply use awk. The benefit of using awkwith the default field-separator (FS) being set to break on whitespace is that it considers any number of whitespace between fields as a single separator.
So using awk will do away with needing to use tr -s to "squeeze" whitespace to define fields. Further, awk gives far greater control over field matching using regular expressions rather than having to rely on grep of a full line and cut to locate a pre-determined field numbers. (though to some extent you will still have to tell awk what field out of the ps command you are interested in)
Using bash, you can also eliminate the pipe | by using process substitution to send the output of ps auxwww to awk on stdin using redirection, e.g. awk ... < <(ps auxwww) for a single tidy command line.
To get your "program" and "file_of_interest" into awk you have two options. You can initialize awk variables using the -v var=value option (there can be multiple -v otions given), or you can use the BEGIN rule to initialize the variables. The only difference being with -v you can provide a shell variable for value and there is no whitespace allowed surrounding the = sign, while within BEGIN any whitespace is ignored.
So in your case a couple of examples to get the virtual memory size for firefox processes, you could use:
awk -v prog="firefox" -v fnum="5" '
$11 ~ prog {print $fnum}
' < <(ps auxwww)
(above if you had myprog=firefox as a shell variable, you could use -v prog="$myprog" to initialize the prog variable for awk)
or using the BEGIN rule, you could do:
awk 'BEGIN {prog = "firefox"; fnum = "5"}
$11 ~ prog {print $fnum }
' < <(ps auxwww)
In each command above, it locates the COMMAND field from ps (field 11) and checks whether it contains firefox and if so it outputs field no. 5 the virtual memory size used by each process.
Both work fine as one-liners as well, e.g.
awk -v prog="firefox" -v fnum="5" '$11 ~ prog {print $fnum}' < <(ps auxwww)
Don't get me wrong, the pipeline is perfectly fine, it will just be slow. For short commands with limited output there won't be much difference, but when the output is large, awk will provide orders of magnitude improvement over having to tr and grep and cut reading over the same records three times.
The reason being, the pipes and the process on each side requires separate processes be spawned by the shell. So minimizes their use, improves the efficiency of what your script is doing. Now if the data is small as are the processes, there isn't much of a difference. However if you are reading a 3G file 3 times over -- that's is the difference in orders of magnitude. Hours verses minutes or seconds.
I had to use single quotes on CentosOS Linux to get tr working like described above:
ps -o ppid= $$ | tr -d '[:space:]'
You can reduce the number of pipes using this Perl one-liner, which uses Perl regexes instead of a separate grep process. This combines grep, tr and cut in a single command, with an easy way to manipulate the output (#F is the array of fields, 0-indexed):
Examples:
# Start an example process to provide the input for `ps` in the next commands:
/Applications/Emacs.app/Contents/MacOS/Emacs-x86_64-10_14 --geometry 109x65 /tmp/foo &
# Print single space-delimited output of `ps` for all emacs processes:
ps auxwww | perl -lane 'print "#F" if $F[10] =~ /emacs/i'
# Prints:
# bar 72144 0.0 0.5 4610272 82320 s006 SN 11:15AM 0:01.31 /Applications/Emacs.app/Contents/MacOS/Emacs-x86_64-10_14 --geometry 109x65 /tmp/foo
# Print emacs PID and file name opened with emacs:
ps auxwww | perl -lane 'print join "\t", #F[1, -1] if $F[10] =~ /emacs/i'
# Prints:
# 72144 /tmp/foo
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)

Bash comparing values

I'm getting the size of a file from a remote webserver and saving the results to a var called remote I get this using:
remote=`curl -sI $FILE | grep -i Length | awk '/Content/{print $(NF-0)}'`
Once I've downloaded the file I'm getting the local files size with:
local=`stat --print="%s" $file`
If I echo remote and local they contain the same value.
I'm trying to run an if statement for this
if [ "$local" -ne "$remote" ]; then
But it always shows the error message, and never advises they match.
Can someone advise what I'm doing wrong.
Thanks
curl's output uses the network format for text, meaning that lines are terminated by a carriage return followed by linefeed; unix tools (like the shell) expect lines to end with just linefeed, so they treat the CR as part of the content of the line, and often get confused. In this case, what's happening is that the remote variable is getting the content length and a CR, which isn't valid in a numeric expression, hence errors. There are many ways to strip the CR, but in this case it's probably easiest to have awk do it along with the field extraction:
remote=$(curl -sI "$remotefile" | grep -i Length | awk '/Content/{sub("\r","",$NF); print $NF}')
BTW, I also took the liberty of replacing backticks with $( ) -- this is easier to read, and doesn't have some oddities with escapes that backticks have, so it's the preferred syntax for capturing command output. Oh, and (NF-0) is equivalent to just NF, so I simplified that. As #Jason pointed out in a comment, it's safest to use lower- or mixed-case for variable names, and put double-quotes around references to them, so I did that by changing $FILE to "$remotefile". You should do the same with the local filename variable.
You could also drop the grep command and have awk search for /^Content-Length:/ to simplify it even further.

Removing duplicate entries from files on the basis of substring postfixes

Let's say that I have the following text in a file:
foo.bar.baz
bar.baz
123.foo.bar.baz
pqr.abc.def
xyz.abc.def
abc.def.ghi.jkl
def.ghi.jkl
How would I remove duplicates from the file, on the basis of postfixes? The expected output without duplicates would be:
bar.baz
pqr.abc.def
xyz.abc.def
def.ghi.jkl
(Consider foo.bar.baz and bar.baz. The latter is a substring postfix so only bar.baz remains. However, neither of pqr.abc.def and xyz.abc.def are not substring postfixes of each other, so both remain.)
Try this:
#!/bin/bash
INPUT_FILE="$1"
in="$(cat $INPUT_FILE)"
out="$in"
for line in $in; do
out=$(echo "$out" | grep -v "\.$line\$")
done
echo "$out"
You need to save it to a script (e.g. bashor.sh), make it executable (chmod +x bashor.sh) and call it with your input file as the first argument:
./bashor.sh path/to/input.txt
Use sed to escape the string for regular expressions, prefix ., postfix $ and pipe this into GNU grep (-f - doesn't work with BSD grep, eg. on a mac).
sed 's/[^-A-Za-z0-9_]/\\&/g; s/^/./; s/$/$/' test.txt |grep -vf - test.txt
I just used to regular expression escaping from another answer and didn't think about whether it is reasonable. On first sight it seems fine, but escapes too much, though probably this is not an issue.

File names stacking during for loop

I am new to shell script, so this might be a dumb question. I haven't found an answer online though. I am taking a coworkers script and changing it so that it works for my data. Right now I am running a test that only uses three of my data files. The code hits a spot in the script where it goes through a for loop and it is supposed to run through the loop once for each of the different files (three times).
listtumor=`cat /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/MatchedTumorTest.txt`
for i in $listtumor
do
lst=`ls /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/freshstart/${i}*.txt | awk -F'/' '{print $9}'`
MatchedTumorTest.txt just contains the three different file names that I am using for the test without '.txt' As far as I can tell, this code should just run through the script three times, one for each file. Instead I am getting this error:
ls: /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/freshstart/TCGA-04-1514-01A-01D-0500-02_S01_CGH_105_Dec08\rTCGA-04-1530-01A-02D-0500-02_S01_CGH_105_Dec08\rTCGA-04-1542-01A-01D-0500-02_S01_CGH_105_Dec08*.txt: No such file or directory
For some reason all of the file names are stacked on top of each other instead of the loop going to each one individually. Any ideas why this is happening?
Thanks,
T.J.
It looks like the lines in your text file may be separated by carriage returns instead of newlines. Since none of the file names in your example have spaces, the for loop should work just fine if you initialize your listtumor like this:
listtumor=`tr '\r' '\n' < /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/MatchedTumorTest.txt`
The tr command will translate the carriage returns into newlines, which is what most text-processors (like the shell's own for command) will expect), and write the result to standard output.
The for loop doesn't do too well with some kinds of separators. Try this instead:
while read line; do
lst=`ls /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/freshstart/${line}*.txt | awk -F'/' '{print $9}'`
...
done < /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/MatchedTumorTest.txt
I'm assuming here that you're separating MatchedTumorTest.txt with newlines.
So combined all together:
dir="/Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun"
file="$dir/MatchedTumorTest.txt"
< "$file" tr '\r' '\n' | while read tumor
do
ls "$dir/freshstart" | grep "$tumor.*\.txt$"
done
will print all .txt file-names in the directory $dir/freshstart what contain a name form the file MatchedTumorTest.txt

Resources