How do I write an awk print command in a loop? - shell

I would like to write a loop creating various output files with the first column of each input file, respectively.
So I wrote
for i in $(\ls -d /home/*paired.isoforms.results)
do
awk -F"\t" {print $1}' $i > $i.transcript_ids.txt
done
As an example if there were 5 files in the home directory named
A_paired.isoforms.results
B_paired.isoforms.results
C_paired.isoforms.results
D_paired.isoforms.results
E_paired.isoforms.results
I would like to print the first column of each of these files into a seperate output file, i.e. I would like to have 5 output files called
A.transcript_ids.txt
B.transcript_ids.txt
C.transcript_ids.txt
D.transcript_ids.txt
E.transcript_ids.txt
or any other name as long as it is 5 different names and I can still link them back to the original files.
I understand, that there is a problem with the double usage of $ in both the awk and the loop command, but I don't know how to change that.
Is it possible to write a command like this in a loop?

This should do the job:
for file in /home/*paired.isoforms.results
do
base=${file##*/}
base=${base%%_*}
awk -F"\t" '{print $1}' $file > $base.transcript_ids.txt
done
I assume that there can be spaces in the first field since you set the delimiter explicitly to tab. This runs awk once per file. There are ways to do it running awk once for all files, but I'm not convinced the benefit is significant. You could consider using cut instead of awk '{print $1}', too. Note that using ls as you did is less satisfactory than using globbing directly; it runs foul of file names with oddball characters (spaces, tabs, etc) in the name.

You can do that entirely in awk:
awk -F"\t" '{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; print $1 > out}' *_paired.isoforms.results
If your input files don't have names as indicated in the question, you'd have to split on something else ( as well as use a different pattern match for the input files ).
My original answer is actually doing extra name resolution every time something is printed. Here's a version that only updates the output filename when FILENAME changes:
awk -F"\t" 'FILENAME!=lf{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; lf=FILENAME} {print $1 > out}' *_paired.isoforms.results

Related

Join two files with AWK, one file from console [duplicate]

I was wondering how do I get awk to take a string from the pipe output and a file?
I've basically have a chain of commands that eventually will spit out a string. I want to check this string against a csv file (columns separated by commas). Then, I want to find the first row in the file that contains the string in the 7th column of the csv file and print out the contents of the 5th column of that line. Also, I don't know linux command line utilities/awk too well, so feel free to suggest completely different methods. :)
CSV file contents look like this:
col1,col2,col3,col4,col5,etc...
col1,col2,col3,col4,col5,etc...
etc...
My general line of thought:
(rest of commands that will give a string) | awk -F ',' 'if($5 == string){print $7;exit}' filename.txt
Can this be done? If so, how do I tell awk to compare against that string?
I've found some stuff about using a - symbol with ARGV[] before the filename, but couldn't get it working.
As Karoly suggests,
str=$( rest of commands that will give a string )
awk -v s="$str" -F, '$7==s {print $5; exit}' file
If you want to feed awk with a pipe:
cmds | awk -F, 'NR==FNR {str=$0; next}; $7==str {print $5}' - file
I think the first option is more readable.

List of extensions of filenames in bash script in one line

I currently have the following line of code:
ls /some/dir/prefix* | sed -e 's/prefix.//' | tr '\n' ' '
Which does achieve what I want it to do:
Get list of files starting with prefix
Remove path and prefix from each string
Remove newlines and replace with spaces for later processing.
For example:
/some/dir/prefix.hello
/some/dir/prefix.world
Should become
hello world
But I feel like there's a nicer way of doing this. Is there a better way to do this in one line?
Here is a two-liner using just built-ins that does it:
fnames=(some/dir/prefix*)
echo "${fnames[#]##*.}"
And here's how this works:
fnames=(some/dir/prefix*) creates an array with all the files starting with prefix and avoids all the problems that come with parsing ls
echo "${fnames[#]##*.}" is a combination of two parameter expansions: ${fnames[#]} prints all array elements, and the ##*. part removes the longest match of anything that ends with . from each array element, leaving just the extension
If you're hell-bent on a one-liner, just join the two commands with &&.
passing ls output to external programs is not recommended, following bash solution may help you here.
for file in prefix*; do echo ${file##*.}; done
Adding a non-one liner form of solution too now.
for file in prefix*
do
echo ${file##*.}
done
Here is a very simple Awk one-liner to achieve this :
awk -F. '{$0=FILENAME; printf $NF" "; nextfile}' /some/dir/prefix*
It essentially does the following :
-F.: Set the field separator FS to a .. This way $NF represents the extension.
$0=FILENAME: Ignore the current record and set it to FILENAME, reparse everything this way.
print $NF; nextfile : print the extension and go to the next file.
The problem with this is that the file still reads a record of the current file. If that file is empty this will fail.
To make this work with empty files, you could use the gawk extension BEGINFILE
awk -F. 'BEGINFILE{$0=FILENAME; printf $NF" "; nextfile}' /some/dir/prefix*
Or you can loop over all the arguments :
awk -F. 'BEGIN{for(i in ARGV){$0=ARGV[i]; printf $NF" "};exit}' /some/dir/prefix*
One approach with awk:
ls /some/dir/prefix* | awk -F"." '{printf "%s ", $2} END {print ""}'
It might qualify as being "nicer" because there's only one command the output is piped through?!

Find string in col 1, print col 2 in awk

I'm on a Mac, and I want to find a field in a CSV file adjacent to a search string
This is going to be a single file with a hard path; here's a sample of it:
84:a5:7e:6c:a6:b0, AP-ATC-151g84
84:a5:7e:6c:a6:b1, AP-A88-131g84
84:a5:7e:73:10:32, AP-AG7-133g56
84:a5:7e:73:10:30, AP-ADC-152g81
84:a5:7e:73:10:31, AP-D78-152e80
so if my search string is "84:a5:7e:73:10:32"
I want to get returned "AP-AG7-133g56"
I had been working within an Applescript, but maybe a shell script will do.
I just need the proper syntax for opening the file and having awk search it. Again, I'm weak conceptually on how shell commands run, how they must be executed, etc
This errors, gives me ("command not found"):
set the_file to "/Users/Paw/Desktop/AP-Decoder 3.app/Contents/Resources/BSSIDtable.csv"
set the_val to "70:56:81:cb:a2:dc"
do shell script "'awk $1 ~ the_val {print $2} the_file'"
Thank you for coddling me...
This is a relatively simple:
awk '$1 == "70:56:81:cb:a2:dc," {print "The answer is "$2}' 'BSSIDtable.csv'
(the "The answer is " text can be omitted if you only wish to see only the data, but this shows you how to get more user-friendly output if desired).
The comma is included since awk uses white space for separators so the comma becomes part of column 1.
If the thing you're looking for is in a shell variable, you can use -v to provide that to awk as an awk variable:
lookfor="70:56:81:cb:a2:dc,"
awk -v mac=$lookfor '$1 == mac {print "The answer is "$2}' 'BSSIDtable.csv'
As an aside, your AppleScript solution is probably not working because the $1/$2 are being interpreted as shell variable rather than awk variables. If you insist on using AppleScript, you will have to figure out how to construct a shell command that quotes the awk commands correctly.
My advice is to just use the shell directly, the number of people proficient in that almost certainly far outnumber those proficient in AppleScript :-)
if sed is available (normaly on mac, event if not tagged in OP)
simple but read all the file
sed -n 's/84:a5:7e:73:10:32,[[:blank:]]*//p' YourFile
quit after first occurence (so average of 50% faster on huge file)
sed -n -e '/84:a5:7e:73:10:32,[[:blank:]]*/!b' -e 's///p;q' YourFile
awk
awk '/^84:a5:7e:73:10:32/ {print $2}'
# OR using a variable for batch interaction
awk -v Src='84:a5:7e:73:10:32' '$1 == Src {print $2}'
# OR assuming that case is unknow
awk -v Src='84:a5:7e:73:10:32' 'BEGIN{IGNORECASE=1} $1 == Src {print $2}'
by default it take $0 as compare test if a regex is present, just add the ^ to take first field content

awk overwriting files in a loop

I am trying to look through a set of files. There are 4-5 files for each month in a 2 year period with 1000+ stations in them. I am trying to separate them so that I have one file per station_no (station_no = $1).
I thought this was easy and simply went with;
awk -F, '{ print > $1".txt" }' *.csv
which I've tested with one file and it works fine. However, when I run this it creates the .txt files, but there is nothing in the files.
I've now tried to put it in a loop and see if that works;
#!/bin/bash
#program to extract stations from orig files
for file in $(ls *.csv)
do
awk -F, '{print > $1".txt" }' $file
done
It works as it loops through the files etc, but it keeps overwriting the when it moves to the next month.
How do I stop it overwriting and just adding to the end of the .txt with that name?
You are saying print > file, which truncates on every new call. Use >> instead, so that it appends to the previous content.
Also, there is no need to loop through all the files and then call awk for each one. Instead, provide the set of files to awk like this:
awk -F, '{print >> ($1".txt")}' *.csv
Note, however, that we need to talk a little about how awk keeps files opened for writing. If you say awk '{print > "hello.txt"}' file, awk will keep hello.txt file opened until it finishes processing. In your current approach, awk stops on every file; however, in my current suggested approach the file is open until the last file is processed. Thus, in this case a single > suffices:
awk -F, '{print > $1".txt"}' *.csv
For the detail on ( file ), see below comments by Ed Morton, I cannot explain it better than him :)

How do I print a field from a pipe-separated file?

I have a file with fields separated by pipe characters and I want to print only the second field. This attempt fails:
$ cat file | awk -F| '{print $2}'
awk: syntax error near line 1
awk: bailing out near line 1
bash: {print $2}: command not found
Is there a way to do this?
Or just use one command:
cut -d '|' -f FIELDNUMBER
The key point here is that the pipe character (|) must be escaped to the shell. Use "\|" or "'|'" to protect it from shell interpertation and allow it to be passed to awk on the command line.
Reading the comments I see that the original poster presents a simplified version of the original problem which involved filtering file before selecting and printing the fields. A pass through grep was used and the result piped into awk for field selection. That accounts for the wholly unnecessary cat file that appears in the question (it replaces the grep <pattern> file).
Fine, that will work. However, awk is largely a pattern matching tool on its own, and can be trusted to find and work on the matching lines without needing to invoke grep. Use something like:
awk -F\| '/<pattern>/{print $2;}{next;}' file
The /<pattern>/ bit tells awk to perform the action that follows on lines that match <pattern>.
The lost-looking {next;} is a default action skipping to the next line in the input. It does not seem to be necessary, but I have this habit from long ago...
The pipe character needs to be escaped so that the shell doesn't interpret it. A simple solution:
$ awk -F\| '{print $2}' file
Another choice would be to quote the character:
$ awk -F'|' '{print $2}' file
Another way using awk
awk 'BEGIN { FS = "|" } ; { print $2 }'
And 'file' contains no pipe symbols, so it prints nothing. You should either use 'cat file' or simply list the file after the awk program.

Resources