Paste columns side by side on many files using awk - bash

I've been struggling for quite a while on this problem (please note I'm not a really good bash coder, let alone awk).
I have about 10000 files, each formated the same way (quite heavy as well, about 3Mb). I would like to get the 3rd row of each file and paste them side by side on a new file.
I found many solutions using paste, awk, or cut, but none of them worked when working with wildcards. For instance,
paste <(awk '{print $3}' file1 ) <(awk '{print $3}' file2 ) <(awk '{print $3}' file3) > output
would work great if I only had 3 files, but I won't type that for 10000 of them. So I gave it a try with wildcards:
paste <(awk '{print $3}' file* ) > output
And it does paste the 3rd rows, but in a single line. I tried some other codes, but eventually always end up with the same result. Is there a way to paste them side by side using wildcards?
Thank you very much for your help!
Baptiste G.
EDIT 1: With the help of schorsch312, I found a solution that works
for me. Instead of getting the columns and pasting them side by side,
I print each columns as a line and add them one after the other:
for i in ls files*; do
awk '{printf $3i" "}END{print}' $i >> output done
It works but 1/ it's quite slow, and 2/ it's not exactly what I asked
in the title, as my output files is the "transpose". It doesn't really
matter to me because it's only floats and I can transpose it later
with python if needed.

I know that you said awk alone, but I don't know how to do it. Here is a simple bash script which does what you like to do.
# do a loop over all your files
for i in `ls file*`; do
# use awk to get the 3rd row for all files and save output
awk '{print $3}' $i > row_$i
done
# now paste your rows together.
paste row_* > output
# cleanup
rm row_*

Related

Join two files with AWK, one file from console [duplicate]

I was wondering how do I get awk to take a string from the pipe output and a file?
I've basically have a chain of commands that eventually will spit out a string. I want to check this string against a csv file (columns separated by commas). Then, I want to find the first row in the file that contains the string in the 7th column of the csv file and print out the contents of the 5th column of that line. Also, I don't know linux command line utilities/awk too well, so feel free to suggest completely different methods. :)
CSV file contents look like this:
col1,col2,col3,col4,col5,etc...
col1,col2,col3,col4,col5,etc...
etc...
My general line of thought:
(rest of commands that will give a string) | awk -F ',' 'if($5 == string){print $7;exit}' filename.txt
Can this be done? If so, how do I tell awk to compare against that string?
I've found some stuff about using a - symbol with ARGV[] before the filename, but couldn't get it working.
As Karoly suggests,
str=$( rest of commands that will give a string )
awk -v s="$str" -F, '$7==s {print $5; exit}' file
If you want to feed awk with a pipe:
cmds | awk -F, 'NR==FNR {str=$0; next}; $7==str {print $5}' - file
I think the first option is more readable.

awk overwriting files in a loop

I am trying to look through a set of files. There are 4-5 files for each month in a 2 year period with 1000+ stations in them. I am trying to separate them so that I have one file per station_no (station_no = $1).
I thought this was easy and simply went with;
awk -F, '{ print > $1".txt" }' *.csv
which I've tested with one file and it works fine. However, when I run this it creates the .txt files, but there is nothing in the files.
I've now tried to put it in a loop and see if that works;
#!/bin/bash
#program to extract stations from orig files
for file in $(ls *.csv)
do
awk -F, '{print > $1".txt" }' $file
done
It works as it loops through the files etc, but it keeps overwriting the when it moves to the next month.
How do I stop it overwriting and just adding to the end of the .txt with that name?
You are saying print > file, which truncates on every new call. Use >> instead, so that it appends to the previous content.
Also, there is no need to loop through all the files and then call awk for each one. Instead, provide the set of files to awk like this:
awk -F, '{print >> ($1".txt")}' *.csv
Note, however, that we need to talk a little about how awk keeps files opened for writing. If you say awk '{print > "hello.txt"}' file, awk will keep hello.txt file opened until it finishes processing. In your current approach, awk stops on every file; however, in my current suggested approach the file is open until the last file is processed. Thus, in this case a single > suffices:
awk -F, '{print > $1".txt"}' *.csv
For the detail on ( file ), see below comments by Ed Morton, I cannot explain it better than him :)

Move a column using sed

My Awk script generates this output:
1396.0893854748604 jdbc:mysql 192.168.0.8:3306/ycsb 3
I need to put the final column at the start, but do not wish to swap its position with the first. I need to do this using sed, or another pipe that is not awk.
I have tried variants of this command, but with no luck. My output just stays the same.
sed 's#\(.*\),\(.*\),\(.*\)#\4,\1,\2,\3#g'
Just for clarity my desired output would look like this:
3 1396.0893854748604 jdbc:mysql 192.168.0.8:3306/ycsb
You should use awk for this. It's totes better:
awk '{print $4, $1, $2, $3}' yourfilename
Updated: Oh right... Now I see that you require not using awk again... that's a wierd requirement. Leaving this here because it's an otherwise outstanding answer...

How do I write an awk print command in a loop?

I would like to write a loop creating various output files with the first column of each input file, respectively.
So I wrote
for i in $(\ls -d /home/*paired.isoforms.results)
do
awk -F"\t" {print $1}' $i > $i.transcript_ids.txt
done
As an example if there were 5 files in the home directory named
A_paired.isoforms.results
B_paired.isoforms.results
C_paired.isoforms.results
D_paired.isoforms.results
E_paired.isoforms.results
I would like to print the first column of each of these files into a seperate output file, i.e. I would like to have 5 output files called
A.transcript_ids.txt
B.transcript_ids.txt
C.transcript_ids.txt
D.transcript_ids.txt
E.transcript_ids.txt
or any other name as long as it is 5 different names and I can still link them back to the original files.
I understand, that there is a problem with the double usage of $ in both the awk and the loop command, but I don't know how to change that.
Is it possible to write a command like this in a loop?
This should do the job:
for file in /home/*paired.isoforms.results
do
base=${file##*/}
base=${base%%_*}
awk -F"\t" '{print $1}' $file > $base.transcript_ids.txt
done
I assume that there can be spaces in the first field since you set the delimiter explicitly to tab. This runs awk once per file. There are ways to do it running awk once for all files, but I'm not convinced the benefit is significant. You could consider using cut instead of awk '{print $1}', too. Note that using ls as you did is less satisfactory than using globbing directly; it runs foul of file names with oddball characters (spaces, tabs, etc) in the name.
You can do that entirely in awk:
awk -F"\t" '{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; print $1 > out}' *_paired.isoforms.results
If your input files don't have names as indicated in the question, you'd have to split on something else ( as well as use a different pattern match for the input files ).
My original answer is actually doing extra name resolution every time something is printed. Here's a version that only updates the output filename when FILENAME changes:
awk -F"\t" 'FILENAME!=lf{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; lf=FILENAME} {print $1 > out}' *_paired.isoforms.results

prevent duplicate variable and print using awk statement

I am iterating through a file and printing a set of values using awk
echo $value | awk ' {print $4}' >> 'some location'
the command works fine , but I want to prevent the duplicate values being stored in the file
Thanks in advance.
Instead of processing the file line by line, you should use a single awk command for the entire file
For example:
awk '!a[$4]++{print $4}' file >> 'some location'
Will only keep the unique values of the fourth column
Using only one instance of awk as suggested by user000001 is certainly the right thing to do, and since very little detail is given in the question this is pure speculation, but the simplest solution may be a trivial refactor of your loop. For example, if the current code is:
while ...; do
...
echo $value | awk ...
...
done
You can simply change it to:
while ...; do
...
echo $value >&5
...
done 5>&1 | awk '!a[$4]++{print $4}' >> /p/a/t/h
Note that although this is a "simple" fix in terms of code to change, it is almost certainly not the correct fix! Removing the while loop completely and just using awk is the right thing to do.

Resources