awk overwriting files in a loop - bash

I am trying to look through a set of files. There are 4-5 files for each month in a 2 year period with 1000+ stations in them. I am trying to separate them so that I have one file per station_no (station_no = $1).
I thought this was easy and simply went with;
awk -F, '{ print > $1".txt" }' *.csv
which I've tested with one file and it works fine. However, when I run this it creates the .txt files, but there is nothing in the files.
I've now tried to put it in a loop and see if that works;
#!/bin/bash
#program to extract stations from orig files
for file in $(ls *.csv)
do
awk -F, '{print > $1".txt" }' $file
done
It works as it loops through the files etc, but it keeps overwriting the when it moves to the next month.
How do I stop it overwriting and just adding to the end of the .txt with that name?

You are saying print > file, which truncates on every new call. Use >> instead, so that it appends to the previous content.
Also, there is no need to loop through all the files and then call awk for each one. Instead, provide the set of files to awk like this:
awk -F, '{print >> ($1".txt")}' *.csv
Note, however, that we need to talk a little about how awk keeps files opened for writing. If you say awk '{print > "hello.txt"}' file, awk will keep hello.txt file opened until it finishes processing. In your current approach, awk stops on every file; however, in my current suggested approach the file is open until the last file is processed. Thus, in this case a single > suffices:
awk -F, '{print > $1".txt"}' *.csv
For the detail on ( file ), see below comments by Ed Morton, I cannot explain it better than him :)

Related

Paste columns side by side on many files using awk

I've been struggling for quite a while on this problem (please note I'm not a really good bash coder, let alone awk).
I have about 10000 files, each formated the same way (quite heavy as well, about 3Mb). I would like to get the 3rd row of each file and paste them side by side on a new file.
I found many solutions using paste, awk, or cut, but none of them worked when working with wildcards. For instance,
paste <(awk '{print $3}' file1 ) <(awk '{print $3}' file2 ) <(awk '{print $3}' file3) > output
would work great if I only had 3 files, but I won't type that for 10000 of them. So I gave it a try with wildcards:
paste <(awk '{print $3}' file* ) > output
And it does paste the 3rd rows, but in a single line. I tried some other codes, but eventually always end up with the same result. Is there a way to paste them side by side using wildcards?
Thank you very much for your help!
Baptiste G.
EDIT 1: With the help of schorsch312, I found a solution that works
for me. Instead of getting the columns and pasting them side by side,
I print each columns as a line and add them one after the other:
for i in ls files*; do
awk '{printf $3i" "}END{print}' $i >> output done
It works but 1/ it's quite slow, and 2/ it's not exactly what I asked
in the title, as my output files is the "transpose". It doesn't really
matter to me because it's only floats and I can transpose it later
with python if needed.
I know that you said awk alone, but I don't know how to do it. Here is a simple bash script which does what you like to do.
# do a loop over all your files
for i in `ls file*`; do
# use awk to get the 3rd row for all files and save output
awk '{print $3}' $i > row_$i
done
# now paste your rows together.
paste row_* > output
# cleanup
rm row_*

Using awk to extract specific line from all text files in a directory

I have a folder with 50 text files and I want to extract the first line from each of them at the command line and output this to a result.txt file.
I'm using the following command within the directory that contains the files I'm working with:
for files in *; do awk '{if(NR==1) print NR, $0}' *.txt; done > result.txt
When I run the command, the result.txt file contains 50 lines but they're all from a single file in the directory rather than one line per file. The common appears to be looping over a single 50 times rather than over each of the 50 files.
I'd be grateful if someone could help me understand where I'm going wrong with this.
try this -
for i in *.txt;do head -1 $i;done > result.txt
OR
for files in *.txt;do awk 'NR==1 {print $0}' $i;done > result.txt
Your code has two problems:
You have an outer loop that iterates over *, but your loop body doesn't use $files. That is, you're invoking awk '...' *.txt 50 times. This is why any output from awk is repeated 50 times in result.txt.
Your awk code checks NR (the number of lines read so far), not FNR (the number of lines read within the current file). NR==1 is true only at the beginning of the very first file.
There's another problem: result.txt is created first, so it is included among *.txt. To avoid this, give it a different name (one that doesn't end in .txt) or put it in a different directory.
A possible fix:
awk 'FNR==1 {print NR, $0}' *.txt > result
Why not use head? For example with find:
find midir/ -type f -exec head -1 {} \; >> result.txt
If you want to follow your approach you need to specify the file and not use the wildcard with awk:
for files in *; do awk '{if(NR==1) print NR, $0}' "$files"; done > result.txt

Awk only works on final file

I am attempting to process many .csv files using the following loop:
for i in *.csv
do
dos2unix $i
awk '{print $0, FILENAME}' $i>new_edit.csv
done
This script shoud append the file name, to the end of each file, and it works. However, looking at the output new_edit.csv only contains data from one of the .csv files entered.
wc -l new_edit.csv
Indicates that my awk is only processing lines from one of my csv files. How can I make my awk process every file?
Instead of using > you should use >> as appending redirector. You could also replace the whole code with:
$ awk '{sub(/\r/,"",$NF); print $0, FILENAME}' *.csv > new_edit.csv
Following program should help you:
since you were using the redirect operator > , which was always overriding the content in file. if we replace it with append redirect oprerator >>, it would process all files and append the content in new file
#!/bin/bash
for i in *.csv
do
awk '{print $0, FILENAME}' $i>>new_edit.csv
done

Bash Shell Scripting assigning new variables for output of a grep search

EDIT 2:
I've decided to re-write this in order to better portray my outcome.
I'm currently using this code to output a list of files within various directories:
for file in /directoryX/*.txt
do
grep -rl "Annual Compensation" $file
done
The output shows all files that have a certain table I'm trying to extract in a layout like this:
txtfile1.txt
txtfile2.txt
txtfile3.txt
I have been using this awk command on each individual .txt file to extract the table and then send it to a .csv:
awk '/Annual Compensation/{f=1} f{print; if (/<\/TABLE>/) exit}' txtfile1.txt > txtfile1.csv
My goal is to find a command that will run my awk command against each file in the list all at once. Thank you to those that have provided suggestions already.
If I understand what you're asking, I think what you want to do is add a line after the grep, or instead of the grep, that says:
awk '/Annual Compensation/{f=1} f{print; if (/<\/TABLE>/) exit}' $file > ${file}_new.csv
When you say ${file}_new.csv, it expands the file variable, then adds the string "_new.csv" to it. That's what you're shooting for, right?
Modifying your code:
for file in /directoryX/*.txt
do
files+=($(grep -rl "Annual Compensation" $file))
done
for f in "${files[#]}";do
awk '/Annual Compensation/{f=1} f{print; if (/<\/TABLE>/) exit}' "$f" > "$f"_new.csv
done
Alternative code:
files+=($(grep -rl "Annual Compensation" /directoryX/*))
for f in "${files[#]}";do
awk '/Annual Compensation/{f=1} f{print; if (/<\/TABLE>/) exit}' "$f" > "$f"_new.csv
In both cases, the grep results and awk results are not verified by me - it is just a copy - paste of your code.

How do I write an awk print command in a loop?

I would like to write a loop creating various output files with the first column of each input file, respectively.
So I wrote
for i in $(\ls -d /home/*paired.isoforms.results)
do
awk -F"\t" {print $1}' $i > $i.transcript_ids.txt
done
As an example if there were 5 files in the home directory named
A_paired.isoforms.results
B_paired.isoforms.results
C_paired.isoforms.results
D_paired.isoforms.results
E_paired.isoforms.results
I would like to print the first column of each of these files into a seperate output file, i.e. I would like to have 5 output files called
A.transcript_ids.txt
B.transcript_ids.txt
C.transcript_ids.txt
D.transcript_ids.txt
E.transcript_ids.txt
or any other name as long as it is 5 different names and I can still link them back to the original files.
I understand, that there is a problem with the double usage of $ in both the awk and the loop command, but I don't know how to change that.
Is it possible to write a command like this in a loop?
This should do the job:
for file in /home/*paired.isoforms.results
do
base=${file##*/}
base=${base%%_*}
awk -F"\t" '{print $1}' $file > $base.transcript_ids.txt
done
I assume that there can be spaces in the first field since you set the delimiter explicitly to tab. This runs awk once per file. There are ways to do it running awk once for all files, but I'm not convinced the benefit is significant. You could consider using cut instead of awk '{print $1}', too. Note that using ls as you did is less satisfactory than using globbing directly; it runs foul of file names with oddball characters (spaces, tabs, etc) in the name.
You can do that entirely in awk:
awk -F"\t" '{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; print $1 > out}' *_paired.isoforms.results
If your input files don't have names as indicated in the question, you'd have to split on something else ( as well as use a different pattern match for the input files ).
My original answer is actually doing extra name resolution every time something is printed. Here's a version that only updates the output filename when FILENAME changes:
awk -F"\t" 'FILENAME!=lf{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; lf=FILENAME} {print $1 > out}' *_paired.isoforms.results

Resources