Using awk to extract specific line from all text files in a directory - bash

I have a folder with 50 text files and I want to extract the first line from each of them at the command line and output this to a result.txt file.
I'm using the following command within the directory that contains the files I'm working with:
for files in *; do awk '{if(NR==1) print NR, $0}' *.txt; done > result.txt
When I run the command, the result.txt file contains 50 lines but they're all from a single file in the directory rather than one line per file. The common appears to be looping over a single 50 times rather than over each of the 50 files.
I'd be grateful if someone could help me understand where I'm going wrong with this.

try this -
for i in *.txt;do head -1 $i;done > result.txt
OR
for files in *.txt;do awk 'NR==1 {print $0}' $i;done > result.txt

Your code has two problems:
You have an outer loop that iterates over *, but your loop body doesn't use $files. That is, you're invoking awk '...' *.txt 50 times. This is why any output from awk is repeated 50 times in result.txt.
Your awk code checks NR (the number of lines read so far), not FNR (the number of lines read within the current file). NR==1 is true only at the beginning of the very first file.
There's another problem: result.txt is created first, so it is included among *.txt. To avoid this, give it a different name (one that doesn't end in .txt) or put it in a different directory.
A possible fix:
awk 'FNR==1 {print NR, $0}' *.txt > result

Why not use head? For example with find:
find midir/ -type f -exec head -1 {} \; >> result.txt
If you want to follow your approach you need to specify the file and not use the wildcard with awk:
for files in *; do awk '{if(NR==1) print NR, $0}' "$files"; done > result.txt

Related

AWK remove blank lines and append empty columns to all csv files in the directory

Hi I am looking for a way to combine all the below commands together.
Remove blank lines in the csv file (comma delimited)
Add multiple empty columns to each line up to 100th column
Perform action 1 & 2 on all the files in the folder
I am still learning and this is the best I could get:
awk '!/^[[:space:]]*$/' x.csv > tmp && mv tmp x.csv
awk -F"," '($100="")1' OFS="," x.csv > tmp && mv tmp x.csv
They work out individually but I don't know how how to put them together and I am looking for ways to have it run through all the files under the directory.
Looking for concrete AWK code or shell script calling AWK.
Thank you!
An example input would be:
a,b,c
x,y,z
Expected output would be:
a,b,c,,,,,,,,,,
x,y,z,,,,,,,,,,
you can combine in one script without any loops
$ awk 'BEGIN{FS=OFS=","} FNR==1{close(f); f=FILENAME".updated"} NF{$100=""; print > f}' files...
it won't overwrite the original files.
You can pipe the output of the first to the other:
awk '!/^[[:space:]]*$/' x.csv | awk -F"," '($100="")1' OFS="," > new_x.csv
If you wanted to run the above on all the files in your directory, you would do:
shopt -s nullglob
for f in yourdirectory/*.csv; do
awk '!/^[[:space:]]*$/' "${f}" | awk -F"," '($100="")1' OFS="," > new_"${f}"
done
The shopt -s nullglob is so that an empty directory won't give you a literal *. Quoted from a good source for about looping through files
With recent enough GNU awk you could:
$ gawk -i inplace 'BEGIN{FS=OFS=","}/\S/{NF=100;$1=$1;print}' *
Explained:
$ gawk -i inplace ' # using GNU awk and in-place file editing
BEGIN {
FS=OFS="," # set delimiters to a comma
}
/\S/ { # gawk specific regex operator that matches any character that is not a space
NF=100 # set the field count to 100 which truncates fields above it
$1=$1 # edit the first field to rebuild the record to actually get the extra commas
print # output records
}' *
Some test data (the first empty record is empty, the second empty record has a space and a tab, trust me bro):
$ cat file
1,2,3
1,2,3,4,5,6,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101
Output of cat file after the execution of the GNU awk program:
1,2,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,3,4,5,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100

Awk only works on final file

I am attempting to process many .csv files using the following loop:
for i in *.csv
do
dos2unix $i
awk '{print $0, FILENAME}' $i>new_edit.csv
done
This script shoud append the file name, to the end of each file, and it works. However, looking at the output new_edit.csv only contains data from one of the .csv files entered.
wc -l new_edit.csv
Indicates that my awk is only processing lines from one of my csv files. How can I make my awk process every file?
Instead of using > you should use >> as appending redirector. You could also replace the whole code with:
$ awk '{sub(/\r/,"",$NF); print $0, FILENAME}' *.csv > new_edit.csv
Following program should help you:
since you were using the redirect operator > , which was always overriding the content in file. if we replace it with append redirect oprerator >>, it would process all files and append the content in new file
#!/bin/bash
for i in *.csv
do
awk '{print $0, FILENAME}' $i>>new_edit.csv
done

awk overwriting files in a loop

I am trying to look through a set of files. There are 4-5 files for each month in a 2 year period with 1000+ stations in them. I am trying to separate them so that I have one file per station_no (station_no = $1).
I thought this was easy and simply went with;
awk -F, '{ print > $1".txt" }' *.csv
which I've tested with one file and it works fine. However, when I run this it creates the .txt files, but there is nothing in the files.
I've now tried to put it in a loop and see if that works;
#!/bin/bash
#program to extract stations from orig files
for file in $(ls *.csv)
do
awk -F, '{print > $1".txt" }' $file
done
It works as it loops through the files etc, but it keeps overwriting the when it moves to the next month.
How do I stop it overwriting and just adding to the end of the .txt with that name?
You are saying print > file, which truncates on every new call. Use >> instead, so that it appends to the previous content.
Also, there is no need to loop through all the files and then call awk for each one. Instead, provide the set of files to awk like this:
awk -F, '{print >> ($1".txt")}' *.csv
Note, however, that we need to talk a little about how awk keeps files opened for writing. If you say awk '{print > "hello.txt"}' file, awk will keep hello.txt file opened until it finishes processing. In your current approach, awk stops on every file; however, in my current suggested approach the file is open until the last file is processed. Thus, in this case a single > suffices:
awk -F, '{print > $1".txt"}' *.csv
For the detail on ( file ), see below comments by Ed Morton, I cannot explain it better than him :)

How do I write an awk print command in a loop?

I would like to write a loop creating various output files with the first column of each input file, respectively.
So I wrote
for i in $(\ls -d /home/*paired.isoforms.results)
do
awk -F"\t" {print $1}' $i > $i.transcript_ids.txt
done
As an example if there were 5 files in the home directory named
A_paired.isoforms.results
B_paired.isoforms.results
C_paired.isoforms.results
D_paired.isoforms.results
E_paired.isoforms.results
I would like to print the first column of each of these files into a seperate output file, i.e. I would like to have 5 output files called
A.transcript_ids.txt
B.transcript_ids.txt
C.transcript_ids.txt
D.transcript_ids.txt
E.transcript_ids.txt
or any other name as long as it is 5 different names and I can still link them back to the original files.
I understand, that there is a problem with the double usage of $ in both the awk and the loop command, but I don't know how to change that.
Is it possible to write a command like this in a loop?
This should do the job:
for file in /home/*paired.isoforms.results
do
base=${file##*/}
base=${base%%_*}
awk -F"\t" '{print $1}' $file > $base.transcript_ids.txt
done
I assume that there can be spaces in the first field since you set the delimiter explicitly to tab. This runs awk once per file. There are ways to do it running awk once for all files, but I'm not convinced the benefit is significant. You could consider using cut instead of awk '{print $1}', too. Note that using ls as you did is less satisfactory than using globbing directly; it runs foul of file names with oddball characters (spaces, tabs, etc) in the name.
You can do that entirely in awk:
awk -F"\t" '{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; print $1 > out}' *_paired.isoforms.results
If your input files don't have names as indicated in the question, you'd have to split on something else ( as well as use a different pattern match for the input files ).
My original answer is actually doing extra name resolution every time something is printed. Here's a version that only updates the output filename when FILENAME changes:
awk -F"\t" 'FILENAME!=lf{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; lf=FILENAME} {print $1 > out}' *_paired.isoforms.results

How to copy a .c file to a numbered listing

I simply want to copy my .c file into a line-numbered listing file. Basically generate a .prn file from my .c file. I'm having a hard time finding the right bash command to do so.
Do you mean nl?
nl -ba filename.c
The -ba means to number all lines, not just non-empty ones.
awk '{print FNR ":" $0}' file1 file2 ...
is one way.
FNR is FileNumberRecord (the current line number per file).
You can change the ":" per your needs.
$0 means "the-whole-line-of-input"
Or you can do
cat -n file1 file2 ....
IHTH
On my linux system, I occasionally use pr -tn to prefix line numbers for listings. The -t option suppresses headers and footers; -n says to prefix line numbers. -n allows optional format and digit specifiers; see man page. Anyhow, to print file xyz.c to xyz.prn with line numbering, use:
pr -tn xyz.c > xyz.prn
Note, this is not as compact and handy as cat -n xyz.c > xyz.prn (using cat -n as suggested in a previous answer); but pr has numerous other options, and I most often use it when I want to both number the lines and put them into multiple columns or print multiple files side by side. Eg for a 2-column numbered listing use:
pr -2 -tn xyz.c > xyz.prn
I think shellter has the right idea. However, if your require output written to files with prn extensions, here's one way:
awk '{ sub(/\.c$/, "", FILENAME); print FNR ":" $0 > FILENAME ".prn" }' file1.c file2.c ...
To perform this on all files in the present working directory:
for i in *.c; do awk '{ sub(/\.c$/, "", FILENAME); print FNR ":" $0 > FILENAME ".prn" }' "$i"; done

Resources