Awk argument too long when merging csv files - bash

I have more than 10000 csv files in a folder and I'm trying to merge them by line using awk but if I run this command:
printf '%s\n' *.csv | xargs cat | awk 'FNR==1 && NR!=1{next;}{print}' *.csv > master.csv
I get the following errors:
/usr/bin/awk: Argument list too long and printf: write error: Broken pipe

With the printf and xargs parts, you are sending the contents of the csv files into awk, but you also provide the filenames to awk. Pick one or the other: I'd suggest:
{ printf '%s\n' *.csv | xargs awk 'FNR==1 && NR!=1{next;}{print}'; } > master.csv

If your file names don't contain newlines then you could do:
printf '%s\n' *.csv | awk 'NR==FNR{ARGV[ARGC++]=$0; next} !c++ || FNR>1' -
or if they can contain newlines then:
printf '%s\0' *.csv | awk -v RS='\0' 'NR==FNR{ARGV[ARGC++]=$0; next} !c++ || FNR>1' RS='\0' - RS='\n'
i.e. have awk read the list of CSV file names as input rather than the shell passing it to awk as arguments. That would work even if you had millions of CSV files.
For example, given this input:
$ head -n +50 file*.csv
==> file1.csv <==
Number
1
2
==> file2.csv <==
Number
10
11
12
the above would produce this output:
$ printf '%s\n' *.csv | awk 'NR==FNR{ARGV[ARGC++]=$0; next} !c++ || FNR>1' -
Number
1
2
10
11
12

Related

xargs and cut: getting `cut` fields of a csv to bash variable

I am using xargs in conjuction with cut but I am unsure how to get the output of cut to a variable which I can pipe to use for further processing.
So, I have a text file like so:
test.txt:
/some/path/to/dir,filename.jpg
/some/path/to/dir2,filename2.jpg
...
I do this:
cat test.txt | xargs -L1 | cut -d, -f 1,2
/some/path/to/dir,filename.jpg
but what Id like to do is:
cat test.txt | xargs -L1 | cut -d, -f 1,2 | echo $1 $2
where $1 and $2 are /some/path/to/dir and filename.jpg
I am stumped that I cannot seem to able to achieve this..
You may want to say something like:
#!/bin/bash
while IFS=, read -r f1 f2; do
echo ./mypgm -i "$f1" -o "$f2"
done < test.txt
IFS=, read -r f1 f2 reads a line from test.txt one by one,
splits the line on a comma, then assigns the variables f1 and f2
to the fields.
The line echo .. is for the demonstration purpose. Replace the
line with your desired command using $f1 and $f2.
Try this:
cat test.txt | awk -F, '{print $1, $2}'
From man xargs:
xargs [-L number] [utility [argument ...]]
-L number
Call utility for every number non-empty lines read.
From man awk:
Awk scans each input file for lines that match any of a set of patterns specified literally in prog or in one or more files specified as -f progfile.
So you don't have to use xargs -L1 as you don't pass the utility to call.
Also from man awk:
The -F fs option defines the input field separator to be the regular expression fs.
So awk -F, can replace the cut -d, part.
The fields are denoted $1, $2, ..., while $0 refers to the entire line.
So $1 is for the first column, $2 is for the second one.
An action is a sequence of statements. A statement can be one of the following:
print [ expression-list ] [ > expression ]
An empty expression-list stands for $0.
The print statement prints its argument on the standard output (or on a file if > file or >> file is present or on a pipe if | cmd is present), separated by the current output field separator, and terminated by the output record separator.
Put all these together, cat test.txt | awk -F, '{print $1, $2}' would achieve that you want.

print lines that first column not in the list

I have a list of numbers in a file
cat to_delete.txt
2
3
6
9
11
and many txt files in one folder. Each file has tab delimited lines (can be more lines than this).
3 0.55667 0.66778 0.54321 0.12345
6 0.99999 0.44444 0.55555 0.66666
7 0.33333 0.34567 0.56789 0.34543
I want to remove the lines that the first number ($1 for awk) is in to_delete.txt and print only the lines that the first number is not in to_delete.txt. The change should be replacing the old file.
Expected output
7 0.33333 0.34567 0.56789 0.34543
This is what I got so far, which doesn't remove anything;
for file in *.txt; do awk '$1 != /2|3|6|9|11/' "$file" > "$tmp" && mv "$tmp" "$file"; done
I've looked through so many similar questions here but still cannot make it work. I also tried grep -v -f to_delete.txt and sed -n -i '/$to_delete/!p'
Any help is appreciated. Thanks!
In awk:
$ awk 'NR==FNR{a[$1];next}!($1 in a)' delete file
Output:
7 0.33333 0.34567 0.56789 0.34543
Explained:
$ awk '
NR==FNR { # hash records in delete file to a hash
a[$1]
next
}
!($1 in a) # if $1 not found in record in files after the first, output
' delete files* # mind the file order
My first idea was this:
printf "%s\n" *.txt | xargs -n1 sed -i "$(sed 's!.*!/& /d!' to_delete.txt)"
printf "%s\n" *.txt - outputs the *.txt files each on separate lines
| xargs -n1 execute the following command for each line passing the line content as the input
sed -i - edit file in place
$( ... ) - command substitution
sed 's!.*!/^& /d!' to_delete.txt - for each line in to_delete.txt, append the line with /^ and suffix with /d. That way from the list of numbers I get a list of regexes to delete, like:
/^2 /d
/^3 /d
/^6 /d
and so on. Which tells sed to delete lines matching the regex - line starting with the number followed by a space.
But I think awk would be simpler. You could do:
awk '$1 != 2 && $1 != 3 && $1 != 6 ... and so on ...`
but that would be longish, unreadable. It's easier to read the map from the file and then check if the number is in the array:
awk 'FNR==NR{ map[$1] } FNR!=NR && !($1 in map)' to_delete.txt "$file"
The FNR==NR is true only for the first file. So when we read it, we set the map[$1] (we "set" it, just so such element exists). Then FNR!=NR is true for the second file, for which we check if the first element is the key in the map. If it is not, the expression is true and the line gets printed out.
all together:
for file in *.txt; do awk 'FNR==NR{ map[$1] } FNR!=NR && !($1 in map)' to_delete.txt "$file" > "$tmp"; mv "$tmp" "$file"; done

How to get output of awk into a tab-delimited file merging two lines to a line every time?

I have multiple files in gz format and used this script which counts lines in each file and prints 1/4 of lines for each file:
for file in *.gz;
do echo $file;
gunzip -c $file | wc -l | awk '{print, $1/4}';
done
STDOUT:
AB.gz
12
CD.gz
4
How I can pipe outputs of awk into a tab-delimited file like this merging two lines each time:
AB.gz 12
CD.gz 4
I tried paste by piping | paste -sd '\t' > output.txt in the script but it didn't work.
You can use a script like this:
for file in *.gz; do
gzcat "$file" | awk -v fn="$file" -v OFS='\t' 'END{print fn, int(NR/4)}'
done
Do not echo a newline after the file:
for file in *.gz;
do
printf "%s " "${file}"
gunzip -c $file | wc -l | awk '{print, $1/4}';
done

Print all lines in "file2" which have line number stored in "file1" $2

File1:
count line_num
xy 55
ab 67
File2:
a|b|c
d|e|f
I want to print 55, 67 line numbers of file2
am trying:
#!/usr/bin/ksh
while read file_name; do
line_num=`echo $file_name | awk '{print $2}'`
awk 'NR==$line_num{print;exit}' file2 >> file3.txt
done < file1
but it's not working!
Using awk you can do:
awk 'NR==FNR{line[$2]; next} FNR in line' file1 file2
We iterate the first file and store second column in a map called line (we could ignore the first line which is the header by doing NR>1 but since it doesn't contain numbers we don't need to). Once the first file is loaded in map, we iterate the second file and print out lines that are in our map. NR and FNR are awk variables that remembers the line numbers.
You can use awk to read the line numbers in a loop and sed to print out the specific lines:
while read a; do sed -n ${a}p f2.txt; done < <(awk 'NR>1{print$2}' f1.txt)
If you have a bigger file, performance can be an issue as Ed pointed out, in that case you can use awk alone:
awk 'NR==FNR{if(NR>1)l[$2]=1;next}{if(l[FNR])print $0}' f1.txt f2.txt
Another way, is to use xargs:
awk 'NR>1{print $2}' f1.txt | xargs -n1 -I {} sed -n {}p f2.txt
Use sed to construct a sed one-liner (in the case of file1 it'd output and run sed -n "55p;67p;" file2):
sed -n "$(sed -n '2~1{s/.* //;s/.*/&p/p}' file1)" file2
A good advertisement for awk, alas!

unix command to get lines from in between first and last occurence of a word and write to a file

I want a unix command to find the lines between first & last occurence of a word
For example:
let's imagine we have 1000 lines. Tenth line contains word "stackoverflow", thirty fifth line also contains word "stackoverflow".
I want to print lines between 10 and 35 and write it to a new file.
You can make it in two steps. The basic idea is to:
1) get the line number of the first and last match.
2) print the range of lines in between these range.
$ read first last <<< $(grep -n stackoverflow your_file | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file
Explanation
read first last reads two values and stores them in $first and $last.
grep -n stackoverflow your_file greps and shows the output like this: number_of_line:output
awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}') prints the number of line of the first and last match of stackoverflow in the file.
And
awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file prints all lines from $first line number till $last line number.
Test
$ cat a
here we
have some text
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
to make more fun
blablabla
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' a
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
By steps:
$ grep -n stackoverflow a
3:stackoverflow
9:stackoverflow
11:stackoverflow
$ grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}'
3 11
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ echo "first=$first, last=$last"
first=3, last=11
If you know an upper bound of how many lines there can be (say, a million), then you can use this simple abusive script:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow) < file
You can append | tail -n +2 | head -n -1 to strip the border lines as well:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow
| tail -n +2 | head -n -1) < file
I'm not 100% sure from the question whether the output should be inclusive of the first and last matching lines, so I'm assuming it is. But this can be easily changed if we want exclusive instead.
This pure-bash solution does it all in one step - i.e. the file (or pipe) is only read once:
#!/bin/bash
function midgrep {
while read ln; do
[ "$saveline" ] && linea[$((i++))]=$ln
if [[ $ln =~ $1 ]]; then
if [ "$saveline" ]; then
for ((j=0; j<i; j++)); do echo ${linea[$j]}; done
i=0
else
saveline=1
linea[$((i++))]=$ln
fi
fi
done
}
midgrep "$1"
Save this as a script (e.g. midgrep.sh) and pipe whatever output you like to it as follows:
$ cat input.txt | ./midgrep.sh stackoverflow
This works as follows:
find the first matching line and buffer in the first element of an array
continue reading lines until the next match, buffering to the array as we go
on each subsequent matches, flush the buffer array to output
continue reading file to the end. If there are no more matches, then the last buffer is simply discarded.
The advantage of this approach is that we only read through the input one time only. The disadvantage is that we buffer everything between each match - if there are many lines between each match, then these are all buffered to memory, until we hit the next match.
Also this uses the bash =~ regular expression operator to keep this pure bash. But you could replace this with a grep instead, if you are more comfortable with that.
Using perl :
perl -00 -lne '
chomp(my #arr = split /stackoverflow/);
print join "\nstackoverflow", #arr[1 .. $#arr -1 ]
' file.txt | tee newfile.txt
The idea behind this is to feed an array of the whole input file in to chunks using "stackoverflow" string to split. Next, we print the 2nd occurrences to the last -1 with join "stackoverflow".

Resources