Add filename to output of an xargs and awk command - shell

I have a directory full of .txt files, each of which has two columns and many rows (>10000). For each of these files, I am trying to find the maximum value in the second column, and print the corresponding entry in columns 1 and 2 to an output file. For this, I have a working awk command.
find ./ -name "*.txt" | xargs -I FILE awk '{if(max<$2){max=$2;datum=$1}}END{print datum, max}' FILE >> out.txt
However, I would also like to print the name of the corresponding input file with each pair of numbers. The output would look something like:
file1.txt datum1 max1
file2.txt datum2 max2
For this, I tried to draw inspiration from this similar question:
add filename to beginning of file using find and sed,
but I couldn't quite get a working solution. My best effort so far looks something like this
find ./ -name "*.txt" | xargs -I FILE echo FILE | awk '{if(max<$2){max=$2;datum=$1}}END{print datum, max}' FILE >> out.txt
but I get the error:
awk: can't open file FILE
source line number 1
I tried various other approaches which are probably a few characters away from being correct:
(1)
find ./ -name "*.txt" | xargs -I FILE -c "echo FILE ; awk '{if(max<$2){max=$2;datum=$1}}END{print datum, max}' FILE" >> out.txt
(2)
find ./ -name "*.txt" -exec sh -c "echo {} && awk '{if(max<$2){max=$2;datum=$1}}END{print datum, max}' {}" \; >> out.txt
I don't mind what command is used (xargs or exec or whatever), I only really care about the output.

If all the .txt files are in the current directory, try (GNU awk):
awk '{if(max=="" || max<$2+0){max=$2;datum=$1}}ENDFILE{print FILENAME, datum, max; max=""}' *.txt
If you want to search both the current directory and all its subdirectories for .txt files, then try:
find . -name '*.txt' -exec awk '{if(max=="" || max<$2+0){max=$2;datum=$1}}ENDFILE{print FILENAME, datum, max; max=""}' {} +
Because modern find has an -exec action, the command xargs is rarely needed anymore.
How it works
{if(max=="" || max<$2+0){max=$2;datum=$1}}
This finds the maximum column 2 and saves its and the corresponding value in column 1.
ENDFILE{print FILENAME, datum, max; max=""}
After the end of each file is reached, this prints the filename and column 1 and column 2 from the line with the maximum column 2.
Also, at the end of each file, max is reset to an empty string.
Example
Consider a directory with these three files:
$ cat file1.txt
1 1
2 2
$ cat file2.txt
3 12
5 14
4 13
$ cat file3.txt
1 0
2 1
Our command produces:
$ awk '{if(max=="" || max<$2+0){max=$2;datum=$1}}ENDFILE{print FILENAME, datum, max; max=""}' *.txt
file1.txt 2 2
file2.txt 5 14
file3.txt 2 1
BSD awk
If we cannot use ENDFILE, try:
$ awk 'FNR==1 && NR>1{print f, datum, max; max=""} max=="" || max<$2+0{max=$2;datum=$1;f=FILENAME} END{print f, datum, max}' *.txt
file1.txt 2 2
file2.txt 5 14
file3.txt 2 1
Because one awk process can analyze many files, this approach should be fast.
FNR==1 && NR>1{print f, datum, max; max=""}
Every time that we start a new file, we print the maximum from the previous file.
In awk, FNR is the line number of the current file and NR is the total number of lines read so far. When FNR==1 && NR>1, that means that we have finished at least one file and we are started on the next.
max=="" || max<$2+0{max=$2;datum=$1;f=FILENAME}
Like before, we capture the maximum of column 2 and the corresponding datum from column 1. We also record the filename as variable f.
END{print f, datum, max}
After we finish reading the last file, we print its maximum line.

If you have 10,000 files of 100,000 lines each, you will be quite a long time waiting if you start a new invocation of awk for each and every file like this because you will have to create 10,000 processes:
find . -name \*.txt -exec awk ....
I created some test files and found that the above takes just over 5 minutes on my iMac.
So, I decided to see what all those lovely Intel cores and all that lovely flash disk that I paid Apple so dearly for might be able to do using GNU Parallel.
Basically, it will run as many jobs in parallel as your CPU has cores - probably 4 or 8 on a decent Mac, and it can tag output lines with the parameters it supplied to the command:
parallel --tag -q awk 'BEGIN{max=$2;d=$1} $2>max {max=$2;d=$1} END{print d,max}' ::: *.txt
That produces the same results and now runs in 1 minute 22 seconds, nearly a 4x speedup, - not bad! But we can do better... as it stands above, we are still invoking a new awk for every file, so 10,000 awks, but in parallel, 8 at a time. It would be better to pass as many files as the OS permits to each of our 8 awks that run in parallel. Luckily, GNU Parallel will work out how many that is for us, with the -X option:
parallel -X -q gawk 'BEGINFILE{max=$2;d=$1} $2>max {max=$2;d=$1} ENDFILE{print FILENAME,d,max}' ::: *.txt
That now takes 49 seconds, but note that I am using gawk for ENDFILE/BEGINFILE and not the --tag option because each awk invocation is now receiving many hundreds of files rather than just one.
GNU Parallel and gawk can be easily installed on a Mac with homebrew. You just go to the homebrew website and copy and paste the one-liner into your terminal. Then you have a proper package manager on macOS and access to thousands of quality, useful, well managed packages.
Once you have homebrew installed, you can install GNU Parallel with:
brew install parallel
and you can install gawk with:
brew install gawk
If you don't want a package manager, it's worth noting that GNU Parallel is just a Perl script and macOS ships with Perl anyway. So, you can also install it very simply with:
(wget -O - pi.dk/3 || curl pi.dk/3/ ) | bash
Note that if your filenames are longer than about 25 characters, you will hit the limit of 262,144 characters on the argument length and get an error message telling you the argument list is too long. If that happens, just feed the names on stdin like this:
find . -name \*.txt -print0 | parallel -0 -X -q gawk 'BEGINFILE{max=$2;d=$1} $2>max {max=$2;d=$1} ENDFILE{print FILENAME,d,max}'

find . -name '*.txt' | xargs -n 1 -I FILE awk '(FNR==1) || (max<$2){max=$2;datum=$1} END{print FILENAME, datum, max}' FILE >> out.txt
find . -name '*.txt' -exec awk '(FNR==1) || (max<$2){max=$2;datum=$1} END{print FILENAME, datum, max}' {} \; >> out.txt
(edited by OP for typo)

Related

Merge huge number of files into one file by reading the files in ascending order

I want to merge a large number of files into a single file and this merge file should happen based on ascending order of the file name. I have tried the below command and it works as intended but the only problem is that after the merge the output.txt file contains whole data in a single line because all the input files have only one line of data without any newline.
Is there any way to merge each file data into output.txt as separate line rather than merging every file data into a single line?
My list of files has the naming format of 9999_xyz_1.json, 9999_xyz_2.json, 9999_xyz_3.json, ....., 9999_xyz_12000.json.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_xyz_2.json
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Actual output:
$ ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs cat
abcdef12345
EDIT:
Since my input files won't contain any spaces or special characters like backslash or quotes, I decided to use the below command which is working for me as expected.
find . -name '9999_xyz_*.json' -type f | sort -V | xargs awk 1 > output.txt
Tried with file name containing a space and below are the results with 2 different commands.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_ xyz_2.json -- This File name contains a space
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Command:
find . -name '9999_xyz_*.json' -print0 -type f | sort -V | xargs -0 awk 1 > output.txt
Output:
Successfuly completed the merge as expected but with an error at the end.
abcdef
12345
hello
awk: cmd. line:1: fatal: cannot open file `
' for reading (No such file or directory)
Command:
Here I have used the sort with -zV options to avoid the error occured in the above command.
find . -name '9999_xyz_*.json' -print0 -type f | sort -zV | xargs -0 awk 1 > output.txt
Output:
Command completed successfully but results are not as expected. Here the file name having space is treated as last file after the sort. The expectation is that the file name with space should be at second position after the sort.
abcdef
hello
12345
I would approach this with a for loop, and use echo to add the newline between each file:
for x in `ls -v -1 -d "$PWD/"9999_xyz_*.json`; do
cat $x
echo
done > output.txt
Now, someone will invariably comment that you should never parse the output of ls, but I'm not sure how else to sort the files in the right order, so I kept your original ls command to enumerate the files, which worked according to your question.
EDIT
You can optimize this a lot by using awk 1 as #oguzismail did in his answer:
ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs awk 1 > output.txt
This solution finishes in 4 seconds on my machine, with 12000 files as in your question, while the for loop takes 13 minutes to run. The difference is that the for loop launches 12000 cat processes, while the xargs needs only a handful to awk processes, which is a lot more efficient.
Note: if want to you upvote this, make sure to upvote #oguzismail's answer too, since using awk 1 is his idea. But his answer with printf and sort -V is safer, so you probably want to use that solution anyway.
Don't parse the output of ls, use an array instead.
for fname in 9999_xyz_*.json; do
index="${fname##*_}"
index="${index%.json}"
files[index]="$fname"
done && awk 1 "${files[#]}" > output.txt
Another approach that relies on GNU extensions:
printf '%s\0' 9999_xyz_*.json | sort -zV | xargs -0 awk 1 > output.txt

Applying awk pattern to all files with same name, outputting each to a new file

I'm trying to recursively find all files with the same name in a directory, apply an awk pattern to them, and then output to the directory where each of those files lives a new updated version of the file.
I thought it was better to use a for loop than xargs, but I don't exactly how to make this work...
for f in $(find . -name FILENAME.txt );
do awk -F"\(corr\)" '{print $1,$2,$3,$4}' ./FILENAME.txt > ./newFILENAME.txt $f;
done
Ultimately I would like to be able to remove multiple strings from the file at once using -F, but also not sure how to do that using awk.
Also is there a way to remove "(cor*)" where the * represents a wildcard? Not sure how to do while keeping with the escape sequence for the parentheses
Thanks!
To use (corr*) as a field separator where * is a glob-style wildcard, try:
awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
For example:
$ echo '1(corr)2(corrTwo)3(corrThree)4' | awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
1 2 3 4
To apply this command to every file under the current directory named FILENAME.txt, use:
find . -name FILENAME.txt -execdir sh -c 'awk -F'\''[(]corr[^)]*[)]'\'' '\''{print $1,$2,$3,$4}'\'' "$1" > ./newFILENAME.txt' Awk {} \;
Notes
Don't use:
for f in $(find . -name FILENAME.txt ); do
If any file or directory has whitespace or other shell-active characters in it, the results will be an unpleasant surprise.
Handling both parens and square brackets as field separators
Consider this test file:
$ cat file.txt
1(corr)2(corrTwo)3[some]4
To eliminate both types of separators and print the first four columns:
$ awk -F'[(]corr[^)]*[)]|[[][^]]*[]]' '{print $1,$2,$3,$4}' file.txt
1 2 3 4

Count how many files contain a string in the last line

I want to count how many files in the current directory have the string "A" in the last line.
First solution: tail -n 1 * | grep \"A\"| wc -l
This works fine, but when there are more files it does bash: /usr/bin/tail: Argument list too long.
Is there a way to get around it?
Bonus points if I can also optionally get which files contains it.
EDIT: my folder contains 343729 files
EDIT2: #tso usefully pointed to the article I'm getting "Argument list too long". How can I process a large list in chunks? in his comment.
RESULTS:
#tso solution for f in $(find . -type f); do tail -1 $f|grep \"A\"; done|wc -l takes about 20 minutes
#lars solution grep -P "\"A\"*\Z" -r . | wc -l takes about 20 minutes
#mklement0 solution printf '%s\0' * | xargs -0 sh -c 'tail -q -n 1 "$#" | grep \"A\"' - | wc -l takes about 10 minutes
#james solution (in the comments) for i in * ; do awk 'END{if(/a/)print FILENAME}' "$i" ; done takes about 25 minutes
#codeforester find . -type f -exec tail -n 1 -- {} + | grep -EB 1 '^[^=]+A' | grep -c '^==>' takes >20 minutes.
#mklement0 and #codeforester solutiona also has the advantage that if I want to change the grep pattern, the second time I run it it takes zero time, I guess it's due to some sort of caching.
I've accepted #mklement0 answer at is seems to be the fastest, but I still like to mention #tso and #lars for their contributions and, based on my personal knowledge, an easier and adaptable solution.
xargs is able to overcome the max. command-line length limitation by efficiently batching the invocations into as few calls as possible.
The shell's builtins, such as printf, are not subject to the max. command-line length.
Knowing this, you can use the following approach (which assumes that your xargs implementation supports the -0 option for NUL-terminated input, and that your tail implementation supports multiple file operands and the -q option for suppressing filename headers.
Both assumptions hold for the GNU (Linux) and BSD/macOS implementations of these utilities):
printf '%s\0' * | xargs -0 sh -c 'tail -q -n 1 "$#" | grep \"A\"' - | wc -l
How about using find, tail, and grep this way? This will be more efficient than having to loop through each file. Also, tail -1 will just read the last line of the files and hence very I/O efficient.
find . -maxdepth 1 -type f -exec tail -n 1 -- {} + | grep -EB 1 '^[^=]+A' | grep -c '^==>'
find will invoke tail -1 in batches, passing ARG_MAX file names at a time
tail will print the last line of each of the file, prefixing it with the pattern "==> file_name <=="
grep -EB 1 '^[^=]+A' will look for pattern A and fetch the previous line as well (it will exclude the file_name lines while looking for the match)
grep -c '^==>' will count the number of files with matching pattern
If you don't need to know the name of the files having a match, but just get the count of files, you could do this:
find . -maxdepth 1 -type f -exec tail -q -n 1 -- {} + | grep -c 'A'
Using GNU awk:
$ cat foo
b
a
$ cat bar
b
b
$ awk 'ENDFILE{if(/a/){c++; print FILENAME}}END{print c}' *
foo
1
try with find:
for f in $(find . -type f); do tail -1 $f|grep PATERN; done|wc -l
If grep supports the -P option, this might work:
grep -P "A\Z" -r . | wc -l
See man pcrepattern. In short:
\Z matches at the end of the subject also matches before a newline at the end of the subject
\z matches only at the end of the subject
Try \Z and \z.
To see which files match, you would use only the grep part without the pipe to wc.
This will return the number of files:
grep -rlP "A\z" | wc -l
If you want to get the names then simply:
grep -rlP "A\Z"

Pad/Fill missing columns in CSV file (using tabs)

I have some CSV files with TAB as separator. The lines have variable amount of columns and I want to normalize that.
I need exactly say 10 columns so effectively I want to add empty column up until 10th column in case it has fewer columns.
Also I would like to loop all files in a folder and update the corresponding file and not just output or write to a new file.
I can manage to do it with commas like this:
awk -F, '{$10=""}1' OFS=',' file.txt
But when changing it to \t i breaks and adds too many columns:
awk -F, '{$10=""}1' OFS='\t' file.txt
Any inputs?
If you have GNU awk (sometimes called gawk), this will make sure that you have ten columns and it won't erase tenth if it is already there:
awk -F'\t' -v OFS='\t' '{NF=10}1' file >file.tmp && mv file.tmp file
Awk users value brevity and a further simplification, as suggested by JID, is possible. Since, under awk, NF=10 evaluates to true, we can set NF to 10 at the same time that we cause the line to be printed:
awk -F'\t' -v OFS='\t' 'NF=10' file >file.tmp && mv file.tmp file
MacOS: On a Mac, the default awk is BSD but GNU awk (gawk) can be installed using brew install gawk.
find /YourFolder -name "*.csv" -exec sed -i 's/$/\t\t\t\t\t\t\t\t\t/;s/^\(\([^\t]*\t\)\{9\}[^\t]*\).*/\1/' {} \;
The find for taking all your CSV files
the sed
-i for inline editing and avoid temporary file
add 9 tab on each line then keep only the 10 first element (separated by 9 tab)
A version that only change line that are not compliant:
find /YourFolder -name "*.csv" -exec sed -i '/^\([^\t]*\t\)\{9\}[^\t]*$/ ! {
s/$/\t\t\t\t\t\t\t\t\t/
s/^\(\([^\t]*\t\)\{9\}[^\t]*\).*/\1/
}' {} \;
Auto adapt column number
# change the 2 occurance of "9" by the number of wanted column - 1
find /YourFolder -name "*.csv" -exec sed -i ':cycle
/^\([^\t]*\t\)\{9\}[^\t]*$/ ! {
# optimize with number ot \t on line below
s/$/\t/
s/^\(\([^\t]*\t\)\{9\}[^\t]*\).*/\1/
b cycle
}' {} \;
you can optimize your case by adding several \t instead of 1 per cycle (best should be the average missing column with a normal distribution)

moving files that contain part of a line from a file

I have a file that on each line is a string of some numbers such as
1234
2345
...
I need to move files that contain that number in their name followed by other stuff to a directory examples being
1234_hello_other_stuff_2334.pdf
2345_more_stuff_3343.pdf
I tried using xargs to do this, but my bash scripting isn't the best. Can anyone share the proper command to accomplish what I want to do?
for i in `cat numbers.txt`; do
mv ${i}_* examples
done
or (look ma, no cat!)
while read i; do
mv ${i}_* examples
done < numbers.txt
You could use a for loop, but that could make for a really long command line. If you have 20000 lines in numbers.txt, you might hit shell limits. Instead, you could use a pipe:
cat numbers.txt | while read number; do
mv ${number}_*.pdf /path/to/examples/
done
or:
sed 's/.*/mv -v &_*.pdf/' numbers.txt | sh
You can leave off the | sh for testing. If there are other lines in the file and you only want to match lines with 4 digits, you could restrict your match:
sed -r '/^[0-9]{4}$/s//mv -v &_*.pdf/' numbers.txt | sh
cat numbers.txt | xargs -n1 -I % find . -name '%*.pdf' -exec mv {} /path/to \;
% is your number (-n1 means one at a time), and '%*.pdf' to find means it'll match all files whose names begin with that number; then it just copies to /path/to ({} is the actual file name).

Resources