Bash script to add numbers from all files (each containing an integer) in a directory - bash

I have many .txt files in a directory. Each file has only an integer.
How to write a bash script to add these integers and save the output to a file?

Just loop through the files extracting its integers and then sum them:
grep -ho '[0-9]*' files* | awk '{sum+=$1} END {print sum}'
Explanation
grep -ho '[0-9]*' files* extract numbers from the files whose name matches files*. We use -h to prevent getting the file name of the match and -o to just get the match, not the whole line.
awk '{sum+=$1} END {print sum}' loop through the values coming from grep and sum them. Finally, print the result.
Test
$ tail a*
==> a1 <==
hello 23 asd
asdfasfd
==> a2 <==
asdfasfd
is 15
==> a3 <==
$ grep -ho '[0-9]*' a* | awk '{sum+=$1} END {print sum}'
38

You can cat your files and then sum up using awk:
cat *.txt | awk '{x+=$0}END{print x}' > test.txt
test.txt should contain the sum.

Create some test files:
$ for f in {a,b,c,d}.txt; do
> echo $RANDOM > "$f"
> done
$ cat *.txt
18419
25511
31919
28810
Sum it using Bash:
$ i=0;
$ for f in *.txt; do
> ((i+=$(<"$f")));
> done
$ echo $i
104659

Related

Estimate number of lines in a file and insert that value as first line

I have many files for which I have to estimate the number of lines in each file and add that value as first line. To estimate that, I used something like this:
wc -l 000600.txt | awk '{ print $1 }'
However, no success on how to do it for all files and then to add the value corresponding to each file as first line.
An example:
a.txt b.txt c.txt
>>print a
15
>> print b
22
>>print c
56
Then 15, 22 and 56 should be added respectively to: a.txt b.txt and c.txt
I appreciate the help.
You can add a pattern for example (LINENUM) in first line of file and then use the following script.
wc -l a.txt | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i 's/LINENUM/LINENUM:{}/' a.txt
or just use from this script:
wc -l a.txt | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i '1s/^/LINENUM:{}\n/' a.txt
This way you can add the line number as the first line for all *.txt files in current directory. Also using that group command here would be faster than inplace editing commands, in case of large files. Do not modify spaces or semicolons into the grouping.
for f in *.txt; do
{ wc -l < "$f"; cat "$f"; } > "${f}.tmp" && mv "${f}.tmp" "$f"
done
For iterate over the all file you can add use from this script.
for f in `ls *` ; do if [ -f $f ]; then wc -l $f | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i '1s/^/LINENUM:{}\n/' $f ; fi; done
This might work for you (GNU sed):
sed -i '1h;1!H;$!d;=;x' file1 file2 file3 etc ...
Store each file in memory and insert the last lines line number as the file size.
Alternative:
sed -i ':a;$!{N;ba};=' file?

Awk argument too long when merging csv files

I have more than 10000 csv files in a folder and I'm trying to merge them by line using awk but if I run this command:
printf '%s\n' *.csv | xargs cat | awk 'FNR==1 && NR!=1{next;}{print}' *.csv > master.csv
I get the following errors:
/usr/bin/awk: Argument list too long and printf: write error: Broken pipe
With the printf and xargs parts, you are sending the contents of the csv files into awk, but you also provide the filenames to awk. Pick one or the other: I'd suggest:
{ printf '%s\n' *.csv | xargs awk 'FNR==1 && NR!=1{next;}{print}'; } > master.csv
If your file names don't contain newlines then you could do:
printf '%s\n' *.csv | awk 'NR==FNR{ARGV[ARGC++]=$0; next} !c++ || FNR>1' -
or if they can contain newlines then:
printf '%s\0' *.csv | awk -v RS='\0' 'NR==FNR{ARGV[ARGC++]=$0; next} !c++ || FNR>1' RS='\0' - RS='\n'
i.e. have awk read the list of CSV file names as input rather than the shell passing it to awk as arguments. That would work even if you had millions of CSV files.
For example, given this input:
$ head -n +50 file*.csv
==> file1.csv <==
Number
1
2
==> file2.csv <==
Number
10
11
12
the above would produce this output:
$ printf '%s\n' *.csv | awk 'NR==FNR{ARGV[ARGC++]=$0; next} !c++ || FNR>1' -
Number
1
2
10
11
12

Bash shell: Count occurrences of pattern (in one file) listed in arrays (array elements loaded from different file)

Hi I have loaded patterns of pattern.txt file into array and now I would like to grep count of each array element from second file (named as count.csv)
pattern.txt
abc
def
ghi
count.csv
1234,abc,joseph
5678,ramson,abc
2231,sam,def
1123,abc,richard
2521,ghi,albert
7371,jackson,def
bash shell script is given below:
declare -a myArray
myArray=( $(awk '{print $1}' ./pattern.txt))
for ((i=0; i < ${#myArray[*]}; i++))
do
var1=$(grep -c "${myArray[i]}" count.csv)
echo $var1
done
But, when I run the script, instead of giving below output
3
2
1
It gives output as
0
0
1
i.e. it only gives correct count of last array element.
grep + sort + uniq pipeline solution:
grep -o -w -f pattern.txt count.csv | sort | uniq -c
The output:
3 abc
2 def
1 ghi
grep options:
-f - obtain pattern(s) from file
-o - print only the matched parts of matching lines
-w - select only those lines containing matches that form whole words
The alternative awk approach:
awk 'NR==FNR{p[$0]; next}{ for(i=1;i<=NF;i++){ if($i in p) {p[$i]++; break} }}
END {for(i in p) print p[i],i}' pattern.txt FS="," count.csv
The output:
2 def
3 abc
1 ghi
p[$0] - accumulating patterns from the 1st input file (pattern.txt)
for(i=1;i<=NF;i++) - iterating though the fields of the line of the 2nd file (count.csv)
if($i in p) {p[$i]++; break} - incrementing counter for each matched pattern
It is better to use awk for processing text files line by line:
awk -F, 'NR==FNR {wrd[$1]; next} $2 in wrd{wrd[$2]++} $3 in wrd{wrd[$3]++}
END{for (w in wrd) print w, wrd[w]}' pattern.txt count.csv
def 2
abc 3
ghi 1
Reference: Effective AWK Programming
You could also skip the array and just loop over the patterns:
while read -r pattern; do
[[ -n $pattern ]] && grep -c "$pattern" count.csv
done < pattern.txt
grep -c outputs just the counts of the matches
Try using this command instead:
mapfile -t myArray < pattern.txt
for pattern in ${myArray[*]}; do
echo $(grep -o $pattern count.csv| wc -l)
done
Output:
3
2
1
mapfile will store every pattern in pattern.txt into myArray
The for loop will iterate through each pattern in myArray and print the number of occurrence of pattern in count.csv

In loop cat file - echo name of file - count

I trying make oneline command with operation where I can do:
in folder "data" have 570 files - each file have some text line - file are called from 1 to 570.txt
I want cat each file, grep by word and count how manny that word occurs.
For the moment he is trying to get this using ' for '
for FILES in $(find /home/my/data/ -type f -print -exec cat {} \;) ; do echo $FILES; cat $FILES |grep word ; done |wc -l
but if I do that they correctly counts but does not display the counted file
I would like it to look :
----> 1.txt <----
210
---> 2.txt <----
15
etc, etc, etc..
How to get it
grep -o word * | uniq -c
is practically all you need.
grep -o word * gives a line for each hit, but only prints the match, in this case "word". Each line is prefixed with the filename it was found in.
uniq -c gives only one line per file so to say and prefixes it with the count.
You can further format it to your needs with awk or whatever, though, for example like this:
grep -o word * | uniq -c | cut -f1 -d':' | awk '{print "File: " $2 " Count: " $1}'
You can try this :
for file in /path/to/folder/data/* ; do echo "----> $file <----" ; grep -c "word_to_count" /path/to/folder/data/$file ; done
for loop will ierate over file inside folder "data".
For each of these file, print the name and search for number of occurrence of "word_to_count" (grep -c will directly output a count of matching lines).
Be carefull, if there is more than one iteration of your search word inside a line, this solution will count only one for these iteration.
Bit of awk should do it?
awk '{s+=$1} END {print s}' mydatafile
Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf rather than print:
awk '{s+=$1} END {printf "%.0f", s}' mydatafile
$ python -c "import sys; print(sum(int(l) for l in sys.stdin))"
If you only want the total number of lines, you could use
find /home/my/data/ -type f -exec cat {} + | wc -l

How to apply 'awk' for all files in folder?

I am new to awk pls pardon my ignorance. I am using awk to extract tag values from file. following code works for single execution
awk -F"<NAME>|</NAME>" '{print $2; exit;}' file.txt
but I am not sure how I can run it for all files in folder.
File sample is as follows
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
#!/bin/bash
STRING=ABC
DATE=$(date +%Y/%m/%d | tr '/' '-')
changedate(){
for a in $(ls /root/Working/awk/*)
do
for b in $(awk -F"<NAME>|</NAME>" '{print $2;}' "$a")
do
if [ "$b" == "$STRING" ]; then
for c in $(awk -F"<DATE>|</DATE>" '{print $2;}' "$a")
do
sed "s/$c/$DATE/g" "$a";
done
else
echo "Strings are not a match";
fi
done
done
}
changedate
When you run it -
root#revolt:~# cat /root/Working/awk/*
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>DEF</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>GHI</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>JKL</NAME><DATE>2015-12-11</DATE></BODY>
String in code is set to ABC
root#revolt:~# ./ANSWER
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-07-24</DATE></BODY>
Strings are not a match
Strings are not a match
Strings are not a match
String in code is set to DEF
root#revolt:~# ./ANSWER
Strings are not a match
<HEADER><H1></H1></HEADER><BODY><NAME>DEF</NAME><DATE>2015-07-24</DATE></BODY>
Strings are not a match
Strings are not a match
Alright. So in this you would set the STRING=ABC or whatever your desired string is. You can also set it to = a list of strings you're checking for.
The date variable echoes the date in the same format (Y/m/d) as your string. The tr command then replaces all instances of forward slashes with hyphens.
First we're creating a function called "changedate". Within this function we're going to nest a few for loops to do different things. The first for loop sets ls /root/Working/awk/* to the variable a. This means that for each instance of a file/directory in /root/Working/awk/, do the following.
The next for loop is checking for of each instance, grab between the Name tags and print it. Notice we're still using $a as the file because that's going to be the file path for each file. Then we're going to have an if statement to check for your string. If it is true, then do another for loop that will substitute the date in file a. If it isn't true, then echo Strings are not a match.
Lastly, we call our "changedate" function which basically runs the entire looping sequence above.
To answer somewhat generically your question about running awk on multiple
files, imagine we have these files:
$ cat file1.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
$ cat file2.txt
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
$ cat file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
One thing you can do is simply supply awk with multiple files as with almost any command (like ls *.txt):
$ awk -F"<NAME>|</NAME>" '{print $2}' *.txt
XYZ
ABC
123
Awk just reads lines from each file in turn. As mentioned in the comments,
be careful with exit because it will stop processing all together after the first match::
$ awk -F"<NAME>|</NAME>" '{print $2; exit}' *.txt
XYZ
However, if for efficiency or some other reason you want to stop
processing in the current file and move on immediately to the next one,
you can use the gawk only nextfile:
$ # GAWK ONLY!
$ gawk -F"<NAME>|</NAME>" '{print $2; nextfile}' *.txt
XYZ
ABC
123
Sometimes the results on multiple files are not useful without knowing
which lines came from which file. For that you can use the built in FILENAME
variable:
$ awk -F"<NAME>|</NAME>" '{print FILENAME, $2}' *.txt
file1.txt XYZ
file2.txt ABC
file3.txt 123
Things get trickier when you want to modify the files you are working
on. Imagine you want to convert the name to lower case:
$ awk -F"<NAME>|</NAME>" '{print tolower($2)}' *.txt
xyz
abc
123
With traditional awk, the usual pattern is to save to a temp file and copy
the temp file back to the original (obviously you want to be careful with
this, keeping copies of the orignals!)
$ cat file1.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
$ awk -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' file1.txt > tmp && mv tmp file1.txt
$ cat file1.txt
<HEADER><H1></H1></HEADER><BODY><NAME>xyz</NAME><DATE>2015-12-11</DATE></BODY>
To use this style on multiple files, it's probably easier to drop back to
the shell and run awk in a loop on single files:
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
$ for f in file*.txt; do
> awk -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' $f > tmp && mv tmp $f
> done
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>xyz</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>abc</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
Finally, with gawk you have the option if in-place editing (much like sed -i):
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
$ # GAWK ONLY!
$ gawk -v INPLACE_SUFFIX=.sav -i inplace -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' *.txt
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>xyz</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>abc</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
The recommended INPLACE_SUFFIX variable tells gawk to make backups of
each file with that extension:
$ cat file1.txt.sav file2.txt.sav file3.txt.sav
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>

Resources