how to take average count of columns from multiple files using shell? - shell

I have 3 files each having 4 columns each, i want to take count of columns of each file and divide by total number of files, thus get the average column count from multiple files.
note- column count of each file may not be same each time and number of files can increase.
kindly help
eg -
file1 = 3 columns
file2 = 4 columns
file3 = 5 columns
sum(3+4+5)/3(file count)= avg column count for directory having multiple files.

You can use the below code snippet if you want. It is a bit elaborative to explain everything in detail.
The location contains 3 files
1 #!/bin/bash
2
3 fileCount=`ls -lrt file*.txt | wc -l` ##take the count of number of files in the location
4
5 for file in ./file*.txt; do ##running loop over the list of files individually. "file" is a parameter which'll reppresent each file
6 temp=`awk -F"," '{print NF}' ${file} | uniq` ##storing the column count in "temp" variable. Assumed each row has an entry
7 columnCount=$(($columnCount + $temp)) storing and adding column count from each file
8 done
9
10 avgColumn=$(( columnCount / fileCount )) ##Calculating average
11 echo "Files in directory `pwd` are having ${avgColumn} columns on average!" ##Printing the average

Related

How to display some information from csv file in Bash Ubuntu

I have a csv file like this:
,2019 - October,,2019 - September,,2019 - August,
,Agricultural,Industrial,Agricultural,Industrial,Agricultural,Industrial
Toronto,86746,382958,55833,348182,49313,355977
Montreal,70718,605909,22084,549823,23428,641181
Calgary,231493,1420226,114937,1249378,114243,1189979
TOTAL,388957,2409093,192853,2147384,186984,2187137
I have to do a script to display some information from the file in this way:
2 388957 388957
3 2409093 2409093
.. ...... ....
For example, in the first line the number 2 corresponds to the column number (the display begins in column B), the next number corresponds to the sum of rows 3 to 5 of column B, and de last one corresponds to the number in TOTAL line. The second line is the same bout done with column C, and so on from column B to column G.
How could I do this in Bash Ubuntu?
I've tried to do this, but it isn't work:
#!/bin/bash
INPUTFILE="/media/exports.csv"
FIELD=2
COUNT=`sed 's/[^,]//g' $INPUTFILE | wc -c`; let "COUNT+=1"
while [ "$FIELD" -lt "$COUNT" ]; do
awk 'BEGIN{FS=",";sum=0}{if(NR>=3&&NR<=5)sum=sum+$FIELD}END{print$FIELD,sum, (tail -1), "\n"
}' /media/exports.csv
let "FIELD+=1"
done
Running this script, the information displays like this:
TOTAL,388957,2409093,192853,2147384,186984,2187137 0 -1
This result is repeated many times. What am I doing wrong?

How to average the values of different files and save them in a new file

I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.

Bash script to Split a file into n files with each file containing x number of records

I have a requirement where I need write a bash script to split a single input file into 'n' files and each file should not contain more than 'x' number of records (except the last file as the last file will have everything remaining). Values of 'n' and 'x' will be passed to the script as arguments by the user.
n should be the total number of split files
x should be the maximum number of records in a split file (except the last file).
Suppose if the input file has 5000 records and the user passes argument values of n and x as 3 and 1000 then, file 1 and 2 should contain 1000 records each and file 3 should contain 3000 records.
Another example will be if the input file has 4000 records and the user passes argument values of n and x as 2 and 3000 then, file 1 should contain 3000 records and file 2 should contain 1000 records.
I tried the below command:
split -n$maxBatch -l$batchSize --numeric-suffixes $fileDir/$nzbnListFileName $splitFileName
But it throws an error that, split cannot be done in more than one way.
Please advise.
you either need to give -n parameter or -l parameter. not both of them together.
split -l1000 --numeric-suffixes yourFile.txt
Sounds like split isn't enough for your requirements then - it can do either files of X lines each, or N files, but not the combination. Try something like this:
awk -v prefix=$splitFileName -v lines=$x -v maxfiles=$n '
(NR - 1) % lines == 0 && fileno < maxfiles { fileno +=1 }
{ print >> prefix fileno }' input.txt
That increments a counter every X lines up to N times, and writes lines to a file whose name depends on the counter.

Calculate the average over a number of columns

I am trying to create a script which calculates the average over a number of rows.
This number would depend on the number of samples that I have, which varies.
An example of these files is here:
24 1 2.505
24 2 0.728
24 3 0.681
48 1 2.856
48 2 2.839
48 3 2.942
96 1 13.040
96 2 12.922
96 3 13.130
192 1 50.629
192 2 51.506
192 3 51.016
The average is calculated on the 3rd column and,
the second column indicates the number of samples, 3 in this particular case.
Therefore, I should obtain 4 values here.
One average value per 3 rows.
I have tried something like:
count=3;
total=0;
for i in $( awk '{ print $3; }' ${file} )
do
for j in 1 2 3
do
total=$(echo $total+$i | bc )
done
echo "scale=2; $total / $count" | bc
done
But it is not giving me the right answer, instead I think it calculates an average per each group of three rows.
The average is calculated on the 3rd column and,
the second column indicates the number of samples, 3 in this particular case.
Therefore, I should obtain 4 values here.
One average value per 3 rows.
I have tried something like:
count=3;
total=0;
for i in $( awk '{ print $3; }' ${file} )
do
for j in 1 2 3
do
total=$(echo $total+$i | bc )
done
echo "scale=2; $total / $count" | bc
done
But it is not giving me the right answer, instead I think it calculates an average per each group of three rows.
Expected output
24 1.3046
48 2.879
96 13.0306
192 51.0503
You can use the following awk script:
awk '{t[$2]+=$3;n[$2]++}END{for(i in t){print i,t[i]/n[i]}}' file
Output:
1 17.2575
2 16.9988
3 16.9423
This is better explained as a multiline script with comments in it:
# On every line of input
{
# sum up the value of the 3rd column in an array t
# which is is indexed by the 2nd column
t[$2]+=$3
# Increment the number of lines having the same value of
# the 2nd column
n[$2]++
}
# At the end of input
END {
# Iterate through the array t
for(i in t){
# Print the number of samples along with the average
print i,t[i]/n[i]
}
}
Apparently I brought a third view to the problem. In awk:
$ awk 'NR>1 && $1!=p{print p, s/c; c=s=0} {s+=$3;c++;p=$1} END {print p, s/c}' file
24 1.30467
48 2.879
96 13.0307
192 51.0503

Average number of rows in 10000 text files

I have a set of 10000 text files (file1.txt, file2.txt,...file10000.txt). Each one has a different number of rows. I'd like to know which is the average number of rows, among these 10000 files, excluding the last row. For example:
File1:
a
b
c
d
last
File2:
a
b
c
last
File2:
a
b
c
d
e
last
here I should obtain 4 as result. I tried with python but it requires too much time to read all the files. How could I do with a shell script?
Here's one way:
touch file{1..3}.txt
file 1 has 1 line, file 2 two lines and so on...
$ for i in {1..3}; do wc -l file${i}.txt; done | awk '{sum+=$1}END{print sum/NR}'
2

Resources