I have about ~1000 data files in format of file_1000.txt, file_1100.txt, etc.
Each of this file contains data in 2 columns and more than 2k rows (this is an example):
1.270000e-01 1.003580e+00
6.270000e-01 1.003582e+00
1.126000e+00 1.003582e+00
1.626000e+00 1.003584e+00
2.125000e+00 1.003584e+00
2.625000e+00 1.003586e+00
...
I want to find maximum value in each data file from 2nd column and store these numbers anywhere (particulary, plot in gnuplot). I tried to use the script:
cat file_1*00.txt | awk '{if ($2 > max) max=$2}END{print max}'
But it searches all files with file_1*00.txt condition and outputs only 1 number - maximum value from all these files.
How can I change the script to output maximums from ALL the files I mentioned in scrypt?
Thanks!
awk '{if(a[FILENAME]<$2)a[FILENAME]=$2}END{for(i in a)print i,a[i]}' file_1*00.txt
each file max ?
Related
I have a tab delimited text file with two columns without header. Now, I want to take the mean of each column within blocks of 10 rows. This means, I take the first 10 rows, take the mean between the 10 numbers in each column and output the mean of each column into another text file. Now go further, take the next 10 rows and make the same again. Until the end of the file. If there are less than 10 rows left at the end, just take the mean of the left rows.
Input file:
0.32832977 3.50941E-10
0.31647876 3.38274E-10
0.31482627 3.36508E-10
0.31447645 3.36134E-10
0.31447645 3.36134E-10
0.31396809 3.35591E-10
0.31281157 3.34354E-10
0.312004 3.33491E-10
0.31102326 3.32443E-10
0.30771822 3.2891E-10
0.30560062 3.26647E-10
0.30413213 3.25077E-10
0.30373717 3.24655E-10
0.29636685 3.16777E-10
0.29622422 3.16625E-10
0.29590765 3.16286E-10
0.2949896 3.15305E-10
0.29414582 3.14403E-10
0.28841901 3.08282E-10
0.28820667 3.08055E-10
0.28291832 3.02403E-10
0.28243792 3.01889E-10
0.28156429 3.00955E-10
0.28043638 2.9975E-10
0.27872239 2.97918E-10
0.27833349 2.97502E-10
0.27825573 2.97419E-10
0.27669023 2.95746E-10
0.27645657 2.95496E-10
Expected output text file:
0.314611284 3.36278E-10
0.296772974 3.172112E-10
0.279535036 2.987864E-10
I tried this code, but i don't know how to include the loop for each 10th row:
awk '{x+=$1;next}END{print x/NR}' file
Here is an awk to do this:
awk -v m=10 -v OFS="\t" '
FNR%m==1{sum1=0;sum2=0}
{sum1+=$1;sum2+=$2}
FNR%m==0{print sum1/m,sum2/m; lfnr=FNR; next}
END{print sum1/(FNR-lfnr),sum2/(FNR-lfnr)}' file
Prints:
0.314611 3.36278e-10
0.296773 3.17211e-10
0.279535 2.98786e-10
Or if you want the same number of decimals you have, you can use printf:
awk -v m=10 -v OFS="\t" '
FNR%m==1{sum1=0;sum2=0}
{sum1+=$1;sum2+=$2}
FNR%m==0{printf("%0.9G%s%0.9G\n",sum1/m,OFS,sum2/m); lfnr=FNR; next}
END{printf("%0.9G%s%0.9G\n",sum1/(FNR-lfnr),OFS,sum2/(FNR-lfnr))}' file
Prints:
0.314611284 3.36278E-10
0.296772974 3.172112E-10
0.279535036 2.98786444E-10
Your superpower here is the % modulo operator which allows you to detect ever m step -- in this case every 10th. Your x-ray vision is the FNR awk special variable which is the line of the file you are reading.
FNR%10 is always less than 10 and when 0 you are on the 10th iteration and time to print. When 1 you are on the first iteration and it is time to reset the sums.
I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.
I have a requirement where I need write a bash script to split a single input file into 'n' files and each file should not contain more than 'x' number of records (except the last file as the last file will have everything remaining). Values of 'n' and 'x' will be passed to the script as arguments by the user.
n should be the total number of split files
x should be the maximum number of records in a split file (except the last file).
Suppose if the input file has 5000 records and the user passes argument values of n and x as 3 and 1000 then, file 1 and 2 should contain 1000 records each and file 3 should contain 3000 records.
Another example will be if the input file has 4000 records and the user passes argument values of n and x as 2 and 3000 then, file 1 should contain 3000 records and file 2 should contain 1000 records.
I tried the below command:
split -n$maxBatch -l$batchSize --numeric-suffixes $fileDir/$nzbnListFileName $splitFileName
But it throws an error that, split cannot be done in more than one way.
Please advise.
you either need to give -n parameter or -l parameter. not both of them together.
split -l1000 --numeric-suffixes yourFile.txt
Sounds like split isn't enough for your requirements then - it can do either files of X lines each, or N files, but not the combination. Try something like this:
awk -v prefix=$splitFileName -v lines=$x -v maxfiles=$n '
(NR - 1) % lines == 0 && fileno < maxfiles { fileno +=1 }
{ print >> prefix fileno }' input.txt
That increments a counter every X lines up to N times, and writes lines to a file whose name depends on the counter.
I have a file with two columns like this:
73697695 1111111100110100211010101100000110100111
73697715 0100001010100022000020000000000200200000
73698148 0000000000200000000200220001100100010111
73698210 1111111211011012001011000001111000001110
73698229 1111111100110111110020101000000111210011
736985237658373533 0000000110100011101010000001100100100000
73698858 1111111210010011101010000001100100100111
73698887 2222222200020000202000000010001100110000
73699163 2222222200020100110110211100000100100100
7369929986423 2222222200020100110110211100000011110111
I am trying to create a bash file to format its columns. First I need to find the maximum length of the first column. So, I did this:
lengthID=$(awk '{ if (length($1) > max) max = length($1) } END { print max }' file)
After that I would like to apply that "lengthID" to such file to align the first column to the right based on the "lengthID". The second column must be align to the left. I tried use the following command line, but it did not work.
awk '{printf("%"lengthID"s%1s%" "s\n",$1," ",$2)}' file > temp && mv temp file
I know that the maximum length is 18 (lengthID = 18). So, the following command would work:
awk '{printf("%18s%1s%" "s\n",$1," ",$2)}' file > temp && mv temp file
And I would get a file like this (exactly what I need). However, I'd like to automatically find such length and use it.
73697695 1111111100110100211010101100000110100111
73697715 0100001010100022000020000000000200200000
73698148 0000000000200000000200220001100100010111
73698210 1111111211011012001011000001111000001110
73698229 1111111100110111110020101000000111210011
736985230909090909 0000000110100011101010000001100100100000
73698858 1111111210010011101010000001100100100111
73698887 2222222200020000202000000010001100110000
73699163 2222222200020100110110211100000100100100
7369929986423 2222222200020100110110211100000011110111
Does anyone know a way to do it?
Thank you.
$ awk 'NR==FNR{c=length($1); w=(w>c?w:c); next} {printf "%*s %s\n", w, $1, $2}' file file
73697695 1111111100110100211010101100000110100111
73697715 0100001010100022000020000000000200200000
73698148 0000000000200000000200220001100100010111
73698210 1111111211011012001011000001111000001110
73698229 1111111100110111110020101000000111210011
736985237658373533 0000000110100011101010000001100100100000
73698858 1111111210010011101010000001100100100111
73698887 2222222200020000202000000010001100110000
73699163 2222222200020100110110211100000100100100
7369929986423 2222222200020100110110211100000011110111
You sure you can't just use column -t file though?
$ column -t file
73697695 1111111100110100211010101100000110100111
73697715 0100001010100022000020000000000200200000
73698148 0000000000200000000200220001100100010111
73698210 1111111211011012001011000001111000001110
73698229 1111111100110111110020101000000111210011
736985237658373533 0000000110100011101010000001100100100000
73698858 1111111210010011101010000001100100100111
73698887 2222222200020000202000000010001100110000
73699163 2222222200020100110110211100000100100100
7369929986423 2222222200020100110110211100000011110111
The alignments different but if all you need are visual columns maybe that's OK.
I have many files that have this structure that have two columns of numbers. And I want to add each line value of the second column, for all of my files, so I'll end up with only one file. Can anyone help? Hope the question was clear enough. Thanks.
The following is based on the information OP provided in his comments here above:
We have multiple files and we have to sum the second column of each of these files. As far as we know we could have hundreds or thousands of different files
The first column in each file seems not important and I'm going to assume (based on OP sample data) we have the same (first) column in each input file
The basic idea is to start with an empty summary (file tot), paste one after the other each file with tot and sum 2 and 4 columns (if present) into the second column of the new tot file.
In other words...
$ touch tot ; for f in * ; do paste tot ${f} | awk '{ if ( NF > 3 ) { print $1, $2+$4 } else { print $1, $2 } }' > tmp ; mv tmp tot ; done
I did test it with 8 different files and seems to work as expected.
Of course for f in * has to be changed in order to capture ALL and ONLY the files we want to sum.
Assuming what you want is the sum of all values of the second column of each file, it looks like a simple enough job for awk:
cat files | awk '{ sum += $2 } END { print sum }'