I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.
I have many files that have this structure that have two columns of numbers. And I want to add each line value of the second column, for all of my files, so I'll end up with only one file. Can anyone help? Hope the question was clear enough. Thanks.
The following is based on the information OP provided in his comments here above:
We have multiple files and we have to sum the second column of each of these files. As far as we know we could have hundreds or thousands of different files
The first column in each file seems not important and I'm going to assume (based on OP sample data) we have the same (first) column in each input file
The basic idea is to start with an empty summary (file tot), paste one after the other each file with tot and sum 2 and 4 columns (if present) into the second column of the new tot file.
In other words...
$ touch tot ; for f in * ; do paste tot ${f} | awk '{ if ( NF > 3 ) { print $1, $2+$4 } else { print $1, $2 } }' > tmp ; mv tmp tot ; done
I did test it with 8 different files and seems to work as expected.
Of course for f in * has to be changed in order to capture ALL and ONLY the files we want to sum.
Assuming what you want is the sum of all values of the second column of each file, it looks like a simple enough job for awk:
cat files | awk '{ sum += $2 } END { print sum }'
I have about ~1000 data files in format of file_1000.txt, file_1100.txt, etc.
Each of this file contains data in 2 columns and more than 2k rows (this is an example):
1.270000e-01 1.003580e+00
6.270000e-01 1.003582e+00
1.126000e+00 1.003582e+00
1.626000e+00 1.003584e+00
2.125000e+00 1.003584e+00
2.625000e+00 1.003586e+00
...
I want to find maximum value in each data file from 2nd column and store these numbers anywhere (particulary, plot in gnuplot). I tried to use the script:
cat file_1*00.txt | awk '{if ($2 > max) max=$2}END{print max}'
But it searches all files with file_1*00.txt condition and outputs only 1 number - maximum value from all these files.
How can I change the script to output maximums from ALL the files I mentioned in scrypt?
Thanks!
awk '{if(a[FILENAME]<$2)a[FILENAME]=$2}END{for(i in a)print i,a[i]}' file_1*00.txt
each file max ?
I have this file with 25 million rows. I want to get specific 10 million lines from this file
I have the indices of these lines in another file. How can I do it efficiently?
Assuming that the list of lines is in a file list-of-lines and the data is in data-file, and that the numbers in list-of-lines are in ascending order, then you could write:
current=0
while read wanted
do
while ((current < wanted))
do
if read -u 3 line
then ((current++))
else break 2
fi
done
echo "$line"
done < list-of-lines 3< data-file
This uses the Bash extension that allows you to specify which file descriptor read should read from (read -u 3 to read from file descriptor 3). The list of line numbers to be printed is read from standard input; the data file is read from file descriptor 3. This makes one pass through each of the two files, which is within a constant factor of optimal.
If the list-of-lines is not sorted, replace the last line with the following, which uses the Bash extension called process substitution:
done < <(sort -n list-of-lines) 3< data-file
Assume that the file containing line indices is called "no.txt" and the data file is "input.txt".
awk '{printf "%08d\n", $1}' no.txt > no.1.txt
nl -n rz -w 8 input.txt | join - no.1.txt | cut -d " " -f1 --complement > output.txt
The output.txt will have the lines wanted. I am not sure if this is efficient enough. It seems to be faster than this script (https://stackoverflow.com/a/22926494/3264368) under my environment though.
Some explanations:
The 1st command preprocess the indices file so that the numbers are right adjusted with leading zeroes and width 8 (since number of rows in input.txt is known to be 25M)
The 2nd command will print the rows and line numbers with exactly the same format as in the preprocessed index file, then join them to get the wanted rows (cut to remove the line numbers).
Since you said the file with lines you're looking for is sorted, you can loop through the two files in awk:
awk 'BEGIN{getline nl < "line_numbers.txt"} NR == nl {print; getline nl < "line_numbers.txt"}' big_file.txt
This will read each line in each file precisely once.
Like your index file is index.txt and datafile is data.txt then you can do it using sed like as follows
#!/bin/bash
while read line_no
do
sed ''$line_no'q;d' data.txt
done < input.txt
You could run a loop that reads from the 25 million lined file and when the loop counter reaches a line number that you want tell it to write that line. EX:
String line = "";
int count = 0;
while((line = br.readLine())!=null)
{
if(count == indice)
{
System.out.println(line) //or file write
}
I have 15 files like
file1.csv
a,cg2,0,0,0,21,0
a,cq1,10,0,0,0,0
a,cm2,0,19,0,0,0
...
a,ad10,0,0,0,37,0
file2.csv
d,cm1,0,3,0,0,0
d,cs2,0,32,0,0,0
d,cg2,0,0,9,0,0
...
d,az2,0,0,0,21,0
.
.
.
.
file15.csv
s,sd1,0,23,0,0,0
s,cw1,0,0,7,0,0
s,c23,0,0,90,0,0
...
s,cg2,0,45,0,0,0
I have different number of lines in each file and I want to compare the second field of all 15 files and extract the lines which are common to second field of all 15 files.
in this above case
output is:
cg2
(taking it is common to second field of all 15 files)
I am little new to unix and shell scripting, please help
Do you want the full lines from each of the fifteen files where field 2 appears in all fifteen files? Or do you only want a list of the field 2 values that appear in all fifteen files.
The former:
a,cg2,0,0,0,21,0
d,cg2,0,0,9,0,0
. . .
s,cg2,0,45,0,0,0
. . .
The latter:
cg2
. . .
If the latter, then this should work
awk -F, '{arr[$2]++; if (FILENAME != prevfile) {c++; prevfile = FILENAME}} END {for (i in arr) {if (arr[i] == c) {print i}}}' file*.csv
Broken out on multiple lines:
awk -F, '{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv
Explanation:
increment the count of the number of times a field 2 value occurs
if the filename changes, increment the count of files (the first file changes from a null string to its filename and the count increments from 0 to 1)
save the current filename
once all the counting is done, iterate of the array by its keys
if the count contained in the array is greater than or equal to the number of files, then the field 2 value appeared in all the files (by checking for >= instead of == this will work in case a value appears more than once in a single file)
so print the key (which is a field 2 value)
a glob is used to get all the files, but you could list them explicitly
Edit:
Here's a way to print the full matching lines using a two-pass technique. It's a modification of the version above. Make sure to list the files twice.
awk -F, '
FILENAME == first && flag {
exit
}
! first {
first = FILENAME
}
FILENAME != first {
flag = 1
}
{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
# print the matching lines
do {
if ($2 in arr) {
print;
}
} while (getline);
# print the list of words
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv file*.csv
It depends on the first file in the first group being the same name as the first file in the second group. Using globbing similar to what I've shown will take care of that requirement.
It prints the matching lines (not grouped, though), then it prints the list of words. If you want only one or the other, comment out or remove the loop that you don't want (do/while or for).
If you print only the full lines, you can pipe the output to:
sort -t , -k2,2
to have them grouped.
Piping only the list of words to:
sort
will put them in the same order for easier comparison.
Fun problem.
One way to do it, entirely in Bash, is as follows.
One thing you will need to invoke is join -t ',' -1 2 -2 2 file1 file2 to join on the second column of two files. Before you can join, though, you must sort on the second column.
Do successive joins in a for-loop, because join takes only two files as arguments.
ADDENDUM
Here is a little transcript showing successive joins. You can adapt it fairly easily, I think.
$ cat 1.csv
a,b,c,d
e,f,g,h
i,j,k,l
$ cat 2.csv
7,5,4,3
3,b,s,e
2,f,5,5
$ cat 3.csv
4,5,6,7
0,0,0,0
1,b,4,4
$ join -t ',' -1 2 -2 2 1.csv 2.csv | cut -f 1 -d ',' > temp
$ cat temp
b
f
$ join -t ',' -2 2 temp 3.csv | cut -f 1 -d ','
b
The first join (on the first two files) produces the joined value in the first column of the result. So as you join to file3, file4, file5, etc. You will be using the first column of the result you are generating, which is why you only need the -2 option. To keep things very efficient, always cut out all but the first column each time you do the join.