how shell script merging three files by lines and calculating some value, meeting some condition - shell

while read line1
do
while read line2
do
while read line3
do echo "$line1, $line2, $line3" | awk -F , ' $1==$5 && $6==$11 && $10==$12 {print $1,",",$2,",",$3,",",$4,",",$6,",",$7,",",$8,",",$9,",",$10,",",$13,",",$14,",",$15}' >>out.txt
done < grades.csv
done < subjects.csv
done < students.csv
In this code i am merging three files by line(cross product) and if any merged line meets the condition "$1==$5 && $6==$11 && $10==$12", I am printing them in the output file.
Now my problem is i want to keep adding "$13" field values for each iteration if it meets the condition.
How can I do this? Please help.
Here is the sample files.
gardes.csv containes lines :
1,ARCH,1,90,very good,80
1,ARCH,2,70,good,85
1,PLNG,1,89,very good,85
subjects.csv contains lines :
1,ARCH,Computer Architecture,A,K.Gose
1,PLNG,Programming Languages,A,P.Yang
1,OS,Operating System,B,K.Gopalan
2,ARCH,Computer Architecture,A,K.Gose
students.csv contains lines:
1,pankaj,vestal,986-654-32
2,satadisha,binghamton,879-876-54
5,pankaj,vestal,986-654-32
6,pankaj,vestal,986-654-31
This is the expected output:
ARCH 1 pankaj vestal 986-654-32 Computer Architecture A K.Gose 1 1 90 very good 80
ARCH 1 pankaj vestal 986-654-32 Computer Architecture A K.Gose 1 2 70 good 85
ARCH 2 satadisha binghamton 879-876-54 Computer Architecture A K.Gose 1 1 90 very good 80
ARCH 2 satadisha binghamton 879-876-54 Computer Architecture A K.Gose 1 2 70 good 85
PLNG 1 pankaj vestal 986-654-32 Programming Languages A P.Yang 1 1 89 very good 85
Also I need the sum of (90+70+90+70+89) in another shell variable which can be written to a file.

Assuming you have joined the columns to form a TSV (tab-separated values) file or stream, and that columns $k1, $k2, and $k3 (in that file or stream) form the key, and that you want to sum column $s in the join, here is the awk command you can use to form a TSV listing of the keys and sum:
awk -F\\t -v k1=$k1 -v k2=$k2 -v k3=$k3 '
BEGIN{t=OFS="\t"}
{ key=$k1 t $k2 t $k3; sum[key]+=$s }
END {for (key in sum) {print key, sum[key] } }'
(Using awk to process CSV files that might contain commas is asking for trouble, so I've illustrated how to use awk with tabs.)

You can use the join to create your expanded data and operate with awk on it.
$ join -t, -1 5 -2 2 <(join -t, -j 1 file3 file2 | sort -t, -k5,5) file1 | column -s, -t
ARCH 1 pankaj vestal 986-654-32 Computer Architecture A K.Gose 1 1 90 very good 80
ARCH 1 pankaj vestal 986-654-32 Computer Architecture A K.Gose 1 2 70 good 85
ARCH 2 satadisha binghamton 879-876-54 Computer Architecture A K.Gose 1 1 90 very good 80
ARCH 2 satadisha binghamton 879-876-54 Computer Architecture A K.Gose 1 2 70 good 85
PLNG 1 pankaj vestal 986-654-32 Programming Languages A P.Yang 1 1 89 very good 85
alternatively, you can do the join in awk as well, eliminating the while loops.
If you want to add the values in $11.
$ join -t, -1 5 -2 2 <(join -t, -j 1 file3 file2
| sort -t, -k5,5) file1 | awk -F, '{sum+=$11} END{print sum}'
To assign the result to a shell variable
$ sum=$(join ... )

Related

Sorting tab delimited numbers by column with pure bash script.

Im stuck on some homework. The requirements of the assignment are to accept an input file and perform some statistics on the values. The user may specify whether to calculate the statistics by row or by value. The shell script must be pure bash script so I can't use awk, sed, perl, python etc.
sample input:
1 1 1 1 1 1 1
39 43 4 3225 5 2 2
6 57 8 9 7 3 4
3 36 8 9 14 4 3
3 4 2 1 4 5 5
6 4 4814 7 7 6 6
I can't figure out how to sort and process the data by column. My code for processing the rows works fine.
# CODE FOR ROWS
while read -r line
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t > sorted.txt
....
#I perform the stats calculations
# for row line by working with the temp file sorted.txt
done
How could I process this data by column? I've never worked with shell script so I've been staring at this for hours.
If you wanted to analyze by columns you'll need the cols value first (number of columns). head -n 1 gives you the first row, and NF counts the number of fields, giving us the number of columns.
cols=$(head -n 1 test.txt | awk '{print NF}');
Then you can use cut with the '\t' delimiter to grab every column from input.txt, and run it through sort -n, as you did in your original post.
$ for i in `seq 2 $((cols+1))`; do cut -f$i -d$'\t' input.txt; done | sort -n > output.txt
For rows, you can use the shell built-in printf with the format modifier %dfor integers. The sort command works on lines of input, so we replace spaces ' ' with newlines \n using the tr command:
$ cat input.txt | while read line; do echo $(printf "%d\n" $line); done | tr ' ' '\n' | sort -n > output.txt
Now take the output file to gather our statistics:
Min: cat output.txt | head -n 1
Max: cat output.txt | tail -n 1
Sum: (courtesy of Dimitre Radoulov): cat output.txt | paste -sd+ - | bc
Mean: (courtesy of porges): cat output.txt | awk '{ $total += $2 } END { print $total/NR }'
Median: (courtesy of maxschlepzig): cat output.txt | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Histogram: cat output.txt | uniq -c
8 1
3 2
4 3
6 4
3 5
4 6
3 7
2 8
2 9
1 14
1 36
1 39
1 43
1 57
1 3225
1 4814

Subtract corresponding lines

I have two files, file1.csv
3 1009
7 1012
2 1013
8 1014
and file2.csv
5 1009
3 1010
1 1013
In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.
The result would be
-2 1009
-3 1010
7 1012
1 1013
8 1014
The files are huge (several GB). The second columns are sorted.
How would I do this efficiently in the shell?
Assuming that both files are sorted on second column:
$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014
join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.
The output from join is three columns like:
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
Which is piped to awk, to subtract column three from column 2, and then reformatting.
$ awk 'NR==FNR { a[$2]=$1; next }
{ a[$2]-=$1 }
END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010
It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:
$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2 # previous $2 = current $2 -> subtract
3 1010 2 # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1 # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1
(I'm out of time for now, maybe I'll finish it later)
EDIT by Ed Morton
Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:
$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
print p[1] - $1, p[2]
prev = ""
next
}
p[2] != "" {
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
{ prev = $0 }
END {
split(prev,p)
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014
Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:
$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
All those options are required:
-j2 says to join based on the second column of each file
-o auto says to make every row have the same format, beginning with the join key
-e 0 says that missing values should be substituted with zero
-a 1 and -a 2 include rows that are absent from one file or another
the filenames (I've used names based on the question number here)
Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:
sed -re 's/.*/c&-n[ ]np/e'
This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)
c 1009 3 5 -n [ ]n p
which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving
-2 1009
as required.
We can then pipe all these lines into dc, giving us the output file that we want:
$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
> | sed -e 's/.*/c& -n[ ]np/' \
> | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014
¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.
TL;DR
The full command is:
join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc
It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.
Assuming this is a csv with blank separation, if this is a "," use argument -F ','
awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
{Discounts[$2]=$1; ids[$2]++}
END { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
' file1.csv file2.csv
for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)
awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
| sort -k2 \
> file.tmp
awk 'Last != $2 {
if (NR != 1) print Result " "Last
Last = $2; Result = $1
}
Last == $2 { Result+= $1; next}
END { print Result " " $2}
' file.tmp
rm file.tmp

awk Count number of occurrences

I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.
Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

Mapping ids for 10 million records [duplicate]

This question already has answers here:
Efficient way to map ids
(2 answers)
Closed 9 years ago.
I have two text files,
File 1 with data like
User game count
A Rugby 2
A Football 2
B Volleyball 1
C TT 2
...
File 2
1 Basketball
2 Football
3 Rugby
...
90 TT
91 Volleyball
...
Now what I want to do is add another column to File 2 such that I have the corresponding index of the game from File 2 as an extra column in File 1.
I have 2 million entries in File 1. So I want to add another column specifying the index(basically the line number or order) of the game from file 2. How can I do this efficiently.
Right now I am doing this line by line. Reading a line from file 1, grep the corresponding game from file 2 for its line number and saving/writing that to a file.
This will take me ages. How can I speed this up if I have 10 million rows in file 2 and 3000 rows in file 1?
With awk, read field 1 from File2 into an array indexed by field 2, look up the array using field 2 from File1 as you iterate through it
awk 'NR == FNR{a[$2]=$1; next}; {print $0, a[$2]}' File2 File1
A Rugby 2 3
A Football 2 2
B Volleyball 1 91
C TT 2 90
You can construct an associative array from the second file, with game names as keys and the game index as values. then for each line in file 1 search the array for the wanted id, and write it back
Associative arrays provide O(1) time complexity.
Use the join command:
$ cat file1
A Rugby 2
A Football 2
B Volleyball 1
C TT 2
$ cat file2
1 Basketball
2 Football
3 Rugby
90 TT
91 Volleyball
$ join -1 3 -2 1 -o 1.1,1.2,1.3,2.2 \
<(sort -k 3 file1) <(sort -k 1 file2)
B Volleyball 1 Basketball
A Football 2 Football
A Rugby 2 Football
C TT 2 Football
Here's another approach: only read the small file into memory, and then read the bigger file line-by-line. Once each ID has been found, bail out:
awk '
NR == FNR {
f1[$2] = $0
n++
next
}
($2 in f1) {
print f1[$2], $1
delete f1[$2]
if (--n == 0) exit
}
' file1 file2
Rereading your question, I don't know if I've answered the question: do you want an extra column appended to file1 or file2?

Resources