Sorting dates by groups - bash

Here is a sample of my data with 4 columns and comma delimiter.
1,A,2009-01-01,2009-07-15
1,A,2009-07-10,2009-07-12
2,B,2009-01-01,2009-07-15
2,B,2009-07-10,2010-12-15
3,C,2009-01-01,2009-07-15
3,C,2009-07-15,2010-12-15
3,C,2010-12-15,2014-07-07
4,D,2009-06-01,2009-07-15
4,D,2009-07-21,2012-12-15
5,E,2011-04-23,2012-10-19
The first 2 columns are grouped. I want the minimum date from the third column, and the maximum date from the fourth column, for each group.
Then I will pick the first line for each first 2 column combination.
Desired output
1,A,2009-01-01,2009-07-15
2,B,2009-01-01,2010-12-15
3,C,2009-01-01,2014-07-07
4,D,2009-06-01,2012-12-15
5,E,2011-04-23,2012-10-19
I have tried the following code, but not working. I get close, but not the max date.
cat exam |sort -t, -nk1 -k2,3 -k4,4r |sort -t, -uk1,2
Would prefer an easy one-liner like above.

sort datafile |
awk -F, -v OFS=, '
{key = $1 FS $2}
key != prev {prev = key; min[key] = $3}
{max[key] = ($4 > max[key]) ? $4 : max[key]}
END {for (key in min) print key, min[key], max[key]}
' |
sort
1,A,2009-01-01,2009-07-15
2,B,2009-01-01,2010-12-15
3,C,2009-01-01,2014-07-07
4,D,2009-06-01,2012-12-15
5,E,2011-04-23,2012-10-19
When you pre-sort, you are guaranteed that the minimum col3 date will occur on the first line of a new group. Then you just need to find the maximum col4 date.
The final sort is required because iterating over the keys of an awk hash is unordered. You can do this sorting in (g)awk with:
END {
n = asorti(min, sortedkeys)
for (i=1; i<=n; i++)
print sortedkeys[i], min[sortedkeys[i]], max[sortedkeys[i]]
}

#!/usr/bin/awk -f
BEGIN { FS = OFS = "," }
{
sub(/[[:blank:]]*<br>$/, "")
key = $1 FS $2
if (!a[key]) {
a[key] = $3
b[key] = $4
keys[++k] = key
} else if ($3 < a[key]) {
a[key] = $3
} else if ($4 > b[key]) {
b[key] = $4
}
}
END {
for (i = 1; i <= k; ++i) {
key = keys[i]
print key, a[key], b[key] " <br>"
}
}
Usage:
awk -f script.awk file
Output:
1,A,2009-01-01,2009-07-15 <br>
2,B,2009-01-01,2010-12-15 <br>
3,C,2009-01-01,2014-07-07 <br>
4,D,2009-06-01,2012-12-15 <br>
5,E,2011-04-23,2012-10-19 <br>
Of course you can add print statements before and after the loop to print the other two <br>'s:
END {
print "<br>"
for (i = 1; i <= k; ++i) {
key = keys[i]
print key, a[key], b[key] " <br>"
}
print "<br>"
}

You want a "one liner" ?
paste -d, \
<(cat exam|sort -t, -nk1,2 -k4 |cut -d, -f1-3) \
<(cat exam|sort -t, -nk1,2 -k4r |cut -d, -f4 ) |
uniq -w4
The key idea is to sort the data once by field 3 asc, and independently by field 4 desc. Then you just have to merge corresponding lines (cut and paste). Finally uniq is used to keep only the first row for each pair of identical first two columns. This is the weak point here, as I assume 4 characters max for comparison. You either have to adjust to your needs, or somehow normalize data for those two columns in order to have a fixed width here when using your actual data.
EDIT: A probably better option is to replace uniq by a simple awk filter:
paste -d, \
<(cat exam|sort -t, -nk1,2 -k4 |cut -d, -f1-3) \
<(cat exam|sort -t, -nk1,2 -k4r |cut -d, -f4 ) |
awk -F , '$1","$2 != last { print; last=$1","$2 }'
On my system (GNU Linux Debian Wheezy), both produce the same result:
1,A,2009-01-01,2009-07-15<br>
2,B,2009-01-01,2010-12-15<br>
3,C,2009-01-01,2014-07-07<br>
4,D,2009-06-01,2012-12-15 <br>
5,E,2011-04-23,2012-10-19<br>

Related

How can I use awk to remove duplicate entries in the same field with data separated with commas?

I am trying to call awk from a bash script to remove duplicate data entries of a field in a file.
Data Example in file1
data1 a,b,c,d,d,d,c,e
data2 a,b,b,c
Desired Output:
data1 a,b,c,d,e
data2 a,b,c
First I removed the first column to only have the second remaining.
cut --complement -d$'\t' -f1 file1 &> file2
This worked fine, and now I just have the following in file2:
a,b,c,d,d,d,c,e
a,b,b,c
So then I tried this code that I found but do not understand well:
awk '{
for(i=1; i<=NF; i++)
printf "%s", (!seen[$1]++? (i==1?"":FS) $i: "" )
delete seen; print ""
}' file2
The problem is that this code was for a space delimiter and mine is now a comma delimiter with variable values on each row. This code just prints the file as is and I can see no difference. I also tried to make the FS a comma by doing this, to no avail:
printf "%s", (!seen[$1]++? (i==1?"":FS=",") $i: ""
This is similar to the code you found.
awk -F'[ ,]' '
{
s = $1 " " $2
seen[$2]++
for (i=3; i<=NF; i++)
if (!seen[$i]++) s = s "," $i
print s
delete seen
}
' data-file
-F'[ ,]' - split input lines on spaces and commas
s = ... - we could use printf like the code you found, but building a string is less typing
!seen[x]++ is a common idiom - it returns true only the first time x is seen
to avoid special-casing when to print a comma (as your sample code does with spaces), we simply add $2 to the print string and set seen[$2]
then for the remaining columns (3 .. NF), we add comma and column if it hasn't been seen before
delete seen - clear the array for the next line
That code is right, you need to specify the delimiter and change $1 to $i.
$ awk -F ',' '{
for(i=1; i<=NF; i++)
printf "%s", (!seen[$i]++? (i==1?"":FS) $i: "" )
delete seen; print ""
}' /tmp/file1
data1 a,b,c,d,e
data2 a,b,c
Using GNU sed if applicable
$ sed -E ':a;s/((\<[^,]*\>).*),\2/\1/;ta' input_file
data1 a,b,c,d,e
data2 a,b,c
so i did something similar lately - sanitizing the output of gnu prime factoring program when it prints out every single copy of a bunch of small primes :
gawk -Mbe '
BEGIN {
__+=__+=__+=(__+=___=_+=__=____=_^=_<_)-+-++_
__+=__^=!(___=__-=_+=_++)
for (_; _<=___; _+=__) {
if ((_%++__)*(_%(__+--__))) {
print ____*=_^_
}
}
} | gfactor | sanitize_gnu_factor
58870952193946852435332666506835273111444209706677713:
7^7
11^11
13^13
17^17
116471448967943114621777995869564336419122830800496825559417754612566153180027:
7^7
11^11
13^13
17^17
19^19
2431978363071055324951111475877083878108827552605151765803537946846931963403343871776360412541253748541645309:
7^7
11^11
13^13
17^17
19^19
23^23
6244557167645217304114386952069758950402417741892127946837837979333340639740318438767128131418285303492993082345658543853142417309747238004933649896921:
7^7
11^11
13^13
17^17
19^19
23^23
29^29
823543:
7^7
234966429149994773:
7^7
11^11
71165482274405729335192792293569:
7^7
11^11
13^13
And the core sanitizer does basically the same thing - intra-row duplicate removal :
sanitize_gnu_factor() # i implemented it as a shell function
{
mawk -Wi -- '
BEGIN {
______ = "[ ]+"
___= _+= _^=__*=____ = FS
_______ = FS = "[ \v"(OFS = "\f\r\t")"]+"
FS = ____
} {
if (/ is prime$/) {
print; next
} else if (___==NF) {
$NF = " - - - - - - - \140\140\140"\
"PRIME\140\140\140 - - - - - - - "
} else {
split("",_____)
_ = NF
do { _____[$_]++ } while(--_<(_*_))
delete _____[""]
sub("$"," ")
_^=_<_
for (__ in _____) {
if (+_<+(___=_____[__])) {
sub(" "(__)"( "(__)")+ ",
sprintf(" %\47.f^%\47.f ",__,___))
} }
___ = _+=_^=__*=_<_
FS = _______
$__ = $__
FS = ____ } } NF = NF' |
mawk -Wi -- '
/ is prime$/ { print
next } /[=]/ { gsub("="," ")
} $(_^=(_<_)) = \
(___=length(__=$_))<(_+=_++)^(_+--_) \
?__: sprintf("%.*s......%s } %\47.f dgts ",
_^=++_,__, substr(__,++___-_),--___)' FS='[:]' OFS=':'
}

how to find out common columns and its records from two files using awk

I have two files:
File 1:
id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio
File 2(only headers):
id|name|country
Now, I want an output like:
OUTPUT:
id|name|country
1|abc|xyz
2|asd|uio
Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.
I want to do this using awk or bash.
I tried using:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt
and have no idea what to do next.
Thank You
Following awk may help you on same.
awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2 File1
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
for(i=1;i<=NF;i++){
a[$i]};
next}
FNR==1 && FNR!=NR{
for(j=1;j<=NF;j++){
if($j in a){ b[++p]=j }}
}
{
for(o=1;o<=p;o++){
printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2 File1
Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=1; i<=NF; i++) {
names[$i]
}
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in names) {
f[++numFlds] = i
}
}
}
{
for (i=1; i<=numFlds; i++) {
printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
}
}
with (lot's of) unix pipes as Doug McIlroy intended...
$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file
id|name|country
1|abc|xyz
2|asd|uio
Solution using bash>4:
IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'
# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[#]}-1))); do
for j in $(seq 0 $((${#headers1[#]}-1))); do
if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
idx+=($j)
break
fi
done
done
# idx=(0 1 3) for example
# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }
# first line - output headers
join_by '|' "${headers2[#]}"
isfirst=true
while IFS='|' read -a vals; do
# ignore first (header line)
if $isfirst; then
isfirst=false
continue;
fi;
# filter from line only columns with idx indices
tmp=()
for i in ${idx[#]}; do
tmp+=("${vals[$i]}")
done
# join ouptut with '|'
join_by '|' "${tmp[#]}"
done < $file1
This one respects the order of columns in file1, changed the order:
$ cat file1
id|country|name
The awk:
$ awk '
BEGIN { FS=OFS="|" }
NR==1 { # file1
n=split($0,a)
next
}
NR==2 { # file2 header
for(i=1;i<=NF;i++)
b[$i]=i
}
{ # output part
for(i=1;i<=n;i++)
printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd
(Another version using cut for outputing in revisions)
This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.
It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.
BEGIN { FS = OFS = "|" }
# read headers from first file
NR == FNR { for (i = 1; i <= NF; ++i) header[$i]; next }
# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1 {
for (i = 1; i <= NF; ++i)
select[i] = ($i in header)
}
{
skip = 0
pos = 1
for (i = 1; i <= NF; ++i)
if (!select[i]) { # we don't want this field
++skip
$pos = $(pos + skip) # shift fields left
} else
++pos
NF -= skip # adjust number of fields
print
}
Running this:
$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio

shell script: count lines and sum numbers from multiple files

I want to count the number of lines in a .CSV file, and sum the last numeric column. I need to make this for a crontab, so, it has to be a script.
This code counts the number of lines:
egrep -i String file_name_201611* | \
egrep -i "cdr,20161115" | \
awk -F"," '{print $4}' | sort | uniq | wc -l
This code sums the last column:
egrep -i String file_name_201611* | \
egrep -i ".cdr,20161115"| \
awk -F"," '{print $8}' | paste -s -d"+" | bc
Lines looks like:
COMGPRS,CGSCO05,COMGPRS_CGSCO05_400594.dat,processed_cdr_20161117100941_00627727.cdr,20161117095940,20161117,18,46521
The expected output:
CGSCO05,sum_#_lines, Sum_$8
CGSCO05, 225, 1500
This should work...
#!/usr/bin/awk -f
BEGIN{
k=0;
FS=","
}
{
if ($2 in counter){
counter[$2] = counter[$2] + 1;
sum_8[$2] = sum_8[$2] + $8;
}else{
k = k + 1;
counter[$2] = 1;
sum_8[$2] = $8;
name[k] = $2;
}
}
END{
for (i=1; i<=k; i++)
printf "%s, %i, %i\n", name[k], counter[name[k]], sum_8[name[k]];
}
sort a CSV file by field #2, keeping only lines with unique entries, then print the number of unique lines, and the sum of column #8 from those unique lines:
sort -t, -k2 -u foo.csv | datamash -t, count 2 sum 8

Unix/Bash: Uniq on a cell

I have a tab-separated fileA where the 12th column (starting from 1) contain several comma separated identifiers. Some of them in the same row, however, can occur more than once:
GO:0042302, GO:0042302, GO:0042302
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
....
....
(some with a white-space after the comma, some where it is not).
I would like to only get the unique identifiers and remove the multiples for each row in the 12th column:
GO:0042302
GO:0004386,GO:0005524,GO:0006281
....
....
Here is what I have so far:
for row in `fileA`
do
cut -f12 $row | sed "s/,/\n/" | sort | uniq | paste fileA - | \
awk 'BEGIN {OFS=FS="\t"}{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $13}'
done > out
The idea was to go over each row at a time, cut out the 12th column, replace all commas with newlines and then sort and take uniq to get rid of duplicates, paste it back and print the columns in the right order, skipping the original identifier column.
However, this does not seem to work. Any ideas?
Just for completeness, and because I personally prefer Perl over Awk for this sort of thing, here's a Perl one-liner solution:
perl -F'\t' -le '%u=();#k=split/,/,$F[11];#u{#k}=#k;$F[11]=join",",sort
keys%u;print join"\t",#F'
Explanation:
-F'\t' Loop over input lines, splitting each one into fields at tabs
-l automatically remove newlines from input and append on output
-e get code to execute from the next argument instead of standard input
%u = (); # clear out the hash variable %u
#k = split /,/, $F[11]; # Split 12th field (1st is 0) on comma into array #k
#u{#k} = #k; # Copy the contents of #k into #u as key/value pairs
Because hash keys are unique, that last step means that the keys of %u are now a deduplicated copy of #k.
$F[11] = join ",", sort keys %u; # replace the 12th field with the sorted unique list
print join "\t", #F; # and print out the modified line
If I understand you correctly, then with awk:
awk -F '\t' 'BEGIN { OFS = FS } { delete b; n = split($12, a, /, */); $12 = ""; for(i = 1; i <= n; ++i) { if(!(a[i] in b)) { b[a[i]]; $12 = $12 a[i] "," } } sub(/,$/, "", $12); print }' filename
This works as follows:
BEGIN { OFS = FS } # output FS same as input FS
{
delete b # clear dirty table from last pass
n = split($12, a, /, */) # split 12th field into tokens,
$12 = "" # then clear it out for reassembly
for(i = 1; i <= n; ++i) { # wade through those tokens
if(!(a[i] in b)) { # those that haven't been seen yet:
b[a[i]] # remember that they were seen
$12 = $12 a[i] "," # append to result
}
}
sub(/,$/, "", $12) # remove trailing comma from resulting field
print # print the transformed line
}
The delete b; has been POSIX-conforming for only a short while, so if you're working with an old, old awk and it fails for you, see #MarkReed's comment for another way that ancient awks should accept.
Using field 2 instead of field 12:
$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
split($2,f,/ *, */)
$2 = ""
delete seen
for (i=1;i in f;i++) {
if ( !seen[f[i]]++ ) {
$2 = $2 (i>1?",":"") f[i]
}
}
print
}
.
$ cat file
a,a,a GO:0042302, GO:0042302, GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281 d,d,d
$ awk -f tst.awk file
a,a,a GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281 d,d,d
If your awk doesn't support delete seen you can use split("",seen).
Using this awk:
awk -F '\t' -v OFS='\t' '{
delete seen;
split($12, a, /[,; ]+/);
for (i=1; i<=length(a); i++) {
if (!(a[i] in seen)) {
seen[a[i]];
s=sprintf("%s%s,", s, a[i])
}
}
$12=s} 1' file
GO:0042302,
GO:0042302,GO:0004386,GO:0005524,GO:0006281,
In your example data, the comma followed by a space is the delimiter of the 12th field. Every subfield after that is merely a repeat of the first field. The subfields appear to already be in sorted order.
GO:0042302, GO:0042302, GO:0042302
^^^dup1^^^ ^^^dup2^^^
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
^^^^^^^^^^^^^^^dup1^^^^^^^^^^^^^
Based on that, you could simply keep the first of the subfields and toss the rest:
awk -F"\t" '{sub(/, .*/, "", $12)} 1' fileA
If instead, you can have different sets of repeated subfields, where keys are not sorted like this:
GO:0042302, GO:0042302, GO:0042302, GO:0062122,GO:0055000, GO:0055001, GO:0062122,GO:0055000
GO:0004386,GO:0005524,GO:0006281, GO:0005525, GO:0004386,GO:0005524,GO:0006281
If you were stuck with a default MacOS awk you could introduce a sort/uniq functions in an awk executable script:
#!/usr/bin/awk -f
BEGIN {FS="\t"}
{
c = uniq(a, split($12, a, /, |,/))
sort(a, c)
s = a[1]
for(i=2; i<=c; i++) { s = s "," a[i] }
$2 = s
}
47 # print out the modified line
# take an indexed arr as from split and de-dup it
function uniq(arr, len, i, uarr) {
for(i=len; i>=1; i--) { uarr[arr[i]] }
delete arr
for(k in uarr) { arr[++i] = k }
return( i )
}
# slightly modified from
# http://rosettacode.org/wiki/Sorting_algorithms/Bubble_sort#AWK
function sort(arr, len, haschanged, tmp, i)
{
haschanged = 1
while( haschanged==1 ) {
haschanged = 0
for(i=1; i<=(len-1); i++) {
if( arr[i] > arr[i+1] ) {
tmp = arr[i]
arr[i] = arr[i + 1]
arr[i + 1] = tmp
haschanged = 1
}
}
}
}
If you had GNU-awk, I think you could swap out the sort(a, c) call with asort(a), and drop the bubble-sort local function completely.
I get the following for the 12th field:
GO:0042302,GO:0055000,GO:0055001,GO:0062122
GO:0004386,GO:0005524,GO:0005525,GO:0006281

how to sum each column in a file using bash

I have a file on the following format
id_1,1,0,2,3,lable1
id_2,3,2,2,1,lable1
id_3,5,1,7,6,lable1
and I want the summation of each column ( I have over 300 columns)
9,3,11,10,lable1
how can I do that using bash.
I tried using what described here but didn't work.
Using awk:
$ awk -F, '{for (i=2;i<NF;i++)a[i]+=$i}END{for (i=2;i<NF;i++) printf a[i]",";print $NF}' file
9,3,11,10,lable1
This will print the sum of each column (from i=2 .. i=n-1) in a comma separated file followed the value of the last column from the last row (i.e. lable1).
If the totals would need to be grouped by the label in the last column, you could try this:
awk -F, '
{
L[$NF]
for(i=2; i<NF; i++) T[$NF,i]+=$i
}
END{
for(i in L){
s=i
for(j=NF-1; j>1; j--) s=T[i,j] FS s
print s
}
}
' file
If the labels in the last column are sorted then you could try without arrays and save memory:
awk -F, '
function labelsum(){
s=p
for(i=NF-1; i>1; i--) s=T[i] FS s
print s
split(x,T)
}
p!=$NF{
if(p) labelsum()
p=$NF
}
{
for(i=2; i<NF; i++) T[i]+=$i
}
END {
labelsum()
}
' file
Here's a Perl one-liner:
<file perl -lanF, -E 'for ( 0 .. $#F ) { $sums{ $_ } += $F[ $_ ]; } END { say join ",", map { $sums{ $_ } } sort keys %sums; }'
It will only do sums, so the first and last column in your example will be 0.
This version will follow your example output:
<file perl -lanF, -E 'for ( 1 .. $#F - 1 ) { $sums{ $_ } += $F[ $_ ]; } END { $sums{ $#F } = $F[ -1 ]; say join ",", map { $sums{ $_ } } sort keys %sums; }'
A modified version based on the solution you linked:
#!/bin/bash
colnum=6
filename="temp"
for ((i=2;i<$colnum;++i))
do
sum=$(cut -d ',' -f $i $filename | paste -sd+ | bc)
echo -n $sum','
done
head -1 $filename | cut -d ',' -f $colnum
Pure bash solution:
#!/usr/bin/bash
while IFS=, read -a arr
do
for((i=1;i<${#arr[*]}-1;i++))
do
((farr[$i]=${farr[$i]}+${arr[$i]}))
done
farr[$i]=${arr[$i]}
done < file
(IFS=,;echo "${farr[*]}")

Resources