ksh sort first column and avg second column - sorting

Looking for an awk or similar command that can sort a file with 2 columns, then produce an output of Unique column one names and total avg in column two.
so for example:
aaaa 11.5
aaaa 1.01
aaaa 5.50
bbbb 12.5
bbbb 1.10
bbbb 9.5
looking for output
aaaa 6.00
bbbb 7.7

Like this with awk:
awk '{a[$1]+=$2;b[$1]++} END{for(i in a)print i,a[i]/b[i]}' File
bbbb 7.7
aaaa 6.00333
If you want to round off, use printf.
awk '{a[$1]+=$2;b[$1]++} END{for(i in a)printf("%s %.2f\n",i,a[i]/b[i])}' File
with the first field as index, update array a (add up the 2nd fields). Keep a counter b[first field] also updated. At the end, print all indices of a and the average. Hope it's clear.
For sorted result, pipe the output to sort (awk '{a[$1]+=$2;b[$1]++} END{for(i in a)print i,a[i]/b[i]}' File | sort)

Related

How to compare parts of lines of two txt files?

I am given two txt files, each one of them with information aligned in several columns separated by a tab. What I want to do is look for lines in both files, where one of these columns match. --Not the whole lines, but only their first column parts should be identical. How do I do that in a bash script?
I have tried using grep -Fwf.
So this is what the files look like
aaaa bbbb
cccc dddd
and
aaaa eeee
ffff gggg
The output I'd like to get is something like:
bbbb and eeee match
I really haven't found a command that does both line-wise and by words comparison at the same time.
Sorry for not providing any code of my own, I'm new to programming and couldn't come up with anything reasonable so far. Thanks in advance!
Have you seen the join command? This in combination with sort maybe what you are looking for. https://shapeshed.com/unix-join/
for example:
$ cat a
aaaa bbbb
cccc dddd
$ cat b
aaaa eeee
ffff gggg
$ join a b
aaaa bbbb eeee
If the values in the first column are not sorted, than you have to sort them first, otherwise join will not work.
join <(sort a) <(sort b)
Kind regards
Oliver
Assuming your tab separated file maintains the correct file structure, this should work:
diff <(awk '{print $2}' f1) <(awk '{print $2}' f2)
# File names: f1, f2
# Column: 2nd column.
The output when there is something different,
2c2
< dx
---
> ldx
No output when the column is the same.
I tried #Wiimm's answer and it didn't work for me.
There are different kinds and different tools to compare:
diff
cmp
comm
...
All commands have options to vary the comparison.
For each command, you can specify filters. E.g.
# remove comments before comparison
diff <( grep -v ^# file1) <( grep -v ^# file2)
Without concrete examples, it is impossible to be more exact.
You can use awk, like this:
awk 'NR==FNR{a[NR]=$1;b[NR]=$2;next}
a[FNR]==$1{printf "%s and %s match\n", b[FNR], $2}' file1 file2
Output:
bbbb and eeee match
Explanation (the same code broken into multiple lines):
# As long as we are reading file1, the overall record
# number NR is the same as the record number in the
# current input file FNR
NR==FNR{
# Store column 1 and 2 in arrays called a and b
# indexed by the record number
a[NR]=$1
b[NR]=$2
next # Do not process more actions for file1
}
# The following code gets only executed when we read
# file2 because of the above _next_ statement
# Check if column 1 in file1 is the same as in file2
# for this line
a[FNR]==$1{
printf "%s and %s match\n", b[FNR], $2
}

How to keep the last occurrence of duplicate lines in a text file?

I have a text file with contents that may be duplicates. Below is a simplified representation of my txt file. text means a unique character or word or phrase). Note that the separator ---------- may not be present. Also, the whole content of the file consists of unicode Japanese and Chinese characters.
EDITED
sometext1
sometext2
sometext3
aaaa
sometext4
aaaa
aaaa
bbbb
bbbb
cccc
dddd
eeee
ffff
gggg
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
What I want to achieve is to keep only the line with the last occurrence of the duplicates like so:
sometext1
sometext2
sometext3
sometext4
aaaa
bbbb
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
The closest I found online is How to remove only the first occurrence of a line in a file using sed but this requires that you know which matching pattern(s) to delete. The suggested topics provided when writing the title gives Duplicating characters using sed and last occurence of date but they didn't work.
I am on a Mac with Sierra. I am writing my executable commands in a script.sh file to execute commands line by line. I'm using sed and gsed as my primary stream editors.
I am not sure if your intent is to preserve the original order of the lines. If that is the case, you could do this:
export LC_ALL=en_US.utf8 # to handle unicode characters in file
nl -n rz -ba file | sort -k2,2 -t$'\t' | uniq -f1 | sort -k1,1 | cut -f2
nl -n rz -ba file adds zero padded line numbers to the file
sort -k2,2 -t'$\t' sorts the output of nl by the second field (note that nl puts a tab after the line number)
uniq -f1 removes the duplicates, while ignoring the line number field (-f1)
the final sort restores the original order of the lines, with duplicates removed
cut -f2 removes the line number field, restoring the content to the original format
This awk is very close.
Given:
$ cat file
sometext1
sometext2
sometext3
aaaa
sometext4
aaaa
aaaa
bbbb
bbbb
cccc
dddd
eeee
ffff
gggg
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
You can do:
$ awk 'BEGIN{FS=":"}
FNR==NR {for (i=1; i<=NF; i++) {dup[$i]++; last[$i]=NR;} next}
/^$/ {next}
{for (i=1; i<=NF; i++)
if (dup[$i] && FNR==last[$i]) {print $0; next}}
' file file
sometext1
sometext2
sometext3
sometext4
aaaa
bbbb
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
This might work for you (GNU sed):
sed -r '1h;1!H;x;s/([^\n]+)\n(.*\1)$/\2/;s/\n-+$//;x;$!d;x' file
Store the first line in the hold space (HS) and append every subsequent line. Swap to the HS and remove any duplicate line that matches the last line. Also delete any separator lines and then swap back to the pattern space (PS). Delete all but the last line, which is swapped with the HS and printed out.
I found a simpler solution but it sorts file in the process. So if u don't mind output in sort format then u can use the following:
$sort -u input.txt > output.txt
Note: the u flag sort the lines of the file listing unique lines.
Like in the uniq manual:
cat input.txt | uniq -d

How to print columns one after the other in bash?

Is there any better methods to print two or more columns into one column, for example
input.file
AAA 111
BBB 222
CCC 333
output:
AAA
BBB
CCC
111
222
333
I can only think of:
cut -f1 input.file >output.file;cut -f2 input.file >>output.file
But it's not good if there are many columns, or when I want to pipe the output to other commands like sort.
Any other suggestions? Thank you very much!
With awk
awk '{if(maxc<NF)maxc=NF;
for(i=1;i<=NF;i++){(a[i]!=""?a[i]=a[i]RS$i:a[i]=$i)}
}
END{
for(i=1;i<=maxc;i++)print a[i]
}' input.file
You can use a GNU awk array of arrays to store all the data and print it later on.
If the number of columns is constant, this works for any amount of columns:
gawk '{for (i=1; i<=NF; i++) # loop over columns
data[i][NR]=$i # store in data[column][line]
}
END {for (i=1;i<=NR;i++) # loop over lines
for (j=1;j<=NF;j++) # loop over columns
print data[i][j] # print the given field
}' file
Note NR stands for number of records (that is, number of lines here) and NF stands for number of fields (that is, the number of fields in a given line).
If the number of columns changes over rows, then we should use yet another array, in this case to store the number of columns for each row. But in the question I don't see a request for this, so I am leaving it for now.
See a sample with three columns:
$ cat a
AAA 111 123
BBB 222 234
CCC 333 345
$ gawk '{for (i=1; i<=NF; i++) data[i][NR]=$i} END {for (i=1;i<=NR;i++) for (j=1;j<=NF;j++) print data[i][j]}' a
AAA
BBB
CCC
111
222
333
123
234
345
If the number of columns is not constant, using an array to store the number of columns for each row helps to keep track of it:
$ cat sc.wk
{for (i=1; i<=NF; i++)
data[i][NR]=$i
columns[NR]=NF
}
END {for (i=1;i<=NR;i++)
for (j=1;j<=NF;j++)
print (i<=columns[j] ? data[i][j] : "-")
}
$ cat a
AAA 111 123
BBB 222
CCC 333 345
$ awk -f sc.wk a
AAA
BBB
CCC
111
222
333
123
-
345
awk '{print $1;list[i++]=$2}END{for(j=0;j<i;j++){print list[j];}}' input.file
Output
AAA
BBB
CCC
111
222
333
More simple solution would be
awk -v RS="[[:blank:]\t\n]+" '1' input.file
Expects tab as delimiter:
$ cat <(cut -f 1 asd) <(cut -f 2 asd)
AAA
BBB
CCC
111
222
333
Since the order is of no importance:
$ awk 'BEGIN {RS="[ \t\n]+"} 1' file
AAA
111
BBB
222
CCC
333
Ugly, but it works-
for i in {1..2} ; do awk -v p="$i" '{print $p}' input.file ; done
Change the {1..2} to {1..n} where 'n' is the number of columns in the input file
Explanation-
We're defining a variable p which itself is the variable i. i varies from 1 to n and at each step we print the 'i'th column of the file.
This will work for an arbitrary number fo space separated colums
awk '{for (A=1;A<=NF;A++) printf("%s\n",$A);}' input.file | sort -u > output.file
If space is not the separateor ... let's suppose ":" is the separator
awk -F: '{for (A=1;A<=NF;A++) printf("%s\n",$A);}' input.file | sort -u > output.file

Print lines where first column matches, second column different

In a text file, how do I print out only the lines where the first column is duplicate but 2nd column is different? I want to reconcile these differences. Possibly using awk/sed/bash?
Input:
Jon AAA
Jon BBB
Ellen CCC
Ellen CCC
Output:
Jon AAA
Jon BBB
Note that the real file is not sorted.
Thanks for any help.
this line should do: (I broke the one-liner into 3 lines for better reading)
awk '!($1 in a) {a[$1]=$2;next}
$1 in a && $2!=a[$1]{p[$1 FS $2];p[$1 FS a[$1]]}
END{for(x in p)print x}' file
the 1st line save $1 $2 into array, if it was checked first time
line2: for existing $1 and different $2, put them (the two lines) into an array p, so that same $1,$2 combination won't be print multiple times.
print the index of array p
sort file | uniq -u
Will only print the unique lines.
This might work for you:
sort file | uniq -u | rev | uniq -Df1 | rev
This sorts the file, removes any duplicate lines, reverses the line, removes and unique lines that don't have the same key (keeps duplicates where the 2nd field is the same) and the reverses the line to its original position.
This will drop duplicate lines and lines with singleton keys.
Just a normal unique sort should work
awk '!a[$0]++' test

awk compare two files -erase row from second file from condtion of first file

I need some help.
first file
0.5
0.4
0.1
0.6
0.9
second file .bam
(I have to use samtools view)
aaaa bbbb cccc
aaab bbaa ccaa
hoho jojo toto
sese rere baba
jouj douj trou
And I need output:
aaaa bbbb cccc
aaab bbaa ccaa
sese rere baba
Condition: if $1 from first file is in <0.3;0.6> print same row from the second file, if it is not, erase it. I want to get filtrate second file from condition of first file. I prefer awk or bash code, but It is not important.
condition for the first file:
awk '{if($1>0.3 && $1<0.6) {print $0}}'
Please could you help me?
Thanks a lot
Another way
paste file1 file2 | awk '$1<=0.6&&$1>=0.3{$1="";print substr($0,2) }'
Here is one awk solution:
awk 'FNR==NR {a[NR]=$1;next} a[FNR]>0.3 && a[FNR]<0.6' firstfile secondfile
aaaa bbbb cccc
aaab bbaa ccaa
sese rere baba is not printed since you say <0.6 and not <=0.6
You can use awk and its getline function. It reads lines from second file, and for each one use getline to read one from first one, compare its number and print if it matches:
awk '
BEGIN { f = ARGV[2]; --ARGC }
{
getline n <f
if ( (n >= 0.3) && (n <= 0.6) ) {
print $0
}
}
' file2 file1
It yields:
aaaa bbbb cccc
aaab bbaa ccaa
sese rere baba

Resources