Linux sort -u does not work - sorting

I tried to sort a big file with more than 10000 data. I want to find the unique species (like Fe, La, etc.) in the file (17x200.o_neighbors.raw.dat) Ideally, I should get the result like below (see the 4th column)
FRAME 0 9194 Fe 6330SI
FRAME 11 9194 La 12858H 6330SI
However, I got results like this
FRAME 0 9194 Fe 6330SI
FRAME 11 9194 La 12858H 6330SI
FRAME 19 9194 La 13537H 6330SI
There are two "La" species. How can I get the duplicated one removed.
Here is my command
grep FRAME 17x200.o_neighbors.raw.dat | grep 9194 |sort -k 2 -n |sort -k 4 -u
the first sort -k 2 -n is to get timeseries order
the second sort -k 4 -u is to get unique species data
Any suggestion would be appreciated.

Per the sort manpage, the -k flag works like so:
-k, --key=POS1[,POS2]
start a key at POS1 (origin 1), end it at POS2 (default end of line)
So -k 4 defines a key from position 4 to the end of the line; so in your example, its values are { Fe 6330SI, La 12858H 6330SI, La 13537H 6330SI }, which are all distinct.
To fix this, you need to define a key from position 4 to position 4:
... | sort -k 4,4 -u

Using awk:
$ awk '($4 in a==0) { # if $4 not hashed yet, ...
a[$4]=$0 # hash it to a
}
END { # after all record have been processed
for(i in a) # iterate all hashed records
print a[i] # output
} ' file
FRAME 0 9194 Fe 6330SI
FRAME 11 9194 La 12858H 6330SI
Now you can sort that output.

I test it. I use command below to filter the duplicated one. But I do not understand why.
grep FRAME 17x200.o_neighbors.raw.dat | grep 9194 |sort -k 2 -n |sort -k 3,4 -u
Any explanation would be appreciated.

Related

Bash Ordering csv by colum not as expected with numbers an spaces at the end of the string [duplicate]

I have a very simple text file of 3 fields, each is separated by a space, like following:
123 15 0
123 14 0
345 12 0
345 11 0
And I issued a sort command to sort by the first column: sort -k 1 myfile. But it does not sort just by the first column. It sort by the whole line and I get the following result:
123 14 0
123 15 0
345 11 0
345 12 0
Is there anything wrong on my command or file?
You need to use:
sort -k 1,1 -s myfile
if you want to sort only on the first field. This syntax specifies the start and end field for sorting. sort -k 1 means to sort starting with the first field through to the end of the line. To ensure the lines are kept in the same order with respect to the input where the sort key is the same, you need to use a stable sort with the -s flag (GNU).
See this from the sort(1) man page:
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where
F is a field number and C a character position in the field; both are
origin 1, and the stop position defaults to the line's end.
and the info page:
The --stable (-s) option disables this last-resort comparison so that
lines in which all fields compare equal are left in their original relative
order.

sort command (MacOS terminal) gives inconsistent results [duplicate]

I have a very simple text file of 3 fields, each is separated by a space, like following:
123 15 0
123 14 0
345 12 0
345 11 0
And I issued a sort command to sort by the first column: sort -k 1 myfile. But it does not sort just by the first column. It sort by the whole line and I get the following result:
123 14 0
123 15 0
345 11 0
345 12 0
Is there anything wrong on my command or file?
You need to use:
sort -k 1,1 -s myfile
if you want to sort only on the first field. This syntax specifies the start and end field for sorting. sort -k 1 means to sort starting with the first field through to the end of the line. To ensure the lines are kept in the same order with respect to the input where the sort key is the same, you need to use a stable sort with the -s flag (GNU).
See this from the sort(1) man page:
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where
F is a field number and C a character position in the field; both are
origin 1, and the stop position defaults to the line's end.
and the info page:
The --stable (-s) option disables this last-resort comparison so that
lines in which all fields compare equal are left in their original relative
order.

How to sort lines based on specific part of their value?

When I run the following command:
command list -r machine-a-.* | sort -nr
It gives me the following result:
machine-a-9
machine-a-8
machine-a-72
machine-a-71
machine-a-70
I wish to sort these lines based on the number at the end, in descending order.
( Clearly sort -nr doesn't work as expected. )
You just need the -t and -k options in the sort.
command list -r machine-a-.* | sort -t '-' -k 3 -nr
-t is the separator used to separate the fields.
By giving it the value of '-', sort will see given text as:
Field 1 Field 2 Field 3
machine a 9
machine a 8
machine a 72
machine a 71
machine a 70
-k is specifying the field which will be used for comparison.
By giving it the value 3, sort will sort the lines by comparing the values from the Field 3.
Namely, these strings will be compared:
9
8
72
71
70
-n makes sort treat the fields for comparison as numbers instead of strings.
-r makes sort to sort the lines in reverse order(descending order).
Therefore, by sorting the numbers from Field 3 in reverse order, this will be the output:
machine-a-72
machine-a-71
machine-a-70
machine-a-9
machine-a-8
Here is an example of input to sort:
$ cat 1.txt
machine-a-9
machine-a-8
machine-a-72
machine-a-71
machine-a-70
Here is our short program:
$ cat 1.txt | ( IFS=-; while read A B C ; do echo $C $A-$B-$C; done ) | sort -rn | cut -d' ' -f 2
Here is its output:
machine-a-72
machine-a-71
machine-a-70
machine-a-9
machine-a-8
Explanation:
$ cat 1.txt \ (put contents of file into pipe input)
| ( \ (group some commands)
IFS=-; (set field separator to "-" for read command)
while read A B C ; (read fields in 3 variables A B and C every line)
do echo $C $A-$B-$C; (create output with $C in the beggining)
done
) \ (end of group)
| sort -rn \ (reverse number sorting)
| cut -d' ' -f 2 (cut-off first unneeded anymore field)

Convert Mainframe SORT to Shell Script

Is there any easy way to convert JCL SORT to Shell Script?
Here is the JCL SORT:
OPTION ZDPRINT
SORT FIELDS=(15,1,CH,A)
SUM FIELDS=(16,8,25,8,34,8,43,8,52,8,61,8),FORMAT=ZD
OUTREC BUILD=(14X,15,54,13X)
Only bytes 15 for a length of 54 are relevant from the input data, which is the key and the source values for the summation. Others bytes from the input are not important.
Assuming the data is printable.
The data is sorted on the one-byte key, and each value for records with the same key is summed, separately, for each of the six numbers. A single record is written, per key, with the summed values and with other data (those one bytes in between and at the end) from the first record. The sort is "unstable" (meaning that the order of records presented to the summation is not reproduceable from one execution to the next) so the byte values should theoretically be the same on all records, or be irrelevant.
The output, for each key, is presented as a record containing 14 blanks (14X) then the 54 bytes starting at position 15 (which is the one-byte key) and then followed by 13 blanks (13X). The numbers should be right-aligned and left-zero-filled [OP to confirm, and amend sample data and expected output].
Assuming the sum will only contain positive number and will not be signed, and that for any number which is less than 999999990 there will be leading zeros for any unused positions (numbers are character, right-aligned and left-zero-filled).
Assuming the one-byte key will only be alphabetic.
The data has already been converted to ASCII from EBCDIC.
Sample Input:
00000000000000A11111111A11111111A11111111A11111111A11111111A111111110000000000000
00000000000000B22222222A22222222A22222222A22222222A22222222A222222220000000000000
00000000000000C33333333A33333333A33333333A33333333A33333333A333333330000000000000
00000000000000A44444444B44444444B44444444B44444444B44444444B444444440000000000000
Expected Output:
A55555555A55555555A55555555A55555555A55555555A55555555
B22222222A22222222A22222222A22222222A22222222A22222222
C33333333A33333333A33333333A33333333A33333333A33333333
(14 preceding blanks and 13 trailing blanks)
Expected Volume: tenth thousands
I have figured an answer:
awk -v FIELDWIDTHS="14 1 8 1 8 1 8 1 8 1 8 1 8 13" \
'{if(!($2 in a)) {a[$2]=$2; c[$2]=$4; e[$2]=$6; g[$2]=$8; i[$2]=$10; k[$2]=$12} \
b[$2]+=$3; d[$2]+=$5; f[$2]+=$7; h[$2]+=$9; j[$2]+=$11; l[$2]+=$13;} END \
{for(id in a) printf("%14s%s%s%s%s%s%s%s%s%s%s%s%s%13s\n","",a[id],b[id],c[id],d[id],e[id],f[id],g[id],h[id],i[id],j[id],k[id],l[id],"");}' input
Explaination:
1) Split the string
awk -v FIELDWIDTHS="14 1 8 1 8 1 8 1 8 1 8 1 8 13"
2) Let $2 be the key and $4, $6, $8, $10, $12 will only set value for the first time
{if(!($2 in a)) {a[$2]=$2; c[$2]=$4; e[$2]=$6; g[$2]=$8; i[$2]=$10; k[$2]=$12}
3) Others will be summed up
b[$2]+=$3; d[$2]+=$5; f[$2]+=$7; h[$2]+=$9; j[$2]+=$11; l[$2]+=$13;} END
4) Print for each key
{for(id in a) printf("%14s%s%s%s%s%s%s%s%s%s%s%s%s%13s\n","",a[id],b[id],c[id],d[id],e[id],f[id],g[id],h[id],i[id],j[id],k[id],l[id],"");}
okay I have tried something
1) extracting duplicate keys from file and storing it in duplicates file.
awk '{k=substr($0,1,15);a[k]++}END{for(i in a)if(a[i]>1)print i}' sample > duplicates
OR
awk '{k=substr($0,1,15);print k}' sample | sort | uniq -c | awk '$1>1{print $2}' > duplicates
2) For duplicates, doing the calculation and creating newfile with specificied format
while read line
do
grep ^$line sample | awk -F[A-Z] -v key=$line '{for(i=2;i<=7;i++)f[i]=f[i]+$i}END{printf("%14s"," ");for(i=2;i<=7;i++){printf("%s%.8s",substr(key,15,1),f[i]);if(i==7)printf("%13s\n"," ")}}' > newfile
done < duplicates
3) for unique ones,format and append to newfile
grep -v -f duplicates sample | sed 's/0/ /g' >> newfile ## gives error if 0 is within data instead of start and end in a row.
OR
grep -v -f duplicates sample | awk '{printf("%14s%s%13s\n"," ",substr($0,15,54)," ")}' >> newfile
if you have any doubt, let me know.

The Sort command does not work as expected

I have a very simple text file of 3 fields, each is separated by a space, like following:
123 15 0
123 14 0
345 12 0
345 11 0
And I issued a sort command to sort by the first column: sort -k 1 myfile. But it does not sort just by the first column. It sort by the whole line and I get the following result:
123 14 0
123 15 0
345 11 0
345 12 0
Is there anything wrong on my command or file?
You need to use:
sort -k 1,1 -s myfile
if you want to sort only on the first field. This syntax specifies the start and end field for sorting. sort -k 1 means to sort starting with the first field through to the end of the line. To ensure the lines are kept in the same order with respect to the input where the sort key is the same, you need to use a stable sort with the -s flag (GNU).
See this from the sort(1) man page:
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where
F is a field number and C a character position in the field; both are
origin 1, and the stop position defaults to the line's end.
and the info page:
The --stable (-s) option disables this last-resort comparison so that
lines in which all fields compare equal are left in their original relative
order.

Resources