shell sort command : How to sort by the last column (the number of columns is uncertain)? - bash

If the data is like the follow:
a,b,3
c,d,e,f,2
g,1
I want sort by the last column. the result should be:
g,1
c,d,e,f,2
a,b,3

if the last field is single digit
$ rev file | sort | rev
you may need to add -t, -n to sort for numerical ordering but single digits it doesn't matter.
or, for the general case with awk
$ awk -F, '{a[$NF]=$0} END{n=asorti(a,d); for(k=1;k<=n;k++) print a[d[k]]}' file
g,1
c,d,e,f,2
a,b,3
This will fail if the last field is not unique. Using decorate/sort/undecorate idiom you can write instead (as you found yourself)
$ awk -F, '{print $NF FS $0}' file | sort -n | cut -d, -f2-
it's safer to use the field delimiter between the key and the record since you want to ensure the FS doesn't appear in the key itself.

I have a stupid but simple way to do it :)
// if original data in the file : ~/Desktop/1.log
$ awk -F, '{print $NF, $0}' ~/Desktop/1.log | sort -n | awk '{print $2}'
g,1
c,d,e,f,2
a,b,3

Here is my solution using bash script -- i named it uncertain.sh.
# Set here the size of the largest item to sort.
# In our case it is c,d,e,f,2 which is size 5.
max_n=5
# This function 'pads' array with P's before last element
# to force it to grow to max_n size.
# For example, (a b 3) will be transformed into (a b P P 3).
pad () {
local arr=("$#")
local l=${#arr[#]}
local diff_l=$((max_n-l))
local padding=""
# construct padding
for i in `seq 1 $diff_l`; do
padding+="P "
done
local l_minus=$((l-1))
arr=(${arr[#]:0:$l_minus} "$padding"${arr[#]:$l_minus})
echo "${arr[#]}"
}
################################################
# Provide A,B,C here to sort by last item
################################################
A="a,b,3"
B="c,d,e,f,2"
C="g,1"
A=$(echo "$A" | tr ',' ' ')
B=$(echo "$B" | tr ',' ' ')
C=$(echo "$C" | tr ',' ' ')
a=(`echo "$A"`)
b=(`echo "$B"`)
c=(`echo "$C"`)
# Get padded arrays.
a=$(pad "${a[#]}")
b=$(pad "${b[#]}")
c=$(pad "${c[#]}")
# Here, we sort by the last field (we can do this since
# padded arrays are all same size 5).
# Then we remove 'P's from strings.
feed=$(printf "%s\n" "$a" "$b" "$c" | sort -k5,5n | tr -d 'P')
# Lastly, we change spaces with commas ','.
while read line; do
echo "$line" | tr -s ' ' | tr ' ' ','
done < <(echo "$feed")
Here's the output
$ ./uncertain.sh
g,1
c,d,e,f,2
a,b,3
Here's how I did it:
We start with
a,b,3
c,d,e,f,2
g,1
We convert this to
a,b,P,P,3
c,d,e,f,2
g,P,P,P,1
Then we can sort by the 5th column since they are all of same size 5.
So this becomes
g,P,P,P,1
c,d,e,f,2
a,b,P,P,3
We can now remove P's.
g,1
c,d,e,f,2
a,b,3
Hope you found this useful.

Related

check if column has more than one value in unix [duplicate]

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

how to use values with sed in shell scripting?

i am trying te write a shell script in alphametic ,
i have 5 parameters like this
$alphametic 5790813 BEAR RARE ERE RHYME
to get
ABEHMRY -> 5790813
i tried this :
#!/bin/bash
echo "$2 $3 $4 $5" | sed 's/ //g ' | sed 's/./&\n/g' | sort -n | sed '/^$/d' | uniq -i > testing
paste -sd '' testing > testing2
sed "s|^\(.*\)$|\1 -> ${1}|" testing2
but i get error (with the last command sed), i dont know where is the problem .
Another approach:
chars=$(printf '%s' "${#:2}" | fold -w1 | sort -u | paste -sd '')
echo "$chars -> $1"
sort's -n does't make sense here: these are letters, not numbers.
One idea using awk for the whole thing:
arg1="$1"
shift
others="$#"
awk -v arg1="${arg1}" -v others="${others}" '
BEGIN { n=split(others,arr,"") # split into into array of single characters
for (i=1;i<=n;i++) # loop through indices of arr[] array
letters[arr[i]] # assign characters as indices of letters[] array; eliminates duplicates
delete letters[" "] # delete array index for "<space>"
PROCINFO["sorted_in"]="#ind_str_asc" # sort array by index
for (i in letters) # loop through indices
printf "%s", i # print index to stdout
printf " -> %s\n", arg1 # finish off line with final string
}
'
NOTE: requires GNU awk for the PROCINFO["sorted_in"] (to sort the indices of the letters[] array)
This generates:
ABEHMRY -> 5790813

Error in bash script: arithmetic error

I am wrote a simple script to extract text from a bunch of files (*.out) and add two lines at the beginning and a line at the end. Then I add the extracted text with another file to create a new file. The script is here.
#!/usr/bin/env bash
#A simple bash script to extract text from *.out and create another file
for f in *.out; do
#In the following line, n is a number which is extracted from the file name
n=$(echo $f | cut -d_ -f6)
t=$((2 * $n ))
#To extract the necessary text/data
grep " B " $f | tail -${t} | awk 'BEGIN {OFS=" ";} {print $1, $4, $5, $6}' | rev | column -t | rev > xyz.xyz
#To add some text as the first, second and last lines.
sed -i '1i -1 2' xyz.xyz
sed -i '1i $molecule' xyz.xyz
echo '$end' >> xyz.xyz
#To combine the extracted info with another file (ea_input.in)
cat xyz.xyz ./input_ea.in > "${f/abc.out/pqr.in}"
done
./script.sh: line 4: (ls file*.out | cut -d_ -f6: syntax error: invalid arithmetic operator (error token is ".out) | cut -d_ -f6")
How I can correct this error?
In bash, when you use:
$(( ... ))
it treats the contents of the brackets as an arithmetic expression, returning the result of the calculation, and when you use:
$( ... )
it executed the contents of the brackets and returns the output.
So, to fix your issue, it should be as simple as to replace line 4 with:
n=$(ls $f | cut -d_ -f6)
This replaces the outer double brackets with single, and removes the additional brackets around ls $f which should be unnecessary.
The arithmetic error can be avoided by adding spaces between parentheses. You are already using var=$((arithmetic expression)) correctly elsewhere in your script, so it should be easy to see why $( ((ls "$f") | cut -d_ -f6)) needs a space. But the subshells are completely superfluous too; you want $(ls "$f" | cut -d_ -f6). Except ls isn't doing anything useful here, either; use $(echo "$f" | cut -d_ -f6). Except the shell can easily, albeit somewhat clumsily, extract a substring with parameter substitution; "${f#*_*_*_*_*_}". Except if you're using Awk in your script anyway, it makes more sense to do this - and much more - in Awk as well.
Here is an at empt at refactoring most of the processing into Awk.
for f in *.out; do
awk 'BEGIN {OFS=" " }
# Extract 6th _-separated field from input filename
FNR==1 { split(FILENAME, f, "_"); t=2*f[6] }
# If input matches regex, add to array b
/ B / { b[++i] = $1 OFS $4 OFS $5 OFS $6 }
# If array size reaches t, start overwriting old values
i==t { i=0; m=t }
END {
# Print two prefix lines
print "$molecule"; print -1, 2;
# Handle array smaller than t
if (!m) m=i
# Print starting from oldest values (index i + 1)
for(j=1; j<=m; j++) {
# Wrap to beginning of array at end
if(i+j > t) i-=t
print b[i+j]; }
print "$end" }' "$f" |
rev | column -t | rev |
cat - ./input_ea.in > "${f/foo.out/bar.in}"
done
Notice also how we avoid using a temporary file (this would certainly have been avoidable without the Awk refactoring, too) and how we take care to quote all filename variables in double quotes.
The array b contains (up to) the latest t values from matching lines; we collect these into an array which is constrained to never contain more than t values by wrapping the index i back to the beginning of the array when we reach index t. This "circular array" avoids keeping too many values in memory, which would make the script slow if the input file contains many matches.

Using awk to extract two separate strings

MacOS, Unix
So I have a file in the following stockholm format:
# STOCKHOLM 1.0
#=GS WP_002855993.1/5-168 DE [subseq from] MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
#=GS WP_002856586.1/5-166 DE [subseq from] MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
WP_002855993.1/5-168 ------LEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELmkfgKALLT.K...NDFLKTLLECFFKVLGKEGTLLMP-TF---TYSF------CKNE------VYDKVHSKG--KVGVLNEFFRTSGgGVRRTSDPIFSFAVKGAKADIFLKEN--SSCFGKDSVYEILTREGGKFMLLGLNYG-HALTHYAEE-----
#=GR WP_002855993.1/5-168 PP ......6788899999***********************9333344455.6...8999********************.33...3544......4555......799999975..68********98626999****************999865..689*********************9875.456799996.....
WP_002856586.1/5-166 ------LEFENKKYSTYDFIETFYKLGLQKGDTLCVHTEL....FNFGFpLlsrNEFLQTILDCFFEVIGKEGTLIMP-TF---TYSF------CKNE------VYDKINSKT--KMGALNEYFRKQT.GVKRTNDPIFSFAIKGAKEELFLKDT--TSCFGENCVYEVLTKENGKYMTFGGQG--HTLTHYAEE-----
#=GR WP_002856586.1/5-166 PP ......5566677788889999******************....**9953422246679*******************.33...3544......4455......799998876..589**********.******************99999886..689******************999765..5666***96.....
#=GC PP_cons ......6677788899999999*****************9....77675.5...68889*******************.33...3544......4455......799999976..689*******998.8999**************99999876..689******************9998765.466699996.....
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....xxxxx.x...xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
WP_002855993.1/5-168 -----------------------------------------------------------------------------------------------------
#=GR WP_002855993.1/5-168 PP .....................................................................................................
WP_002856586.1/5-166 -----------------------------------------------------------------------------------------------------
#=GR WP_002856586.1/5-166 PP .....................................................................................................
#=GC PP_cons .....................................................................................................
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
//
And I've created a script to extract the IDs I want, in this case, WP_002855993.1 and WP_002856586.1, and search through another file to extract DNA sequences with the appropriate IDs. The script is as follows:
#!/bin/bash
for fileName in *.sto;
do
protID=$(grep -o "WP_.\{0,11\}" $fileName | sort | uniq)
echo $protID
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
And here's an example of the type of file I'm looking through:
>WP_002855993.1 MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
MKYFLEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELMKFGKALLTKNDFLKTLLECFFKVLGKEGTLLMPTFT
>WP_002856586.1 MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
MKYLLEFENKKYSTYDFIETFYKLGLQKGDTLCVHTELFNFGFPLLSRNEFLQTILDCFFEVIGKEGTLIMPTFT
YSFCKNEVYDKINSKTKMGALNEYFRKQTGVKRTNDPIFSFAIKGAKEELFLKDTTSCFGENCVYEVLTKENGKY
>WP_002856595.1 MULTISPECIES: acetyl-CoA carboxylase biotin carboxylase subunit [Campylobacter]
MNQIHKILIANRAEIAVRVIRACRDLHIKSVAVFTEPDRECLHVKIADEAYRIGTDAIRGYLDVARIVEIAKACG
This script works if I have one ID, but in some cases I get two IDs, and I get an error, because I think it's looking for an ID like "WP_002855993.1 WP_002856586.1". Is there a way to modify this script so it looks for two separate occurrences? I guess it's something with the gawk command, but I'm not sure what exactly. Thanks in advance!
an update to the original script:
#!/usr/bin/env bash
for file_sto in *.sto; do
file_faa=$(echo $file_sto | cut -d '_' -f 1,2,3)
file_faa=${file_faa}"_protein.faa"
awk '(NR==FNR) { match($0,/WP_.\{0,11\}/);
if (RSTART > 0) a[substr($0,RSTART,RLENGTH)]++
next; }
($1 in a){ print RS $0 }' $file_sto RS=">" $file_faa >> sequence_protein.file
done
The awk part can probably even be reduced to :
awk '(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }
($1 in a) { print RS $0 }' FS='/' $file_sto FS=" " RS=">" $file_faa
This awk script does the following:
Set the field separator FS to / and read file $file_sto.
When reading $file_sto the record number NR is the same as the file record number FNR.
(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }: this line works only one $file_sto due to the condition in the front. It checks if the line starts with WP_. If it does, it stores the first field $1 (separated by FS which is a /) in an array a; it then skips to the next record in the file (next).
If we finished reading file $file_sto, we set the field separator back to a single space FS=" " (see section Regular expression) and the record separator RS to > and start reading file $file_faa The latter implies that $0 will contain all lines between > and the first field $1 is the protID.
Reading $file_faa, the file record number FNR is restarted from 1 while NR is not reset. Hence the first awk line is skipped.
($1 in a){ print RS $0 } if the first field is in the array a, print the record with the record separator in front of it.
fixing the original script:
If you want to keep your original script, you could store the protID in a list and then loop the list :
#!/bin/bash
for fileName in *.sto; do
protID_list=( $(grep -o "WP_.\{0,11\}" $fileName | sort | uniq) )
echo ${protID_list[#]}
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
for protID in ${protID_list[#]}; do
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
done
Considering your output file is test.
Using following command gives you only file names:
>>cat text | awk '{print $1}' | grep -R 'WP*' | cut -d":" -f2
gives me output:
WP_002855993.1/5-168
WP_002856586.1/5-166
WP_002855993.1/5-168
WP_002856586.1/5-166
Do you want output like that?

Extract specific text out of bash variable containing TSV string

I have the following TSV and newlines string assigned to a variable in bash:
TAGS Product3 qwerty text Desc3
TAGS Product1 qwerty text Desc1
TAGS Product2 qwerty text Desc2
I would like to extract the last column to a new string, and it has to be product ordered by my product input, for example:
Product1,Product2,Product3 will have to output: Desc1,Desc2,Desc3
What would be the best approach to accomplish this?
echo "$tsv_data" | awk '{print $2 " " $5}' | sort | awk '{print $2}' | paste -sd ',' -
This does the following steps in order:
Print the second and 5th argument (Product and Description) with a space between them.
Sort the input with sort (use gnu-sort if it can contain numbers)
Print only the description (in each line)
Join the lines together with paste
which will produce the following output:
Desc1,Desc2,Desc3
Here's a function that I suppose should do it:
get_descriptions() {
local tsvstring="$1"
local prodnames="$2"
local result=()
# read tsv line by line, splitting into variables
while IFS=$'\t' read -r tags prodname val1 val2 desc || [[ -n ${prodname} && -n ${desc} ]]; do
# check if the line matches the query, and if, append to array
if grep -iq "${prodname}" <<< "${prodnames}"; then
result+=("${desc}")
fi
done <<< "${tsvstring}"
# echo the result-array with field separator set to comma
echo $(IFS=,; echo "${result[*]}")
}
Then you can just use it like:
get_descriptions "${tsv_string_var}" "product1,product2"
echo "$var" | sort -k2 tags | cut -f5 | paste -sd,
sort + awk + paste pipeline:
echo "$tsv" | sort -nk2 | awk '{print $5}' | paste -sd',' -
The output:
Desc1,Desc2,Desc3
sort -nk2 - sorts the input by the second column numerically
awk '{print $5}' - prints out each fifth column
paste -sd',' - merge lines with ,

Resources