I have a fasta file such as:
>sequence1_CP [seq virus]
>sequence2 [virus]
>sequence4_CP hypothetical protein [another virus]
>sequence5 hypothetical protein [another virus]
>sequence6 |hypothetical protein[virus]
>sequence7 |hypothetical protein[virus]
And in this file I would like to remove duplicated sequence and get:
>sequence1_CP [seq virus]
>sequence4_CP hypothetical protein [another virus]
>sequence6 |hypothetical protein[virus]
Here as you can see the content after the > name for sequence1_CP, sequence2 and sequence3 is the same, then I want only to keep on of the 3. But if one of the 3 sequences have a _CP in its name, then I want to keep this one especially. If there is none _CP in any of them it does not mater which one I keep.
So for the first duplicates between Sequence1_CP, Sequence2 and Sequence3 I keep sequence1_CP
For the second duplicates between sequence4_CP and sequence5 I keep sequence4_CP
And for the third duplicates between sequence6 and sequence7 I keep the first one sequence6
Does someone have an idea using biopython or a bash method?
In a fasta file, identical sequences are not necessarily split at the same position. So it is paramount to merge the sequences before comparing. Furthermore, sequences can have upper case or lower case, but are in the end case insensitive:
The following awk will do exactly that:
$ awk 'BEGIN{RS="";ORS="\n\n"; FS="\n"}
{seq="";for(i=2;i<=NF;++i) seq=seq toupper($i)}
!(seq in a){print; a[seq]}' file.fasta
There exists actually a version of awk which has been upgraded to process fasta files:
$ bioawk -c fastx '!(seq in a){print; a[seq]}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
You could use this awk one-liner:
$ awk 'BEGIN{FS="\n";RS=""}{if(!seen[$2,$3]++)print}' file
>sequence1_CP [seq virus]
>sequence4_CP hypothetical protein [another virus]
>sequence6 |hypothetical protein[virus]
Above relies on observation that the sequences are in order where the _CPs come before others like in the sample. If this is not in fact the case, use the following. It stores the first instance of each sequence which is overwritten if a _CP sequence is found:
$ awk 'BEGIN{FS="\n";RS=""}{if(!($2,$3) in seen||$1~/^[^ ]+_CP /)seen[$2,$3]=$0}END{for(i in seen)print (++j>1?ORS:"") seen[i]}' file
Or in pretty-print:
$ awk '
if(!($2,$3) in seen||$1~/^[^ ]+_CP /)
for(i in seen)
print (++j>1?ORS:"") seen[i]
}' file
The output order is awk default ie. appears random.
Update If #kvantour's BOTH comments are valid in this case, use this awk:
$ awk '
k=(i==2?"":k) $i
if(!(k in seen)||$1~/^[^ ]+_CP /)
for(i in seen)
print (++j>1?ORS:"") seen[i]
}' file
Output now:
>sequence1_CP [seq virus]
>sequence4_CP hypothetical protein [another virus]
Or pure-bash solution (following same log as separate perl solution):
#! /bin/bash
declare -A p
# Read inbound data into associative array 'p'
while read id ; do
read s1 ; read s2 ; read s3
if [[ -z "$prev" || "$id" = %+CP% ]] ; then p[$key]=$id ; fi
# Print all data
for k in "${!p[#]}" ; do
echo -e "${p[$k]}\n${k/:/\\n}\n"
Here's a python program that will provide you with results you are looking for:
import fileinput
import re
for line in fileinput.input():
line = line.rstrip()
if re.search( "^>", line ):
if seq:
nameseq[ id ] = seq
if seq in seqnames:
if re.search( "_CP", id ):
seqnames[ seq ] = id
seqnames[ seq ] = id
seq = ""
id = line
seq += line
for k,v in seqnames.iteritems():
Or with perl. Assuming code in m.pl ,can be wrapped into bash script
Hopefully, code will help find medicines, and not develop new viruses :-)
perl m.pl < input-file
! /usr/bin/perl
use strict ;
my %to_id ;
local $/ = "\n\n";
while ( <> ) {
chomp ;
my ($id, $s1, $s2 ) = split("\n") ;
my $key = "$s1\n$s2" ;
my $prev_id = $to_id{$key} ;
$to_id{$key} = $id if !defined($prev_id) || $id =~ /_CP/ ;
} ;
print "$to_id{$_}\n$_\n\n" foreach(keys(%to_id)) ;
It's not clear what is the expected order. Perl code will print directly from hash. Can be customized, if needed.
Here's a Biopython answer. Be aware that you only have two unique sequences in your example (sequence 6 and 7 only show a character more in the first line but are essentially the same protein sequence as 1).
from Bio import SeqIO
seen = []
records = []
# examples are in sequences.fasta
for record in SeqIO.parse("sequences.fasta", "fasta"):
if str(record.seq) not in seen:
# printing to console
for record in records:
# writing to a fasta file
SeqIO.write(records, "unique_sequences.fasta", "fasta")
For more info you can try the biopython cookbook
I am new to shell scripting.
I have a huge csv file which contains more than 100k rows. I need to find a column and sort it and write it to another file and later I need to process this new file.
below is the sample data
Now you can see that field 4 has data which contains comma as well. now I need the data in which the field 4 is sorted out as below:
to get this solution I have written a script file as below but the solution does not seems to be efficient because for 100k records it took 20 mins, so trying to get the efficient solution
#this command replaces the comma inside "" with | so that I can split the line based on ','(comma)
awk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/,/, "|", $i)}} {print $0}' $FEED_FILE > temp.csv
while read line
#break the line on comma ',' and get the array of strings.
IFS=',' read -ra data <<< "$line" #'data' is the array of the record of full line.
#take the 8th column, which is the reportable jurisdiction.
echo "REPORTABLE_JURISDICTION is : " ${data[4]}
#brake the data based on pipe '|' and sort the data
IFS='|' read -ra REPORTABLE_JURISDICTION_ARR <<< "${data[4]}"
#Sort this array
IFS=$'\n' sorted=($(sort <<<"${REPORTABLE_JURISDICTION_ARR[*]}"))
#printf "[%s]\n" "${sorted[#]}"
separator="|" # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${sorted[#]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}"
echo "$data[68]"
#here we are building the whole line which will be written to the output file.
separator="," # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${data[#]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}" >> temp2.csv
echo $count
done < temp.csv
#remove the '|' from the and put the comma back
awk -F\| 'BEGIN{OFS=","} {$1=$1; print}' temp2.csv > temp3.csv
# to remove the tailing , if any
sed 's/,$//' temp3.csv > $OUT_FILE
How to make it faster?
You're using the wrong tools for the task. While CSV seems to be so simple that you can easily process it with shell tools, but your code will break for cells that contain new lines. Also bash isn't very fast when processing lots of data.
Try a tool which understands CSV directly like http://csvkit.rtfd.org/ or use a programming language like Python. That allows you to do the task without starting external processes, the syntax is much more readable and the result will be much more maintainable. Note: I'm suggesting Python because of the low initial cost.
With python and the csv module, the code above would look like this:
import csv
FEED_FILE = '...'
OUT_FILE = '...'
with open(OUT_FILE, 'w', newline='') as out:
with open(FEED_FILE, newline='') as in:
reader = csv.reader(in, delimiter=',', quotechar='"')
writer = csv.writer(
for row in reader:
row[3] = sorted(list(row[3].split(',')))
That said, there is nothing obviously wrong with your code. There is not much that you can do to speed up awk and sed and the main bash loop doesn't spawn many external processes as far as I can see.
With single awk:
awk 'BEGIN{ FS=OFS="\042,\042"}{ split($4,a,","); asort(a); sf=a[1];
for(i=2;i<=NF;i++) { sf=sf","a[i] } $4=sf; print $0 }' file > output.csv
output.csv contents:
FS=OFS="\042,\042" - considering "," as field separator
split($4,a,",") - split the 4th field into array by separator ,
asort(a) - sort the array by values
Try pandas in python3. Only limitation: The data needs to fit into memory. And that can be a bit larger than your actually data is. I sorted CSV files with 30.000.000 rows without any problem using this script, which I quickly wrote:
import pandas as pd
import os, datetime, traceback
L1_DIR = '/mnt/ssd/ASCII/'
suffix = '.csv
for fname in sorted(os.listdir(L1_DIR)):
if not fname.endswith(suffix):
print("Start processing %s" % fname)
s = datetime.datetime.now()
fin_path = os.path.join(L1_DIR, fname)
fname_out = fname.split('.')[0] + '.csv_sorted'
fpath_out = os.path.join(L1_DIR, fname_out)
df = pd.read_csv(fin_path)
e = datetime.datetime.now()
print("Read %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s = datetime.datetime.now()
df.set_index('ts', inplace=True)
e = datetime.datetime.now()
print("set_index %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s = datetime.datetime.now()
e = datetime.datetime.now()
print("sort_index %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e-s)))
s = datetime.datetime.now()
# This one saves at ~10MB per second to disk.. One day is 7.5GB --> 750 seconds or 12.5 minutes
df.to_csv(fpath_out, index=False)
e = datetime.datetime.now()
print("to_csv %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e - s)))
I have a large file that contains 2 IPs per line - and there's about 3 million lines total.
Here's an example of the file:,,,,,
I need to convert each IP to an IP Decimal, like this:
I'd prefer a way to do this strictly via command line. I'm okay with perl or python being used, as long as it doesn't require extra modules to be installed.
I thought I had come across a way that someone converted IPs like this using sed but can't seem to find that tutorial anymore. Any help would be appreciated.
If you have gnu awk installed (for the RT variable), you could use this one-liner:
awk -F. -v RS='[\n,]' '{printf "%d%s", (($1*256+$2)*256+$3)*256+$4, RT}' file
Here it is python solution, that use only standard modules (re, sys):
import re
import sys
def multiplier_generator():
""" Cyclic generator of powers of 256 (from 256**3 down to 256**0)
The mulitpliers tupple could be replaced by inline calculation
of power, but this approach has better performance.
multipliers = (
idx = 0
while 1 == 1:
yield multipliers[idx]
idx = (idx + 1) % 4
def replacer(match_object):
"""re.sub replacer for ip group"""
multiplier = multiplier_generator()
res = 0
for i in xrange(1,5):
res += multiplier.next()*int(match_object.group(i))
return str(res)
if __name__ == "__main__":
std_in = ""
if len(sys.argv) > 1:
with open(sys.argv[1],'r') as f:
std_in = f.read()
std_in = sys.stdin.read()
print re.sub(r"([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)", replacer, std_in )
This solution replace every ip address, that can be found in text from standard input or from file passed as first parameter, i.e:
python convert.py < input_file.txt, or
python convert.py file.txt, or
echo "," | python convert.py.
With bash:
ip2dec() {
set -- ${1//./ } # split $1 with "." to $1 $2 $3 $4
declare -i dec # set integer attribute
echo -n $dec
while IFS=, read -r a b; do ip2dec $a; echo -n ,; ip2dec $b; echo; done < file
With bash and using shift (one CPU instruction) instead of multiply (a lot of instructions):
ip2dec() { local IFS=.
set -- $1 # split $1 with "." to $1 $2 $3 $4
printf '%s' "$(($1<<24+$2<<16+$3<<8+$4))"
while IFS=, read -r a b; do
printf '%s,%s\n' "$(ip2dec $a)" "$(ip2dec $b)"
done < file
Reading a text file into an array, extracting elements and sorting them is taking a very long time.
The text file is ffmpeg console output for R128 audio analysis. I need to get the highest M and S values. Example:
[Parsed_ebur128_0 # 0x7fd32a60caa0] t: 4.49998 M: -22.2 S: -29.9 I: -27.0 LUFS LRA: 9.8 LU FTPK: -12.4 dBFS TPK: -9.7 dBFS
[Parsed_ebur128_0 # 0x7fd32a60caa0] t: 4.69998 M: -22.5 S: -28.6 I: -25.9 LUFS LRA: 11.3 LU FTPK: -12.7 dBFS TPK: -9.7 dBFS
The text file can be hundreds or thousands of lines long depending on the duration of the audio file being analysed
I want to find the highest M (-22.2) and S Values (-28.6) and assign them to variables M and S
This is what I am using currently:
while read LINE
done < $tempDir/text.txt
for LINE in "${ARRAY[#]}"
echo "$LINE" | sed -n ‘/B:/p' | sed 's/S:.*//' | sed -n -e 's/^.*M://p' | sed -n -e 's/-//p' >>/$tempDir/R128M.txt
for LINE in "${ARRAY[#]}"
echo "$LINE" | sed -n '/M:/p' | sed 's/I:.*//' | sed -n -e 's/^.*S://p' | sed -n -e 's/-//p' >>$tempDir/R128S.txt
cat $tempDir/R128M.txt
M=( $(sort $tempDir/R128M.txt) )
cat $tempDir/R128S.txt
S=( $(sort $tempDir/R128S.txt) )
Is there a faster way of doing this?
Rather than reading in the whole file in memory, writing bits of it out to separate file, and reading those in again, just parse it and pick out the largest values:
$ awk '$7 > m || m == "" { m = $7 } $9 > s || s == "" { s = $9 } END { print m, s }' data
-22.2 -28.6
In your data, field 7 and 9 contains the values of M and S. The awk script will update its m and s variables if it finds larger values in these fields and then print the largest found at the end. The m == "" and s == "" are needed to trigger initialization of the values if no values has been read yet.
Another way with awk, which may look cleaner:
$ awk 'FNR == 1 { m = $7; s = $9; next } $7 > m { m = $7 } $9 > s { s = $9 } END { print m, s }' data
To assign them to M and S in the shell:
$ declare $( awk 'FNR == 1 { m = $7; s = $9; next } $7 > m { m = $7 } $9 > s { s = $9 } END { printf("M=%f S=%f\n", m, s) }' data )
$ echo $M $S
-22.200000 -28.600000
Adjust the printf() format to use %s instead of %f if you want the original strings instead of float values, or set the number of decimals you might want with, e.g., %.2f in place of %f.
First of all, three-process pipe is a bit redundant for a single value extraction, especially taking into account you reinstantiate it anew for every line.
Next, you save all the values into a file and then sort that file, while all you need is the maximum value. You can easily find it during the very first (value extraction) loop, for additional O(N) running time, instead of I/O and sorting with all the I/O overhead and O(NlogN) sorting expenses. See ARITHMETIC EXPANSION and conditional expressions in bash manual.
I could format the data using a perl script (hash). I am wondering if it can be done through some shell one liner, so that every time I dont need to write a perl script if there is some change in the
input format.
Example Input:
rinku a
rinku b
rinku c
rrs d
rrs e
abc f
abc g
abc h
abc i
xyz j
example Output:
rinku a,b,c
rrs d,e
abc f,g,h,i
xyz j
Please help me with a command using shell/awk/sed to format the input.
How about
$ awk '{arr[$1]=arr[$1]?arr[$1]","$2:$2} END{for (i in arr) print i, arr[i]}' input
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
The awk program also has associative arrays, similar to Perl:
awk '{v[$1]=v[$1]","$2}END{for(k in v)print k" "substr(v[k],2)}' inputFile
For each line X Y (key of X, value of Y), it basically just appends ,Y to every array element indexed by X, taking advantage of the fact they all start as empty strings.
Then, since your values are then of the form ,x,y,z, you just strip off the first character when outputting.
This generates, for your input data (in inputFile):
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
As an aside, if you want it as nicely formatted as the original, you can create a program.awk file:
val[$1] = val[$1]","$2
if (length ($1) > maxlen) {
maxlen = length ($1)
for (key in val) {
printf "%-*s %s\n", maxlen, key, substr(val[key],2)
and run that with:
awk -f program.awk inputFile
and you'll get:
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
sed -n ':cycle
s/^\([^[:blank:]]*\)\([[:blank:]]\{1,\}.*\)\n\1[[:blank:]]\{1,\}/\1\2,/;t cycle
s/.*\n//;t cycle' YourFile
trying not to use the hold buffer (and not loading the full file in memory)
- load the line
- if first word is the same as the one after CR, repolace the CR and first word by a ,
- if the case, restart at line loading
- if not, print first line
- replace the current buffer until first \n by nothing
- if the case restart at line loading
posix version so --posix on GNU sed
Okay, I have two files: one is baseline and the other is a generated report. I have to validate a specific string in both the files match, it is not just a single word see example below:
name os ksd
some text..................
some text..................
My search criteria here is to find unique number such as "56633223223" and retrieve above 1 line and below 3 lines, i can do that on both the basefile and the report, and then compare if they match. In whole i need shell script for this.
Since the strings above and below are unique but the line count varies, I had put it in a file called "actlist":
56633223223 1 5
56633223224 1 6
56633223225 1 3
Now from below "Rcount" I get how many iterations to be performed, and in each iteration i have to get ith row and see if the word count is 3, if it is then take those values into variable form and use something like this
I'm stuck at the below, which command to be used. I'm thinking of using AWK but if there is anything better please advise. Here's some pseudo-code showing what I'm trying to do:
Rcount=`wc -l $xxxxx | awk -F " " '{print $1}'`
while ((i <= Rcount))
record=_________________'(Awk command to retrieve ith(1st) record (of $xxxx),
wcount=_________________'(Awk command to count the number of words in $record)
(( i=i+1 ))
Note: record, wcount values are later printed to a log file.
Sounds like you're looking for something like this:
while read -r word1 word2 word3 junk; do
if [[ -n "$word1" && -n "$word2" && -n "$word3" && -z "$junk" ]]; then
echo "all good"
echo "error"
done < /root/shravan/actlist
This will go through each line of your input file, assigning the three columns to word1, word2 and word3. The -n tests that read hasn't assigned an empty value to each variable. The -z checks that there are only three columns, so $junk is empty.
I PROMISE you you are going about this all wrong. To find words in file1 and search for those words in file2 and file3 is just:
awk '
NR==FNR{ for (i=1;i<=NF;i++) words[$i]; next }
{ for (word in words) if ($0 ~ word) print FILENAME, word }
' file1 file2 file3
or similar (assuming a simple grep -f file1 file2 file3 isn't adequate). It DOES NOT involve shell loops to call awk to pull out strings to save in shell variables to pass to other shell commands, etc, etc.
So far all you're doing is asking us to help you implement part of what you think is the solution to your problem, but we're struggling to do that because what you're asking for doesn't make sense as part of any kind of reasonable solution to what it sounds like your problem is so it's hard to suggest anything sensible.
If you tells us what you are trying to do AS A WHOLE with sample input and expected output for your whole process then we can help you.
We don't seem to be getting anywhere so let's try a stab at the kind of solution I think you might want and then take it from there.
Look at these 2 files "old" and "new" side by side (line numbers added by the cat -n):
$ paste old new | cat -n
1 a b
2 b 56633223223
3 56633223223 c
4 c d
5 d h
6 e 56633223225
7 f i
8 g Z
9 h k
10 56633223225 l
11 i
12 j
13 k
14 l
Now lets take this "actlist":
$ cat actlist
56633223223 1 2
56633223225 1 3
and run this awk command on all 3 of the above files (yes, I know it could be briefer, more efficient, etc. but favoring simplicity and clarity for now):
$ cat tst.awk
numPre[$1] = $2
numSuc[$1] = $3
oldLine[FNR] = $0
if ($0 in numPre) {
oldHitFnr[$0] = FNR
newLine[FNR] = $0
if ($0 in numPre) {
newHitFnr[$0] = FNR
for (str in numPre) {
if ( str in oldHitFnr ) {
if ( str in newHitFnr ) {
for (i=-numPre[str]; i<=numSuc[str]; i++) {
oldFnr = oldHitFnr[str] + i
newFnr = newHitFnr[str] + i
if (oldLine[oldFnr] != newLine[newFnr]) {
print str, "mismatch at old line", oldFnr, "new line", newFnr
print "\t" oldLine[oldFnr], "vs", newLine[newFnr]
else {
print str, "is present in old file but not new file"
else if (str in newHitFnr) {
print str, "is present in new file but not old file"
$ awk -f tst.awk actlist old new
56633223225 mismatch at old line 12 new line 8
j vs Z
It's outputing that result because the 2nd line after 56633223225 is j in file "old" but Z in file "new" and the file "actlist" said the 2 files had to be common from one line before until 3 lines after that pattern.
Is that what you're trying to do? The above uses GNU awk for ARGIND but the workaround is trivial for other awks.
Use the below code:
awk '{if (NF == 3) { word1=$1; word2=$2; word3=$3; print "Words are:" word1, word2, word3} else {print "Line", NR, "is having", NF, "Words" }}' filename.txt
I have given the solution as per the requirement.
awk '{ # awk starts from here and read a file line by line
if (NF == 3) # It will check if current line is having 3 fields. NF represents number of fields in current line
{ word1=$1; # If current line is having exact 3 fields then 1st field will be assigned to word1 variable
word2=$2; # 2nd field will be assigned to word2 variable
word3=$3; # 3rd field will be assigned to word3 variable
print word1, word2, word3} # It will print all 3 fields
}' filename.txt >> output.txt # THese 3 fields will be redirected to a file which can be used for further processing.
This is as per the requirement, but there are many other ways of doing this but it was asked using awk.