sorting a multiFASTA file by DNA length - sorting

I'm trying to sort a multiFASTA file by length. I have the alphabetical sort figured out but I can't seem to get the numerical sort. The output should be a sorted multiFASTA file. This is an option to another program. Here is the code.
sub sort {
my $length;
my $key;
my $id;
my %seqs;
my $seq;
my $action = shift;
my $match = $opts{$action};
$match =~ /[l|id]/ || die "not the right parameters\n";
my $in = Bio::SeqIO->new(-file=>"$filename", -format=>'fasta');
while(my $seqobj = $in->next_seq()){
my $id = $seqobj->display_id();
my $length=$seqobj->length();
#$seq =~s/.{1,60}\K/\n/sg;
$seqs{$id} = $seqobj, unless $match eq 'l';
$seqs{$length}=$seqobj, unless $match eq 'id';
}
if($match eq 'id'){
foreach my $id (sort keys %seqs) {
printf ">%-9s \n%-s\n", $id, $seqs{$id}->seq;
}
}
elsif($match eq 'l'){
foreach my $length ( sort keys %seqs){
printf "%-10s\n%-s\n",$length, $seqs{$length}->seq;
}
}
}

To sort numerically, you must provide the comparing subroutine:
sort { $a <=> $b } keys %seqs
Are you sure no two sequences can have the same length? $seqs{$length}=$seqobj overwrites the previously stored value.

A one lineer: use awk to linearize. a second awk to add a column containing the length, sort on this column, remove the column, restore the fasta sequence.
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -t $'\t' -k1,1n |\
cut -f 2- |\
tr "\t" "\n"
PS: for bioinformartics questions, you should use https://www.biostars.org/, or https://bioinformatics.stackexchange.com/, etc...

You can use pyfaidx or just take a look at jim hester repos. But as #pierre said above you should ask you question on biostars for example. The answer on biostars can be found here.

Related

Sort directory list by subfolder number

Is there a fast and smart way in bash (maybe with awk/sed/sort???) to sort the result of a find command by number of subfolders in the path and then alphabetically.
I mean something like
./a/
./b/
./c/
./a/a/
./a/python-script.py
./a/test.txt
./b/a/
./b/b/
./c/a/
./c/c/
./a/a/a/
./a/a/file.txt
./a/a/t/
...
...
I want to take the output of the find command and see first the filenames in the current folder, then the files in the first level of subfolders, then the files in the second level, and so on (if possible sorted alfabetically for each level).
You can use the printf statement in find and ask it to return the depth of the file %d. Then use sort on that and cut to remove the output:
$ find . -printf '%d\t%p\n' | sort -n | cut -f2-
I suppose this is much less elegant than #kvantour's answer, how about the Schwartzian transform in Perl:
find . -print0 | perl -0ne '
push(#list, $_);
END {
#sorted = map { $_->[0] }
sort { $a->[1] <=> $b->[1] or $a->[0] cmp $b->[0] }
map { [$_, tr#/#/#] } #list;
print join("\n", #sorted), "\n";
}'

Unscramble words Challenge - improve my bash solution

There is a Capture the Flag challenge
I have two files; one with scrambled text like this with about 550 entries
dnaoyt
cinuertdso
bda
haey
tolpap
...
The second file is a dictionary with about 9,000 entries
radar
ccd
gcc
fcc
historical
...
The goal is to find the right, unscrambled version of the word, which is contained in the dictionary file.
My approach is to sort the characters from the first word from the first file and then look up if the first word from the second file has the same length. If so then sort that too and compare them.
This is my fully functional bash script, but it is very slow.
#!/bin/bash
while IFS="" read -r p || [ -n "$p" ]
do
var=0
ro=$(echo $p | perl -F -lane 'print sort #F')
len_ro=${#ro}
while IFS="" read -r o || [ -n "$o" ]
do
ro2=$(echo $o | perl -F -lane 'print sort # F')
len_ro2=${#ro2}
let "var+=1"
if [ $len_ro == $len_ro2 ]; then
if [ $ro == $ro2 ]; then
echo $o >> new.txt
echo $var >> whichline.txt
fi
fi
done < dictionary.txt
done < scrambled-words.txt
I have also tried converting all characters to ASCII integers and sum each word, but while comparing I realized that the sum of a different char pattern may have the same sum.
[edit]
For the records:
- no anagrams contained in dictionary
- to get the flag, you need to export the unscrambled words as one blob and ans make a SHA-Hash out of it (thats the flag)
- link to ctf for guy who wanted the files https://challenges.reply.com/tamtamy/user/login.action
You're better off creating a lookup dictionary (keyed by the sorted word) from the dictionary file.
Your loop body is executed 550 * 9,000 = 4,950,000 times (O(N*M)).
The solution I propose executes two loops of at most 9,000 passes each (O(N+M)).
Bonus: It finds all possible solutions at no cost.
#!/usr/bin/perl
use strict;
use warnings qw( all );
use feature qw( say );
my $dict_qfn = "dictionary.txt";
my $scrambled_qfn = "scrambled-words.txt";
sub key { join "", sort split //, $_[0] }
my %dict;
{
open(my $fh, "<", $dict_qfn)
or die("Can't open \"$dict_qfn\": $!\n");
while (<$fh>) {
chomp;
push #{ $dict{key($_)} }, $_;
}
}
{
open(my $fh, "<", $scrambled_qfn)
or die("Can't open \"$scrambled_qfn\": $!\n");
while (<$fh>) {
chomp;
my $matches = $dict{key($_)};
say "$_ matches #$matches" if $matches;
}
}
I wouldn't be surprised if this only takes one millionths of the time of your solution for the sizes you provided (and it scales so much better than yours if you were to increase the sizes).
I would do something like this with gawk
gawk '
NR == FNR {
dict[csort()] = $0
next
}
{
print dict[csort()]
}
function csort( chars, sorted) {
split($0, chars, "")
asort(chars)
for (i in chars)
sorted = sorted chars[i]
return sorted
}' dictionary.txt scrambled-words.txt
Here's perl-free solution I came up with using sort and join:
sort_letters() {
# Splits each letter onto a line, sorts the letters, then joins them
# e.g. "hello" becomes "ehllo"
echo "${1}" | fold-b1 | sort | tr -d '\n'
}
# For each input file...
for input in "dict.txt" "words.txt"; do
# Convert each line to [sorted] [original]
# then sort and save the results with a .sorted extension
while read -r original; do
sorted=$(sort_letters "${original}")
echo "${sorted} ${original}"
done < "${input}" | sort > "${input}.sorted"
done
# Join the two files on the [sorted] word
# outputting the scrambled and unscrambed words
join -j 1 -o 1.2,2.2 "words.txt.sorted" "dict.txt.sorted"
I tried something very alike, but a bit different.
#!/bin/bash
exec 3<scrambled-words.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>scrambled-words_sorted.txt
exec 3>&-
exec 3<dictionary.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>dictionary_sorted.txt
exec 3>&-
printf "" > whichline.txt
exec 3<scrambled-words_sorted.txt
while read -r line <&3; do
counter="$((++counter))"
grep -n -e "^${line}$" dictionary_sorted.txt | cut -d ':' -f 1 | tr -d '\n' >>whichline.txt printf "\n" >>whichline.txt
done
exec 3>&-
As you can see I don't create a new.txt file; instead I only create whichline.txt with a blank line where the word doesn't match. You can easily paste them up to create new.txt.
The logic behind the script is nearly the logic behind yours, with the exception that I called perl less times and I save two support files.
I think (but I am not sure) that creating them and cycle only one file will be better than ~5kk calls of perl. This way "only" ~10k times is called.
Finally, I decided to use grep because it's (maybe) the fastest regex matcher, and searching for the entire line the lenght is intrinsic in the regex.
Please, note that what #benjamin-w said is still valid and, in that case, grep will reply badly and I did not managed it!
I hope this could help [:

Append data to another column in a CSV if duplicate is found in first column

I have a CSV with data such as:
somename1,value1
somename1,value2
somename1,value3
anothername1,anothervalue1
anothername1,anothervalue2
anothername1,anothervalue3
I would like to rewrite the CSV so that when a duplicate in column 1 is found, the the data is appended to a new column on the first entry.
For instance, the desired output would be :
somename1,value1,value2,value3
anothername1,anothervalue1,anothervalue2,anothervalue3
How can i do this in a shell script ?
TIA
You need much more than just removing duplicated lines when using Awk, you need a logic as below to create an array of elements for each unique entry in $1.
The solution creates a hash-map with unique values in $1 working as indices of the array and elements as the value appended with a , separator.
awk 'BEGIN{FS=OFS=","; prev="";}{ if (prev != $1) {unique[$1]=$2;} else {unique[$1]=(unique[$1]","$2)} prev=$1; }END{for (i in unique) print i,unique[i]}' file
anothername1,anothervalue1,anothervalue2,anothervalue3
somename1,value1,value2,value3
A more readable version would be to have something like,
BEGIN {
# set input and output field separator to ',' and initialize
# variable holding last instance of $1 to empty
FS=OFS=","
prev=""
}
{
# Update the value of $2 directly in the hash array only when new
# unique elements are found in $1
if (prev != $1){
unique[$1]=$2
}
else {
unique[$1]=(unique[$1]","$2)
}
# Update the current $1
prev=$1
}
END {
for (i in unique) {
print i,unique[i]
}
FILE=$1
NAMES=`cut -d',' -f 1 $FILE | sort -u`
for NAME in $NAMES; do
echo -n "$NAME"
VALUES=`grep "$NAME" $FILE | cut -d',' -f2`
for VAL in $VALUES; do
echo -n ",$VAL"
done
echo ""
done
running with your data generates:
>bash script.sh data1.txt
anothername1,anothervalue1,anothervalue2,anothervalue3
somename1,value1,value2,value3
the filename of your data has to be passed as parameter. output can be written to a new file by redirecting.
>bash script.sh data1.txt > data_new.txt

Search / Replace using key:values from file in bash?

So I have a file of "keys", for example:
key1
key2
key3
and I have a file of key:value pairs:
key1:value1
key2:value2
key3:value3
I want to replace the keys in my file of keys with their corresponding values in the key:value file. So the file of keys will look like this when complete:
value1
value2
value3
...
What is the best way to do this in bash? Note that a key may appear more than once in the keys file, but should only appear once in the key:values file.
if the join command is available in your environment, the following should work. The addition of an index via the awk command is needed to restore original key order (via a Schwartzian transform).
join -o 1.1,2.2 -t':' -1 2 -2 1 <(awk '{print(NR":"$0)}' key_file | sort -k2,2 -t':') <(sort -k1,1 -t':' key_values_file) | sort -k1,1 -t':' | cut -f2 -d':'
I know you want "bash" , but this is very simply solved with a quick perl script. Assume you have the files pairs.txt and keys.txt :
use strict;
my %keys2values;
# read through the pairs file to get the key:value mapping
open PAIRS, "cat pairs.txt |" ;
while(<PAIRS>) {
chomp $_;
my ($key,$value) = split(":",$_);
$keys2values{$key} = $value;
}
open KEYS, "cat keys.txt |";
while(<KEYS>) {
chomp $_;
my $key = $_;
if(defined $keys2values{$key}) {
print "$keys2values{$key}\n";
}
# if a key:value pair isn't defined, just print the key
else {
print "$key\n";
}
}
Since I have a thing for pure-bash solutions, I'll just post this solution. It will only work in bash 4+, because it uses associative arrays.
#!/bin/bash
while IFS=: read key value; do
declare -A hash[$key]=$value
done < pairfile
while read key; do
printf '%s\n' "${hash[$key]}"
done < keyfile

Bash script optimisation

This is the script in question:
for file in `ls products`
do
echo -n `cat products/$file \
| grep '<td>.*</td>' | grep -v 'img' | grep -v 'href' | grep -v 'input' \
| head -1 | sed -e 's/^ *<td>//g' -e 's/<.*//g'`
done
I'm going to run it on 50000+ files, which would take about 12 hours with this script.
The algorithm is as follows:
Find only lines containing table cells (<td>) that do not contain any of 'img', 'href', or 'input'.
Select the first of them, then extract the data between the tags.
The usual bash text filters (sed, grep, awk, etc.) are available, as well as perl.
Looks like that can all be replace by one gawk command:
gawk '
/<td>.*<\/td>/ && !(/img/ || /href/ || /input/) {
sub(/^ *<td>/,""); sub(/<.*/,"")
print
nextfile
}
' products/*
This uses the gawk extension nextfile.
If the wildcard expansion is too big, then
find products -type f -print | xargs gawk '...'
Here's some quick perl to do the whole thing that should be alot faster.
#!/usr/bin/perl
process_files($ARGV[0]);
# process each file in the supplied directory
sub process_files($)
{
my $dirpath = shift;
my $dh;
opendir($dh, $dirpath) or die "Cant readdir $dirpath. $!";
# get a list of files
my #files;
do {
#files = readdir($dh);
foreach my $ent ( #files ){
if ( -f "$dirpath/$ent" ){
get_first_text_cell("$dirpath/$ent");
}
}
} while ($#files > 0);
closedir($dh);
}
# return the content of the first html table cell
# that does not contain img,href or input tags
sub get_first_text_cell($)
{
my $filename = shift;
my $fh;
open($fh,"<$filename") or die "Cant open $filename. $!";
my $found = 0;
while ( ( my $line = <$fh> ) && ( $found == 0 ) ){
## capture html and text inside a table cell
if ( $line =~ /<td>([&;\d\w\s"'<>]+)<\/td>/i ){
my $cell = $1;
## omit anything with the following tags
if ( $cell !~ /<(img|href|input)/ ){
$found++;
print "$cell\n";
}
}
}
close($fh);
}
Simply invoke it by passing the directory to be searched as the first argument:
$ perl parse.pl /html/documents/
What about this (should be much faster and clearer):
for file in products/*; do
grep -P -o '(?<=<td>).*(?=<\/td>)' $file | grep -vP -m 1 '(img|input|href)'
done
the for will look to every file in products. See the difference with your syntax.
the first grep will output just the text between <td> and </td> without those tags for every cell as long as each cell is in a single line.
finally the second grep will output just the first line (which is what I believe you wanted to achieve with that head -1) of those lines which doesn't contain img, href or input (and will exit right then reducing the overall time allowing to process the next file faster)
I would have loved to use just a single grep, but then the regex will be really awful. :-)
Disclaimer: of course I haven't tested it

Resources