How to count sequences in a fasta file using Bioperl - bioinformatics

Good evening, i have a bioperl code to count the number of sequences in a fasta file, but i am trying to modify the code to count sequences shorter than 20 and longer from 120 in any given fasta file. The code is below
use strict;
use Bio::SeqIO;
my $seqfile = 'sequences.fa';
my $in = Bio::SeqIO->new
(
-format => 'fasta',
-file => $seqfile
) ;
my $count = 0;
while (my $seq = $in->next_seq)
{
$count++;
}
print "There are $count sequences\n";

You can use the length function of the sequence object to construct an if statement inside your while loop.
my $len = $seq->length();
if($len < 20 || $len > 120){
$count++;
}

Related

How do I increment a number in middle of the string

I am trying to increment a number in the middle of the string. Tried many ways but didn't find a solution. Any ideas in shell script
Ex:- i have a string sam_2.0_protected_dev_branch. I want to increment the number at the middle of the string. So the output should look like
sam_2.0_kumar_dev_branch
sam_2.1_kumar_dev_branch
sam_2.2_kumar_dev_branch
...
Maybe instead of trying to change the string, divide it in 3 parts. text, number and text. you can increment the number and then make one string with that 3 parts.
Sorry if the terminology isn't right
here is a proper documentation: https://ss64.com/ps/syntax-concat.html
Code :
$version = 1.0
$nameOne = "sam_"
$nametwo = "_protected_dev_branch"
for ($i = 1; $i -lt 5; $i++)
{
$fullName = $nameOne + $version + $nametwo
Write-Host $fullName
$version = $version + 0.1
}
Output :
sam_1_protected_dev_branch
sam_1.1_protected_dev_branch
sam_1.2_protected_dev_branch
sam_1.3_protected_dev_branch
there's a way to to force a number to show number of numbers after the comma, but I will let you deal with it
$version = 1
for ($i = 1; $i -lt 5; $i++)
{
$fullName = "sam_2." + $version + "_protected_dev_branch"
Write-Host $fullName
$version = $version + 1
}

Compare 4 files line by line to see if they match or don't match

I'm trying to compare 4 text files for counts in each line:
file1.txt:
32
44
75
22
88
file2.txt
32
44
75
22
88
file3.txt
11
44
75
22
77
file4.txt
32
44
75
22
88
each line represents a title
line1 = customerID count
line2 = employeeID count
line3 = active_users
line4 = inactive_users
line5 = deleted_users
I'm trying to compare file2.txt, file3.txt and file4.txt with file1.txt; file1.txt will always have the correct counts.
Example: Since file2.txt matches exactly line by line to file1.txt in the example above then i'm trying to output "file2.txt is good" but since file3.txt line1 and line5 do not match to file1.txt I'm trying to output "customerID for file3.txt does not match by 21 records", (i.e. 32 - 11 = 21), and "deleted_users in file3.txt does not match by 11 records", (88 - 77 = 11).
If shell is easier then that is fine too.
One way to process files by lines in parallel
use warnings;
use strict;
use feature 'say';
my #files = #ARGV;
#my #files = map { $_ . '.txt' } qw(f1 f2 f3 f4); # my test files' names
# Open all files, filehandles in #fhs
my #fhs = map { open my $fh, '<', $_ or die "Can't open $_: $!"; $fh } #files;
# For reporting, enumerate file names
my %files = map { $_ => $files[$_] } 0..$#files;
# Process (compare) the same line from all files
my $line_cnt;
LINE: while ( my #line = map { my $line = <$_>; $line } #fhs )
{
defined || last LINE for #line;
++$line_cnt;
s/(?:^\s+|\s+$)//g for #line;
for my $i (1..$#line) {
if ($line[0] != $line[$i]) {
say "File $files[$i] differs at line $line_cnt";
}
}
}
This compares the whole line by == (after leading and trailing spaces are stripped), since it is a given that each line carries a single number which need be compared.
It prints, with my test files named f1.txt, f2.txt, ...
File f3.txt differs at line 1
File f3.txt differs at line 5
Store the line names in an array, store the correct values in another array. Then, loop over the files, and for each of them, read their lines and compare them to the stored correct values. You can use the special variable $. that contains the line number of the last access file handle to serve as an index to the arrays. Lines are 1-based, arrays are 0-based, so we need to subtract 1 to get the correct index.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #line_names = ('customerID count',
'employeeID count',
'active_users',
'inactive_users',
'deleted_users');
my #correct;
open my $in, '<', shift or die $!;
while (<$in>) {
chomp;
push #correct, $_;
}
while (my $file = shift) {
open my $in, '<', $file or die $!;
while (<$in>) {
chomp;
if ($_ != $correct[$. - 1]) {
say "$line_names[$. - 1] in $file does not match by ",
$correct[$. - 1] - $_, ' records';
}
}
}
Read first file into array then loop over other files using the same function to read into array. Within this loop consider every line, calc diff and print message with text from #names if diff is not zero.
#!/usr/bin/perl
use strict;
use warnings;
my #names = qw(customerID_count employeeID_count active_users inactive_users deleted_users);
my #files = qw(file1.txt file2.txt file3.txt file4.txt);
my #first = readfile($files[0]);
for (my $i = 1; $i <= $#files; $i++) {
print "\n$files[0] <=> $files[$i]:\n";
my #second = readfile($files[$i]);
for (my $j = 0; $j <= $#names; $j++) {
my $diff = $first[$j] - $second[$j];
$diff = -$diff if $diff < 0;
if ($diff > 0) {
print "$names[$j] does not match by $diff records\n";
}
}
}
sub readfile {
my ($file) = #_;
open my $handle, '<', $file;
chomp(my #lines = <$handle>);
close $handle;
return grep(s/\s*//g, #lines);
}
Output is:
file1.txt <=> file2.txt:
file1.txt <=> file3.txt:
customerID_count does not match by 21 records
deleted_users does not match by 11 records
file1.txt <=> file4.txt:
A mash-up of bash, and mostly the GNU versions of standard utils like diff, sdiff, sed, et al, plus the ifne util, and even an eval:
f=("" "customerID count" "employeeID count" \
"active_users" "inactive_users" "deleted_users")
for n in file{2..4}.txt ; do
diff -qws file1.txt $n ||
$(sdiff file1 $n | ifne -n exit | nl |
sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' |
xargs printf 'eval echo "%s for '"$n"' does not match by %s records.";\n') ;
done
Output:
Files file1.txt and file2.txt are identical
Files file1.txt and file3.txt differ
customerID count for file3.txt does not match by 21 records.
deleted_users for file3.txt does not match by 11 records.
Files file1.txt and file4.txt are identical
The same code, tweaked for prettier output:
f=("" "customerID count" "employeeID count" \
"active_users" "inactive_users" "deleted_users")
for n in file{2..4}.txt ; do
diff -qws file1.txt $n ||
$(sdiff file1 $n | ifne -n exit | nl |
sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' |
xargs printf 'eval echo "%s does not match by %s records.";\n') ;
done |
sed '/^Files/!s/^/\t/;/^Files/{s/.* and //;s/ are .*/ is good/;s/ differ$/:/}'
Output:
file2.txt is good
file3.txt:
customerID count does not match by 21 records.
deleted_users does not match by 11 records.
file4.txt is good
Here is an example in Perl:
use feature qw(say);
use strict;
use warnings;
{
my $ref = read_file('file1.txt');
my $N = 3;
my #value_info;
for my $i (1..$N) {
my $fn = 'file'.($i+1).'.txt';
my $values = read_file( $fn );
push #value_info, [ $fn, $values];
}
my #labels = qw(customerID employeeID active_users inactive_users deleted_users);
for my $info (#value_info) {
my ( $fn, $values ) = #$info;
my $all_ok = 1;
my $j = 0;
for my $value (#$values) {
if ( $value != $ref->[$j] ) {
printf "%s: %s does not match by %d records\n",
$fn, $labels[$j], abs( $value - $ref->[$j] );
$all_ok = 0;
}
$j++;
}
say "$fn: is good" if $all_ok;
}
}
sub read_file {
my ( $fn ) = #_;
my #values;
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
while( my $line = <$fh>) {
if ( $line =~ /(\d+)/) {
push #values, $1;
}
}
close $fh;
return \#values;
}
Output:
file2.txt: is good
file3.txt: customerID does not match by 21 records
file3.txt: deleted_users does not match by 11 records
file4.txt: is good

Using nested loops in bash to process huge datasets

I am currently working on big datasets (typically 10 Gb for each) that prevent me from using R (RStudio) and dealing with data frames as I used to.
In order to deal with a restricted amount of memory (and CPU power), I've tried Julia and Bash (Shell Script) to process those files.
My question is the following: I've concatenated my files (I have more or less 1 million individual files merged into one big file) and I would like to process those big files in this way: Let's say that I have something like:
id,latitude,longitude,value
18,1,2,100
18,1,2,200
23,3,5,132
23,3,5,144
23,3,5,150
I would like to process my file saying that for id = 18, compute the max (200), the min (100) or some other propreties then go to next id and do the same. I guess some sort of nested loop in bash would work but I'm having issues doing it in an elegant way, the answers found on the Internet so far were not really helping. I cannot process it in Julia because it's too big/heavy, that's why I'm looking for answers mostly in bash.
However, I wanted to do this because I thought it would be faster to process a huge file rather than open a file, calculate, close file and go to the next one again and again. I'm not sure at all though !
Finally, which one would be better to use? Julia or Bash? Or something else?
Thank you !
Julia or Bash?
If you are talking about using plain bash and not some commands that could be executed in any other shell, then the answer is obviously Julia. Plain bash is magnitudes slower than Julia.
However, I would recommend to use an existing tool instead of writing your own.
GNU datamash could be what you need. You can call it from bash or any other shell.
for id = 18, compute the max (200), the min (100) [...] then go to next id and do the same
With datamash you could use the following bash command
< input.csv datamash -Ht, -g 1 min 4 max 4
Which would print
GroupBy(id),min(value),max(value)
18,100,200
23,132,150
Loops in bash are slow, I think that Julia is a much better fit in this case. Here is what I would do:
(Ideally) convert your data into a binary format, like NetCDF or HDF5.
load a chunk of data (e.g. 100 000 rows, not all, unless all data holds into RAM) and perform min/max per id as you propose
go to the next chunk and update the min/max for every ids
Do not load all the data at once in memory if you can avoid it. For computing easy statistics like the minimum, maximum, sum, mean, standard deviation, ... this can by done.
In my opinion, the memory overhead of julia (versus bash) are probably quite small given the size of the problem.
Be sure to read the performance tips in Julia and in particular place hoot-loops inside functions and not in global scope.
https://docs.julialang.org/en/v1/manual/performance-tips/index.html
Alternatively, such operations can also be done with specific queries in a SQL database.
Bash is definitely not the best option. (Fortran, baby!)
Anyway, the following can be translated to any language you want.
#!/bin/bash
function postprocess(){
# Do whatever statistics you want on the arrays.
echo "id: $last_id"
echo "lats: ${lat[#]}"
echo "lons: ${lon[#]}"
echo "vals: ${val[#]}"
}
# Set dummy start variable
last_id="not a valid id"
count=0
while read line; do
id=$( echo $line | cut -d, -f1 )
# Ignore first line
[ "$id" == "id" ] && continue
# If this is a new id, post-process the old one
if [ $id -ne $last_id -a $count -ne 0 ] 2> /dev/null; then
# Do post processing of data
postprocess
# Reset counter
count=0
# Reset value arrays
unset lat
unset lon
unset val
fi
# Increment counter
(( count++ ))
# Set last_id
last_id=$id
# Get values into arrays
lat+=($( echo $line | cut -d, -f2 ))
lon+=($( echo $line | cut -d, -f3 ))
val+=($( echo $line | cut -d, -f4 ))
done < test.txt
[ $count -gt 0 ] && postprocess
For this kind of problem, I'd be wary of using bash, because it isn't suited to line-by-line processing. And awk is too line-oriented for this kind of job, making the code complicated.
Something like this in perl might do the job, with a loop of loops grouping lines together by their id field.
IT070137 ~/tmp $ cat foo.pl
#!/usr/bin/perl -w
use strict;
my ($id, $latitude, $longitude, $value) = read_data();
while (defined($id)) {
my $group_id = $id;
my $min = $value;
my $max = $value;
($id, $latitude, $longitude, $value) = read_data();
while (defined($id) && $id eq $group_id) {
if ($value < $min) {
$min = $value;
}
if ($value > $max) {
$max = $value;
}
($id, $latitude, $longitude, $value) = read_data();
}
print $group_id, " ", $min, " ", $max, "\n";
}
sub read_data {
my $line = <>;
if (!defined($line)) {
return (undef, undef, undef, undef);
}
chomp($line);
my ($id, $latitude, $longitude, $value) = split(/,/, $line);
return ($id, $latitude, $longitude, $value);
}
IT070137 ~/tmp $ cat foo.txt
id,latitude,longitude,value
18,1,2,100
18,1,2,200
23,3,5,132
23,3,5,144
23,3,5,150
IT070137 ~/tmp $ perl -w foo.pl foo.txt
id value value
18 100 200
23 132 150
Or if you prefer Python:
#!/usr/bin/python -tt
from __future__ import print_function
import fileinput
def main():
data = fileinput.input()
(id, lattitude, longitude, value) = read(data)
while id:
group_id = id
min = value
(id, lattitude, longitude, value) = read(data)
while id and group_id == id:
if value < min:
min = value
(id, lattitude, longitude, value) = read(data)
print(group_id, min)
def read(data):
line = data.readline()
if line == '':
return (None, None, None, None)
line = line.rstrip()
(id, lattitude, longitude, value) = line.split(',')
return (id, lattitude, longitude, value)
main()

From the concatenated fasta file, how to find individual range of locations in each protein sequence

May be this question is too generalized, but I am completely stuck at this. Any type of help is appreciated:
I have a protein fasta file (protein.txt) like:
>a
mnspq
>b
rstuvw
>c
mnqa
Note that the length of a, b and c proteins are 5,6 and 4 respectively (total length = 15)
now I have extracted some random ranges (calculation is based on total length) and save it (file1.txt) as:
2-3
4-10
11-14
The length of each protein (within the total length) as seen in protein file is saved in another file (file2.txt) as:
a 1-5
b 6-11
c 12-15
Now from file1 values, I want to modify the file2 values and try to calculate individual range for each protein sequence, For the above input, the output will be:
a 2-3,4-5
b 1-5, 6
c 2-5
In other words, if I first concatenate my all sequences and derermine some ranges from the concatenated file, how can I find individual range of locations in each protein sequence
Thanks
I guess the last line of the answer should be c 1-3:
|---a---| |---b-----| |--c--|
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
|-| |-----------| |-----|
1 2 3 4 5 1 2 3 4 5 6 1 2 3 4
Perl to the rescue! First, the ranges from file1 are read into an array. Then, proteins are read from file2, and for each range that overlaps with the range, the "start" and "end" are computed and printed.
#!/usr/bin/perl
use warnings;
use strict;
my #ranges;
open my $f1, '<', 'file1.txt' or die $!;
while (<$f1>) {
chomp;
push #ranges, [ split /-/ ];
}
open my $f2, '<', 'file2.txt' or die $!;
while (<$f2>) {
my ($protein, $range) = split;
print "$protein";
my $separator = ' ';
my ($from, $to) = split /-/, $range;
shift #ranges while #ranges && $ranges[0][1] < $from;
last unless #ranges;
while (#ranges && $ranges[0][0] <= $to) {
my $start = $ranges[0][0];
$start = $from if $from > $start;
my $end = $ranges[0][1];
$end = $to if $end > $to;
$_ -= $from - 1 for $start, $end;
print $separator, $start == $end ? $start : "$start-$end";
$separator = ',';
if ($ranges[0][1] < $to) {
shift #ranges;
} else {
$ranges[0][0] = $to + 1;
}
}
print "\n";
}

generate text file from set of character

I want to generate one text file containing all combinations possible from a restricted character set into bash , or may be python
For example
I have
aAbBc01+
and I want to have all combinations 9 and 10 character long start with
aaaaaaaaa
finish with
++++++++++
passing through
+++++++++
aaaaaaaaaa
Already discussed in the forum
For python:
python -c "from itertools import permutations as p ; print('\n'.join([''.join(item) for line in open('File') for item in p(line[:-1])]))"
where File contains your input string
For bash -- Much slower
perm() {
items="$1"
out="$2"
[[ "$items" == "" ]] && echo "$out" && return
for (( i=0; i<${#items}; i++ )) ; do
( perm "${items:0:i}${items:i+1}" "$out${items:i:1}" )
done
}
while read line ; do perm $line ; done < File
Here is a python solution:
def combinations(chars,length,result="",place=0):
if place>=length:
print result;
return
for i in range(length):
combinations(chars,length,result+chars[i],place+1)
this function gets a string and desired length for results, and prints all the combinations of characters from it who has the specified length.
if you want the combinations of length 9 or 10, just call
combinations("aAbBc01+",9)
combinations("aAbBc01+",10)
and redirect the output to the text file

Resources