Determining All Possibilities for a Random String? - random

I was hoping someone with better math capabilities would assist me in figuring out the total possibilities for a string given it's length and character set.
i.e. [a-f0-9]{6}
What are the possibilities for this pattern of random characters?

It is equal to the number of characters in the set raised to 6th power.
In Python (3.x) interpreter:
>>> len("0123456789abcdef")
16
>>> 16**6
16777216
>>>
EDIT 1:
Why 16.7 million? Well, 000000 ... 999999 = 10^6 = 1M, 16/10 = 1.6 and
>>> 1.6**6
16.77721600000000
* EDIT 2:*
To create a list in Python, do: print(['{0:06x}'.format(i) for i in range(16**6)])
However, this is too huge. Here is a simpler, shorter example:
>>> ['{0:06x}'.format(i) for i in range(100)]
['000000', '000001', '000002', '000003', '000004', '000005', '000006', '000007', '000008', '000009', '00000a', '00000b', '00000c', '00000d', '00000e', '00000f', '000010', '000011', '000012', '000013', '000014', '000015', '000016', '000017', '000018', '000019', '00001a', '00001b', '00001c', '00001d', '00001e', '00001f', '000020', '000021', '000022', '000023', '000024', '000025', '000026', '000027', '000028', '000029', '00002a', '00002b', '00002c', '00002d', '00002e', '00002f', '000030', '000031', '000032', '000033', '000034', '000035', '000036', '000037', '000038', '000039', '00003a', '00003b', '00003c', '00003d', '00003e', '00003f', '000040', '000041', '000042', '000043', '000044', '000045', '000046', '000047', '000048', '000049', '00004a', '00004b', '00004c', '00004d', '00004e', '00004f', '000050', '000051', '000052', '000053', '000054', '000055', '000056', '000057', '000058', '000059', '00005a', '00005b', '00005c', '00005d', '00005e', '00005f', '000060', '000061', '000062', '000063']
>>>
EDIT 3:
As a function:
def generateAllHex(numDigits):
assert(numDigits > 0)
ceiling = 16**numDigits
for i in range(ceiling):
formatStr = '{0:0' + str(numDigits) + 'x}'
print(formatStr.format(i))
This will take a while to print at numDigits = 6.
I recommend dumping this to file instead like so:
def generateAllHex(numDigits, fileName):
assert(numDigits > 0)
ceiling = 16**numDigits
with open(fileName, 'w') as fout:
for i in range(ceiling):
formatStr = '{0:0' + str(numDigits) + 'x}'
fout.write(formatStr.format(i))

If you are just looking for the number of possibilities, the answer is (charset.length)^(length). If you need to actually generate a list of the possibilities, just loop through each character, recursively generating the remainder of the string.
e.g.
void generate(char[] charset, int length)
{
generate("",charset,length);
}
void generate(String prefix, char[] charset, int length)
{
for(int i=0;i<charset.length;i++)
{
if(length==1)
System.out.println(prefix + charset[i]);
else
generate(prefix+i,charset,length-1);
}
}

The number of possibilities is the size of your alphabet, to the power of the size of your string (in the general case, of course)
assuming your string size is 4: _ _ _ _ and your alphabet = { 0 , 1 }:
there are 2 possibilities to put 0 or 1 in the first place, second place and so on.
so it all sums up to: alphabet_size^String_size

first: 000000
last: ffffff
This matches hexadecimal numbers.

For any given set of possible values, the number of permutations is the number of possibilities raised to the power of the number of items.
In this case, that would be 16 to the 6th power, or 16777216 possibilities.

Related

How to build an empirical codon substitution matrix from a multiple sequence alignment

I have been trying to build an empirical codon substitution matrix given a multiple sequence alignment in fasta format using Biopython.
It appears to be relatively straigh-forward for single nucleotide substitution matrices using the AlignInfo module when the aligned sequences have the same length. Here is what I managed to do using python2.7:
#!/usr/bin/env python
import os
import argparse
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio import SubsMat
import sys
version = "0.0.1 (23.04.20)"
name = "Aln2SubMatrix.py"
parser=argparse.ArgumentParser(description="Outputs a codon substitution matrix given a multi-alignment in FastaFormat. Will raise error if alignments contain dots (\".\"), so replace those with dashes (\"-\") beforehand (e.g. using sed)")
parser.add_argument('-i','--input', action = "store", dest = "input", required = True, help = "(aligned) input fasta")
parser.add_argument('-o','--output', action = "store", dest = "output", help = "Output filename (default = <Input-file>.codonSubmatrix")
args=parser.parse_args()
if not args.output:
args.output = args.input + ".codonSubmatrix" #if no outputname was specified set outputname based on inputname
def main():
infile = open(args.input, "r")
outfile = open(args.output, "w")
align = AlignIO.read(infile, "fasta")
summary_align = AlignInfo.SummaryInfo(align)
replace_info = summary_align.replacement_dictionary()
mat = SubsMat.SeqMat(replace_info)
print >> outfile, mat
infile.close()
outfile.close()
sys.stderr.write("\nfinished\n")
main()
Using a multiple sequence alignment file in fasta format with sequences of same length (aln.fa), the output is a half-matrix corresponding to the number of nucleotide substitutions oberved in the alignment (Note that gaps (-) are allowed):
python Aln2SubMatrix.py -i aln.fa
- 0
a 860 232
c 596 75 129
g 571 186 75 173
t 892 58 146 59 141
- a c g t
What I am aiming to do is to compute similar empirical substitution matrix but for all nucleotide triplets (codons) present in a multiple sequence alignment.
I have tried to tweak the _pair_replacement function of the AlignInfo module in order to accept nucleotide triplets by changing:
line 305 to 308
for residue_num in range(len(seq1)):
residue1 = seq1[residue_num]
try:
residue2 = seq2[residue_num]
to
for residue_num in range(0, len(seq1), 3):
residue1 = seq1[residue_num:residue_num+3]
try:
residue2 = seq2[residue_num:residue_num+3]
At this stage it can retrieve the codons from the alignment but complains about the alphabet (the module only accepts single character alphabet?).
Note that
(i) I would like to get a substitution matrix that accounts for the three possible reading frames
Any help is highly appreciated.

Why are my byte arrays not different even though print() says they are?

I am new to python so please forgive me if I'm asking a dumb question. In my function I generate a random byte array for a given number of bytes called "input_data", then I add bytewise some bit errors and store the result in another byte array called "output_data". The print function shows that it works exactly as expected, there are different bytes. But if I compare the byte arrays afterwards they seem to be identical!
def simulate_ber(packet_length, ber, verbose=False):
# generate input data
input_data = bytearray(random.getrandbits(8) for _ in xrange(packet_length))
if(verbose):
print(binascii.hexlify(input_data)+" <-- simulated input vector")
output_data = input_data
#add bit errors
num_errors = 0
for byte in range(len(input_data)):
error_mask = 0
for bit in range(0,7,1):
if(random.uniform(0, 1)*100 < ber):
error_mask |= 1 << bit
num_errors += 1
output_data[byte] = input_data[byte] ^ error_mask
if(verbose):
print(binascii.hexlify(output_data)+" <-- output vector")
print("number of simulated bit errors: " + str(num_errors))
if(input_data == output_data):
print ("data identical")
number of packets: 1
bytes per packet: 16
simulated bit error rate: 5
start simulation...
0d3e896d61d50645e4e3fa648346091a <-- simulated input vector
0d3e896f61d51647e4e3fe648346001a <-- output vector
number of simulated bit errors: 6
data identical
Where is the bug? I am sure the problem is somewhere between my ears...
Thank you in advance for your help!
output_data = input_data
Python is a referential language. When you do the above, both variables now refer to the same object in memory. e.g:
>>> y=['Hello']
>>> x=y
>>> x.append('World!')
>>> x
['Hello', 'World!']
>>> y
['Hello', 'World!']
Cast output_data as a new bytearray and you should be good:
output_data = bytearray(input_data)

How to improve running time of my binary search code in peripherical parts?

I am studying for this great Coursera course https://www.coursera.org/learn/algorithmic-toolbox . On the fourth week, we have an assignment related to binary trees.
I think I did a good job. I created a binary search code that solves this problem using recursion in Python3. That's my code:
#python3
data_in_sequence = list(map(int,(input().split())))
data_in_keys = list(map(int,(input()).split()))
original_array = data_in_sequence[1:]
data_in_sequence = data_in_sequence[1:]
data_in_keys = data_in_keys[1:]
def binary_search(data_in_sequence,target):
answer = 0
sub_array = data_in_sequence
#print("sub_array",sub_array)
if not sub_array:
# print("sub_array",sub_array)
answer = -1
return answer
#print("target",target)
mid_point_index = (len(sub_array)//2)
#print("mid_point", sub_array[mid_point_index])
beg_point_index = 0
#print("beg_point_index",beg_point_index)
end_point_index = len(sub_array)-1
#print("end_point_index",end_point_index)
if sub_array[mid_point_index]==target:
#print ("final midpoint, ", sub_array[mid_point_index])
#print ("original_array",original_array)
#print("sub_array[mid_point_index]",sub_array[mid_point_index])
#print ("answer",answer)
answer = original_array.index(sub_array[mid_point_index])
return answer
elif target>sub_array[mid_point_index]:
#print("target num higher than current midpoint")
beg_point_index = mid_point_index+1
sub_array=sub_array[beg_point_index:]
end_point_index = len(sub_array)-1
#print("sub_array",sub_array)
return binary_search(sub_array,target)
elif target<sub_array[mid_point_index]:
#print("target num smaller than current midpoint")
sub_array = sub_array[:mid_point_index]
return binary_search(sub_array,target)
else:
return None
def bin_search_over_seq(data_in_sequence,data_in_keys):
final_output = ""
for key in data_in_keys:
final_output = final_output + " " + str(binary_search(data_in_sequence,key))
return final_output
print (bin_search_over_seq(data_in_sequence,data_in_keys))
I usually get the correct output. For instance, if I input:
5 1 5 8 12 13
5 8 1 23 1 11
I get the correct indexes of the sequences or (-1) if the term is not in sequence (first line):
2 0 -1 0 -1
However, my code does not pass on the expected running time.
Failed case #4/22: time limit exceeded (Time used: 13.47/10.00, memory used: 36696064/536870912.)
I think this happens not due to the implementation of my binary search (I think it is right). Actually, I think this happens due to some inneficieny in a peripheral part of the code. Like the way I am managing to output the final answer. However, the way I am presenting the final answer does not seem to be really "heavy"... I am lost.
Am I not seeing something? Is there another inefficiency I am not seeing? How can I solve this? Just trying to present the final result in a faster way?

what does it mean files overflow_xxxx.bin while training glove

I'm training a word embedding model based on Glove method. While the algorith shows a logger like:
$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 8 < /home/ignacio/data/GUsDany/corpus/GUs_regulon_pubMed.txt > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 8
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 145223095 words.
Building lookup table...table contains 228170143 elements.
Processing token: 5478600000
The home directory of Glove is filled with files caled overflow_0534.bin. Can someone tell whether all is going well?
Thanks
Everything is OK.
You can view the source code of Glove cooccur program at Github.
At the line 57 of the file:
long long overflow_length; // Number of cooccurrence records whose product exceeds max_product to store in memory before writing to disk
If your corpus has too many co-occurrence records, then there will be some data to be written into some temp bin disk files.
while (1) {
if (ind >= overflow_length - window_size) { // If overflow buffer is (almost) full, sort it and write it to temporary file
qsort(cr, ind, sizeof(CREC), compare_crec);
write_chunk(cr,ind,foverflow);
fclose(foverflow);
fidcounter++;
sprintf(filename,"%s_%04d.bin",file_head,fidcounter);
foverflow = fopen(filename,"w");
ind = 0;
}
The variable overflow_length depends on your memory settings.
Line 463:
if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]);
Line 467:
rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));
Line 470:
overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1

Awk Calc Avg Rows Below Certain Line

I'm having trouble calculating an average of specific numbers in column BELOW a specific text identifier using awk. I have two columns of data and I'm trying to start the average keying on a common identifier that repeats, which is 01/1991. So, awk should calc the average of all lines beginning with 01/1991, which repeats, using the next 21 lines with total count of rows for average = 22 for the total number of years 1991-2012. The desired output is an average of each TextID/Name entry for all the January's (01) for each year 1991 - 2012 show below:
TextID/Name 1
Avg: 50.34
TextID/Name 2
Avg: 45.67
TextID/Name 3
Avg: 39.97
...
sample data:
TextID/Name 1
01/1991, 57.67
01/1992, 56.43
01/1993, 49.41
..
01/2012, 39.88
TextID/Name 2
01/1991, 45.66
01/1992, 34.77
01/1993, 56.21
..
01/2012, 42.11
TextID/Name 3
01/1991, 32.22
01/1992, 23.71
01/1993, 29.55
..
01/2012, 35.10
continues with the same data for TextID/Name 4
I'm getting an answer using this code shown below but the average is starting to calculate BEFORE the specific identifier line and not on and below that line (01/1991).
awk '$1="01/1991" {sum+=$2} (NR%22==0){avg=sum/22;print"Average: "avg;sum=0;next}' myfile
Thanks and explanations of the solution is greatly appreciated! I have edited the original answer with more description - thank you again.
If you look at your file, the first field is "01/1991," with a comma at the end, not "01/1991". Also, NR%22==0 will look at line numbers divisible by 22, not 22 lines after the point it thinks you care about.
You can do something like this instead:
awk '
BEGIN { l=-1; }
$1 == "01/1991," {
l=22;
s=0;
}
l > 0 { s+=$2; l--; }
l == 0 { print s/22; l--; }'
It has a counter l that it sets to the number of lines to count, then it sums up that number of lines.
You may want to consider simply summing all lines from one 01/1991 to the next though, which might be more robust.
If you're allowed to use Perl instead of Awk, you could do:
#!/usr/bin/env perl
$start = 0;
$have_started = 0;
$count = 0;
$sum = 0;
while (<>) {
$line = $_;
# Grab the value after the date and comma
if ($line = /\d+\/\d+,\s+([\d\.]+)/) {
$val = $+;
}
# Start summing values after 01/1991
if (/01\/1991,\s+([\d\.]+)/) {
$have_started = 1;
$val = $+;
}
# If we have started counting,
if ($have_started) {
$count++;
$sum += $+;
}
}
print "Average of all values = " . $sum/$count;
Run it like so:
$ cat your-text-file.txt | above-perl-script.pl

Resources