bash - Expliciting repetitions in a sequence : how to make AACCCC into 2A4C? - bash

I am looking for a way to quantify the repetitiveness of a DNA sequence. My question is : how are distributed the tandem repeats of one single nucleotide within a given DNA sequence?
To answer that I would need a simple way to "compress" a sequence where there are identical letters repeated several times.
For instance:
AAAATTCGCATTTTTTAGGTA --> 4A2T1C1G1C1A6T1A2G1T1A
From this I would be able to extract the numbers to study the distribution of the repetitions (probably a Poisson distribution I would say), like :
4A2T1C1G1C1A6T1A2G1T1A --> 4 2 1 1 1 1 6 1 2 1 1
The limiting step for me is the first one. There are some topics which give an answer to my question but I am looking for a bash solution using regular expressions.
how to match dna sequence pattern (solution in C++)
Analyze tandem repeat motifs in DNA sequences (solution in python)
Sequence Compression? (solution in Javascript)
So if my questions inspires some regex kings, it would help me a lot.
If there is a software that does this I would take it for sure as well!
Thanks all, I hope I was clear enough
Egill

As others mentioned, Bash might not be ideal for data crunching. That being said, the compression part is not that difficult to implement:
#!/usr/bin/env bash
# Compress DNA sequence [$1: sequence string, $2: name of output variable]
function compress_sequence() {
local input="$1"
local -n output="$2"; output=""
local curr_char="" last_char="${input:0:1}" char_count=1 i
for ((i=1; i <= ${#input}; i++)); do
curr_char="${input:i:1}"
if [[ "${curr_char}" != "${last_char}" ]]; then
output+="${char_count}${last_char}"
last_char="${curr_char}"
char_count=1
else
char_count=$((char_count + 1))
fi
done
}
compress_sequence "AAAATTCGCATTTTTTAGGTA" compressed
echo "${compressed}"
This algorithm processes the sequence string character by character, counts identical characters and adds <count><char> to the output whenever characters change. I did not use regular expressions here and I'm pretty sure there wouldn't be any benefits in doing so.
I might as well add the number extracting part as it is trivial:
numbers_string="${compressed//[^0-9]/ }"
numbers_array=(${numbers_string})
This replaces everything that is not a digit with a space. The array is just a suggestion for further processing.

Related

Word Pattern Finder

Problem: find all words that follow a pattern (independently from the actual symbols used to define the pattern).
Almost identical to what this site does: http://design215.com/toolbox/wordpattern.php
Enter patterns like: ABCCDE
This will find words like "bloody",
"kitten", and "valley". The above pattern will NOT find words like
"fennel" or "hippie" because that would require the pattern to be
ABCCBE.
Please note: I need a version of that algorithm that does find words like "fennel" or "hippie" even with an ABCCDE pattern.
To complicate things further, there is the possibility to add known characters anywhere in the searching pattern, for example: cBBX (where c is the known character) will yield cees, coof, cook, cool ...
What I've done so far: I found this answer (Pattern matching for strings independent from symbols) that solves my problem almost perfectly, but if I assign an integer to every word I need to compare, I will encounter two problems.
The first is the number of unique digits I can use. For example, if the pattern is XYZABCDEFG, the equivalent digit pattern will be 1 2 3 4 5 6 7 8 9 and then? 10? Consider that I would use the digit 0 to indicate a known character (for example, aBe --> 010 --> 10). Using hexadecimal digits will move the problem further, but will not solve it.
The second problem is the maximum length of the pattern: a Long in Java is 19-digit long, and I need no restriction in my patterns (although I don't think there exists a word with 20 different characters).
To solve those problems, I could store each digit of the pattern in an array, but then it becomes an array-to-array comparison instead of an integer comparison, thus taking a lot more time to compute.
As a side note: according to the algorithm used, what data structure will be the best suited for storing the dictionary? I was thinking about using an hash-map, converting each word into its digit-pattern equivalent (assuming no known character) and using this number as an hash (of course, there would be a lot of collisions). In that way searching will require first to match the numeric pattern, and then to scan the results to find all the words that have the known characters at the right place (if present in the original searching pattern).
Also, the dictionary is not static: words can be added and deleted.
EDIT:
This answer (https://stackoverflow.com/a/44604329/4452829) works fairly well and it's fast (testing for equal lengths before matching the patterns). The only problem is that I need a version of that algorithm that find words like "fennel" or "hippie" even with an ABCCDE pattern.
I've already implemented a way to check for known characters.
EDIT 2:
Ok, by checking if each character in the pattern is greater or equal than the corresponding character in the current word (normalized as a temporary pattern) I am almost done: it correctly matches the search pattern ABCA with the word ABBA and it correctly ignores the word ABAC. The last problem remaining is that if (for example) the pattern is ABBA it will match the word ABAA, and that's not correct.
EDIT 3:
Meh, not pretty but it seems to be working (I'm using Python because it's fast to code with it). Also, the search pattern can be any sequence of symbols, using lowercase letters as fixed characters and everything else as wildcards; there is also no need to convert each word in an abstract pattern.
def match(pattern, fixed_chars, word):
d = dict()
if len(pattern) != len(word):
return False
if check_fixed_char(word, fixed_chars) is False:
return False
for i in range(0, len(pattern)):
cp = pattern[i]
cw = word[i]
if cp in d:
if d[cp] != cw:
return False
else:
d[cp] = cw
if cp > cw:
return False
return True
A long time ago I wrote a program for solving cryptograms which was based on the same concept (generating word patterns such that "kitten" and "valley" both map to "abccde."
My technique did involve generating a sort of index of words by pattern.
The core abstraction function looks like:
#!python
#!/usr/bin/env python
import string
def abstract(word):
'''find abstract word pattern
dog or cat -> abc, book or feel -> abbc
'''
al = list(string.ascii_lowercase)
d = dict
for i in word:
if i not in d:
d[i] = al.pop(0)
return ''.join([d[i] for i in word])
From there building our index is pretty easy. Assume we have a file like /usr/share/dict/words (commonly found on Unix-like systems including MacOS X and Linux):
#!/usr/bin/python
words_by_pattern = dict()
words = set()
with open('/usr/share/dict/words') as f:
for each in f:
words.add(each.strip().lower())
for each in sorted(words):
pattern = abstract(each)
if pattern not in words_by_pattern:
words_by_pattern[pattern] = list()
words_by_pattern[pattern].append(each)
... that takes less than two seconds on my laptop for about 234,000 "words" (Although you might want to use a more refined or constrained word list for your application).
Another interesting trick at this point is to find the patterns which are most unique (returns the fewest possible words). We can create a histogram of patterns thus:
histogram = [(len(words_by_pattern[x]),x) for x in words_by_pattern.keys()]
histogram.sort()
I find that the this gives me:
8077 abcdef
7882 abcdefg
6373 abcde
6074 abcdefgh
3835 abcd
3765 abcdefghi
1794 abcdefghij
1175 abc
1159 abccde
925 abccdef
Note that abc, abcd, and abcde are all in the top ten. In other words the most common letter patterns for words include all of those with no repeats among 3 to 10 characters.
You can also look at the histogram of the histogram. In other words how many patterns only show one word: for example aabca only matches "eerie" and aabcb only matches "llama". There are over 48,000 patterns with only a single matching word and almost six thousand with just two words and so on.
Note: I don't use digits; I use letters to create the pattern mappings.
I don't know if this helps with your project at all; but this are very simple snippets of code. (They're intentionally verbose).
This can easily be achieved through using Regular Expressions.
For example, the below pattern matches any word that has ABCCDE pattern:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?!\1|\2|\3|\5)([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
And this one matches ABCCBE:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?=\2)([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
To cover both above pattern, you can use:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?(?=\2)|(?!\1\2\3\5))([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
Going this path, your challenge would be generating the above Regex pattern out of the alphabetic notation you used.
And please note that you may want to use the i Regex flag when using these if case-insensitivity is a requirement.
For more Regex info, take a look at:
Look-around
Back-referencing

Diffing many texts against each other to derive template and data (finding common subsequences)

Suppose there are many texts that are known to be made from a single template (for example, many HTML pages, rendered from a template backed by data from some sort of database). A very simple example:
id:937 name=alice;
id:28 name=bob;
id:925931 name=charlie;
Given only these 3 texts, I'd like to get original template that looks like this:
"id:" + $1 + " name=" + $2 + ";"
and 3 sets of strings that were used with this template:
$1 = 937, $2 = alice
$1 = 28, $2 = bob
$1 = 925931, $3 = charlie
In other words, "template" is a list of the common subsequences encountered in all given texts always in a certain order and everything else except these subsequences should be considered "data".
I guess the general algorithm would be very similar to any LCS (longest common subsequence) algorithm, albeit with different backtracking code, that would somehow separate "template" (characters common for all given texts) and "data strings" (different characters).
Bonus question: are there ready-made solutions to do so?
I agree with the comments about the question being ill-defined. It seems likely that the format is much more specific than your general question indicates.
Having said that, something like RecordBreaker might be a help. You could also Google "wrapper induction" to see if you find some useful leads.
Perform a global multiple sequence alignment, and then call every resulting column that has a constant value part of the template:
id: 937 name=alice ;
id: 28 name=bob ;
id:925931 name=charlie;
Inferred template: XXX XXXXXX X
Most tools that I'm aware of for multiple sequence alignment require smaller alphabets -- DNA or protein -- but hopefully you can find a tool that works on the alphabet you're using (which presumably is at least all printable ASCII characters). In the worst case, you can of course implement the DP yourself: to align 2 sequences (strings) globally you use the Needleman-Wunsch algorithm, while for more than two sequences there are several approaches, the most common being sum-of-pairs scoring. The exact algorithm for k > 2 sequences unfortunately takes time exponential in k, but the heuristics employed in bioinformatics tools such as MUSCLE are much faster, and produce alignments that are very nearly as good. If they can be persuaded to work with the alphabet you're using, they would be the natural choice.

Is there a way to split a large file into chunks with random sizes?

I know you can split a file with split, but for test purposes I would like to split a large file into chunks whose sizes differ. Is this possible?
Alternatively, if the above-mentioned file is a zip, is there a way to split it into volumes of unequal sizes?
Any suggestions welcome! Thanks!
So the general question that you're asking is: how can I compute N random integers that sum to S? Specifically, S is the size of your file and N is how many smaller files that you want to break it into.
For example, assume that you want to split your file into 4 parts. If a, b, c, and d are four random numbers, then:
a + b + c + d = X
a/X + b/X + c/X + d/X = 1
S*a/X + S*b/X + S*c/X + S*d/X = S
Giving us four random numbers that sum to S, the size of your file.
Which means you'd want to write a script that:
Computes N random numbers (any random numbers).
Computes X as the sum of those random numbers.
Multiplies each of those random numbers by S/X (and makes sure you're left with integers greater than 0 that sum to S)
Splits the original file into pieces using the generated random numbers as sizes, using whatever tool you want.
This is a little much for a shell script, but would be pretty straight forward in something like Perl.
since you tagged the question only with shell. so I supposed you want to handle it only with shell script and those common linux command/tools.
As far as I know there is no existing tool/cmd can split file randomly. To split file, we can consider to use split, dd
Both tools support options like, how big (size) split-ed file should be or how many files do you want to split. let's say, we use dd/split first split your file into 500 parts, each file has same size. so we have:
foo.zip.001
foo.zip.002
foo.zip.003
...
foo.zip.500
then we take this file list as input, to do merge (cat). This step could be done by awk or shell script.
for example we can build a set of cat statements like:
cat foo.zip.001, foo.zip.002 > part1
cat foo.zip.003, foo.zip.004, foo.zip.005 > part2
cat foo.zip.006, foo.zip.007, foo.zip.008, foo.zip.009 > part3
....
run the generated cat statements, you got final part1-n, each part has different size.
for example like:
kent$ seq -f'foo.zip.%g' 20|awk 'BEGIN{i=k=2}NR<i{s=s sprintf ("%s,",$0);next}{k++;i=(NR+k);print "cat "s$0" >part"k-2;s="" }'
cat foo.zip.1,foo.zip.2 >part1
cat foo.zip.3,foo.zip.4,foo.zip.5 >part2
cat foo.zip.6,foo.zip.7,foo.zip.8,foo.zip.9 >part3
cat foo.zip.10,foo.zip.11,foo.zip.12,foo.zip.13,foo.zip.14 >part4
cat foo.zip.15,foo.zip.16,foo.zip.17,foo.zip.18,foo.zip.19,foo.zip.20 >part5
but how is the performance you have to test on your own...at least this should work for your requirement.

decoding algorithm wanted

I receive encoded PDF files regularly. The encoding works like this:
the PDFs can be displayed correctly in Acrobat Reader
select all and copy the test via Acrobat Reader
and paste in a text editor
will show that the content are encoded
so, examples are:
13579 -> 3579;
hello -> jgnnq
it's basically an offset (maybe swap) of ASCII characters.
The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.
Thank you!
edit: thanks for the feedback. I'd try to break the question into smaller questions:
Part 1: How to detect identical part(s) inside string?
You need to brute-force it.
If those patterns are simple like +2 character code like in your examples (which is +2 char codes)
h i j
e f g
l m n
l m n
o p q
1 2 3
3 4 5
5 6 7
7 8 9
9 : ;
You could easily implement like this to check against knowns words
>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
... rot=''.join(chr(ord(j)+i) for j in text)
... for x in knowns:
... if x in rot:
... print rot
...
hello
Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?
If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.
Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.
The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.
Edit - saw comment about character swapping
In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.
Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.
Hmmm, thats a tough one.
The only thing I can suggest is using a dictionary (along with some substitution cipher algorithms) may help in decoding some of the text.
But I cannot see a solution that will decode everything for you with the scenario you describe.
Why don't you paste some sample input and we can have ago at decoding it.
It's only possible then you have a lot of examples (examples count stops then: possible to get all the combinations or just an linear values dependency or idea of the scenario).
also this question : How would I reverse engineer a cryptographic algorithm? have some advices.
Do the encoded files open correctly in PDF readers other than Acrobat Reader? If so, you could just use a PDF library (e.g. PDF Clown) and use it to programmatically extract the text you need.

First-Occurrence Parallel String Matching Algorithm

To be up front, this is homework. That being said, it's extremely open ended and we've had almost zero guidance as to how to even begin thinking about this problem (or parallel algorithms in general). I'd like pointers in the right direction and not a full solution. Any reading that could help would be excellent as well.
I'm working on an efficient way to match the first occurrence of a pattern in a large amount of text using a parallel algorithm. The pattern is simple character matching, no regex involved. I've managed to come up with a possible way of finding all of the matches, but that then requires that I look through all of the matches and find the first one.
So the question is, will I have more success breaking the text up between processes and scanning that way? Or would it be best to have process-synchronized searching of some sort where the j'th process searches for the j'th character of the pattern? If then all processes return true for their match, the processes would change their position in matching said pattern and move up again, continuing until all characters have been matched and then returning the index of the first match.
What I have so far is extremely basic, and more than likely does not work. I won't be implementing this, but any pointers would be appreciated.
With p processors, a text of length t, and a pattern of length L, and a ceiling of L processors used:
for i=0 to t-l:
for j=0 to p:
processor j compares the text[i+j] to pattern[i+j]
On false match:
all processors terminate current comparison, i++
On true match by all processors:
Iterate p characters at a time until L characters have been compared
If all L comparisons return true:
return i (position of pattern)
Else:
i++
I am afraid that breaking the string will not do.
Generally speaking, early escaping is difficult, so you'd be better off breaking the text in chunks.
But let's ask Herb Sutter to explain searching with parallel algorithms first on Dr Dobbs. The idea is to use the non-uniformity of the distribution to have an early return. Of course Sutter is interested in any match, which is not the problem at hand, so let's adapt.
Here is my idea, let's say we have:
Text of length N
p Processors
heuristic: max is the maximum number of characters a chunk should contain, probably an order of magnitude greater than M the length of the pattern.
Now, what you want is to split your text into k equal chunks, where k is is minimal and size(chunk) is maximal yet inferior to max.
Then, we have a classical Producer-Consumer pattern: the p processes are feeded with the chunks of text, each process looking for the pattern in the chunk it receives.
The early escape is done by having a flag. You can either set the index of the chunk in which you found the pattern (and its position), or you can just set a boolean, and store the result in the processes themselves (in which case you'll have to go through all the processes once they have stop). The point is that each time a chunk is requested, the producer checks the flag, and stop feeding the processes if a match has been found (since the processes have been given the chunks in order).
Let's have an example, with 3 processors:
[ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]
x x
The chunks 6 and 8 both contain the string.
The producer will first feed 1, 2 and 3 to the processes, then each process will advance at its own rhythm (it depends on the similarity of the text searched and the pattern).
Let's say we find the pattern in 8 before we find it in 6. Then the process that was working on 7 ends and tries to get another chunk, the producer stops it --> it would be irrelevant. Then the process working on 6 ends, with a result, and thus we know that the first occurrence was in 6, and we have its position.
The key idea is that you don't want to look at the whole text! It's wasteful!
Given a pattern of length L, and searching in a string of length N over P processors I would just split the string over the processors. Each processor would take a chunk of length N/P + L-1, with the last L-1 overlapping the string belonging to the next processor. Then each processor would perform boyer moore (the two pre-processing tables would be shared). When each finishes, they will return the result to the first processor, which maintains a table
Process Index
1 -1
2 2
3 23
After all processes have responded (or with a bit of thought you can have an early escape), you return the first match. This should be on average O(N/(L*P) + P).
The approach of having the i'th processor matching the i'th character would require too much inter process communication overhead.
EDIT: I realize you already have a solution, and are figuring out a way without having to find all solutions. Well I don't really think this approach is necessary. You can come up with some early escape conditions, they aren't that difficult, but I don't think they'll improve your performance that much in general (unless you have some additional knowledge the distribution of matches in your text).

Resources