remove scaffold and other unplaced sequence before mapping?

remove scaffold and other unplaced sequence before mapping? - bioinformatics

I downloaded reference genomes from Ensembl (fasta format). But there are lots of sequences with name "dna:scaffold":
https://github.com/CTLife/TEMP/tree/master/RefGenomes
Such as Mouse_GRCm38 (mm10), except chromosome 1-19, Mt, X and Y; others should be removed before mapping ?
Such as Human_GRCh38.p5 (hg38), https://github.com/CTLife/TEMP/blob/master/RefGenomes/Human_GRCh38.p5.txt, there are 516 sequences. In addition to chromosome 1-22, Mt, X and Y; others (such as CHR_HG2241_PATCH and KI270728.1) should be removed before mapping ?

Related

Hashing a long integer ID into a smaller string

Here is the problem, where I need to transform an ID (defined as a long integer) to a smaller alfanumeric identifier. The details are the following:
Each individual on the problem as an unique ID, a long integer of size 13 (something like 123123412341234).
I need to generate a smaller representation of this unique ID, a alfanumeric string, something like A1CB3X. The problem is that 5 or 6 character length will not be enough to represent such a large integer.
The new ID (eg A1CB3X) should be valid in a context where we know that only a small number of individuals are present (less than 500). The new ID should be unique within that small set of individuals.
The new ID (eg A1CB3X) should be the result of a calculation made over the original ID. This means that taking the original ID elsewhere and applying the same calculation, we should get the same new ID (eg A1CB3X).
This calculation should occur when the individual is added to the set, meaning that not all individuals belonging to that set will be know at that time.
Any directions on how to solve such a problem?

Assuming that you don't need a formula that goes in both directions (which is impossible if you are reducing a 13-digit number to a 5 or 6-character alphanum string):
If you can have up to 6 alphanumeric characters that gives you 366 = 2,176,782,336 possibilities, assuming only numbers and uppercase letters.
To map your larger 13-digit number onto this space, you can take a modulo of some prime number slightly smaller than that, for example 2,176,782,317, the encode it with base-36 encoding.
alphanum_id = base36encode(longnumber_id % 2176782317)
For a set of 500, this gives you a
2176782317P500 / 2176782317500 chance of a collision
(P is permutation)

Best option is to change the base to 62 using case sensitive characters
If you want it to be shorter, you can add unicode characters. See below.
Here is javascript code for you: https://jsfiddle.net/vewmdt85/1/
function compress(n) {
var symbols = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïð'.split('');
var d = n;
var compressed = '';
while (d >= 1) {
compressed = symbols[(d - (symbols.length * Math.floor(d / symbols.length)))] + compressed;
d = Math.floor(d / symbols.length);
}
return compressed;
}
$('input').keyup(function() {
$('span').html(compress($(this).val()))
})
$('span').html(compress($('input').val()))

How about using some base-X conversion, for example 123123412341234 becomes 17N644R7CI in base-36 and 9999999999999 becomes 3JLXPT2PR?

If you need a mapping that works both directions, you can simply go for a larger base.
Meaning: using base 16, you can reduce 1 to 16 to a single character.
So, base36 is the "maximum" that allows for shorter strings (when 1-1 mapping is required)!

how bytes are used to store information in protobuf

i am trying to understand the protocol buffer here is the sample , what i am not be able to understand is how bytes are being used in following messages. i dont know what this number
1 2 3 is used for.
message Point {
required int32 x = 1;
required int32 y = 2;
optional string label = 3;
}
message Line {
required Point start = 1;
required Point end = 2;
optional string label = 3;
}
message Polyline {
repeated Point point = 1;
optional string label = 2;
}
i read following paragraph in google protobuf but not able to understand what is being said here , can anyone help me in understanding how bytes are being used to store info.
The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional element.

The general form of a protobuf message is that it is a sequence of pairs of the form:
field header
payload
For your question, we can largely forget about the payload - that isn't the bit that relates to the 1/2/3 and the <=16 restriction - all of that is in the field header. The field header is a "varint" encoded integer; "varint" uses the most-significant-bit as an optional continuation bit, so small values (<=127, assuming unsigned and not zig-zag) require one byte to encode - larger values require multiple bytes. Or in other words, you get 7 useful bits to play with before you need to set the continuation bit, requiring at least 2 bytes.
However! The field header itself is composed of two things:
the wire-type
the field-number / "tag"
The wire-type is the first 3 bits, and indicates the fundamental format of the payload - "length-delimited", "64-bit", "32-bit", "varint", "start-group", "end-group". That means that of the 7 useful bits we had, only 4 are left; 4 bits is enough to encode numbers <= 16. This is why field-numbers <= 16 are suggested (as an optimisation) for your most common elements.
In your question, the 1 / 2 / 3 is the field-number; at the time of encoding this is left-shifted by 3 and composed with the payload's wire-type; then this composed value is varint-encoded.

Protobuf stores the messages like a map from an id (the =1, =2 which they call tags) to the actual value. This is to be able to more easily extend it than if it would transfer data more like a struct with fixed offsets. So a message Point for instance would look something like this on a high level:
1 -> 100,
2 -> 500
Which then is interpreted as x=100, y=500 and label=not set. On a lower level, protobuf serializes this tag-value mapping in a highly compact format, which among other things, stores integers with variable-length encoding. The paragraph you quoted just highlights exactly this in the case of tags, which can be stored more compactly if they are < 16, but the same for instance holds for integer values in your protobuf definition.

How do I make a function use the altered version of a list in Mathematica?

I want to make a list with its elements representing the logic map given by
x_{n+1} = a*x_n(1-x_n)
I tried the following code (which adds stuff manually instead of a For loop):
x0 = Input["Enter x0"]
a = Input["a"]
M = {x0}
L[n_] := If[n < 1, x0, a*M[[n]]*(1 - M[[n]])]
Print[L[1]]
Append[M, L[1]]
Print[M]
Append[M, L[2]]
Print[M]
The output is as follows:
0.3
2
{0.3}
0.42
{0.3,0.42}
{0.3}
Part::partw: Part 2 of {0.3`} does not exist. >>
Part::partw: Part 2 of {0.3`} does not exist. >>
{0.3, 2 (1 - {0.3}[[2]]) {0.3}[[2]]}
{0.3}
It seems that, when the function definition is being called in Append[M,L[2]], L[2] is calling M[[2]] in the older definition of M, which clearly does not exist.
How can I make L use the newer, bigger version of M?
After doing this I could use a For loop to generate the entire list up to a certain index.
P.S. I apologise for the poor formatting but I could find out how to make Latex code work here.
Other minor question: What are the allowed names for functions and lists? Are underscores allowed in names?

It looks to me as if you are trying to compute the result of
FixedPointList[a*#*(1-#)&, x0]
Note:
Building lists element-by-element, whether you use a loop or some other construct, is almost always a bad idea in Mathematica. To use the system productively you need to learn some of the basic functional constructs, of which FixedPointList is one.
I'm not providing any explanation of the function I've used, nor of the interpretation of symbols such as # and &. This is all covered in the documentation which explains matters better than I can and with which you ought to become familiar.
Mathematica allows alphanumeric (only) names and they must start with a letter. Of course, Mathematic recognises many Unicode characters other than the 26 letters in the English alphabet as alphabetic. By convention (only) intrinsic names start with an upper-case letter and your own with a lower-case.
The underscore is most definitely not allowed in Mathematica names, it has a specific and widely-used interpretation as a short form of the Blank symbol.
Oh, LaTeX formatting doesn't work hereabouts, but Mathematica code is plenty readable enough.

It seems that, when the function definition is being called in
Append[M,L2], L2 is calling M[2] in the older definition of M,
which clearly does not exist.
How can I make L use the newer, bigger version of M?
M is never getting updated here. Append does not modify the parameters you pass to it; it returns the concatenated value of the arrays.
So, the following code:
A={1,2,3}
B=Append[A,5]
Will end up with B={1,2,3,5} and A={1,2,3}. A is not modfied.
To analyse your output,
0.3 // Output of x0 = Input["Enter x0"]. Note that the assignment operator returns the the assignment value.
2 // Output of a= Input["a"]
{0.3} // Output of M = {x0}
0.42 // Output of Print[L[1]]
{0.3,0.42} // Output of Append[M, L[1]]. This is the *return value*, not the new value of M
{0.3} // Output of Print[M]
Part::partw: Part 2 of {0.3`} does not exist. >> // M has only one element, so M[[2]] doesn't make sense
Part::partw: Part 2 of {0.3`} does not exist. >> // ditto
{0.3, 2 (1 - {0.3}[[2]]) {0.3}[[2]]} (* Output of Append[M, L[2]]. Again, *not* the new value of M *)
{0.3} // Output of Print[M]
The simple fix here is to use M=Append[M, L[1]].
To do it in a single for loop:
xn=x0;
For[i = 0, i < n, i++,
M = Append[M, xn];
xn = A*xn (1 - xn)
];
A faster method would be to use NestList[a*#*(1-#)&, x0,n] as a variation of the method mentioned by Mark above.
Here, the expression a*#*(1-#)& is basically an anonymous function (# is its parameter, the & is a shorthand for enclosing it in Function[]). The NestList method takes a function as one argument and recursively applies it starting with x0, for n iterations.
Other minor question: What are the allowed names for functions and lists? Are underscores allowed in names?
No underscores, they're used for pattern matching. Otherwise a variable can contain alphabets and special characters (like theta and all), but no characters that have a meaning in mathematica (parentheses/braces/brackets, the at symbol, the hash symbol, an ampersand, a period, arithmetic symbols, underscores, etc). They may contain a dollar sign but preferably not start with one (these are usually reserved for system variables and all, though you can define a variable starting with a dollar sign without breaking anything).

Last byte in Huffman compression

I am wondering about what is the best way to handle the last byte in Huffman Copression. I have some nice code in C++, that can compress text files very well, but currently I must write to my coded file also number of coded chars (well, it equal to input file size), because of no idea how to handle last byte better.
For example, last char to compress is 'a', which code is 011 and I am just starting new byte to write, so the last byte will look like:
011 + some 5 bits of trash, I am making them zeros for example at the end.
And when I am encoding this coded file, it may happen that code 00000 (or with less zeros) is code for some char, so I will have some trash char at the end of my encoded file.
As I wrote in first paragraph, I am avoiding this by saving numbers of chars of input file in coded file, and while encoding, I am reading the coded file to reach that number (not to EndOfFile, to don't get to those example 5 zeros).
It's not really efficient, size of coded file is increased for long number.
How can I handle this in better way?

Your approach (write the number of encoded bytes the to the file) is a perfectly reasonable approach. If you want to try a different avenue, you could consider inventing a new "pseudo-EOF" character that marks the end of the input (I'll denote it as &square;). Whenever you want to compress a string s, you instead compress the string s&square;. This means that when you build up your encoding tree, you would include one copy of the &square; character so that you have a unique encoding for &square;. Then, when you write out the string to the file, you would write out the bits characters of the string as normal, then write out the bit pattern for &square;. If there are leftover bits, you can just leave them set arbitrarily.
The advantage to this approach is that as you decode the file, if at any point you find the &square; character, you can immediately stop decoding bits because you know that you have hit the end of the file. This does not require you to store the number of bytes that were written out anywhere - the encoding implicitly marks its own endpoint.
The disadvantage to this setup is that it might increase the length of the bit patterns used by certain characters, since you will need to assign a bit pattern to &square; in addition to all the other characters.
I teach an introductory programming course and we use Huffman encoding as one of our assignments. We have students use the above approach, since it's a bit easier than having to write out the number of bits or bytes before the file contents. For more details, you could take a look at this handout or these lecture slides from the course.
Hope this helps!

I know this is an old question, but still, there's an alternate, so it might help someone.
When you're writing your compressed file to output, you probably have some integer keeping track of where you are in the current byte (for bit shifting).
char c, p;
p = '\0';
int curr = 7;
while (infile.get(c))
{
std::string trav = GetTraversal(c);
for (int i = 0; i < trav.size(); i++)
{
if (trav[i] == '1')
p += (1 << curr);
if (--curr < 0)
{
outfile.put(p);
p = '\0';
curr = 7;
}
}
}
if (curr < 7)
outfile.put(p);
At the end of this block, (curr+1)%8 equals the number of trash bits in the last data byte. You can then store it at the end as a single extra byte, and just keep it in mind when you're decompressing.

Convert Enum to Binary (via Integer or something similar)

I have an Ada enum with 2 values type Polarity is (Normal, Reversed), and I would like to convert them to 0, 1 (or True, False--as Boolean seems to implicitly play nice as binary) respectively, so I can store their values as specific bits in a byte. How can I accomplish this?

An easy way is a lookup table:
Bool_Polarity : constant Array(Polarity) of Boolean
:= (Normal=>False, Reversed => True);
then use it as
B Boolean := Bool_Polarity(P);
Of course there is nothing wrong with using the 'Pos attribute, but the LUT makes the mapping readable and very obvious.
As it is constant, you'd like to hope it optimises away during the constant folding stage, and it seems to: I have used similar tricks compiling for AVR with very acceptable executable sizes (down to 0.6k to independently drive 2 stepper motors)

3.5.5 Operations of Discrete Types include the function S'Pos(Arg : S'Base), which "returns the position number of the value of Arg, as a value of type universal integer." Hence,
Polarity'Pos(Normal) = 0
Polarity'Pos(Reversed) = 1
You can change the numbering using 13.4 Enumeration Representation Clauses.
...and, of course:
Boolean'Val(Polarity'Pos(Normal)) = False
Boolean'Val(Polarity'Pos(Reversed)) = True

I think what you are looking for is a record type with a representation clause:
procedure Main is
type Byte_T is mod 2**8-1;
for Byte_T'Size use 8;
type Filler7_T is mod 2**7-1;
for Filler7_T'Size use 7;
type Polarity_T is (Normal,Reversed);
for Polarity_T use (Normal => 0, Reversed => 1);
for Polarity_T'Size use 1;
type Byte_As_Record_T is record
Filler : Filler7_T;
Polarity : Polarity_T;
end record;
for Byte_As_Record_T use record
Filler at 0 range 0 .. 6;
Polarity at 0 range 7 .. 7;
end record;
for Byte_As_Record_T'Size use 8;
function Convert is new Ada.Unchecked_Conversion
(Source => Byte_As_Record_T,
Target => Byte_T);
function Convert is new Ada.Unchecked_Conversion
(Source => Byte_T,
Target => Byte_As_Record_T);
begin
-- TBC
null;
end Main;
As Byte_As_Record_T & Byte_T are the same size, you can use unchecked conversion to convert between the types safely.
The representation clause for Byte_As_Record_T allows you to specify which bits/bytes to place your polarity_t in. (i chose the 8th bit)
My definition of Byte_T might not be what you want, but as long as it is 8 bits long the principle should still be workable. From Byte_T you can also safely upcast to Integer or Natural or Positive. You can also use the same technique to go directly to/from a 32 bit Integer to/from a 32 bit record type.

Two points here:
1) Enumerations are already stored as binary. Everything is. In particular, your enumeration, as defined above, will be stored as a 0 for Normal and a 1 for Reversed, unless you go out of your way to tell the compiler to use other values.
If you want to get that value out of the enumeration as an Integer rather than an enumeration value, you have two options. The 'pos() attribute will return a 0-based number for that enumeration's position in the enumeration, and Unchecked_Conversion will return the actual value the computer stores for it. (There is no difference in the value, unless an enumeration representation clause was used).
2) Enumerations are nice, but don't reinvent Boolean. If your enumeration can only ever have two values, you don't gain anything useful by making a custom enumeration, and you lose a lot of useful properties that Boolean has. Booleans can be directly selected off of in loops and if checks. Booleans have and, or, xor, etc. defined for them. Booleans can be put into packed arrays, and then those same operators are defined bitwise across the whole array.
A particular pet peeve of mine is when people end up defining themselves a custom boolean with the logic reversed (so its true condition is 0). If you do this, the ghost of Ada Lovelace will come back from the grave and force you to listen to an exhaustive explanation of how to calculate Bernoulli sequences with a Difference Engine. Don't let this happen to you!
So if it would never make sense to have a third enumeration value, you just name objects something appropriate describing the True condition (eg: Reversed_Polarity : Boolean;), and go on your merry way.

It seems all I needed to do was pragma Pack([type name]); (in which 'type name' is the type composed of Polarity) to compress the value down to a single bit.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

remove scaffold and other unplaced sequence before mapping? - bioinformatics

Related

Hashing a long integer ID into a smaller string

how bytes are used to store information in protobuf

How do I make a function use the altered version of a list in Mathematica?

Last byte in Huffman compression

Convert Enum to Binary (via Integer or something similar)

Categories

Resources