Calculate (mean) sequence divergence for many sequences - bioinformatics

I have ~13K sequences a 120 bases and I want to compare them to find things like conserved regions, a mean divergence between them or very diverging outliers.
The problem is, with this number of sequences the things I tried aren't doable.
So has anyone done something similar in this size and can give me some hints how to achieve it? Or maybe just some tips where I should look for?

Use the dnadist program of the PHYLIP package. You have some help in the Biopython library to deal with the Phylip alignment format here.

Related

Detecting the presence of small details in an image

I'd like to detect regions in an image which contain a comparatively large amount of small details, but equally I need to ignore strong edges. For example I would like to (approximately) identify regions of small text on a poster which is located on a building, but I also want to ignore the strong edges of the building itself.
I guess I'm probably looking for specific frequency bands, so approaches that spring to mind include: hand tuning a convolution kernel(s) until I hit what I need, use specific DCT coefficients, apply a histogram on directional filter responses. But perhaps I'm missing something more obvious?
To answer a question in the comments below, I'm developing in Matlab
I'm open to any suggestions for how to achieve this - thanks!
Here is something unscientific, but maybe not bad to get folks talking. I start with this image.
and use the excellent, free ImageMagick to divide it up into tiles 400x400 pixels, like this:
convert -crop 400x400 cinema.jpg tile%d.jpg
Now I measure the entropy of each tile, and sort by increasing entropy:
for f in tile*.jpg; do
convert $f -print '%[entropy] %f\n' null:
done | sort -n
and I get this outoput:
0.142574 tile0.jpg
0.316096 tile15.jpg
0.412495 tile9.jpg
0.482801 tile5.jpg
0.515268 tile4.jpg
0.534078 tile18.jpg
0.613911 tile12.jpg
0.629857 tile14.jpg
0.636475 tile11.jpg
0.689776 tile17.jpg
0.709307 tile10.jpg
0.710495 tile16.jpg
0.824499 tile6.jpg
0.826688 tile3.jpg
0.849991 tile8.jpg
0.851871 tile1.jpg
0.863232 tile13.jpg
0.917552 tile7.jpg
0.971176 tile2.jpg
So, if I look at the last 3 (i.e. those with the most entropy), I get:
The question itself is too broad for a non-paper worthy answer on my side. That being said, I can offer you some advice of narrowing the question down.
First off, go to Google Scholar and search for the keywords your work is revolved around. In your case, one of them would probably be edge detection.
Look through the most recent papers ( no more than 5 years ) for work that satisfies your needs. If you don't find anything, expand the search criteria or try different terms.
If you have something more specific, please edit your question and let me know.
Always remember to split the big question into smaller chunks and then split them into even smaller chunks, until you have a plate of delicious, manageable bites.
EDIT: From what I've gathered, you're interested in an edge detection and feature selection algorithm? Here are a couple of helpful links, which might prove useful:
-MATLAB feature detection
-MATLAB edge detection
Also this MATLAB edge detection write up, which is a part of their extensive guide documentation will hopefully prove useful enough for you to dig through the Matlab image processing toolbox. documentation for specific answers to your question.
You'll find Maximally Stable Extremal Regions (MSER) useful for this. You should be able to impose an area constraint to filter out large MSERs and then calculate a MSER density, for example as Mark had done in his answer by dividing the image into tiles.

Estimate good parameters for Algorithms with lots of arguments (Like for MSER in OpenCV)

I was wondering if there is a better way to estimate a good set of parameters for algorithms with lots of arguments than just randomly picking them. In detail I am trying to find some good parameters for the MSER Feature Detector which consumes 9 number parameters so there is a huge space to search in. I was thinking about alternatingly picking smaller and larger numbers around the default parameter value with exponentially growing distance. Are there any good thoughts that could help me?
Thanks!
First, you must define an objective function you want to minimize - what defines "better" parameters? In your case, I'd suggest using the number of correct matches found or similar.
Second, you must have an efficient way of looping over the virtually uncountable possibilities. Here, it probably helps that there is a minimal step size beyond which the results don't meaningfully change. Since the objective function is not necessarily derivable, I'd use a method similar to the Golden search in each dimension separately, and then repeat, until hopefully a global "good enough" maximum is reached.

Re-arrange the picture

This question was asked in a recent interview. please suggest something:
A picture of 16x16 is divided into pieces with sizes of 4x4 (16 pieces) and shuffled. Suggest an algorithm to rearrange it back.
If it's a software engineering type of problem and you divide it yourself you can cheat and store each location with each piece. ;)
They're probably looking for some pattern-matching solution though. Perhaps compare the last row of pixels on each side (top/bottom/left/right) with the other (horizontal/vertical) sides (with a certain tolerance). Each side will get a certain score against the others, progressively matching until all are done.
Without going into the Pixel matching algorithms, I think I would take a Dynamic Programming bottom up approach here. First find 8 sets of 2 pieces which are most likely adjacent and then try to build the whole thing from the smaller subsets.
I hope each of these pieces will have a identification (like a number to order/rearrange them). I can think this problem as a analogy to Reception of UDP Packets(Usually UDP Packets might get received out of order and then we need to order them.)
So any sorting algorithm should work.
Please correct me if I have misunderstood the question.
Assuming nothing is available expect the pixels of the pieces, this is a great approach at solving it probabilistically
http://people.csail.mit.edu/taegsang/JigsawPuzzle.html

How calculate minimal waste when tailoring tubes

I have a rather mathematical problem I need to solve:
The task is to cut a predefined number of tubes out of fixed length tubes with a minimum amount of waste material.
So let's say I want to cut 10 1m tubes and 20 2,5m tubes out of tubes with a standardized length of 6m.
I'm not sure what an algorithm for this kind of problem would look like?
I was thinking to create a list of variations of the different sized tubes, fit them into the standard sized tubes and
then choose the variation with the minimal waste.
First I'm not sure if there are not other and better ways to attack the problem.
Second I did not find a solution HOW I would create such a variations list.
Any help is greatly appreciated, thanks!
I believe you are describing the cutting stock problem. Some additional information can be found here.
This is known as the Cutting Stock problem. Wikipedia has a number of references that might help you find clues to an algorithm that works.

How to quantify the quality of a pseudorandom number generator?

This is based on this question. A number of answers were proposed that generate non-uniform distributions and I started wondering how to quantify the non uniformity of the output. I'm not looking for patterning issues, just single value aspects.
What are the accepted procedures?
My current thinking is to computer the average Shannon entropy per call by computing the entropy of each value and taking a weighted average. This can then be compered to the expected value.
My concerns are
Is this correct?
How to compute these value without loosing precision?
For #1 I'm wondering if I've got it correct.
For #2 the concern is that I would be processing numbers with magnitudes like 1/7 +/- 1e-18 and I'm worried that the floating point errors will kill me for any but the smallest problems. The exact form of the computation could result in some major differences here and I seem to recall that there are some ASM options for some special log cases but I can't seem to find the docs about this.
In this case the use is take a "good" PRNG for the range [1,n] and generate a SRNG for the range [1,m]. The question is how much worse is the results than the input?
What I have is expected occurrence rates for each output value.
NIST has a set of documents and tools for statistically analyzing random number generators cross a variety of metrics.
http://csrc.nist.gov/groups/ST/toolkit/rng/index.html
Many of these tests are also incorporated into the Dieharder PRNG test suite.
http://www.phy.duke.edu/~rgb/General/rand_rate.php
There are a ton of different metrics, because there are many, many different ways to use PRNGs. You can't analyze a PRNG in a vacuum - you have to understand the use case. These tools and documents provide a lot of information to help you in this, but at the end of the day you'll still have to understand what you actually need before you can determine of the algorithm is suitable. The NIST documentation is thorough, if somewhat dense.
-Adam
This page discusses one way of checking if you are getting a bad distribution: plotting the pseudo-random values in a field and then just looking at them.
TestU01 has an even more exacting test set than Dieharder. The largest test set is called "BigCrush", but it takes a long time to execute, so there are also subsets called just "Crush" and "SmallCrush". The idea is to first try SmallCrush, and if the PRNG passes that, try Crush, and if it passes that, BigCrush. If it passes that too, it should be good enough.
You can get TestU01 here.

Resources