Should you re-combine split fastq files after genome assembly? [closed] - bioinformatics

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I've split a large fastq file into 6 or 7 smaller more 'manageable' files for genome assembly.
Would it be 'biologically correct' to now re-combine the output files (contigs.fasta) back together? Is there a more meaningful way to do this?
thanks.

The best practice is to use an assembler that can handle large fastq files, on adequate hardware, that is with large RAM, and fast I/O. Let the assembler software itself parallelize the assembly process if the input is larger than can fit in RAM. Prefer this single input approach to splitting the input into parts, assembling each part separately, and then "assembling" the partial outputs/contigs.
REFERENCES:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5850084/
Dominguez Del Angel, V., Hjerde, E., Sterck, L., Capella-Gutierrez, S., Notredame, C., Vinnere Pettersson, O., Amselem, J., Bouri, L., Bocs, S., Klopp, C., Gibrat, J. F., Vlasova, A., Leskosek, B. L., Soler, L., Binzer-Panchal, M., & Lantz, H. (2018). Ten steps to get started in Genome Assembly and Annotation. F1000Research, 7, ELIXIR-148. https://doi.org/10.12688/f1000research.13598.1
For genome assembly, running times and memory requirements will
increase with the amount of data. As more data is needed for large
genomes, there is thus also a correlation between genome size and
running time/memory requirements. Only a small subset of available
assembly programs can distribute the assembly into several processes
and run them in parallel on several compute nodes. Tools that cannot
do this tend to require a lot of memory on a single node, while
programs that can split the process need less memory in each
individual node, but do on the other work most efficiently when
several nodes are available. It is therefore important to select the
proper assembly tools early in a project, and make sure that there are
enough available compute resources of the right type to run these
tools.

Related

Real running time calculation of matrix multiplication [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I would like to calculate approximately the running time of a matrix multiplication problem. Below are my assumptions:
No parallel programming
A 2 Ghz CPU
A square matrix of size n
An O(n^3) algorithm
For example suppose that n = 1000. So, how much time (approximately) should I expect taking the square of this matrix will take on the above assumptions.
Thanks.
This really terribly depends on the algorithm and the CPU. Even without parallelization, there's a lot of freedom in how the same steps would be represented on a CPU, and differences (in clock cycles needed for various operations) between different CPU's of the same family, too. Don't forget, either, that modern CPUs add some parallelization of instructions on their own. Optimization done by the compiler will make a difference in reordering memory order and branches and will likely convert instructions to vectorized ones even if you didn't specify that. Depending on further factors it may make a difference, too, whether your matrices are in a fixed location in memory or if you are accessing them by a pointer, and whether they are allocated with fixed size or each row / column dynamically. Don't forget about memory caching, page invalidations, and operation system scheduling, as I did in previous versions of my answer.
If this is for your own rough estimate or for a "typical" case, you won't do much wrong by just writing the program, running it in your specific conditions (as discussed above) in many repetitions for n = 1000, and calculating the average.
If you want a lot of hard work for a worse result, you can actually do what you probably meant to do in your original question yourself:
see what instructions your specific compiler produces for your specific algorithm under your specific conditions and with specific optimization settings (like here)
pick your specific processor and find its latency table for every instruction that's there,
add them up per iteration and multiply by 1000^3,
divide by the clock frequency.
Seriously, it's not worth the effort, a benchmark is faster, clearer, and more precise anyway (as this does not account for what happens in the branch predictor and hyperthreading and memory caching and other architectural details). If you want an exercise I'll leave that to you.

Alternative algorithms to generate random values in VHDL? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have been using an LFSR implemented according to a primitive polynomial, but as you know an LFSR produces a number of possible values in a repeating order which means that it is not truly random!
one solution to keep using the LFSR and assure that it produces a truly random value is to use some sort of dynamic way of reading values outputed from the LFSR but I can't figure out how to do this in hardware(VHDL)!
Therefore I am after an alternative way of truly producing a random unexpected repeating value of a defined length ie. 10-bits
Any suggestions? I am planning to implement them in VHDL!
Generating TRUE random numbers is actually a field of research on it's own. Basically you will need to gather information about some seemingly random natural phenomena via some kind of sensor. Hardware and software for the moment are deterministic so having the same input will always result in the same output. Gathering external sensor information can "randomize" your input.
Here some reading : https://en.wikipedia.org/wiki/Pseudorandom_number_generator
Also, here is a practical example of using external sensor input in a peer reviewed journal article, titled Random Number Generated from White Noise of Webcam, with a short nugget of info from the abstract:
Random number generators play a very import role in modern cryptography, especially in personal information security. For example, to generate random number from white noise of webcam is a new approach for personal device. Through our algorithms, 91% IPcam generating sequences pass at least four statistical tests, 87% pass all five ones has been approved. Compared with webcam and video respectively, on the contrary, the possibility for both generating sequences to pass all five statistical tests is roughly 80%. The result implies improvement by algorithm on personal devices such as laptop, for instance, is necessary to generate qualified random number to protect private information.

Clustering of a graph of file names to regroup files in folders [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm looking for known good algorithms for (fuzzy) clustering of similar file names found in a hierarchy of folders.
To remain within SO rules and spirit, let me explain the context in detail, so that your answers can be concise rather than generic.
Context
My goal is to develop an application which:
takes a set of files (content and names)
compares filenames to identify clusters
compares contents to find duplicates (this is off scope)
suggest files deletions and file regrouping based on identified clusters and identical contents.
For example, given 3 folders:
folder 1: file_1, file_7, file_23, ...
folder 2: duplicate of file_1, ...
folder 3: file_5, ...
I would suggest to:
delete the duplicate of file_1 in folder 2, rather than in folder 1, because there is a larger part of the cluster in folder 1.
move file_5 from folder 3 to folder 1, because it would extend the existing cluster.
I've read about two concepts:
String metric, and various distances between two strings.
Cluster analysis.
I assume I'm able to create a graph where nodes are file names and edge are distances (I've posted a separate question for distance calculation).
It seems this kind of algorithm would be able to find clusters from this graph.
Question
Being a programmer, not a mathematician, I would appreciate to have some recommendations on best directions to look for efficient clustering algorithms applicable to this specific case of clustering file names (based on existing projects with comparable goals).
Since you are looking for good clustering algorithms, I won't go into similarity scores of text and documents. However, you may find research materials in Natural Language Processing helpful. You can even do Topic Modeling when it involves context of the document.
It sounds like you do not want to dig into too much Math in the algorithms. I will suggest a simple approach (below).
Assuming you have obtained a thresholded similarity graph, the graph can be expressed as a matrix or a dictionary of list. The graph can be sparse or dense after thresholding.
If it is quite dense, try Spectral Clustering.
If it is sparse, try Affinity Propagation.
They are both well documented and implemented in most programming languages used in data science. For examples, in Python, you have Scikit-Learn; in R, you have This.
Interesting concept you proposed. Good luck!

Why do so many things run in 'human observable time'? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I've studied complexity theory and I come from a solid programming background and it's always seemed odd that so many things seem to run in times that are intrinsic to humans. I'm wondering if anyone has any ideas as to why this is?
I'm generally speaking of times in the range of 1 second to 1 hour. If you consider how narrow that span of time is proportional the the billions of operations per second a computer can handle, it seems odd that such a large number of things fall into that category.
A few examples:
Encoding video: 20 minutes
Checking for updates: 5 seconds
Starting a computer: 45 seconds
You get the idea...
Don't you think most things should fall into one of two categories: instantaneous / millions of years?
probably because it signifies the cut-off where people consider further optimizations not being worth the effort.
and clearly, having a computer that takes millions of years to boot wouldn't be very useful (or maybe it would, but you just wouldn't know yet, because it's still booting :P )
Given that computers are tools, and tools are meant to be setup, used, and have their results analyzed by humans (mostly), it makes sense that the majority of operations would be created in a way that didn't take longer than the lifespan of a typical human.
I would argue that most single operations are effectively "instantaneous" (in that they run in less than perceptible time), but are rarely used as a single operation. Humans are capable of creating complexity, and given that many computational operations intrinsically contain a balance between speed and some other factor (quality, memory usage, etc), it actually makes sense that many operations are designed in a way where that balance places them into a "times that are intrinsic to humans". However, I'd personally word that as "a time that is assumed to be acceptable to a human user, given the result generated."

What are the standard data structures that can be used to efficiently represent the world of minecraft? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I am thinking of something like a 3x3 matrix for each of the x,y,z coordinates. But that would be a waste of memory since a lot of block spaces are empty. Another solution would be to have a hashmap ((x,y,z) -> BlockObject), but that doesn't seem too efficient either.
When I say efficient, I do not mean optimal. It simply means that it would be enough to run smoothly on your modern day computer. Keep in mind, that the worlds generated by minecraft are quite huge, efficiency is important regardless. There's is also tons of meta-data that needs to be stored.
As noted in my comment, I have no idea how MineCraft does this, but a common efficient way of representing this sort of data is an Octree; http://en.wikipedia.org/wiki/Octree. The general idea is that it's like a binary tree but in three-space. You recursively divide each block of space in each dimension to get eight smaller blocks, and each block contains the pointers to the smaller blocks and a pointer to its parent block.
This allows you to be efficient about storing large blocks of the same material (e.g., "empty space"), because you can terminate the recursion whenever you get to a block that is made up of all the same thing, even if you haven't recursed down to the level of individual "cube" units.
Also, this means that you can efficiently find all the cubes in a given region by taking your current block and going up the tree just far enough to get to a block that contains all you can see -- and that way, you can very easily ignore all the cubes that are somewhere else.
If you're interested in exploring alternative means to represent Minecraft world (chunk)data, you can also look into the idea of bitstrings. Each 'chunk' is comprised of a volume 16*16*128, whereas 16*16 can adequately be represented by a single byte character and can be consolidated into a binary string.
As this approach is highly specific to a certain goal of trading client-computation vs highly optimized storage and transfer time, it seems imprudent to attempt to explain all the details, but I have created a specification for just this purpose, if you're interested.
Using this method, calculating storage cost is drastically different than the current 1byte-per-block, but instead is 'variable-bit-rate': ((1bit-per-block, rounded up to a multiple of 8) * (number of unique layers a blocktype appears in a chunk) + 2bytes)
This is then summed for the (unique number of blocktypes in that chunk).
Pretty much only in deliberate edgecases can this be more expensive than a normally structured chunk, in excess of 99% of Minecraft chunks are naturally generated and would benefit from this variable-bit-representation by a ratio of 8:1 or more in many of my tests.
Your best bet is to decompile Minecraft and look at the source. Modifying Minecraft: The Source Code is a nice walkthrough on how to do that.
Minecraft is very far from efficent. It just stores "chunks" of data.
Check out the "Map formats" in the Development Resources at Minecraft Wiki. AFAIK, the internal representation is exactly the same.

Resources