Do I need Proper Pairs in BAM file for gene quantification? - rna-seq

I have human RNA reads that I aligned against the human reference genome (GRCh 38) using BWA MEM and TopHat2. I now want to count the genes with HTSeq-count. Do I need to filter out the "non-proper pairs" beforehand? So that I only parse proper pairs into HTSeq-count? If so, how can I do that?
Samtools flagstats shows me that all bam files have ~100% mapped reads and the percentage of proper pairs is between 75-80%.

Related

How to create a .GTF file?

I am new to bioinformatics and programming. I would greatly appreciate some help with step-by-step instructions on how to create a .GTF file. I have two cancer cell lines with different green fluorescent protein (GFP) variants knocked-in to the genome of each cell line. The idea is that the expression of GFP can be used to distinguish cancer cells from non-cancer cells. I would like to count GFP reads in all cancer cells in a single cell RNA-seq experiment. The single cell experiment was performed on the 10X Chromium platform, on organoids composed of a mix of these cancer cells and non-cancer cells. Next generation sequencing was then performed and the reference genome is the human genome sequence, GRCh38. To 'map' and count GFP reads I was told to create a .GTF file which holds the location information, and this file will be used retrospectively to add GFP to the human genome sequence. I have the FASTA sequences for both GFP variants, which I can upload if requested. Where do I start with creation of a .GTF file? Do I create this file in Excel, or with, for example BASH script in a Terminal? I have a link to a Wellcome Trust genome website (https://www.ensembl.org/info/website/upload/gff.html?redirect=no) but it is not clear what practical/programming steps are needed. From my reading it seems a GFF (GFF3?) file is needed as an intermediate step. Step-by-step instructions would be very welcome to create the .GTF file. Thanks in advance.

Mapping microarray probes

I have analysed some microarray data that was floating around for a few years. I then noticed in the Limma output file that there were many probes which gene IDs were different from the gene IDs in any known database. I have managed to get the probe sequences from Agilent but I am having problems to assign them to genes in the genome.
I do not have coordinates for the probes so I will need to map them first. Do you know any Bioconductor package that will take the probes in fasta format and output the genes where they are located using a gff file or gbk of the genome? I am working with Acaryochloris marina.

Matching CLOSEST file in given ASCII Text Files [duplicate]

This question already has answers here:
Algorithm to match one input file with given numbers of file
(3 answers)
Closed 5 years ago.
PROBLEM:
I have around 20 ASCII text files, each having a size less than 10^9 Bytes.Another ASCII text file (say FOO) is given. Program is to strategically match the contents of FOO with the given 20 files and print the name of CLOSEST matching file. The contents of FOO might only match partially.
Since file size is too large ,i'm wondering:
1.How to use Information Retrieval(since I don't know much about IR)
2.which data structure should i use to store such information
3.What would be the best Algorithm to implement it.
I know i'm asking too much, But really i'm stuck at this problem and not able to find out how to approach.Any help would be Appreciated.Thanks!
So I assume a file contain some text. So we can say each one of the file is a big string. Now make 20 vectors or arrays. Go through the file and put each word as an element in the vector. Now create a vectors with a size of 20 to store the matching of each of the file Now create a word vector for the given file as well. Now create a loop to run through these vectors if at any given index you found a match with any of these 20 vectors and your given vectors. Increase the value for corresponding file in match storing vectors. At the end, the highest value in the match storing vector will indicate the file with the best match.
Solution by Vampire Coder assumes that the documents are bag of words, meaning ordering of words doesn't matter. But by "match partially," you meant some of the sentences matching, then that won't do any good.
You could divide each document into overlapping subsets, and take the hash of each subset. Then you transform your document into a set of hashes. Then you could compare the hashes. This is one way you could do what you want to do.
For each document, once you have narrowed down potential matches, you can increase the resolution at which you divide your documents. Say you initially divided them into two, now you can divide them into 10. This is to minimize the running time.
Also you should use locality sensitive hashing algorithm like: http://en.wikipedia.org/wiki/Nilsimsa_Hash
My guess at "closest", is the file with the smallest diff between the 2 files.
I would look for a diff algorithm, or longest common subsequence https://en.m.wikipedia.org/wiki/Longest_common_subsequence_problem

Removing Duplicate Words Across Multiple and Large Dictionary Files

I have roughly ~600GB of dictionaries I've accumulated over the years, and I decided I want to clean them up and sort them
First of all, each file on average is very large, anywhere from 500MB to 9GB in size. A prerequisite for what I want to do is that I sort each dictionary. My end goal is to entirely remove duplicate words within and throughout all dictionary files.
The reason for this is that most of my dictionaries are sorted and organized by categories, but duplicates still often exist.
Load file
Read each line and put into data structure
Sort and remove any and all duplicate
Load next file and repeat
Once all files are individually unique, compare against eachother and remove duplicates
For Dictionaries D{1} to D{N}:
1) Sort D{1} through D{N} individually.
2) Check uniqueness of each word in D{i}
3) For each word in D{i}, check ALL words across D{i+1} to D{N}. Delete each word if unique in D{i} first.
I am considering using a sort of "hash" to improve this algorithm. Possibly by only checking the first one or two characters, since the list will be sorted (e.g. hash beginning line location for words starting with a, b, etc.).
4) Save and exit.
Example before (but far smaller):
Dictionary 1 Dictionary 2 Dictionary 3
]a 0u3TGNdB 2 KLOCK
all avisskriveri 4BZ32nKEMiqEaT7z
ast chorion 4BZ5
astn chowders bebotch
apiala chroma bebotch
apiales louts bebotch
avisskriveri lowlander chorion
avisskriverier namely PC-Based
avisskriverierne silking PC-Based
avisskriving underwater PC-Based
So it would see avisskriveri, chorion, bebotch and PC-Based are words that repeate both within and among each of the three dictionaries. So I see avisskriveri in D{1} first, so remove it in all other instances that I have seen it in. Then I see chorion in D{2} first, and remove that in all other instances first, and so forth. In D{3} bebotch and PC-Based are replicated, so I want to delete all but one entry of it (unless I've seen it before). Then save all files and close.
Example after:
Dictionary 1 Dictionary 2 Dictionary 3
]a 0u3TGNdB 2 KLOCK
all chorion 4BZ32nKEMiqEaT7z
ast chowders 4BZ5
astn chroma bebotch
apiala louts PC-Based
apiales lowlander
avisskriveri namely
avisskriverier silking
avisskriverierne underwater
avisskriving
Remember: I do NOT want to create any new dictionaries, only remove duplicates across all dictionaries.
Options:
"Hash" the amount of unique words for each file, allowing the program to estimate the computation time.
Specify a way give the location of the first word beginning with the desired first letter. So that the search may "jump" to a line and skip unecessary computational time.
Run on GPU for high performance parallel computing. (This is an issue because getting the data off of the GPU is tricky)
Goal: Reduce computational time and space consumption so that the method is affordable on a standard machine or server with limited abilities. Or device a method for running it remotely on a GPU cluster.
tl;dr - Sorting unique words across hundreds of files, where each file is 1-9GB in size.
Assuming the dictionaries are in alphabetical order and line by line, one word per line (as are most dictionaries), you could do something like this:
Open a file stream to each file.
Open a file stream to the compiled list file.
Read 1 entry from each file and put it onto a heap, priority queue, or other sorted data structure.
while you still have entries
find & remove the first entry, storing the word (it is not necessary to store the file)
read in the next entry from that file, if one exists
find & remove any duplicates of the stored entry
read in the next entry for each of those files, if one exists
write the stored word to your compiled list file
Close all of the streams
The efficiency of this is something like O(n*m*log(n)) and the space efficiency is O(n), where n is the number of files and m is the average number of entries.
Note that you'll want to create a data type that pairs entries (strings) with file pointers/references, and sorts by string storing. You'll also need a data structure that allows you to peek before you pop.
If you have questions in implementation, ask me.
A more thorough analysis of the efficiency:
Space efficiency is pretty easy. You fill the data structure, and for every item you put on, you take one off, so it stays at O(n).
Computational efficiency is more complex. The looping itself is O(n*m), because you will consider each entry, and there are n*m entries. Some c percent of those will be valid, but that's a constant, so we don't care.
Next, adding and removing from a priority queue is log(n) both ways, so to find & remove is 2*log(n).
Because we add and remove each entry, we get n*m add and removes, so O(n*m*log(n)). I think it might actually be a theta in this case, but meh.
As far as I understand, there is no pattern to exploit in a clever way. So we want to do raw sorting.
Let us assume that no cluster farm is available (we could do other things then)
Then I would start with the easiest approach possible, the command line tool sort:
sort -u inp1 inp2 -o sorted
This will sort inp1 and inp2 together in output file sorted without duplicates (u = unique). Sort typically uses a customized mergesort algorithm, which can handle a limited amount of memory. So you should not run in memory problems.
You should have at least 600 gb (double the size) of free disk space.
You should test with only 2 input files how long it takes and what happens. My tests did not show any problems, but they had used different data and an afs server (which is rather slow, but is a better emulation as some HPC filesystem provider):
$ ll
2147483646 big1
2147483646 big2
$ time sort -u big1 big2 -o bigsorted
1009.674u 6.290s 28:01.63 60.4% 0+0k 0+0io 0pf+0w
$ ll
2147483646 big1
2147483646 big2
117440512 bigsorted
I'd start with something like:
#include <string>
#include <set>
int main()
{
typedef std::set<string> Words;
Words words;
std::string word;
while (std::cin >> word)
words.insert(word); // will only work if not seen before
for (Words::const_iterator i = words.begin(); i != words.end(); ++i)
std::cout << *i;
}
Then just:
cat file1 file2... | ./this_wonderful_program > greatest_dictionary.txt
Should be fine assuming the number of non-duplicate words fits in memory (likely on any modern PC, especially if you've 64 bits and > 4GB), this will probably be I/O bound anyway so no point fussing over unordered map vs (binary-tree) map etc.. You may want to convert to lower-case, strip spurious characters etc. before inserting to the map.
EDIT:
If the unique words don't fit in memory, or you're just stubbornly determined to sort each individual input then merge them, you can use the unix sort command on each file, then sort -m to efficiently merge the pre-sorted files. If you're not on UNIX/Linux, you can probably still find a port of sort (e.g. from Cygwin for Windows), your OS may have an equivalent program, or you could try compiling the sort source code. Note that this approach is a little different from tb-'s suggestion of asking one invocation of sort to sort everything (presumably in memory) - I'm not sure how well that would work, so best to try/compare.
On that that scale of 300GB+, you may want to consider using Hadoop or some other scalable store - otherwise, you will have to deal with memory issues through your own coding. You can try other, more direct methods (UNIX scripting, small C/C++ programs, etc...), but you will likely run out of memory unless you have a ton of duplicate words in your data.
Addendum
Just came across memcached which seems very close to what you are trying to accomplish: but you may have to tweak it not to throw away the oldest values. I don't have time to check right now, but you should do a search on Distributed Hash Tables.

Sorting algorithm: Big text file with variable-length lines (comma-separated values)

What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
This would run on one machine (not distributed).
The machine that I'd be running this on would have several processors.
The files I'd be sorting could be larger than the physical memory of the machine.
A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
This isn't homework!
Thanks! ♥
This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.
Suggest the following resources:
Merge Sort: http://en.wikipedia.org/wiki/Merge_sort
Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)
A standard merge sort approach will work. The common schema is
Split the file into N parts of roughly equal size
Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
Merge the sorted parts
No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.
So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.
In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.

Resources