How to create a .GTF file? - bioinformatics

I am new to bioinformatics and programming. I would greatly appreciate some help with step-by-step instructions on how to create a .GTF file. I have two cancer cell lines with different green fluorescent protein (GFP) variants knocked-in to the genome of each cell line. The idea is that the expression of GFP can be used to distinguish cancer cells from non-cancer cells. I would like to count GFP reads in all cancer cells in a single cell RNA-seq experiment. The single cell experiment was performed on the 10X Chromium platform, on organoids composed of a mix of these cancer cells and non-cancer cells. Next generation sequencing was then performed and the reference genome is the human genome sequence, GRCh38. To 'map' and count GFP reads I was told to create a .GTF file which holds the location information, and this file will be used retrospectively to add GFP to the human genome sequence. I have the FASTA sequences for both GFP variants, which I can upload if requested. Where do I start with creation of a .GTF file? Do I create this file in Excel, or with, for example BASH script in a Terminal? I have a link to a Wellcome Trust genome website (https://www.ensembl.org/info/website/upload/gff.html?redirect=no) but it is not clear what practical/programming steps are needed. From my reading it seems a GFF (GFF3?) file is needed as an intermediate step. Step-by-step instructions would be very welcome to create the .GTF file. Thanks in advance.

Related

Compression Formats and Delimiter Sequences

My question is: Are there any standard compression formats which can ensure that a certain delimiter sequence does not occur in the compressed data stream?
We want to design a binary file format, containing chunks of sequential data (3D coordinates + other data, not really important for the question). Each chunk should be compressed using a standard compression format, like GZIP, ZIP, ...
So, the file structure will be like:
FileHeader
ChunkDelimiter Chunk1_Header compress(Chunk1_Data)
ChunkDelimiter Chunk2_Header compress(Chunk2_Data)
...
Use case is the following: The files should be read in splits in Hadoop, so we want to be able to start at an arbitrary byte position in the file, and find the start of the next chunk by looking for the delimiter sequence. -> The delimiter sequence should not occur within the chunks.
I know that we could post-process the compressed data, "escaping" the delimiter sequence in case that it occurs in the compressed output. But we'd better avoid this, since the "reverse escaping" would be required in the decoder, adding complexity.
Some more facts why we chose this file format:
Should be easily readable by third parties -> standard compression algorithm preferred.
Large files; streaming operation: amount of data and number of chunks is not known when starting to write the file -> Difficult to write start-of-chunk byte positions in the header.
I won't answer your question with a compression scheme name but will give you a hint of how other solved the same issue.
Let's give a look at Avro. Basically, they have similar requirements: files must be splitable and each data block can be compressed (you can even choose your compression scheme).
From the Avro Specification we learn that splittability is achieved with the help of a synchronization marker ("Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing."). We also discover that the synchronization marker is a 16-byte randomly-generated value ("The 16-byte, randomly-generated sync marker for this file.").
How does it solve your issue ? Well, since Martin Kleppmann provided a great answer to this question a few years ago I will just copy paste his message
On 23 January 2013 21:09, Josh Spiegel wrote:
As I understand it, Avro container files contain synchronization markers
every so often to support splitting the file. See:
https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F
(1) Why isn't the synchronization marker the same for every container file?
(i.e. what is the point of generating it randomly every time)
(2) Is it possible, at least in theory, for naturally occurring data to
contain bytes that match the sync marker? If so, would this break
synchronization?
Thanks,
Josh
Because if it was predictable, it would inevitably appear in the actual data sometimes (e.g. imagine the Avro documentation, stating
what the sync marker is, is downloaded by a web crawler and stored in
an Avro data file; then the sync marker will appear in the actual
data). Data may come from malicious sources; making the marker random
makes it unfeasible to exploit.
Possibly, but extremely unlikely. The probability of a given random 16-byte string appearing in a petabyte of (uniformly distributed) data
is about 10^-23. It's more likely that your data center is wiped out
by a meteorite
(http://preshing.com/20110504/hash-collision-probabilities).
If the sync marker appears in your data, it only breaks reading the file if you happen to also seek to that place in the file. If you just
read over it sequentially, nothing happens.
Martin
Link to the Avro mailing list archive
If it works for Avro, it will work for you too.
No. I know of no standard compression format that does not allow any sequence of bits to occur somewhere within. To do otherwise would (slightly) degrade compression, going against the original purpose of a compression format.
The solutions are a) post-process the sequence to use a specified break pattern and insert escapes if the break pattern accidentally appears in the compressed data -- this is guaranteed to work, but you don't like this solution, or b) trust that the universe is not conspiring against you and use a long break pattern whose length assures that it is incredibly unlikely to appear accidentally in all the sequences this is applied to anytime from now until the heat death of the universe.
For b) you can protect somewhat against the universe conspiring against you by selecting a random pattern for each file, and providing the random pattern at the start of the file. For the truly paranoid, you could go even further and generate a new random pattern for each successive break, from the previous pattern.
Note that the universe can conspire against you for a fixed pattern. If you make one of these compressed files with a fixed break pattern, and then you include that file in another compressed archive also using that break pattern, that archive will likely not be able to compress this already compressed file and will simply store it, leaving exposed the same fixed break pattern as is being used by the archive.
Another protection for b) would be to detect the decompression failure of an incorrect break by seeing that the piece before the break does not terminate, and handle that special case by putting that piece and the following piece back together and trying the decompression again. You would also very likely detect this on the following piece as well, with that decompression failing.

How to extract features from plain text?

I am writing a text parser which should extract features from product descriptions.
Eg:
text = "Canon EOS 7D Mark II Digital SLR Camera with 18-135mm IS STM Lens"
features = extract(text)
print features
Brand: Canon
Model: EOS 7D
....
The way I do this is by training the system with structured data and coming up with an inverted index which can map a term to a feature. This works mostly well.
When the text contains measurements like 50ml, or 2kg, the inverted index will say 2kg -> Size and 50ml -> Size for eg.
The problem here is that, when I get a value which I haven't seen before, like 13ml, it won't be processed. But since the patterns matches to a size, we could tag it as size.
I was thinking to solve this problem by preprocessing the tokens that I get from the text and look for patterns that I know. So when new patterns are identified, that has to be added to the preprocessing.
I was wondering, is this the best way to go about this? Or is there a better way of doing this?
The age-old problem of unseen cases. You could train your scraper to grab any number-like characters preceding certain suffixes (ml, kg, etc) and treat those as size. The problem with this is typos and other poorly formatted texts could enter into your structure data. There is no right answer for how to handle values you haven't seen before - you'll either have to QC them individually, or have rules around them. This is dependent on your dataset.
As far as identifying patterns, you'll either have to manually enter them, or manually classify a lot of records and let the algorithm learn them. Not sure that's very helpful, but a lot of this is very dependent on your data.
If you have a training data like this:
word label
10ml size-valume
20kg size-weight
etc...
you could train a classifier based on character n-grams and that would detect that ml is size-volume even if it sees a 11-ml or ml11 etc. you should also convert the numbers into a single number (e.g. 0) so that 11-ml is seen as 0-ml before feature extraction.
For that you'll need a preprocessing module and also a large training sample. For feature extraction you can use scikit-learn's character n-grams and also SVM.

Retrieving DNA sequences from a database of protein sequences?

I have 1000's of protein sequences in FASTA and their accession numbers. I want to go back into the whole genome shotgun database and retrieve all DNA sequences that encode for a protein identical to one in my list of initial sequences.
I've tried running a tBlastn with <10 results for each sequence, 1 per query and e-value below 1e-100 or with an e-value of zero and I'm not getting any results. I would like to automate this entire process.
Is this something that can be done by running blast from the command line and a batch script?
You should get at least one result: the one that encodes for the original protein. The others, if any, would be pseudogenes, if I follow you.
Anyway, a bit of programming may help help, check out Biopython. Bioperl or Bioruby should have similar features.
In particular you can BLAST using Biopython
You might find this link useful:
https://www.biostars.org/p/5403/
A similar question has been asked there, and some reasonable solutions have been posted.

Sorting algorithm: Big text file with variable-length lines (comma-separated values)

What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
This would run on one machine (not distributed).
The machine that I'd be running this on would have several processors.
The files I'd be sorting could be larger than the physical memory of the machine.
A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
This isn't homework!
Thanks! ♥
This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.
Suggest the following resources:
Merge Sort: http://en.wikipedia.org/wiki/Merge_sort
Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)
A standard merge sort approach will work. The common schema is
Split the file into N parts of roughly equal size
Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
Merge the sorted parts
No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.
So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.
In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.

I want to merge two PostScript Documents, pagewise. How?

i have a tricky question, so i need to describe my problem:
i need to print 2-sided booklets (a third of a paper) on normal paper (german A4, but letter is okay also) and cut the paper afterwards.
The Pages are in a Postscript Level 2 File (generated by an ancient printer driver, so no chance to patch that, except ps2ps) generated by me with the ancient OS's Printing driver facilities (GpiMove, GpiLine, GpiText etc).
I do not want to throw away two-thirds of the paper, so my idea is: Take file one, two and three, merge them (how?) on new double-sided papers by translate/move file two and three by one resp. two thirds and print the resulting new pages.
If it helps, I can manage to print one page of the booklet per file.
I cannot "speak" postscript natively, but I am capable of parsing and merging and manipulating files programmaticly. Maybe you can hint me to a webpage. I've read through the manuals on adobe's site and followed the links on www.inkguides.com/postscript.asp
Maybe there are techniques with PDF that would help? I can translate ps2pdf.
Thanks for help
Peter Miehle
PS:
my current solution: i.e. 8 pages: print page 1, 4 and 7 on page one, 2,5,8 on page two and 3,6,blank on page three, cut the paper and restack. But i want to use a electrical cutting machine, which works better with thicker stacks of paper.
Try psbook or psnup. For instance, at http://www.tardis.ed.ac.uk/~ajcd/psutils/

Resources