Find TopHat accepted_hits.bam mapped to transcriptome and genome separetly - bioinformatics

I did a tophat/2.0.13 alignment giving both genomic and transcriptome inputs and would like to know how I can find %reads mapped to transcriptome and %ampped to the genome in my accepted_hits.bam output file.
Is there a straight forward way of doing this? samtools flagstat only give me overall mapping % and does not distinguish between genomic and transcriptomic alignment.
Below is the script I used to for the tophat alignment:
tophat -G GRCm38_p4.gtf --min-anchor 8 --min-isoform-fraction 0.15
--library-type fr-unstranded --transcriptome-index known_mm10 -p 12
-o ${stdout} /Bowtie2Index/genome /R1.fastq /R2.fastq
Any resources/suggestions would be appreciated.
Thank you,

Related

Image Conversion - RAW to png/raw for game (Pac The Man X)

So I have raw image and I am just curious If I can edit such image to save as RGB-32 Packed transparent interlaced raw and what program I could use, there is specification:
Format of RAW image
I have tried using photoshop but then game crashes. Is it even possible? I should get file without thumbnail. I also tried using gimp, free converters and Raw viewer but no luck. Any suggestions?
Edit:
Used photoshop (interleaved with transparency format), game starts but images are just bunch of pixels.
file that i try to prepare (221bits)
We are still not getting a handle on what output format you are really trying to achieve. Let's try generating a file from scratch, to see if we can get there.
So, let's just use simple commands that are available on a Mac and generate some test images from first principles. Start with exactly the same ghost.raw image you shared in your question. We will take the first 12 bytes as the header, and then generate a file full of red pixels and see if that works:
# Grab first 12 bytes from "ghost.raw" and start a new file "red.raw"
head -c 12 ghost.raw > red.raw
# Now generate 512x108 pixels, where red=ff, green=00, blue=01, alpha=fe and append to "red.raw"
perl -E 'say "ff0001fe" x (512*108)' | xxd -r -p >> red.raw
So you can try using red.raw in place of ghost.raw and tell me what happens.
Now try generating a blue file just the same:
# Grab first 12 bytes from "ghost.raw" and start a new file "blue.raw"
head -c 12 ghost.raw > blue.raw
# Now generate 512x108 pixels, where red=00, green=01, blue=ff, alpha=fe and append to "blue.raw"
perl -E 'say "0001fffe" x (512*108)' | xxd -r -p >> blue.raw
And then try blue.raw.
Original Answer
AFAIK, your image is actually 512 pixels wide by 108 pixels tall in RGBA8888 format with a 12-byte header at the start - making 12 + 4*(512 * 108) bytes.
You can convert it to PNG or JPEG with ImageMagick like this:
magick -size 512x108+12 -depth 8 RGBA:ghost.raw result.png
I still don't understand from your question or comments what format you actually want - so if you clarify that, I am hopeful we can get you answered.
Try using online converters. They help most of the time.\
A Website like these can possibly help:
https://www.freeconvert.com/raw-to-png
https://cloudconvert.com/raw-to-png
https://www.zamzar.com/convert/raw-to-png/
Some are specific websites which ask you for detail and some are straight forward conversions.

Input reads are 55 million but only 1 million were used for alignment

[U]I ran this code using tophat (v2.1.0) to align reads (bowtie2 (v2.2.6.0)) from my RNA-seq fastq file using the bowtie2 genomes.bt2 indexes from igenomes (Homo_sapiens_UCSC_hg19)([/U]:
tophat2 -p 8 -G /home/ajsn6c/Desktop/Kumar_RNA-seq/Homo_sapiens_UCSC_hg19 /Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/hg19.gtf /home/ajsn6c/Desktop/Kumar_RNA-seq/Homo_sapiens_UCSC_hg19/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome HPDE_S11_L002_R1_001.fastq
[U]My fastq file is around 13 GB. However, after alignment my accepted hits file is only 50 MB.[/U]
[U]Heres the alignment output saying I have around 55 million kept reads:[/U]
[2018-02-21 13:58:33] Beginning TopHat run (v2.1.0)
[2018-02-21 13:58:33] Checking for Bowtie
Bowtie version: 2.2.6.0
[2018-02-21 13:58:33] Checking for Bowtie index files (genome)..
[2018-02-21 13:58:33] Checking for reference FASTA file
[2018-02-21 13:58:33] Generating SAM header for /home/ajsn6c/Desktop /Kumar_RNA-seq/Homo_sapiens_UCSC_hg19/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome
[2018-02-21 13:58:35] Reading known junctions from GTF file
[2018-02-21 13:58:39] Preparing reads
left reads: min. length=12, max. length=101, 55970267 kept reads (45104 discarded)
Warning: short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places
[2018-02-21 14:17:45] Building transcriptome data files Panc1/tmp/genes
[2018-02-21 14:17:59] Building Bowtie index from genes.fa
[2018-02-21 14:32:14] Mapping left_kept_reads to transcriptome genes with Bowtie2
[2018-02-21 15:38:44] Resuming TopHat pipeline with unmapped reads
[2018-02-21 15:38:44] Mapping left_kept_reads.m2g_um to genome genome with Bowtie2
[2018-02-21 16:17:07] Mapping left_kept_reads.m2g_um_seg1 to genome genome with Bowtie2 (1/4)
[2018-02-21 16:18:13] Mapping left_kept_reads.m2g_um_seg2 to genome genome with Bowtie2 (2/4)
[2018-02-21 16:19:32] Mapping left_kept_reads.m2g_um_seg3 to genome genome with Bowtie2 (3/4)
[2018-02-21 16:20:46] Mapping left_kept_reads.m2g_um_seg4 to genome genome with Bowtie2 (4/4)
[2018-02-21 16:21:59] Searching for junctions via segment mapping
[2018-02-21 16:25:24] Retrieving sequences for splices
[2018-02-21 16:27:18] Indexing splices
Building a SMALL index
[2018-02-21 16:27:37] Mapping left_kept_reads.m2g_um_seg1 to genome segment_juncs with Bowtie2 (1/4)
[2018-02-21 16:27:50] Mapping left_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie2 (2/4)
[2018-02-21 16:28:03] Mapping left_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie2 (3/4)
[2018-02-21 16:28:17] Mapping left_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie2 (4/4)
[2018-02-21 16:28:31] Joining segment hits
[2018-02-21 16:31:02] Reporting output tracks
[2018-02-22 19:21:42] A summary of the alignment counts can be found in ./tophat_out/align_summary.txt
[2018-02-22 19:21:42] Run complete: 02:08:37 elapse
[U]This is the alignment summary from the align_summary files[/U]:
reads:
Input : 926337
Mapped : 898584 (97.0% of input)
of these: 14621 ( 1.6%) have multiple alignments (14 have >20)
97.0% overall read mapping rate.
Why is the input only 900K, when it kept 55 million reads? The quality of the reads have excellent phred scores too. Any ideas would be greatly appreciated!
Thanks
Alex
These entries in your log file are odd:
[2018-02-21 14:17:45] Building transcriptome data files Panc1/tmp/genes
[2018-02-21 14:17:59] Building Bowtie index from genes.fa
Here is your tophat2 command (I have restructured the command to help with readability)
./tophat2 \
-p 8 \
-G /home/ajsn6c/Desktop/Kumar_RNA-seq/Homo_sapiens_UCSC_hg19 /Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/hg19.gtf \
/home/ajsn6c/Desktop/Kumar_RNA-seq/Homo_sapiens_UCSC_hg19/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome \
HPDE_S11_L002_R1_001.fastq
There seem to be some erroneous whitespaces (e.g. [...]Homo_sapiens_UCSC_hg19 /Homo_sapiens[...]; not sure if that is the issue.
Based on your command the transcriptome should be build based on file [...]/UCSC/hg19/Sequence/Bowtie2Index/hg19.gtf; I have no idea where Panc1/tmp/genes is coming from, but clearly this file is being used to build a reference transcriptome, and not [...]/hg19.gtf.

Vowpal Wabbit difference between raw predictions (-r) and predictions (-p)

I am trying to classify binary data. In the data file, class [0,1] is converted to [-1,1]. Data has 21 features. All features are categorical. I am using neural network for training. The training command is:
vw -d train.vw --cache_file data --passes 5 -q sd -q ad -q do -q fd --binary -f model --nn 22
I create raw prediction file as:
vw -d test.vw -t -i neuralmodel -r raw.txt
And normal prediction file as:
vw -d test.vw -t -i neuralmodel -p out.txt
First five lines of raw file are:
0:-0.861075,-0.696812 1:-0.841357,-0.686527 2:0.796014,0.661809 3:1.06953,0.789289 4:-1.23823,-0.844951 5:0.886767,0.709793 6:2.02206,0.965555 7:-2.40753,-0.983917 8:-1.09056,-0.797075 9:1.22141,0.84007 10:2.69466,0.990912 11:2.64134,0.989894 12:-2.33309,-0.981359 13:-1.61462,-0.923839 14:1.54888,0.913601 15:3.26275,0.995055 16:2.17991,0.974762 17:0.750114,0.635229 18:2.91698,0.994164 19:1.15909,0.820746 20:-0.485593,-0.450708 21:2.00432,0.964333 -0.496912
0:-1.36519,-0.877588 1:-2.83699,-0.993155 2:-0.257558,-0.251996 3:-2.12969,-0.97213 4:-2.29878,-0.980048 5:2.70791,0.991148 6:1.31337,0.865131 7:-2.00127,-0.964116 8:-2.14167,-0.972782 9:2.50633,0.986782 10:-1.09253,-0.797788 11:2.29477,0.97989 12:-1.67385,-0.932057 13:-0.740598,-0.629493 14:0.829695,0.680313 15:3.31954,0.995055 16:3.44069,0.995055 17:2.48612,0.986241 18:1.32241,0.867388 19:1.97189,0.961987 20:1.19584,0.832381 21:1.65151,0.929067 -0.588528
0:0.908454,0.72039 1:-2.48134,-0.986108 2:-0.557337,-0.505996 3:-2.15072,-0.973263 4:-1.77706,-0.944375 5:0.202272,0.199557 6:2.37479,0.982839 7:-1.97478,-0.962201 8:-1.78124,-0.944825 9:1.94016,0.959547 10:-1.67845,-0.932657 11:2.54895,0.987855 12:-1.60502,-0.92242 13:-2.32369,-0.981008 14:1.59895,0.921511 15:2.02658,0.96586 16:2.55443,0.987987 17:3.47049,0.995055 18:1.92482,0.958313 19:1.47773,0.901044 20:-3.60913,-0.995055 21:3.56413,0.995055 -0.809399
0:-2.11677,-0.971411 1:-1.32759,-0.868656 2:2.59003,0.988807 3:-0.198721,-0.196146 4:-2.51631,-0.987041 5:0.258549,0.252956 6:1.60134,0.921871 7:-2.28731,-0.97959 8:-2.89953,-0.993958 9:-0.0972349,-0.0969177 10:3.1409,0.995055 11:1.62083,0.924746 12:-2.30097,-0.980134 13:-2.05674,-0.967824 14:1.6744,0.932135 15:1.85612,0.952319 16:2.7231,0.991412 17:1.97199,0.961995 18:3.47125,0.995055 19:0.603527,0.539567 20:1.25539,0.84979 21:2.15267,0.973368 -0.494474
0:-2.21583,-0.97649 1:-2.16823,-0.974171 2:2.00711,0.964528 3:-1.84079,-0.95087 4:-1.27159,-0.854227 5:-0.0841799,-0.0839635 6:2.24566,0.977836 7:-2.19458,-0.975482 8:-2.42779,-0.98455 9:0.39883,0.378965 10:1.32133,0.86712 11:1.87572,0.95411 12:-2.22585,-0.976951 13:-2.04512,-0.96708 14:1.52652,0.909827 15:1.98228,0.962755 16:2.37265,0.982766 17:1.73726,0.939908 18:2.315,0.980679 19:-0.08135,-0.081154 20:1.39248,0.883717 21:1.5889,0.919981 -0.389856
First five lines of (normal) prediction file are:
-0.496912
-0.588528
-0.809399
-0.494474
-0.389856
I have tallied this (normal) output with raw output. I notice that the (last or) ending float value in each of the five raw lines is the same as above.
I would please like to understand the raw output as also the normal output. That each line holds 22 pairs of values is something to do with 22 neurons? How to interpret the output as [-1,1] and why a sigmoid function is needed to convert either of the above to probabilities. Will be grateful for help.
For binary classification, you should use a suitable loss function (--loss_function=logistic or --loss_function=hinge). The --binary switch just makes sure that the reported loss is the 0/1 loss (but you cannot optimize for 0/1 loss directly, the default loss function is --loss_function=squared).
I recommend trying the --nn as one of the last steps when tuning the VW parameters. Usually, it improves the results only a little bit and the optimal number of units in the hidden layer is quite small (--nn 1, --nn 2 or --nn 3). You can also try adding a direct connections between the input and output layer with --inpass.
Note that --nn uses always tanh as the sigmoid function for the hidden layer and only one hidden layer is possible (it is hardcoded in nn.cc).
If you want to get probabilities (real number from [0,1]), use vw -d test.vw -t -i neuralmodel --link=logistic -p probabilities.txt. If you want the output to a be real number from [-1,1], use --link=glf1.
Without --link and --binary, the --pred output are the internal predictions (in range [-50, 50] when logistic or hinge loss function is used).
As for the --nn --raw question, your guess is correct:
The 22 pairs of numbers correspond to the 22 neurons and the last number is the final (internal) prediction. My guess is that each pair corresponds to the bias and output of each unit on the hidden layer.

Issue in creating Vectors from text in Mahout

I'm using Mahout 0.9 (installed on HDP 2.2) for topic discovery (Latent Drichlet Allocation algorithm). I have my text file stored in directory
inputraw and executed the following commands in order
command #1:
mahout seqdirectory -i inputraw -o output-directory -c UTF-8
command #2:
mahout seq2sparse -i output-directory -o output-vector-str -wt tf -ng 3 --maxDFPercent 40 -ow -nv
command #3:
mahout rowid -i output-vector-str/tf-vectors/ -o output-vector-int
command #4:
mahout cvb -i output-vector-int/matrix -o output-topics -k 1 -mt output-tmp -x 10 -dict output-vector-str/dictionary.file-0
After executing the second command and as expected it creates a bunch of subfolders and files under the
output-vector-str (named df-count, dictionary.file-0, frequency.file-0, tf-vectors,tokenized-documents and wordcount). The size of these files all looks ok considering the size of my input file however the file under ``tf-vectors` has a very small size, in fact it's only 118 bytes).
Apparently as the
`tf-vectors` is the input to the 3rd command, the third command also generates a file of small size. Does anyone know:
what is the reason of the file under
`tf-vectors` folder to be that small? There must be something wrong.
Starting from the first command, all the generated files have a strange coding and are nor human readable. Is this something expected?
Your answers are as follows:
what is the reason of the file under tf-vectors folder to be that small?
The vectors are small considering you have given maxdf percentage to be only 40%, implying that only terms which have a doc freq(percentage freq of terms occurring throughout the docs) of less than 40% would be taken in consideration. In other words, only terms which occur in 40% of the documents or less would be taken in consideration while generating vectors.
what is the reason of the file under tf-vectors folder to be that small?
There is a command in mahout called the mahout seqdumper which would come to your rescue for dumping the files in "sequential" format to "human" readable format.
Good Luck!!

How to zgrep the last line of a gz file without tail

Here is my problem, I have a set of big gz log files, the very first info in the line is a datetime text, e.g.: 2014-03-20 05:32:00.
I need to check what set of log files holds a specific data.
For the init I simply do a:
'-query-data-'
zgrep -m 1 '^20140320-04' 20140320-0{3,4}*gz
BUT HOW to do the same with the last line without process the whole file as would be done with zcat (too heavy):
zcat foo.gz | tail -1
Additional info, those logs are created with the data time of it's initial record, so if I want to query logs at 14:00:00 I have to search, also, in files created BEFORE 14:00:00, as a file would be created at 13:50:00 and closed at 14:10:00.
The easiest solution would be to alter your log rotation to create smaller files.
The second easiest solution would be to use a compression tool that supports random access.
Projects like dictzip, BGZF, and csio each add sync flush points at various intervals within gzip-compressed data that allow you to seek to in a program aware of that extra information. While it exists in the standard, the vanilla gzip does not add such markers either by default or by option.
Files compressed by these random-access-friendly utilities are slightly larger (by perhaps 2-20%) due to the markers themselves, but fully support decompression with gzip or another utility that is unaware of these markers.
You can learn more at this question about random access in various compression formats.
There's also a "Blasted Bioinformatics" blog by Peter Cock with several posts on this topic, including:
BGZF - Blocked, Bigger & Better GZIP! – gzip with random access (like dictzip)
Random access to BZIP2? – An investigation (result: can't be done, though I do it below)
Random access to blocked XZ format (BXZF) – xz with improved random access support
Experiments with xz
xz (an LZMA compression format) actually has random access support on a per-block level, but you will only get a single block with the defaults.
File creation
xz can concatenate multiple archives together, in which case each archive would have its own block. The GNU split can do this easily:
split -b 50M --filter 'xz -c' big.log > big.log.sp.xz
This tells split to break big.log into 50MB chunks (before compression) and run each one through xz -c, which outputs the compressed chunk to standard output. We then collect that standard output into a single file named big.log.sp.xz.
To do this without GNU, you'd need a loop:
split -b 50M big.log big.log-part
for p in big.log-part*; do xz -c $p; done > big.log.sp.xz
rm big.log-part*
Parsing
You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:
SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }')
tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1
Side note
Version 5.1.1 introduced support for the --block-size flag:
xz --block-size=50M big.log
However, I have not been able to extract a specific block since it doesn't include full headers between blocks. I suspect this is nontrivial to do from the command line.
Experiments with gzip
gzip also supports concatenation. I (briefly) tried mimicking this process for gzip without any luck. gzip --verbose --list doesn't give enough information and it appears the headers are too variable to find.
This would require adding sync flush points, and since their size varies on the size of the last buffer in the previous compression, that's too hard to do on the command line (use dictzip or another of the previously discussed tools).
I did apt-get install dictzip and played with dictzip, but just a little. It doesn't work without arguments, creating a (massive!) .dz archive that neither dictunzip nor gunzip could understand.
Experiments with bzip2
bzip2 has headers we can find. This is still a bit messy, but it works.
Creation
This is just like the xz procedure above:
split -b 50M --filter 'bzip2 -c' big.log > big.log.sp.bz2
I should note that this is considerably slower than xz (48 min for bzip2 vs 17 min for xz vs 1 min for xz -0) as well as considerably larger (97M for bzip2 vs 25M for xz -0 vs 15M for xz), at least for my test log file.
Parsing
This is a little harder because we don't have the nice index. We have to guess at where to go, and we have to err on the side of scanning too much, but with a massive file, we'd still save I/O.
My guess for this test was 50000000 (out of the original 52428800, a pessimistic guess that isn't pessimistic enough for e.g. an H.264 movie.)
GUESS=50000000
LAST=$(tail -c$GUESS big.log.sp.bz2 \
|grep -abo 'BZh91AY&SY' |awk -F: 'END { print '$GUESS'-$1 }')
tail -c $LAST big.log.sp.bz2 |bunzip2 -c |tail -n1
This takes just the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts that from the guess size, and pulls that many bytes off of the end of the file. Just that part is decompressed and thrown into tail.
Because this has to query the compressed file twice and has an extra scan (the grep call seeking the header, which examines the whole guessed space), this is a suboptimal solution. See also the below section on how slow bzip2 really is.
Perspective
Given how fast xz is, it's easily the best bet; using its fastest option (xz -0) is quite fast to compress or decompress and creates a smaller file than gzip or bzip2 on the log file I was testing with. Other tests (as well as various sources online) suggest that xz -0 is preferable to bzip2 in all scenarios.
————— No Random Access —————— ——————— Random Access ———————
FORMAT SIZE RATIO WRITE READ SIZE RATIO WRITE SEEK
————————— ————————————————————————————— —————————————————————————————
(original) 7211M 1.0000 - 0:06 7211M 1.0000 - 0:00
bzip2 96M 0.0133 48:31 3:15 97M 0.0134 47:39 0:00
gzip 79M 0.0109 0:59 0:22
dictzip 605M 0.0839 1:36 (fail)
xz -0 25M 0.0034 1:14 0:12 25M 0.0035 1:08 0:00
xz 14M 0.0019 16:32 0:11 14M 0.0020 16:44 0:00
Timing tests were not comprehensive, I did not average anything and disk caching was in use. Still, they look correct; there is a very small amount of overhead from split plus launching 145 compression instances rather than just one (this may even be a net gain if it allows an otherwise non-multithreaded utility to consume multiple threads).
Well, you can access randomly a gzipped file if you previously create an index for each file ...
I've developed a command line tool which creates indexes for gzip files which allow for very quick random access inside them:
https://github.com/circulosmeos/gztool
The tool has two options that may be of interest for you:
-S option supervise a still-growing file and creates an index for it as it is growing - this can be useful for gzipped rsyslog files as reduces to zero in the practice the time of index creation.
-t tails a gzip file: this way you can do: $ gztool -t foo.gz | tail -1
Please, note that if the index doesn't exists, this will consume the same time as a complete decompression: but as the index is reusable, next searches will be greatly reduced in time!
This tool is based on zran.c demonstration code from original zlib, so there's no out-of-the-rules magic!

Resources