I have a RNA-seq bam file and there are few reads that are puzzling me.
According to the bam header, this bam file is sorted by coordinate, created using tophat and markduplicate step is not done. But some reads are marked for being duplicate in the samflag. What is worse is when I run picard markduplicate, these reads' pcr duplicate flag is toggled, marking them not a duplicate. Also I manually found the duplicate of this read (identical reads with same start positions and mates start position) so initial marking looks true.
So my questions are:
Any idea why would this happen?
Does Tophat mark reads for being duplicate? ( I don't think so)
And does picard markduplicate will toggle if the reads are already marked for being duplicate?
Here is how the read looks before and after mark duplicate step.
Before:
C0RTF 1187 17 7579880 255 61M10754N40M = 7579927 10902 CTC...
0UNP1 163 17 7579880 255 61M10754N40M = 7579927 10902 CTC...
After Markduplicate
C0RTF 163 17 7579880 255 61M10754N40M = 7579927 10902 CTC...
0UNP1 163 17 7579880 255 61M10754N40M = 7579927 10902 CTC...
Thanks
Related
I have two 100% identical empty .sh shell script files on Mac:
encrypt.sh: 299 bytes
decrypt.sh: 13 bytes (Actually this size is correct, since I have 13 bytes: 11 character + two new line)
The contents of encrypt.sh and its hexdump:
The contents of decrypt.sh and its hexdump:
The file info window of encrypt.sh:
The file info window of decrypt.sh:
They have the exact same hexdump, then how is it possible that they have different sizes?
Mac OS X file system is implementing forks, so the larger one is likely having something specific stored in its resource fork.
Use ls -l# to get more details.
My script is perfectly fine and produce a file. The file is in plain text and is formatted like how (My expect results should look like this.) is formatted. However when I try to send my file to my email the formatting is completly wrong.
The line of code I am using to send my email.
cat ReportEmail | mail -s 'Report' bob#aol.com
The result I am getting on my email.
30129 22.65 253
96187 72.32 294
109525 82.35 295
10235 7.7 105
5906 4.44 106
76096 57.22 251
My expect results should look like this.
30129 22.65 253
96187 72.32 294
109525 82.35 295
10235 7.7 105
5906 4.44 106
76096 57.22 251
Your source file achieves the column alignment by using a combination of tabs and spaces. The width assigned to a tab, however, can vary from program to program. Widths of 4, 5, or 8 spaces, for example, are common. If you want consistent formatting in plain text from one viewer to the next, use only spaces.
As a workaround, you can expand the the tabs to spaces before passing the file to mail using the expand utility:
expand -t 8 ReportEmail.txt | mail -s 'Report' bob#aol.com
The option -t 8 tells expand to treat tabs as 8 spaces wide. Change the 8 to whatever number consistently makes the format in ReportEmail.txt work properly.
Need to recruit the help of any budding bioinformaticians that are lurking in the shadows here.
I am currently in the process of formatting some .fasta files for use in a set of grouping programs but I cannot for the life of me get them to work. First things first, all the files have to have a 3 or 4 character name such as the following:
PP41.fasta
PP59.fasta
PPBD.fasta
...etc...
The files must have headers for each gene sequence that look like so: >xxxx|yyyyyyyyyy where xxxx is the same 3 or 4 letter 'taxon' identifier as the file names I put above and yyyyyyy is a numerical identifier for each of the proteins within each of the taxons (the pipe symbol can also be replaced with an _ as below). I then cat all of these in to one file which has a header that looks correct like so:
>PP49_00001
MIENFNENNDMSDMFWEVEKGTGEVINLVPNTSNTVQPVVLMRLGLFVPTLKSTKRGHQG
EMSSMDATAELRQLAIVKTEGYENIHITGARLDMDNDFKTWVGIIHSFAKHKVIGDAVTL
SFVDFIKLCGIPSSRSSKRLRERLGASLRRIATNTLSFSSQNKSYHTHLVQSAYYDMVKD
TVTIQADPKIFELYQFDRKVLLQLRAINELGRKESAQALYTYIESLPPSPAPISLARLRA
RLNLRSRVTTQNAIVRKAMEQLKGIGYLDYTEIKRGSSVYFIVHARRPKLKALKSSKSSF
KRKKETQEESILTELTREELELLEIIRAEKIIKVTRNHRRKKQTLLTFAEDESQ*
>PP49_00002
MQNDIILPINKLHGLKLLNSLELSDIELGELLSLEGDIKQVSTGNNGIVVHRIDMSEIGS
FLIIDSGESRFVIKAS*
Next step is to construct a blast database which I do as follows, using the formatdb tool of NCBI Blast:
formatdb -i allproteins.fasta -p T -o T
This produces a set of files for the database. Next I conduct an all-vs-all BLAST of the concatenated proteins against the database that I made of them like so, which outputs a tabular file which I suspect is where my issues are beginning to arise:
blastall -p blastp -d allproteins.fasta -i allproteins.fasta -a 6 -F '0 S' -v 100000 -b 100000 -e 1e-5 -m 8 -o plasmid_allvall_blastout
These files have 12 columns and look like the below. It appears correct to me, but my supervisor suspects the error is in the blast file - I don't know what I'm doing wrong however.
PP49_00001 PP51_00025 100.00 354 0 0 1 354 1 354 0.0 552
PP49_00001 PP49_00001 100.00 354 0 0 1 354 1 354 0.0 552
PP49_00001 PPTI_00026 90.28 288 28 0 1 288 1 288 3e-172 476
PP49_00001 PPNP_00026 90.28 288 28 0 1 288 1 288 3e-172 476
PP49_00001 PPKC_00016 89.93 288 29 0 1 288 1 288 2e-170 472
PP49_00001 PPBD_00021 89.93 288 29 0 1 288 1 288 2e-170 472
PP49_00001 PPJN_00003 91.14 79 7 0 145 223 2 80 8e-47 147
PP49_00002 PPTI_00024 100.00 76 0 0 1 76 1 76 3e-50 146
PP49_00002 PPNP_00024 100.00 76 0 0 1 76 1 76 3e-50 146
PP49_00002 PPKC_00018 100.00 76 0 0 1 76 1 76 3e-50 146
SO, this is where the problems really begin. I now pass the above file to a program called orthAgogue which analyses the paired sequences I have above using parameters laid out in the manual (still no idea if I'm doing anything wrong) - all I know is the several output files that are produced are all just nonsense/empty.
Command looks like so:
orthAgogue -i plasmid_allvsall_blastout -t 0 -p 1 -e 5 -O .
Any and all ideas welcome! (Hope I've covered everything - sorry about the long post!)
EDIT Never did manage to find a solution to this. Had to use an alternative piece of software. If admins wish to close this please do, unless it is worth having open for someone else (though I suspect its a pretty niche issue).
Discovered this issue (of orthAgogue) first today:
though my reply may be old, I hope it may help future users;
issue is due to a missing parameter: seems like you forgot to specify the separator: -s '_', ie, the following set of command-line parameters should do the trick*:
orthAgogue -i plasmid_allvsall_blastout -t 0 -p 1 -e 5 -O -s '_'
(* Under the assumption that your input-file is a tabular-seperated file of columns.)
A brief update after comment made by Joe:
In brief, the problem described in the intiail error report (by Joe) is (in most cases) not a bug. Instead it is one of the core properties of the Inparanoid algorithm which orthAgogue implements: if your ortholog-result-file is empty (though constructed), this (in most cases) implies that there are no reciprocal best match between a protein-pair from two different taxa/species.
One (of many) explanations for this could be that your blastp-scores are too similar, a case where I would suggest a combined tree-based/homology clustering as in TREEFAM.
Therefore, when I receive your data, I'll send it to one of the biologists I'm working with, with goal of identifying the tool proper for your data: hope my last comment makes your day ;)
Ole Kristian Ekseth, developer of orthAgogue
what I want is to add eps file into temporary ps file which has text written, then I convert my ps file to eps file using ghostscript, but when I see my eps file in AI outline mode, I see extra square around my eps file which is size box, which should not be there, It should be part of single compound box
Ghostscript version is 9.05 and before I include eps into ps I need to resize it. So resized eps file shows page border into outline mode. Which is actually not there, but when it goes to machine it will cut out that path which should not be case.
Alright, I think I have some understanding of what you're doing and where the trouble may be creeping in. As I commented, you're running the file through ghostscript multiple times. Each time it has to interpret the postscript code and construct an internal display-list representation and then recreate appropriate postscript code on the output end. So it's the clone of a clone of a clone problem. Any little hiccup can cause cascade failures.
So, enough of being preachy. The alternative is to manipulate the eps file as text.
So, if we want the image to be scaled to fill a 500x500 square, we will be guided by the number in the BoundingBox comment. I'll quote this silly file from the linked question as an example:
%!PS-Adobe-2.0 EPSF-2.0
%%BoundingBox: 72 700 127 708 %<-- modify this
%%HiResBoundingBox: 72.000000 700.000000 127.000000 707.500000 %<-- delete this
%%EndComments
% EPSF created by ps2eps 1.68
%%BeginProlog
save
countdictstack
mark
newpath
/showpage {} def
/setpagedevice {pop} def
%%EndProlog
%%Page 1 1 %<-- insert translate and scale after this line
/Times-Roman findfont
11 scalefont setfont
72 700 moveto
(This is a test)show
%%Trailer
cleartomark
countdictstack
exch sub { end } repeat
restore
%%EOF
So, the BoundingBox was %%BoundingBox: 72 700 127 708 and it needs to be 0 0 500 500 or rather (to preserve the aspect ratio) 0 0 x 500 or 0 0 500 y where x or y (whichever it happens to be) is < 500. The existing size is 127-72 x 708 - 700 = 55 x 8. So our scaling factor is 500/55. But we also want to translate the lower-left corner to the origin, and its simplest to do that first, so the scaling doesn't affect the interpretation of the numbers.
So, to take 72 700 127 708 to 0 0 500 y, first we add -72 -700 translate to the file, and modify the bounding box to 0 0 55 8, and delete that silly HiRes line: we don't really need it.
Then, we add 500 55 div dup scale (let the interpreter do the math, hee hee). So the maximum x will now be 500, but, oh, what to put for the y? A quick calculation yields 72!
So, this awk program will modify an eps file to be 500 points wide, with y scaled appropriately.
/%%BoundingBox: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)/{x=$2;y=$3;w=$4-x;h=$5-y;print $1,0,0,500,(500/w)*h}
!/%%BoundingBox:/&&!/%%HiRes/{print}
/%%Page /{print -x,-y,"translate"; print 500,w,"div dup scale"}
Usage:
$ awk -f epsscale.awk etest.eps
Hi I am having an error saying "Train dataset for temp stage can not filled. Branch training terminated. Cascade Classifier can't be trained. check the used training parameters" when I am trying to do training. I used 50 positive and 100 negative images. I saw a similar question here . My bg.txt file is already in the form mentioned in that solution but still having the error.
my console output was as follows-
C:\Users\Administrator\Documents\Visual Studio 2010\Projects\cv_traincascade
\Debug>cv_traincascade.exe -data test -vec positives.vec -bg infofile.txt -numPos 50 -
numNeg 100 -numStages 20 -precalcValBufSize 1024 -precalcIdxBufSize 1024 -w 24 -h 24
PARAMETERS:
cascadeDirName: test
vecFileName: positives.vec
bgFileName: infofile.txt
numPos: 50
numNeg: 100
numStages: 20
precalcValBufSize[Mb] :1024
precalcIdxBufSize[Mb] :1024
stageType: BOOST
featureType: HAAR
sampleWidth: 24
sampleHeight: 24
boostType: GAB
minHitRate: 0.995
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
mode: BASIC
===== TRAINING 0-stage =====
<BEGIN
POS count : consumed 50 : 50
Train dataset for temp stage can not be filled. Branch training terminated.
Cascade classifier can't be trained. Check the used training parameters.
Can anyone please say what is wrong in my command? any help will be appreciated. thank you.
You have a problem with your Background dataset, otherwise it would move through the Bg Count and start the classifier. Check the file locations are all correct in your background file "infofile.txt"
EDIT: Upload a portion of your bg file.