orthAgogue incorrectly processing BLAST files - terminal

Need to recruit the help of any budding bioinformaticians that are lurking in the shadows here.
I am currently in the process of formatting some .fasta files for use in a set of grouping programs but I cannot for the life of me get them to work. First things first, all the files have to have a 3 or 4 character name such as the following:
PP41.fasta
PP59.fasta
PPBD.fasta
...etc...
The files must have headers for each gene sequence that look like so: >xxxx|yyyyyyyyyy where xxxx is the same 3 or 4 letter 'taxon' identifier as the file names I put above and yyyyyyy is a numerical identifier for each of the proteins within each of the taxons (the pipe symbol can also be replaced with an _ as below). I then cat all of these in to one file which has a header that looks correct like so:
>PP49_00001
MIENFNENNDMSDMFWEVEKGTGEVINLVPNTSNTVQPVVLMRLGLFVPTLKSTKRGHQG
EMSSMDATAELRQLAIVKTEGYENIHITGARLDMDNDFKTWVGIIHSFAKHKVIGDAVTL
SFVDFIKLCGIPSSRSSKRLRERLGASLRRIATNTLSFSSQNKSYHTHLVQSAYYDMVKD
TVTIQADPKIFELYQFDRKVLLQLRAINELGRKESAQALYTYIESLPPSPAPISLARLRA
RLNLRSRVTTQNAIVRKAMEQLKGIGYLDYTEIKRGSSVYFIVHARRPKLKALKSSKSSF
KRKKETQEESILTELTREELELLEIIRAEKIIKVTRNHRRKKQTLLTFAEDESQ*
>PP49_00002
MQNDIILPINKLHGLKLLNSLELSDIELGELLSLEGDIKQVSTGNNGIVVHRIDMSEIGS
FLIIDSGESRFVIKAS*
Next step is to construct a blast database which I do as follows, using the formatdb tool of NCBI Blast:
formatdb -i allproteins.fasta -p T -o T
This produces a set of files for the database. Next I conduct an all-vs-all BLAST of the concatenated proteins against the database that I made of them like so, which outputs a tabular file which I suspect is where my issues are beginning to arise:
blastall -p blastp -d allproteins.fasta -i allproteins.fasta -a 6 -F '0 S' -v 100000 -b 100000 -e 1e-5 -m 8 -o plasmid_allvall_blastout
These files have 12 columns and look like the below. It appears correct to me, but my supervisor suspects the error is in the blast file - I don't know what I'm doing wrong however.
PP49_00001 PP51_00025 100.00 354 0 0 1 354 1 354 0.0 552
PP49_00001 PP49_00001 100.00 354 0 0 1 354 1 354 0.0 552
PP49_00001 PPTI_00026 90.28 288 28 0 1 288 1 288 3e-172 476
PP49_00001 PPNP_00026 90.28 288 28 0 1 288 1 288 3e-172 476
PP49_00001 PPKC_00016 89.93 288 29 0 1 288 1 288 2e-170 472
PP49_00001 PPBD_00021 89.93 288 29 0 1 288 1 288 2e-170 472
PP49_00001 PPJN_00003 91.14 79 7 0 145 223 2 80 8e-47 147
PP49_00002 PPTI_00024 100.00 76 0 0 1 76 1 76 3e-50 146
PP49_00002 PPNP_00024 100.00 76 0 0 1 76 1 76 3e-50 146
PP49_00002 PPKC_00018 100.00 76 0 0 1 76 1 76 3e-50 146
SO, this is where the problems really begin. I now pass the above file to a program called orthAgogue which analyses the paired sequences I have above using parameters laid out in the manual (still no idea if I'm doing anything wrong) - all I know is the several output files that are produced are all just nonsense/empty.
Command looks like so:
orthAgogue -i plasmid_allvsall_blastout -t 0 -p 1 -e 5 -O .
Any and all ideas welcome! (Hope I've covered everything - sorry about the long post!)
EDIT Never did manage to find a solution to this. Had to use an alternative piece of software. If admins wish to close this please do, unless it is worth having open for someone else (though I suspect its a pretty niche issue).

Discovered this issue (of orthAgogue) first today:
though my reply may be old, I hope it may help future users;
issue is due to a missing parameter: seems like you forgot to specify the separator: -s '_', ie, the following set of command-line parameters should do the trick*:
orthAgogue -i plasmid_allvsall_blastout -t 0 -p 1 -e 5 -O -s '_'
(* Under the assumption that your input-file is a tabular-seperated file of columns.)
A brief update after comment made by Joe:
In brief, the problem described in the intiail error report (by Joe) is (in most cases) not a bug. Instead it is one of the core properties of the Inparanoid algorithm which orthAgogue implements: if your ortholog-result-file is empty (though constructed), this (in most cases) implies that there are no reciprocal best match between a protein-pair from two different taxa/species.
One (of many) explanations for this could be that your blastp-scores are too similar, a case where I would suggest a combined tree-based/homology clustering as in TREEFAM.
Therefore, when I receive your data, I'll send it to one of the biologists I'm working with, with goal of identifying the tool proper for your data: hope my last comment makes your day ;)
Ole Kristian Ekseth, developer of orthAgogue

Related

Can a large amount of arguments deteriorate performance of a ksh or bash script?

I'm running a KornShell script which originally has 61 input arguments:
./runOS.ksh 2.8409 24 40 0.350 0.62917 8 1 2 1.00000 4.00000 0.50000 0.00 1 1 4900.00 1.500 -0.00800 1.500 -0.00800 1 100.00000 20.00000 4 1.0 0.0 0.0 0.0 1 90 2 0.10000 0.10000 0.10000 1.500 -0.008 3.00000 0.34744 1.500 -0.008 1.500 -0.008 0.15000 0.21715 1.500 -0.008 0.00000 1 1.334 0 0.243 0.073 0.642 0.0229 38.0 0.03071 2 0 15 -1 20 1
I only vary 6 of them. Would it make a difference in performance if I fixed the remaining 55 arguments inside the script and just call the variable ones, say:
./runOS.ksh 2.8409 24 40 0.350 0.62917 8
If anyone has a quick/general answer to this, it will be highly appreciated, since it might take me a long time to fix the 55 extra arguments inside the script and I'm afraid it won't change anything.
There's no performance impact, as you're asking, but I see other threads:
What is the commandline limitation for your system? You mention 61 input parameters, some of them having a length of 8 characters. If the number of input parameters increases, you might have problems with the maximum command length.
Are you performing 440 million scripts? That's too much, far too much. You need to consider why you're doing this: you mention needing to wait ±153 days for their execution to finish, which is far too much (and unpredictable).

strace'ing/profiling a bash script

I'm currently trying to benchmark a bash script in 4 different versions. Each one does a giant rsync job and it usually takes a very long time to finish. There are many steps in the bash script which involves setting up and tearing down the environment to rsync to.
However, when I ran strace on the bash scripts, I get surprisingly short results, which leads me to believe that strace is not actually tracing the time waiting for a command like rsync(which might be spawned in a subshell and is completely not recorded by rsync), or, it's waking up intermittently and sleep for another amount of time of which strace is not counting. Here's a snippet:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.98 12.972555 120116 108 52 wait4
0.01 0.000751 13 56 clone
0.00 0.000380 1 553 rt_sigprocmask
0.00 0.000303 2 197 85 stat
0.00 0.000274 2 134 read
0.00 0.000223 19 12 open
0.00 0.000190 48 4 getdents
0.00 0.000110 1 82 8 close
0.00 0.000110 1 153 rt_sigaction
0.00 0.000084 1 61 getegid
0.00 0.000074 4 19 write
So what tools can I use that are similar to strace, OR, maybe I'm missing some type of recursive flag in strace to find out correctly where my bash script is waiting on?
I would like something along the lines of:
% time command
------ --------
... rsync
... ls
Any suggestions would be appreciated. Thank you!

Having trouble creating and saving a text file in shell scriptin

I attempt to save a few lines into an output file called Temporal_regions.txt in my folder, so I can continue with further process.
I took a look at some of other post, and they suggested to do this way:
echo "some file content" > /path/to/outputfile
So my returned output looks like this:
1 3 13579 586 Right-Temporal 72 73 66 54
2 5 24680 587 Left-Temporal 89 44 65 56
3 7 34552 599 Right-Temporal 72 75 66 54
4 8 24451 480 Left-Temporal 68 57 47 66
*All of these lines were individually returned as output using the grep command from another file (call it TR.stats)
If I want to store these outputs into the .txt (call it TemperolRegion.txt), I would have to first create the output file right? What would be the command to do this?
Then, I can just simply use the above suggested way to store to the .txt file?
grep "Left-Temporal" TR.stats > /path/to/TemperolRegion.txt
I can't seem to get the commands right.
It sounds like you are missing the directories along the path. In Linux, you have to create the parent directory before you create the file. You can create every directory along the path to the parent by using mkdir with the -p option.
mkdir -p /path/to
grep "Left-Temporal" TR.stats > /path/to/TemperolRegion.txt
Another problem you might have is that you are not in the right directory to TR.stats. In that case you should use the absolute path for that file as well.
grep "Left-Temporal" /path/to/TR.stats > /path/to/TemperolRegion.txt

Otool - Get file size only

I'm using Otool to look into a compiled library (.a) and I want to see what the file size of each component in the binary is. I see that
otool -l [lib.a]
will show me this information but there is also a LOT of other information I do not need. Is there a way I can just see the file size and not everything else? I can't seem to find it if there is.
The size command does that, e.g.,
size lib.a
will show the size of each object stored in the lib.a archive. For example:
$ size libasprintf.a
text data bss dec hex filename
0 0 0 0 0 lib-asprintf.o (ex libasprintf.a)
639 8 1 648 288 autosprintf.o (ex libasprintf.a)
on most systems. OS X format is a little different:
$ size libl.a
__TEXT __DATA __OBJC others dec hex
86 0 0 32 118 76 libl.a(libmain.o)
75 0 0 32 107 6b libl.a(libyywrap.o)
Oddly (though "everyone" implements it), I do not see size on the POSIX site. OS X has a manual page for it.

PDF: extracted images are sliced / tiled

Image extraction with pdfimages and mupdf/mutool works fine so far.
Images in PDFs produced with FreePDF are always sliced, so one image results in multiple image files.
Is there a trick to avoid this? How can I use the results of pdfshow?
Are there coordinates to know the position and height and width to
cut/crop the image after converting the PDF to a PNG or JPEG?
The most likely reason why your images are "sliced" after extracting them is this: they were "sliced" already before extracting them -- as their way of living within the PDF file itself.
Don't ask me why some PDF generating software does this.
MS Powerpoint is infamous for this -- background images showing some gradient often get sliced up into tens of thousands of 1x1, 1x2 or 1x8 pixels and similarly-sized mini images inside the PDF.
Update
1. Identify the scope of the problem
The image fragments of the sample PDF can be identified with the pdfimages -list command (this requires a recent version of pdfimages based on the Poppler fork, not the xpdf one!):
pdfimages -list so-28023312-test1.pdf
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
------------------------------------------------------------------------------------------
1 0 image 271 271 rgb 3 8 jpeg no 18 0 163 163 26.7K 12%
1 1 image 271 271 rgb 3 8 jpeg no 19 0 163 163 21.7K 10%
1 2 image 271 271 rgb 3 8 jpeg no 30 0 163 163 22.9K 11%
1 3 image 271 271 rgb 3 8 jpeg no 31 0 163 163 21.8K 10%
1 4 image 132 271 rgb 3 8 jpeg no 32 0 162 163 9895B 9.2%
1 5 image 271 271 rgb 3 8 jpeg no 33 0 163 163 22.5K 10%
1 6 image 271 271 rgb 3 8 jpeg no 34 0 163 163 16.5K 7.7%
1 7 image 271 271 rgb 3 8 jpeg no 35 0 163 163 16.9K 7.9%
1 8 image 271 271 rgb 3 8 jpeg no 36 0 163 163 20.3K 9.4%
1 9 image 132 271 rgb 3 8 jpeg no 37 0 162 163 14.5K 14%
1 10 image 271 271 rgb 3 8 jpeg no 20 0 163 163 17.1K 8.0%
1 11 image 271 271 rgb 3 8 image no 21 0 163 163 107K 50%
1 12 image 271 271 rgb 3 8 image no 22 0 163 163 96.7K 45%
1 13 image 271 271 rgb 3 8 image no 23 0 163 163 119K 56%
1 14 image 132 271 rgb 3 8 jpeg no 24 0 162 163 10.7K 10%
1 15 image 271 99 rgb 3 8 jpeg no 25 0 163 161 7789B 9.7%
1 16 image 271 99 rgb 3 8 jpeg no 26 0 163 161 6456B 8.0%
1 17 image 271 99 rgb 3 8 jpeg no 27 0 163 161 7202B 8.9%
1 18 image 271 99 rgb 3 8 jpeg no 28 0 163 161 8241B 10%
1 19 image 132 99 rgb 3 8 jpeg no 29 0 162 161 5905B 15%
Because there are only 20 different fragments on 1 page, it is easy to...
...first extract them all and convert them to JPEGs, and
...then stitch them together again.
2. Extract the fragments as JPEGs
The following command will extract the fragments and try to save them as JPEGs (-j) 28023312:
pdfimages so-28023312-test1.pdf 28023312
There are 3 images which came out as PPM. Use ImageMagick's convert to make JPEGs from them (not strictly required, but it simplifies the 'stitching' command line:
for i in 11 12 13; do
convert 28023312-0${i}.ppm 28023312-0${i}.jpg
done
Here are the first three fragments, 280233312-000.jpg, 280233312-001.jpg and 280233312-002.jpg:
3. Stitch the 20 fragments together again
ImageMagick can stitch the 20 images together again. When looking at the PDF page as well as at the 20 JPEGs it is easy to determine the order they need to be put together:
convert \
\( 28023312-0{00,01,02,03,04}.jpg +append \) \
\( 28023312-0{05,06,07,08,09}.jpg +append \) \
\( 28023312-0{10,11,12,13,14}.jpg +append \) \
\( 28023312-0{15,16,17,18,19}.jpg +append \) \
-append \
complete.jpg
Dissecting the command:
The +append image operator appends all listed images in a horizontal order.
The \( ... \) lines indicate an 'aside' processing of the resprective part of the image stack (which needs to be separated by the escaped parentheses). The result of this horizontal appending operation will then replace the individual fragments inside the current image stack.
The final -append image operator appends the current images vertically.
Here is the resulting JPEG, fully stitched together again:
Could this be automated?
In theory we could automate this process. For this we would have to analyse the PDF source code. However, that is rather difficult, because the content stream may be compressed.
In order to uncompress all or most of the content streams and to get a nicer representation of the PDF file structure, we could use mutool clean -d, podofouncompress or qpdf --qdf.
I prefer qpdf, the 'structural, content-preserving PDF file transformer'. Here is the command:
qpdf --qdf --object-streams=disable so-28023312-test1.pdf qdf.pdf
The resulting PDF file, qdf.pdf is more easy to analyse, because most (but not all) previously binary sections are now in ASCII. When you search for the occurrences of Do inside this file, you will see where images are inserted (however, I cannot give you a complete PDF analysing tutorial here, sorry...).
The following command prints all lines where Do occurs, plus the preceding line (-B 1):
grep -a -B 1 " Do" qdf.pdf
1002 0 0 1002 236 5776.67 cm
/Im0 Do
--
1001 0 0 1002 1237 5776.67 cm
/Im1 Do
--
120.12 0 0 120.24 268.44 693.2004 cm
/Im2 Do
--
[...skipping 15 other output segments...]
--
1002 0 0 369 3237 3406.67 cm
/Im18 Do
--
490 0 0 369 4238 3406.67 cm
/Im19 Do
--
1 0 0 1 204.9037018 508.5130005 cm
/Fm0 Do
All the /ImNN Do lines insert images (the /Fm0 Do line refers to a form object not an image).
The preceding lines, for example 490 0 0 369 4238 3406.67 cm set up the current transformation matrix. From this line alone, one can sometimes conclude the position of the image and its size. In the case of this file, it is not enough -- the contents of more preceding lines would be required in order to determine the current 'drawing position'.
FreePDF uses Ghostscript and creates a 'virtual printer'. When you 'print to PDF' what actually happens is that your application prints to the Windows print pipeline, which sends the graphics primitives to the Windows PostScript printer driver, which sends the PostScript to the Port Monitor. The FreePDF Port Monitor stores this PostScript program on disk. When the output is complete, it starts up Ghostscript which interprets the PostScript and produces a PDF file.
Now, unless you are using a spectacularly old version of Ghostscript (which is possible, you should check!) this will take whatever was in the input and put it in the output. It won't slice up images.
Which means that, as Kurt and David said above, that the real reason for the problem is that the PostScript program has sliced up images in it, before Ghostscript ever saw it.
Now I know that's not generally the case, but it depends heavily on what PostScript printer driver you have installed, how its configured, what version of Windows you are using and what the application driving the printer is.
As David rightly says, Microsoft Office applications have a bad habit of drawing certain kinds of patterns this way (to get a 'translucent effect' they use a pattern where the cell is an imagemask, the 'white' pixels are transparent).
Also if you have large photographs (for example) and the PostScript printer is configured with minimal memory, the driver might split up the image in order not to exhaust the printer's memory. Obviously that's a configuration problem because on a desktop PC you would have to be using monster images to overwhelm Ghostscript.
So basically, we need a lot more information from you before we can answer this fully, but the principle is that the damage was done before it got to FreePDF. The version of Ghostscript used to create the PDF file will be in the PDF file metadata, unless FreePDF chose to erase/overwrite it.
Finally, as Kurt pointed out, you should post a link to the PDF file, and ideally the application file and intermediate PostScript file which was used to produce the PDF.

Resources