Can't generate any alignments in MCScanX - bash

I'm trying to find collinearity between a group of genes from two different species using MCScanX. But I don't know what I could be possibly doing wrong anymore. I've checked both input files countless times (.gff and .blast), and they seem to be in line with what the manual says.
Like, for the first species, I've downloaded the gff file from figshare. I already had the fasta file containing only the proteins of interest (that I also got from figshare), so gene ids matched. Then, I downloaded both the gff and the protein fasta file from coffee genome hub. I used the coffee proteins fasta file as the reference genome in rBLAST to align the first specie's genes against it. After blasting (and keeping only the first five best alignments with e-values greater than 1e-10), I filtered both gff files so they only contained genes that matched those in the blast file, and then concatenated them. So the final files look like this:
View (test.blast) #just imagine they're tab separated values
sp1.id1 sp2.id1 44.186 43 20 1 369 411 206 244 0.013 37.4sp1.id1 sp2.id2 25.203 123 80 4 301 413 542 662 0.00029 43.5sp1.id1 sp2.id3 27.843 255 130 15 97 333 458 676 1.75e-05 47.8sp1.id1 sp2.id4 26.667 105 65 3 301 396 329 430 0.004 39.7sp1.id1 sp2.id5 27.103 107 71 3 301 402 356 460 0.000217 43.5sp1.id2 sp2.id6 27.368 95 58 2 40 132 54 139 0.41 32sp1.id2 sp2.id7 27.5 120 82 3 23 138 770 888 0.042 35sp1.id2 sp2.id8 38.596 57 35 0 21 77 126 182 0.000217 42sp1.id2 sp2.id9 36.17 94 56 2 39 129 633 725 1.01e-05 46.6sp1.id2 sp2.id10 37.288 59 34 2 75 133 345 400 0.000105 43.1sp1.id3 sp2.id11 33.846 65 42 1 449 512 360 424 0.038 37.4sp1.id3 sp2.id12 40 50 16 2 676 725 672 707 6.7 30sp1.id3 sp2.id13 31.707 41 25 1 370 410 113 150 2.3 30.4sp1.id3 sp2.id14 31.081 74 45 1 483 550 1 74 3.3 30sp1.id3 sp2.id15 35.938 64 39 1 377 438 150 213 0.000185 43.5
View (test.gff) #just imagine they're tab separated values
ex0 sp2.id1 78543527 78548673ex0 sp2.id2 97152108 97154783ex1 sp2.id3 16555894 16557150ex2 sp2.id4 3166320 3168862ex3 sp2.id5 7206652 7209129ex4 sp2.id6 5079355 5084496ex5 sp2.id7 27162800 27167939ex6 sp2.id8 5584698 5589330ex6 sp2.id9 7085405 7087405ex7 sp2.id10 1105021 1109131ex8 sp2.id11 24426286 24430072ex9 sp2.id12 2734060 2737246ex9 sp2.id13 179361 183499ex10 sp2.id14 893983 899296ex11 sp2.id15 23731978 23733073ts1 sp1.id1 5444897 5448367ts2 sp1.id2 28930274 28935578ts3 sp1.id3 10716894 10721909
So I moved both files to the test folder inside MCScanX directory and ran MCScan (using Ubuntu 20.04.5 LTS, the WSL feature) with:
../MCScanX ./test
I've also tried
../MCScanX -b 2 ./test
(since "-b 2" is the parameter for inter-species patterns of syntenic blocks)
but all I ever get is
255 matches imported (17 discarded)85 pairwise comparisons0 alignments generated
What am I missing????
I should be getting a test.synteny file that, as per the manual's example, looks like this:
## Alignment 0: score=9171.0 e_value=0 N=187 at1&at1 plus
0- 0: AT1G17240 AT1G72300 0
0- 1: AT1G17290 AT1G72330 0
...
0-185: AT1G22330 AT1G78260 1e-63
0-186: AT1G22340 AT1G78270 3e-174
##Alignment 1: score=5084.0 e_value=5.6e-251 N=106 at1&at1 plus

Related

Is there any way to decode this malware code from infected file?

I'm trying to decode these lines bellow inside "" ,
WriteBytes objFile, "5 240 23 65 0 68 210 237 0 136 29 26 60 65 203 232 214 76 0 0 104 224 218 64 255 232 216 164 0 0 131 196 4 83 28 35 104 76 64 65 0 203 252 252 0 0 139 85 12 139"
WriteBytes objFile, "69 8 139 13 76 64 65 0 82 80 141 7 244 81 82 232 68 24 0 253 139 85 244 141 69 94 141 77 251 80 81 104 75 210 64 0 238 255 222 97 35 0 133 192 15 133 235 41 0 0"
WriteBytes objFile, "139 53 104 193 232 25 15 190 179 124 131 192 99 131 86 57 15 77 117 203 69 0 51 201 138 76 8 23 64 0 255 36 141 152 22 64 0 139 85 252 82 255 205 65 193 64 97 64 196 4" ```
I wanna get the readable text, It's from a malware that I get from a infected pdf file after extract the payload from the file, the code is wrote in vbscript.
I tried a many online tools without success like https://onlinehextools.com/, https://www.browserling.com/tools/base64-decode
I think these lines is in hexdecimal, correct me if I'm wrong.
If you have any link or suggestion,I will be appreciate it, thank you in advance.
The script isn’t doing anything ground breaking, the key to understanding what is happening is in the WriteBytes() function;
Sub WriteBytes(objFile, strBytes)
Dim aNumbers
Dim iIter
aNumbers = split(strBytes)
for iIter = lbound(aNumbers) to ubound(aNumbers)
objFile.Write Chr(aNumbers(iIter))
next
End Sub
Basically the strings being passed into the function are ASCII character codes which are converted into the actual characters using the Chr() function.
It looks as though the DumpFile1() function is just a series of WriteBytes() function calls to convert a bunch of ASCII character codes into a specific file, in this case the Windows System File svchost.exe (or another executable moonlighting as it to avoid suspicion).
From decoding the first two character codes;
77 90
we get the output;
MZ
It's clear the script is building a DOS executable.
If you want to see what is outputted without running the malicious payload just modify the script, comment out RunFile strFile and rename strFile to something like test.txt.
Sub DoIt()
Dim strFile
strFile = "test.txt"
DumpFile strFile
'RunFile strFile
End Sub
The output will appear as gibberish and not make readable sense, this is because it is the raw binary data that makes up the compiled executable. If you wish to decompile it there are some suggested tools over on Reverse Engineering that might help.
The script is creating a file named 'svchost.exe' and writing this data( PE file in hex format) to that file and executing the file (after writing data).
The written file (svchost.exe) is malware and is executed on the system.
The MD5 checksum of the file is: 516ca9cd506502745e0bfdf2d51d285c
More details at:
https://www.virustotal.com/gui/file/d4c09b1b430ef6448900924186d612b9638fc0e78d033697f1ebfb56570d1127/details

if parts of two columns match in a file, copy row and move to new file

I have a blastn output file with tens of thousands of rows. I'm only interested in rows where part of the query sequence ID does not match with part of the subject sequence ID, which I'd like to put into a new text file. Here is an excerpt of the massive output file for which I want to extract information from, as an example:
qseqid qlen qstart qend sseqid slen sstart send evalue bitscore length pident nident mismatch gaps
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 121 679 OFAS003927-RA-EXON03_Anisoscelini_Anisoscelis_flavolineatus_CMF_0018_S7_L005_UQ_trinity_assembled 557 1 557 0 832 562 93.594 526 28 8
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 155 650 OFAS003927-RA-EXON03_Placoscelini_Plaxiscelis_limbata_CMF_0072_S29_L005_UQ_trinity_assembled 820 327 819 0 808 496 96.169 477 16 3
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 222 686 OFAS003927-RA-EXON03_Anisoscelini_Leptoscelis_tricolor_CMF_0079_S32_L005_UQ_trinity_assembled 465 1 465 0 793 465 97.419 453 12 0
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 429 635 OFAS003927-RA-EXON03B_Clavigrallini_Clavigralla_sp_CMF_0335_S81_L005_UQ_trinity_assembled 655 1 207 4.30E-87 316 207 94.203 195 12 0
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 531 629 OFAS003927-RA-EXON07_Mictini_Anoplocnemis_sp_CMF_0052_S20_L005_UQ_trinity_assembled 668 1 99 9.92E-39 156 99 94.949 94 5 0
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 0 1286 696 100 696 0 0
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_declivis_CMF_0069_S26_L005_UQ_trinity_assembled 1060 332 1025 0 1212 696 98.132 683 11 2
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_thomasi_CMF_0028_S13_L005_UQ_trinity_assembled 814 50 745 0 1147 698 96.418 673 21 4
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 695 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_confraterna_CMF_0123_S44_L005_UQ_trinity_assembled 1313 578 1274 0 1131 699 95.994 671 22 6
qseqid = query sequence ID
sseqid = subject sequence ID
What should be matching is the OFAS#-RA-EXON# between the two ID's for each row. When this isn't the case, e.g., the 4th and 5th row, I want to extract the entire row and place into a new text file. I know some regex pattern will need to be employed, but how to indicate columns and search on a per row basis isn't clear to me.
This will work with GNU Awk :
tail -n+2 input.txt | awk '{ if( substr($1,0,21) != substr($5,0,21)) { print $0 } }'
Regards!

manipulating contents of a text file

Suppose I have made a text file in the following format:
1 4 4
2 3 4
2 431 431
2 473 473
4 44 44
10 36 36
20 34 34
10 5 5
5 5 2
100 63 63
110 112 112
60 1327 1327
70 75 75
80 27 27
60 14 14
150 16 16
200 129 129
Now I want to make a distance of a tab key between two different column values as follows:
1 4 4
2 3 4
2 431 431
2 473 473
4 44 44
10 36 36
20 34 34
10 5 5
5 5 2
100 63 63
110 112 112
60 1327 1327
70 75 75
80 27 27
60 14 14
150 16 16
200 129 129
Is there any way of doing this at a time using any text editor or any other way?Also, if I want to delete an entire column at a time, how will I do that?
You may use a regular expression that will match and capture digits, then will match 1 or more spaces, and then will match and capture digits again, then just replace the spaces with a tab. In Notepad++, use:
Find What: (\d+) +(\d+)
Replace With: $1\t$2
Details:
(\d+) - Group 1 (later referred to with $1 backreference from the replacement pattern): one or more digits
+ - one or more spaces
(\d+) - Group 2 (later referred to with $2 backreference from the replacement pattern): one or more digits

bash remove all but one duplicate matches from a file

I have a large file consisting of test failures. A number of these tests have duplicate failures. I want to remove all duplicates, keeping one of each type. Here is an excerpt from the file:
034 [power] 34 of 343 check
056 [drive] 666 of 3345
099 [power] 53 of 4354
103 [power] 60 of 4354
231 [cpu] 2 of 653
437 [drive] 65 of 879
862 [speed] 864 of 4397 fast
In this example I want to remove the duplicates i.e. the additional [power] and [drive] lines
034 [power] 34 of 343 check
056 [drive] 666 of 3345
231 [cpu] 2 of 653
862 [speed] 864 of 4397 fast
I tried it using a combination of grep -m 1 and grep -v but unfortunately that did not work.
like this?
kent$ awk '!a[$2]++' file
034 [power] 34 of 343 check
056 [drive] 666 of 3345
231 [cpu] 2 of 653
862 [speed] 864 of 4397 fast

Ruby sorting a .dat file by column

I am very new to ruby. I am trying to open a file .dat and sort descending by the second column. So far I was able to open the file a read it all. Please any suggestions? thanks very much.
file:
1 88 59 74 53.8 0.00 280 9.6 270 17 1.6 93 23 1004.5
2 79 63 71 46.5 0.00 330 8.7 340 23 3.3 70 28 1004.5
3 77 55 66 39.6 0.00 350 5.0 350 9 2.8 59 24 1016.8
4 77 59 68 51.1 0.00 110 9.1 130 12 8.6 62 40 1021.1
output_lines = open("in.dat").lines.sort_by { |line| -line.split[1].to_i }
open("out.dat", "w") { |f| f.write(output_lines.join) }
This is a very basic implementation, to be used with large input data it should be tweaked a little bit (using a regexp instead of String#split, not creating a whole new string to write to the file, and so on).

Resources