Ruby sorting a .dat file by column - ruby

I am very new to ruby. I am trying to open a file .dat and sort descending by the second column. So far I was able to open the file a read it all. Please any suggestions? thanks very much.
file:
1 88 59 74 53.8 0.00 280 9.6 270 17 1.6 93 23 1004.5
2 79 63 71 46.5 0.00 330 8.7 340 23 3.3 70 28 1004.5
3 77 55 66 39.6 0.00 350 5.0 350 9 2.8 59 24 1016.8
4 77 59 68 51.1 0.00 110 9.1 130 12 8.6 62 40 1021.1

output_lines = open("in.dat").lines.sort_by { |line| -line.split[1].to_i }
open("out.dat", "w") { |f| f.write(output_lines.join) }
This is a very basic implementation, to be used with large input data it should be tweaked a little bit (using a regexp instead of String#split, not creating a whole new string to write to the file, and so on).

Related

Can't generate any alignments in MCScanX

I'm trying to find collinearity between a group of genes from two different species using MCScanX. But I don't know what I could be possibly doing wrong anymore. I've checked both input files countless times (.gff and .blast), and they seem to be in line with what the manual says.
Like, for the first species, I've downloaded the gff file from figshare. I already had the fasta file containing only the proteins of interest (that I also got from figshare), so gene ids matched. Then, I downloaded both the gff and the protein fasta file from coffee genome hub. I used the coffee proteins fasta file as the reference genome in rBLAST to align the first specie's genes against it. After blasting (and keeping only the first five best alignments with e-values greater than 1e-10), I filtered both gff files so they only contained genes that matched those in the blast file, and then concatenated them. So the final files look like this:
View (test.blast) #just imagine they're tab separated values
sp1.id1 sp2.id1 44.186 43 20 1 369 411 206 244 0.013 37.4sp1.id1 sp2.id2 25.203 123 80 4 301 413 542 662 0.00029 43.5sp1.id1 sp2.id3 27.843 255 130 15 97 333 458 676 1.75e-05 47.8sp1.id1 sp2.id4 26.667 105 65 3 301 396 329 430 0.004 39.7sp1.id1 sp2.id5 27.103 107 71 3 301 402 356 460 0.000217 43.5sp1.id2 sp2.id6 27.368 95 58 2 40 132 54 139 0.41 32sp1.id2 sp2.id7 27.5 120 82 3 23 138 770 888 0.042 35sp1.id2 sp2.id8 38.596 57 35 0 21 77 126 182 0.000217 42sp1.id2 sp2.id9 36.17 94 56 2 39 129 633 725 1.01e-05 46.6sp1.id2 sp2.id10 37.288 59 34 2 75 133 345 400 0.000105 43.1sp1.id3 sp2.id11 33.846 65 42 1 449 512 360 424 0.038 37.4sp1.id3 sp2.id12 40 50 16 2 676 725 672 707 6.7 30sp1.id3 sp2.id13 31.707 41 25 1 370 410 113 150 2.3 30.4sp1.id3 sp2.id14 31.081 74 45 1 483 550 1 74 3.3 30sp1.id3 sp2.id15 35.938 64 39 1 377 438 150 213 0.000185 43.5
View (test.gff) #just imagine they're tab separated values
ex0 sp2.id1 78543527 78548673ex0 sp2.id2 97152108 97154783ex1 sp2.id3 16555894 16557150ex2 sp2.id4 3166320 3168862ex3 sp2.id5 7206652 7209129ex4 sp2.id6 5079355 5084496ex5 sp2.id7 27162800 27167939ex6 sp2.id8 5584698 5589330ex6 sp2.id9 7085405 7087405ex7 sp2.id10 1105021 1109131ex8 sp2.id11 24426286 24430072ex9 sp2.id12 2734060 2737246ex9 sp2.id13 179361 183499ex10 sp2.id14 893983 899296ex11 sp2.id15 23731978 23733073ts1 sp1.id1 5444897 5448367ts2 sp1.id2 28930274 28935578ts3 sp1.id3 10716894 10721909
So I moved both files to the test folder inside MCScanX directory and ran MCScan (using Ubuntu 20.04.5 LTS, the WSL feature) with:
../MCScanX ./test
I've also tried
../MCScanX -b 2 ./test
(since "-b 2" is the parameter for inter-species patterns of syntenic blocks)
but all I ever get is
255 matches imported (17 discarded)85 pairwise comparisons0 alignments generated
What am I missing????
I should be getting a test.synteny file that, as per the manual's example, looks like this:
## Alignment 0: score=9171.0 e_value=0 N=187 at1&at1 plus
0- 0: AT1G17240 AT1G72300 0
0- 1: AT1G17290 AT1G72330 0
...
0-185: AT1G22330 AT1G78260 1e-63
0-186: AT1G22340 AT1G78270 3e-174
##Alignment 1: score=5084.0 e_value=5.6e-251 N=106 at1&at1 plus

Is there any way to decode this malware code from infected file?

I'm trying to decode these lines bellow inside "" ,
WriteBytes objFile, "5 240 23 65 0 68 210 237 0 136 29 26 60 65 203 232 214 76 0 0 104 224 218 64 255 232 216 164 0 0 131 196 4 83 28 35 104 76 64 65 0 203 252 252 0 0 139 85 12 139"
WriteBytes objFile, "69 8 139 13 76 64 65 0 82 80 141 7 244 81 82 232 68 24 0 253 139 85 244 141 69 94 141 77 251 80 81 104 75 210 64 0 238 255 222 97 35 0 133 192 15 133 235 41 0 0"
WriteBytes objFile, "139 53 104 193 232 25 15 190 179 124 131 192 99 131 86 57 15 77 117 203 69 0 51 201 138 76 8 23 64 0 255 36 141 152 22 64 0 139 85 252 82 255 205 65 193 64 97 64 196 4" ```
I wanna get the readable text, It's from a malware that I get from a infected pdf file after extract the payload from the file, the code is wrote in vbscript.
I tried a many online tools without success like https://onlinehextools.com/, https://www.browserling.com/tools/base64-decode
I think these lines is in hexdecimal, correct me if I'm wrong.
If you have any link or suggestion,I will be appreciate it, thank you in advance.
The script isn’t doing anything ground breaking, the key to understanding what is happening is in the WriteBytes() function;
Sub WriteBytes(objFile, strBytes)
Dim aNumbers
Dim iIter
aNumbers = split(strBytes)
for iIter = lbound(aNumbers) to ubound(aNumbers)
objFile.Write Chr(aNumbers(iIter))
next
End Sub
Basically the strings being passed into the function are ASCII character codes which are converted into the actual characters using the Chr() function.
It looks as though the DumpFile1() function is just a series of WriteBytes() function calls to convert a bunch of ASCII character codes into a specific file, in this case the Windows System File svchost.exe (or another executable moonlighting as it to avoid suspicion).
From decoding the first two character codes;
77 90
we get the output;
MZ
It's clear the script is building a DOS executable.
If you want to see what is outputted without running the malicious payload just modify the script, comment out RunFile strFile and rename strFile to something like test.txt.
Sub DoIt()
Dim strFile
strFile = "test.txt"
DumpFile strFile
'RunFile strFile
End Sub
The output will appear as gibberish and not make readable sense, this is because it is the raw binary data that makes up the compiled executable. If you wish to decompile it there are some suggested tools over on Reverse Engineering that might help.
The script is creating a file named 'svchost.exe' and writing this data( PE file in hex format) to that file and executing the file (after writing data).
The written file (svchost.exe) is malware and is executed on the system.
The MD5 checksum of the file is: 516ca9cd506502745e0bfdf2d51d285c
More details at:
https://www.virustotal.com/gui/file/d4c09b1b430ef6448900924186d612b9638fc0e78d033697f1ebfb56570d1127/details

What's the meaning of "?]0;"

when I connect to a ssh in powershell
I got such strings
?]0;wany#wany02: ~?[01;32mwany#wany02?[00m:?[01;34m~?[00m$
I print the bytes of the string
[27 93 48 59 119 97 110 121 64 119 97 110 121 48 50 58 32 126 7 27 91 48 49 59 51 50 109 119 97 110 121 64 119 97 110 121 48 50 27 91 48 48 109 58 27 91 48 49 59 51 52 109 126 27 91 48 48 109 36 32]
I've used the [ansicolor]: https://github.com/shiena/ansicolor package to convert color
but,what the meaning of "?]0;wany#wany02: ~?"
I can't see it on Linux terminal
thx a lot
ESC]0; is the start of an escape code that xterm and compatible terminals that implement VT100 control sequences use to change the window's title and icon name. The byte with the value 7 (ASCII BEL) ends the sequence. Everything between is used as the title.
Using 2 instead of 0 changes just the title and 1 just the icon name. See the list of operating system controls for what other numbers do.

Get frame / pattern of an image without loop MATLAB

I would like to extract certain part of an image. Let's say, only those parts that are indexed by ones, in some kind of template or frame.
GRAYPIC = reshape(randperm(169), 13, 13);
FRAME = ones(13);
FRAME(5:9, 5:9) = 0;
FRAME_OF_GRAYPIC = []; % the new pic that only shows the frame extracted
I can achieve this using a for loop:
for X = 1:13
for Y = 1:13
vlaue = FRAME(Y, X);
switch vlaue
case 1
FRAME_OF_GRAYPIC(X,Y) = GRAYPIC(X,Y)
case 0
FRAME_OF_GRAYPIC(X,Y) = 0
end
end
end
imshow(mat2gray(FRAME_OF_GRAYPIC));
However, is it possible to use it with some kind of vector operation, i.e.:
FRAME_OF_GRAYPIC = GRAYPIC(FRAME==1);
Though, this doesn't work unfortunately.
Any suggestions?
Thanks a lot for your answers,
best,
Clemens
Too long for a comment...
GRAYPIC = reshape(randperm(169), 13, 13);
FRAME = zeros(13);
FRAME(5:9, 5:9) = 0;
FRAME_OF_GRAYPIC = zeros(size(GRAYPIC); % MUST preallocate new pic the right size
FRAME = logical(FRAME); % ... FRAME = (FRAME == 1)
FRAME_OF_GRAYPIC(FRAME) = GRAYPIC(FRAME);
Three things to note here:
FRAME must be a logical array. Create it with true()/false(), or cast it using logical(), or select a value to be true using FRAME = (FRAME == true_value);
You must preallocate your final image to the proper dimensions, otherwise it will turn into a vector.
You need the image indices on both sides of the assignment:
FRAME_OF_GRAYPIC(FRAME) = GRAYPIC(FRAME);
Output:
FRAME_OF_GRAYPIC =
38 64 107 63 27 132 148 160 88 59 102 69 81
14 108 76 58 49 55 51 19 158 52 100 153 39
79 139 12 115 147 154 96 112 82 73 159 146 93
169 2 71 25 33 149 138 150 129 117 65 97 17
43 111 37 142 0 0 0 0 0 128 84 86 22
9 137 127 45 0 0 0 0 0 68 28 46 163
42 11 31 29 0 0 0 0 0 152 3 85 36
50 110 165 18 0 0 0 0 0 144 143 44 109
114 133 1 122 0 0 0 0 0 80 167 157 145
24 116 60 130 53 77 156 35 6 78 90 30 140
74 120 40 26 106 166 121 34 98 57 56 13 48
8 155 4 16 124 75 123 23 105 66 7 141 70
89 113 99 101 54 20 94 72 83 168 61 5 10

Vowpal Wabbit model works badly on multiclass classification of images using pixel RGB values

I am using Vowpal Wabbit to classify multi class images. My data set is similar to http://www.cs.toronto.edu/~kriz/cifar.html , consisting of 3000 training samples and 500 testing samples. The features are RGB values of 32*32 images. I used Vowpal Wabbit Logistic loss function to train the model with 100 iterations. During the training process the average loss is below 0.02 (I assume this number is pretty good right?). Then I predict the labels of the training set with the output model, and fount that the predictions are very bad. Nearly all of them are of category six. I really don't know what happened, because it seems to me that during training process the predictions mostly correct, but after I predict with the model they suddenly become all 6.
Here is a sample line of feature.
1 | 211 174 171 165 161 161 162 163 163 163 163 163 163 163 163 163
162 161 162 163 163 163 163 164 165 167 168 167 168 163 160 187 153
102 96 90 89 90 91 92 92 92 92 92 92 92 92 92 92 92 91 90 90 90 90 91
92 94 95 96 99 97 98 127 111 71 71 64 66 68 69 69 69 69 69 69 70 70 69
69 70 71 71 69 68 68 68 68 70 72 73 75 78 78 81 96 111 69 68 61 64 67
67 67 67 67 67 67 68 67 67 66 67 68 69 68 68 67 66 66 67 69 69 69 71
70 77 89 116 74 76 71 72 74 74 72 73 74 74 74 74 74 74 74 72 72 74 76
76 75 74 74 74 73 73 72 73 74 85 92 123 83 86 83 82 83 83 82 83 83 82
82 82 82 82 82 81 80 82 85 85 84 83 83 83 85 85 85 85 86 94 95 127 92
96 93 93 92 91 91 91 91 91 90 89 89 86 86 86 86 87 89 89 88 88 88 92
92 93 98 100 96 98 96 132 99 101 98 98 97 95 93 93 94 93 93 95 96 97
95 96 96 96 96 95 94 100 103 98 93 95 100 105 103 103 96 139 106 108
105 102 100 98 98 98 99 99 100 100 95 98 93 81 78 79 77 76 76 79 98
107 102 97 98 103 107 108 99 145 115 118 115 115 115 113 ......
Here is my training script:
./vw train.vw --oaa 6 --passes 100 --loss_function logistic -c
--holdout_off -f image_classification.model
Here is my predicting script (on the training data set):
./vw -i image_classification.model -t train.vw -p train.predict --quiet
Here is the statistics during training:
final_regressor = image_classification.model Num weight bits = 18
learning rate = 0.5 initial_t = 0 power_t = 0.5 decay_learning_rate =
1 using cache_file = train.vw.cache ignoring text input in favor of
cache input num sources = 1 average since example
example current current current loss last counter
weight label predict features
0.000000 0.000000 1 1.0 1 1 3073
0.000000 0.000000 2 2.0 1 1 3073
0.000000 0.000000 4 4.0 1 1 3073
0.000000 0.000000 8 8.0 1 1 3073
0.000000 0.000000 16 16.0 1 1 3073
0.000000 0.000000 32 32.0 1 1 3073
0.000000 0.000000 64 64.0 1 1 3073
0.000000 0.000000 128 128.0 1 1 3073
0.000000 0.000000 256 256.0 1 1 3073
0.001953 0.003906 512 512.0 2 2 3073
0.002930 0.003906 1024 1024.0 3 3 3073
0.002930 0.002930 2048 2048.0 5 5 3073
0.006836 0.010742 4096 4096.0 3 3 3073
0.012573 0.018311 8192 8192.0 5 5 3073
0.014465 0.016357 16384 16384.0 3 3 3073
0.017029 0.019592 32768 32768.0 6 6 3073
0.017731 0.018433 65536 65536.0 6 6 3073
0.017891 0.018051 131072 131072.0 5 5 3073
0.017975 0.018059 262144 262144.0 3 3 3073
finished run number of examples per pass = 3000 passes used = 100
weighted example sum = 300000.000000 weighted label sum = 0.000000
average loss = 0.017887 total feature number = 921900000
It seems to me that it predicts perfectly during training but after I use the outputed model suddenly everything becomes of category 6. I really have no idea what has gone wrong.
There are several problems in your approach.
1) I guess the training set contains first all images with label 1, then all examples with label 2 and so on, the last label is 6. You need to shuffle such training data if you want to use online learning (which is the default learning algorithm in VW).
2) VW uses sparse feature format. The order of features on one line is not important (unless you use --ngram). So if feature number 1 (red channel of the top left pixel) has value 211 and feature number 2 (red channel of the second pixel) has value 174, you need to use:
1 | 1:211 2:147 ...
3) To get good results in image recognition you need something better than a linear model on the raw pixel values. Unfortunately, VW has no deep learning (multi-layer neural net), no convolutional nets. You can try --nn X to get neural net with one hidden layer with X units (and tanh activation function), but this is just a poor substitute for the state-of-the-art approaches to CIFAR etc. You can also try other non-linear reductions available in VW (-q, --cubic, --lrq, --ksvm, --stage_poly). In general, I think VW is not suitable for such tasks (image recognition), unless you apply some preprocessing which generates (a lot of) features (e.g. SIFT).
4) You are overfitting.
average loss is below 0.02 (I assume this number is pretty good right?
No. You used --holdout_off, so the reported loss is rather the train loss. It is easy to get almost zero train loss by simple memoizing all examples, i.e. overfitting. However, you want to get the test (or holdout) loss low.

Resources