Joining two matrices, one with numbers and the other percentages - matrix

I have two matrices, cases and percentages. I want to combine both with the columns alternating between the two i.e. cases [c1] percent [c1] cases [c2] percent [c2]...
tab year region if sex==1, matcell(cases)
tab year region, matcell(total)
mata:st_matrix("percent", 100 * st_matrix("cases"):/st_matrix("total"))
matrix list cases
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 1313 1289 1121 1176 1176 1150 1190 1184 1042 940
r2 340 359 357 366 383 332 406 367 352 272
r3 260 246 266 265 270 259 309 306 266 283
r4 271 267 293 277 317 312 296 285 265 253
r5 218 249 246 213 264 255 247 221 229 220
r6 215 202 157 202 200 204 220 183 176 180
r7 178 193 218 199 194 195 201 187 172 159
r8 127 111 107 130 133 99 142 143 131 114
r9 64 68 85 74 70 60 59 70 76 61
. matrix list percent, format(%2.1f)
percent[9,10]
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 70.1 71.2 67.3 67.2 66.9 71.5 72.6 72.5 74.9 73.2
r2 65.3 65.2 69.1 64.4 68.0 70.5 72.0 64.8 66.4 64.9
r3 74.7 73.7 74.7 69.2 68.9 67.6 70.5 72.3 79.4 80.9
r4 66.3 72.6 72.9 74.9 72.7 73.8 72.2 73.3 74.9 71.7
r5 68.8 67.1 66.0 63.6 67.2 67.1 65.2 67.4 68.6 73.8
r6 73.1 72.9 69.2 63.7 67.6 68.0 72.4 68.8 74.9 78.9
r7 64.5 60.3 69.9 70.6 69.3 78.3 72.3 65.8 71.4 71.3
r8 66.1 64.2 63.3 74.7 69.3 56.9 70.6 70.1 63.9 57.9
r9 77.1 73.9 70.2 74.0 71.4 73.2 81.9 72.9 87.4 74.4
How do I combine both the matrices?
currently I have tried: matrix final=cases, percent but it just puts them beside each other? I want it so each column alternates between cases and percent.
I will then use putexcel command to put them into an already formatted table with columns of cases and percentages.

Let me start by supporting Nick Cox's comments.
The problem is, there is no simple solution for combining matrices as you desire. Nevertheless, it is simple to achieve the results you want, by taking a very much different path from the one you outlined. It's no fun to write an essay describing the technique in natural language; it's much simpler to demonstrate it using code, as I do below, and as I expect Nick might have been inclined to do.
By not providing a Minimal, Complete, and Verifiable example, as described in the link Nick provided to you, you've discouraged others from showing you where you've gone off the tracks.
// create a minimal amount of sample data hopefully similar to actual data
clear
input year region sex
2001 1 1
2001 1 2
2001 1 2
2002 1 1
2002 1 2
2001 2 1
2002 2 1
2002 2 2
end
list, clean noobs
// use collapse to generate summaries equivalent to two tabs
generate male = sex==1
collapse (count) total=male (sum) cases=male, by(year region)
list, clean noobs
generate percent = 100*cases/total
keep year region total percent
// flatten and interleave the columns
reshape wide total percent, i(year) j(region)
drop year
list, clean noobs
// now use export excel to output,
// or use mkmat to load into a matrix and use putexcel to output

Related

reformulating for loop with vectorization or other approach - octave

Is there any way to vectorize (or reformulate) each body of the loop in this code:
col=load('col-deau'); %load data
h=col(:,8); % corresponding water column
dates=col(:,3); % and its dates
%removing out-of-bound data
days=days(h~=9999.000);
h=h(h~=9999.000);
dates=sort(dates(h~=9999.000));
[k,hcat]=hist(h,nbin); %making classes (k) and boundaries of classes (hcat) of water column automatically
dcat=1:15; % make boundaries for dates
for k=1:length(dcat)-1 % Loop for each date class
ii=find(dates>=dcat(k)&dates<dcat(k+1));% Counting dates corresponding to the boundaries of each date class
for j=1:length(hcat)-1 % Loop over each class of water column
ij=find(h>=hcat(j)&h<hcat(j+1)); % Count water column corresponding to the boundaries of each water column class
obs(k,j)=length(intersect(ii,ij)); % Find the size of each intersecting matrix
end
end
I've tried using vectorization, for example, to change this part:
for k=1:length(dcat)-1
ii=find(dates>=dcat(k)&dates<dcat(k+1))
endfor
with this:
nk=1:length(dcat)-1;
ii2=find(dates>=dcat(nk)&dates<dcat(nk+1));
and also using bsxfun:
ii2=find(bsxfun(#and,bsxfun(#ge,dates,nk),bsxfun(#lt,dates,nk+1)));
but to no avail. Both these approaches produce identical output, and do not correspond to that of using for loop (in terms of elements and vector size).
For information, h is a vector which contains water column in meters and dates is a vector (integer with two digits) which contains the dates in which the measurement for a corresponding water column was taken.
The input file can be found here: https://drive.google.com/open?id=1EomLGYleaNtiGG2iV_9LRt425blxdIsm
As for the output, I want to have ii like this:
ii =
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
instead with the first approach I get ii2 which is very different in terms of value and vector size (I can't post the result because the vector size is too big).
Can someone help a desperate newbie here? I just need to reformulate the loop part into a better, more concise version.
If more details need to be added, please feel free to ask me.
You can use hist3:
pkg load statistics
[obs, ~] = hist3([dates(:) h(:)] ,'Edges', {dcat,hcat});

Identifying DEFLATE Algorithm Variant Being Used in Proprietary File Format

Disclaimer: This problem requires a very good knowledge of the DEFLATE algorithm.
I am hoping I could solicit some ideas identifying the compression algorithm being used in a particular file format. This is a legacy proprietary format that my application needs to support, so we are trying to reverse engineer it. (Going to the original creator is not an option, for reasons I won't get into).
I'm extremely close to cracking it, but I feel like I'm living Xeno's paradox because every day I seem to get halfway closer to the finish line but never there!
Here's what I know so far:
It is definitely using something extremely similar to the DEFLATE algorithm. Similarities -
The compressed data is represented by canonical Huffman codes
(usually starting with 000, but I'm not sure that is always the
case).
The data is preceded (I believe immediately) by a header table
which identifies the bit lenghts of each of the actual codes. Like
DEFLATE, this table ALSO comprises cannonical Huffman codes
(starting either at 0 or 00). These codes provide the bit-lenghts of
each character in the 0-255+ alphabet plus whatever distance codes
might be used.
Finally, again like DEFLATE, the header table with the
bit lenghts for the main codes is also preceded (I think immediately)
by a series of 3-bit codes used to derive the header table codes
(I'll call this the "pre-header").
At this point the similarities seem to end though.
The 3-bit codes in the pre-header do not appear go in the 16, 17, 18, 0, 8 ... optimal order as specified by DEFLATE, but rather seem to go sequentially, like 6 7 8 9....
Another difference is that each 3-bit code is not necessarily a literal bit length. For example, here's a header that I've mostly deciphered (I'm 99.99% confident it is correct):
00000001011 100 010 110 010 010 011 010 110 101 100 011 010 010 011 100 010 111
*0* skA *3* *4* *5* *6* *7* *8* *9* skB
Ignoring the unmarked bits, this results in the following code table:
00 7-bits
01 8-bits
100 6-bits
101 9-bits
1100 0-bits (skip code)
1101 skA = skip 3 + value of next 4 bits
1110 5-bits
11110 4-bits
111110 skB = skip 11? + value of next 9 bits
111111 3-bits
The most glaring problem is that there are additional bit-lenghts in the header table that are unused. And, in fact, they would not be usable at all, as there cannot be any additional 2-bit or 3-bit codes, for example, for the codes to be canonical (right?).
The author is also using non-standard codes for 16+. They don't seem to use the copy code (16 in DEFLATE) at all; the main headers all have huge strings of identical length codes (terribly inefficient...), and the skip codes use the next 4 and 9 bits to determine the number of skips, respectively, rather than 3 and 7 as in DEFLATE.
Yet another key difference is in the very first bits of the header. In DEFLATE the first bits are HLIT(5), HDIST(5), and HCLEN(4). If I interpreted the above header that way using LSB packing, I'd get HLIT = 257 (correct), HDIST = 21 (unsure if correct) and HCLEN = 7 (definitely not correct). If I use MSB packing instead, I'd get HLIT=257, HDIST = 6 (more likely correct) and HCLEN = 16 (appears correct). BUT, I don't think there are actually intended to be 14 bits in the prefix because I appear to need the "100" (see above) for the bit count of the 0-bit (skip) code. And in other examples, bits 10-13 don't appear to correlate to the length of the pre-header at all.
Speaking of other examples, not every file appears to follow the same header format. Here's another header:
00000001000 100 100 110 011 010 111 010 111 011 101 010 110 100 011 101 000 100 011
In this second example, I again happen to know that the code table for the header is:
0 8-bits
10 7-bits
110 6-bits
11100 skA
11101 5-bits
111100 0-bits (skip)
111101 skB
111110 9-bits
1111110 3-bits
1111111 4-bits
However, as you can see, many of the required code lenghts are not in the header at all. For example there's no "001" to represent the 8-bit code, and they are not even close to being in sequence (neither consecutively nor by the optimal 16, 17, 18...).
And yet, if we shift the bits left by 1:
skA *0* *5* *6* *7* *8* *9*
0000000100 010 010 011 001 101 011 101 011 101 110 101 011 010 001 110 100 010 001 1
This is much better, but we still can't correctly derive the code for skB (110), or 3 or 4 (111). Shifting by another bit does not improve the situation.
Incidentally, if you're wondering how I am confident that I know the code tables in these two examples, the answer is A LOT of painstaking reverse engineering, i.e., looking at the bits in files that differ slightly or have discernable patterns, and deriving the canonical code table being used. These code tables are 99+% certainly correct.
To summarize, then, we appear to have an extremely close variant of DEFLATE, but for inexplicable reasons one that uses some kind of non-standard pre-header. Where I am getting tripped up, of course, is identifying which pre-header bits correspond to the code bit-lengths for the main header. If I had that, everything would fall into place.
I have a couple of other examples I could post, but rather than ask people to do pattern matching for me, what I'm really praying for is that someone will recognize the algorithm being used and be able to point me to it. I find it unlikely that the author, rather than use an existing standard, would have gone to the trouble of coding his own algorithm from scratch that was 99% like DEFLATE but then change the pre-header structure only slightly. It makes no sense; if they simply wanted to obfuscate the data to prevent what I'm trying to do, there are much easier and more effective ways.
The software dates back to the late 90s, early 2000s, by the way, so consider what was being done back then. This is not "middle out" or anything new and crazy. It's something old and probably obscure. I'm guessing some variant of DEFLATE that was in use in some semi-popular library around that time, but I've not been having much luck finding information on anything that isn't actually DEFLATE.
Many, many thanks for any input.
Peter
PS - As requested, here is the complete data block from the first example in the post. I don't know if it'll be of much use, but here goes. BTW, the first four bytes are the uncompressed output size. The fifth byte begins the pre-header.
B0 01 00 00 01 71 64 9A D6 34 9C 5F C0 A8 B6 D4 D0 76 6E 7A 57 92 80 00 54 51 16 A1 68 AA AA EC B9 8E 22 B6 42 48 48 10 9C 11 FE 10 84 A1 7E 36 74 73 7E D4 90 06 94 73 CA 61 7C C8 E6 4D D8 D9 DA 9D B7 B8 65 35 50 3E 85 B0 46 46 B7 DB 7D 1C 14 3E F4 69 53 A9 56 B5 7B 1F 8E 1B 3C 5C 76 B9 2D F2 F3 7E 79 EE 5D FD 7E CB 64 B7 8A F7 47 4F 57 5F 67 6F 77 7F 87 8F 97 9D FF 4F 5F 62 DA 51 AF E2 EC 60 65 A6 F0 B8 EE 2C 6F 64 7D 39 73 41 EE 21 CF 16 88 F4 C9 FD D5 AF FC 53 89 62 0E 34 79 A1 77 06 3A A6 C4 06 98 9F 36 D3 A0 F1 43 93 2B 4C 9A 73 B5 01 6D 97 07 C0 57 97 D3 19 C9 23 29 C3 A8 E8 1C 4D 3E 0C 24 E5 93 7C D8 5C 39 58 B7 14 9F 02 53 93 9C D8 84 1E B7 5B 3B 47 72 E9 D1 B6 75 0E CD 23 5D F6 4D 65 8B E4 5F 59 53 DF 38 D3 09 C4 EB CF 57 52 61 C4 BA 93 DE 48 F7 34 B7 2D 0B 20 B8 60 60 0C 86 83 63 08 70 3A 31 0C 61 E1 90 3E 12 32 AA 8F A8 26 61 00 57 D4 19 C4 43 40 8C 69 1C 22 C8 E2 1C 62 D0 E4 16 CB 76 50 8B 04 0D F1 44 52 14 C5 41 54 56 15 C5 81 CA 39 91 EC 8B C8 F5 29 EA 70 45 84 48 8D 48 A2 85 8A 5C 9A AE CC FF E8
Edit 7/11/2015
I've managed to decipher quite a bit additional information. The algorithm is definitely using LZ77 and Huffman coding. The length codes and extra bits seem to all match that used in Deflate.
I was able to learn a lot more detail about the pre-header as well. It has the following structure:
HLEN 0 SkS SkL ?? 3 4 5 6 7 8 9 HLIT
00000 00101110 001 0 1100 100 100 110 10 110 101 100 011 010 010 011 100010111
HLEN = the last bit-length in the pre-header - 3 (e.g. 1100 (12) means 9 is the last bit-length code)
HLIT = the number of literal codes in the main dictionary
SkS = "skip short" - skips a # of codes determined by the next 4-bits
SkL = "skip long" - skips a # of codes determined by the next 9-bits
0 - 9 = the number of bits in the dictionary codes for the respective bit lengths
The unmarked bits I'm still unable to decipher. Also, what I'm now seeing is that the pre-header codes themselves appear to have some extra bits thrown in (note the ?? between SkL and 3, above). They're not all straight 3-bit codes.
So, the only essential information that's now missing is:
How to parse the pre-header for extra bits and whatnot; and
How many distance codes follow the literal codes
If I had that information, I could actually feed the remaining data to zlib by manually supplying the code length dictionary along with the correct number of literal vs. distance codes. Everything after this header follows DEFLATE to the letter.
Here are some more example headers, with the bit-length codes indicated along with the number of literal and length codes. Note in each one I was able to reverse engineer the the answers, but I remain unable to match the undeciphered bits to those statistics.
Sample 1
(273 literals, 35 length, 308 total)
????? ???????? ??? ? HLEN 0 SkS SkL ?? 3 ? 4 ? 5 6 7 8 9 HLIT
00000 00100010 010 0 1100 110 101 110 10 111 0 111 0 101 011 010 001 110 100010001
Sample 2
(325 literal, 23 length, 348 total)
????? ???????? ??? ? HLEN 0 SkS SkL ?? 3 4 5 6 7 8 9 HLIT
00000 00110110 001 0 1100 101 101 110 10 110 000 101 000 011 010 001 101000101
Sample 3
(317 literal, 23 length, 340 total)
????? ???????? ??? ? HLEN 0 SkS SkL ??? 4 5 ? 6 7 8 9 HLIT
00000 01000100 111 0 1100 000 101 111 011 110 111 0 100 011 010 001 100111101
Sample 4
(279 literals, 18 length, 297 total)
????? ???????? ??? ? HLEN 0 SkS SkL ?? 3 4 5 6 7 8 9 HLIT
00000 00101110 001 0 1100 100 100 110 10 110 101 100 011 010 010 011 100010111
Sample 5
(258 literals, 12 length, 270 total)
????? ???????? ??? ? HLEN 0 SkS SkL ?? 2 3 4 HLIT
00000 00000010 000 0 0111 011 000 011 01 010 000 001 100000010
I'm still hoping someone has seen a non-standard DEFLATE-style header like this before. Or maybe you'll see a pattern I'm failing to see... Many thanks for any further input.
Well I finally managed to fully crack it. It was indeed using an implementation of LZ77 and Huffman coding, but very much a non-standard DEFLATE-like method for storing and deriving the codes.
As it turns out the pre-header codes were themselves fixed-dictionary Huffman codes and not literal bit lengths. Figuring out the distance codes was similarly tricky because unlike DEFLATE, they were not using the same bit-length codes as the literals, but rather were using yet another fixed-Huffman dictionary.
The takeaway for anyone interested is that apparently, there are old file formats out there using DEFLATE-derivatives. They CAN be reverse engineered with determination. In this case, I probably spent about 100 hours total, most of which was manually reconstructing compressed data from the known decompressed samples in order to find the code patterns. Once I knew enough about what they were doing to automate that process, I was able to make a few dozen example headers and thereby find the patterns.
I still fail to understand why they did this rather than use a standard format. It must have been a fair amount of work deriving a new compression format versus just using ZLib. If they were trying to obfuscate the data, they could have done so much more effectively by encrypting it, xor'ing with other values, etc. Nope, none of that. They just decided to show off their genius to their bosses, I suppose, by coming up with something "new" even if the differences from the standard were trivial and added no value other than to make MY life difficult. :)
Thanks to those who offered their input.

How to find out if Prolog performs Tail Call Optimization

Using the development version of SWI Prolog (Win x64),
I wrote a DCG predicate for a deterministic lexer (hosted on github) (thus all external predicates leave no choice points):
read_token(parser(Grammar, Tables),
lexer(dfa-DFAIndex, last_accept-LastAccept, chars-Chars0),
Token) -->
( [Input],
{
dfa:current(Tables, DFAIndex, DFA),
char_and_code(Input, Char, Code),
dfa:find_edge(Tables, DFA, Code, TargetIndex)
}
-> { table:item(dfa_table, Tables, TargetIndex, TargetDFA),
dfa:accept(TargetDFA, Accept),
atom_concat(Chars0, Char, Chars),
NewState = lexer(dfa-TargetIndex,
last_accept-Accept,
chars-Chars)
},
read_token(parser(Grammar, Tables), NewState, Token)
; {
( LastAccept \= none
-> Token = LastAccept-Chars0
; ( ground(Input)
-> once(symbol:by_type_name(Tables, error, Index, _)),
try_restore_input(Input, FailedInput, InputR),
Input = [FailedInput | InputR],
format(atom(Error), '~w', [FailedInput]),
Token = Index-Error
; once(symbol:by_type_name(Tables, eof, Index, _)),
Token = Index-''
)
)
}
).
Now using (;) and -> a lot, I was wondering SWI-Prolog can optimize the recursive read_token(parser(Grammar, Tables), NewState, Token) using Tail-Call-Optimization,
or if I have to split up the predicate into several clauses manually.
I just don't know how to find out what the interpreter does, especially knowing that TCO is disabled when running the debugger.
To answer your question, I first looked for "trivial" goals that might prevent last call optimization. If found some:
; ( ground(Input)
-> once(symbol:by_type_name(Tables, error, Index, _)),
try_restore_input(Input, FailedInput, InputR),
Input = [FailedInput | InputR],
format(atom(Error), '~w', [FailedInput]),
Token = Index-Error
; once(symbol:by_type_name(Tables, eof, Index, _)),
Token = Index-''
)
In these two cases, LCO is prevented by those goals alone.
Now, I compliled your rule and looked at the expansion with listing:
?- listing(read_token).
read_token(parser(O, B), lexer(dfa-C, last_accept-T, chars-J), Q, A, S) :-
( A=[D|G],
dfa:current(B, C, E),
char_and_code(D, K, F),
dfa:find_edge(B, E, F, H),
N=G
-> table:item(dfa_table, B, H, I),
dfa:accept(I, L),
atom_concat(J, K, M),
P=lexer(dfa-H, last_accept-L, chars-M),
R=N,
read_token(parser(O, B),
P,
Q,
R,
S) % 1: looks nice!
; ( T\=none
-> Q=T-J
; ground(D)
-> once(symbol:by_type_name(B, error, W, _)),
try_restore_input(D, U, V),
D=[U|V],
format(atom(X), '~w', [U]),
Q=W-X % 2: prevents LCO
; once(symbol:by_type_name(B, eof, W, _)),
Q=W-'' % 3: prevents LCO
),
S=A % 4: prevents LCO
).
ad 1) This is the recursive case you most probably are looking for. Here, everything seems nice.
ad 2,3) Already discussed above, maybe you want to exchange goals
ad 4) This is an effect of the precise, steadfast way how {}//1 is handled in DCGs. As a rule of thumb: Implementers rather prefer to be steadfast than to strive for LCO-ness. Please refer to: DCG Expansion: Is Steadfastness ignored?
Please note also that there is a lot more to this than the simple reuse of the call frame. There is a lot of interaction with garbage collection. To overcome all those problems in SWI, an additional GC phase was necessary.
For more, refer to the tiny benchmarks in Precise Garbage Collection in Prolog
So to finally answer your question: Your rule might become optimized; provided there is no choicepoint left prior to the recursive goal.
There is also the real low level approach to this. I never use this for code development: vm_list. The listing shows you ultimately whether or not SWI might consider LCO (provided no choicepoint is there).
i_call and i_callm will never perform LCO. Only i_depart will do. At: 142 i_depart(read_token/5)
?- vm_list(read_token).
========================================================================
read_token/5
========================================================================
0 s_virgin
1 i_exit
----------------------------------------
clause 1 ((0x1cc4710)):
----------------------------------------
0 h_functor(parser/2)
2 h_firstvar(5)
4 h_firstvar(6)
6 h_pop
7 h_functor(lexer/3)
9 h_functor((-)/2)
11 h_const(dfa)
13 h_firstvar(7)
15 h_pop
16 h_functor((-)/2)
18 h_const(last_accept)
20 h_firstvar(8)
22 h_pop
23 h_rfunctor((-)/2)
25 h_const(chars)
27 h_firstvar(9)
29 h_pop
30 i_enter
31 c_ifthenelse(26,118)
34 b_unify_var(3)
36 h_list_ff(10,11)
39 b_unify_exit
40 b_var(6)
42 b_var(7)
44 b_firstvar(12)
46 i_callm(dfa,dfa:current/3)
49 b_var(10)
51 b_firstvar(13)
53 b_firstvar(14)
55 i_call(char_and_code/3)
57 b_var(6)
59 b_var(12)
61 b_var(14)
63 b_firstvar(15)
65 i_callm(dfa,dfa:find_edge/4)
68 b_unify_fv(16,11)
71 c_cut(26)
73 b_const(dfa_table)
75 b_var(6)
77 b_var(15)
79 b_firstvar(17)
81 i_callm(table,table:item/4)
84 b_var(17)
86 b_firstvar(18)
88 i_callm(dfa,dfa:accept/2)
91 b_var(9)
93 b_var(13)
95 b_firstvar(19)
97 i_call(atom_concat/3)
99 b_unify_firstvar(20)
101 b_functor(lexer/3)
103 b_functor((-)/2)
105 b_const(dfa)
107 b_argvar(15)
109 b_pop
110 b_functor((-)/2)
112 b_const(last_accept)
114 b_argvar(18)
116 b_pop
117 b_rfunctor((-)/2)
119 b_const(chars)
121 b_argvar(19)
123 b_pop
124 b_unify_exit
125 b_unify_fv(21,16)
128 b_functor(parser/2)
130 b_argvar(5)
132 b_argvar(6)
134 b_pop
135 b_var(20)
137 b_var2
138 b_var(21)
140 b_var(4)
142 i_depart(read_token/5)
144 c_var_n(22,2)
147 c_var_n(24,2)
150 c_jmp(152)
152 c_ifthenelse(27,28)
155 b_var(8)
157 b_const(none)
159 i_call((\=)/2)
161 c_cut(27)
163 b_unify_var(2)
165 h_functor((-)/2)
167 h_var(8)
169 h_var(9)
171 h_pop
172 b_unify_exit
173 c_var(10)
175 c_var_n(22,2)
178 c_var_n(24,2)
181 c_jmp(101)
183 c_ifthenelse(28,65)
186 b_firstvar(10)
188 i_call(ground/1)
190 c_cut(28)
192 b_functor((:)/2)
194 b_const(symbol)
196 b_rfunctor(by_type_name/4)
198 b_argvar(6)
200 b_const(error)
202 b_argfirstvar(22)
204 b_void
205 b_pop
206 i_call(once/1)
208 b_var(10)
210 b_firstvar(23)
212 b_firstvar(24)
214 i_call(try_restore_input/3)
216 b_unify_var(10)
218 h_list
219 h_var(23)
221 h_var(24)
223 h_pop
224 b_unify_exit
225 b_functor(atom/1)
227 b_argfirstvar(25)
229 b_pop
230 b_const('~w')
232 b_list
233 b_argvar(23)
235 b_nil
236 b_pop
237 i_call(format/3)
239 b_unify_var(2)
241 h_functor((-)/2)
243 h_var(22)
245 h_var(25)
247 h_pop
248 b_unify_exit
249 c_jmp(33)
251 b_functor((:)/2)
253 b_const(symbol)
255 b_rfunctor(by_type_name/4)
257 b_argvar(6)
259 b_const(eof)
261 b_argfirstvar(22)
263 b_void
264 b_pop
265 i_call(once/1)
267 b_unify_var(2)
269 h_functor((-)/2)
271 h_var(22)
273 h_const('')
275 h_pop
276 b_unify_exit
277 c_var(10)
279 c_var_n(23,2)
282 c_var(25)
284 b_unify_vv(4,3)
287 c_var_n(11,2)
290 c_var_n(13,2)
293 c_var_n(15,2)
296 c_var_n(17,2)
299 c_var_n(19,2)
302 c_var(21)
304 i_exit

How to calculate classification error rate

Alright. Now this question is pretty hard. I am going to give you an example.
Now the left numbers are my algorithm classification and the right numbers are the original class numbers
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 89
177 89
177 89
177 89
177 89
177 89
177 89
So here my algorithm merged 2 different classes into 1. As you can see it merged class 86 and 89 into one class. So what would be the error at the above example ?
Or here another example
203 7
203 7
203 7
203 7
16 7
203 7
17 7
16 7
203 7
At the above example left numbers are my algorithm classification and the right numbers are original class ids. As can be seen above it miss classified 3 products (i am classifying same commercial products). So at this example what would be the error rate? How would you calculate.
This question is pretty hard and complex. We have finished the classification but we are not able to find correct algorithm for calculating success rate :D
Here's a longish example, a real confuson matrix with 10 input classes "0" - "9"
(handwritten digits),
and 10 output clusters labelled A - J.
Confusion matrix for 5620 optdigits:
True 0 - 9 down, clusters A - J across
-----------------------------------------------------
A B C D E F G H I J
-----------------------------------------------------
0: 2 4 1 546 1
1: 71 249 11 1 6 228 5
2: 13 5 64 1 13 1 460
3: 29 2 507 20 5 9
4: 33 483 4 38 5 3 2
5: 1 1 2 58 3 480 13
6: 2 1 2 294 1 1 257
7: 1 5 1 546 6 7
8: 415 15 2 5 3 12 13 87 2
9: 46 72 2 357 35 1 47 2
----------------------------------------------------
580 383 496 1002 307 670 549 557 810 266 estimates in each cluster
y class sizes: [554 571 557 572 568 558 558 566 554 562]
kmeans cluster sizes: [ 580 383 496 1002 307 670 549 557 810 266]
For example, cluster A has 580 data points, 415 of which are "8"s;
cluster B has 383 data points, 249 of which are "1"s; and so on.
The problem is that the output classes are scrambled, permuted;
they correspond in this order, with counts:
A B C D E F G H I J
8 1 4 3 6 7 0 5 2 6
415 249 483 507 294 546 546 480 460 257
One could say that the "success rate" is
75 % = (415 + 249 + 483 + 507 + 294 + 546 + 546 + 480 + 460 + 257) / 5620
but this throws away useful information —
here, that E and J both say "6", and no cluster says "9".
So, add up the biggest numbers in each column of the confusion matrix
and divide by the total.
But, how to count overlapping / missing clusters,
like the 2 "6"s, no "9"s here ?
I don't know of a commonly agreed-upon way
(doubt that the Hungarian algorithm
is used in practice).
Bottom line: don't throw away information; look at the whole confusion matrix.
NB such a "success rate" will be optimistic for new data !
It's customary to split the data into say 2/3 "training set" and 1/3 "test set",
train e.g. k-means on the 2/3 alone,
then measure confusion / success rate on the test set — generally worse than on the training set alone.
Much more can be said; see e.g.
Cross-validation.
You have to define the error criteria if you want to evaluate the performance of an algorithm, so I'm not sure exactly what you're asking. In some clustering and machine learning algorithms you define the error metric and it minimizes it.
Take a look at this
https://en.wikipedia.org/wiki/Confusion_matrix
to get some ideas
You have to define a error metric to measure yourself. In your case, a simple method should be to find the properties mapping of your product as
p = properties(id)
where id is the product id, and p is likely be a vector with each entry of different properties. Then you can define the error function e (or distance) between two products as
e = d(p1, p2)
Sure, each properties must be evaluated to a number in this function. Then this error function can be used in the classification algorithm and learning.
In your second example, it seems that you treat the pair (203 7) as successful classification, so I think you have already a metric yourself. You may be more specific to get better answer.
Classification Error Rate(CER) is 1 - Purity (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
ClusterPurity <- function(clusters, classes) {
sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}
Code of #john-colby
Or
CER <- function(clusters, classes) {
1- sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}

Algorithm to find average of group of numbers

I have a quite small list of numbers (a few hundred max) like for example this one:
117 99 91 93 95 95 91 97 89 99 89 99
91 95 89 99 89 99 89 95 95 95 89 948
189 99 89 189 189 95 186 95 93 189 95
189 89 193 189 93 91 193 89 193 185 95
89 194 185 99 89 189 95 189 189 95 89
189 189 95 189 95 89 193 101 180 189
95 89 195 185 95 89 193 89 193 185 99
185 95 189 95 89 193 91 190 94 190 185
99 89 189 95 189 189 95 185 95 185 99
89 189 95 189 186 99 89 189 191 95 185
99 89 189 189 96 89 193 189 95 185 95
89 193 95 189 185 95 93 189 189 95 186
97 185 95 189 95 185 99 185 95 185 99
185 95 190 95 185 95 95 189 185 95 189
2451
If you create a graph with X=the number and Y=number of times we see the number, we'll have something like this:
What I want is to know the average number of each group of numbers. In the example, there's 4 groups and the resulting numbers are 92, 187, 948 and 2451
The number of groups of number is not known.
Do you have any idea of how to create a (simple if possible) algorithm do extract these resulting numbers (if possible in c or pseudo code or English :)
What you want to do is called clustering. If the data you've shown is typical, a gready approach, such as neighbor joining, should be sufficient. So the procedure is:
1) Apply neighbor joining
2) Apply an (empirically identified) threshold to define the clusters
3) Calculate average of each cluster
Using a package that already has clustering algorithms, such as R, would probably be the easiest course, though neighbor joining is not a particularly hard algorithm.
I think std::map<int,int> can easily solve this problem. The key of the map would be the number, and value would be the times/frequency the number occurs.
So the average can be calculated as,
int average = (m[key] * key) / count;
Where count is total number of numbers, so it calculates the average of each group over all numbers, as you didn't clearly mention what you mean by average. I'm also assuming that each distinct number forms its own group!
Here's a way:
Decide what width your bins will be. Let's say 10 (i.e. e.g. numbers > -5 and <= 5 go into bin 0, numbers > 5 and <= 15 go into bin 1, ...).
Create a list which holds lists to the number in each bin. I'd go with something like map<unsigned int, vector<unsigned int> * > in C++.
Now iterate over the numbers, decide what bin they belong to. Check if there's already a vector for this bin in your map, if not create one. Add the number to the vector.
After iterating over all the numbers, simply calculate the average of each vector.
So you are looking for "spikes" in the graph. I'm guessing you are interested in the size and position of each group?
You might use something like this:
Sort the numbers
Loop:
Take the highest number you have
Investigate more numbers until you find a number that is too small to belong to the group (maybe 5% smaller)
Calculate the average of the selected numbers
Let the discarded number be the last number
End loop
In PHP you could do it like this:
$array = array(//an array of numbers);
$average = array_sum($array) / count($array);
With multiple groups of numbers you can do something like:
$array = array(
array(array of numbers, group1),
array(array of numbers, group2),
//etc.
);
foreach($array as $numbers)
{
$average[] = array_sum($numbers) / count($numbers);
}
Unless you're looking for the median or mode.
Ah, I see what you're asking now, you're not asking how to find the average, you're asking how to group the numbers up and find the average of each group.
Lets see, you'd have to find the mode, $counts = array_count_values($array)); array_keys(max($counts)); will do that and the keys in $counts will be the values of the original array, with the values in $counts being the number of times that each number shows up. Then you need to figure out where the bigger gaps in the keys in $counts are. You could also array_unique() the array original array and find the gaps in the values.
Wish my statistics teacher had done a bit more than play poker with us, or I could probably figure out the exact statistical method to determine how big the range checked to determine the groups should be.

Resources