ELKI - input distance matrix - outliers

I'm trying to use ELKI for outlier detection ; I have my custom distance matrix and I'm trying to input it to ELKI to perform LOF (for example, in a first time).
I try to follow http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances but it is not very clear to me. What I do:
I don't want to load data from database so I use:
-dbc DBIDRangeDatabaseConnection -idgen.count 100
(where 100 is the number of objects I'll be analyzing)
I use LOF algo and call the external distance file
-algorithm outlier.LOF
-algorithm.distancefunction external.FileBasedDoubleDistanceFunction
-distance.matrix testData.ascii -lof.k 3
My distance file is as follows (very simple for testing purposes)
0 0 0
0 1 1
0 2 0.2
0 3 0.1
1 1 0
1 2 0.9
1 3 0.9
2 2 0
2 3 0.2
3 3 0
4 0 0.23
4 1 0.97
4 2 0.15
4 3 0.07
4 4 0
5 0 0.1
5 1 0.85
5 2 0.02
5 3 0.15
5 4 0.1
5 5 0
6 0 1
6 1 1
6 2 1
6 3 1
etc
the results say : "all in one trivial clustering", while this is not clustering and there definitely are outliers in my data.
do I do the stuff right ? Or what am I missing ?

When using DBIDRangeDatabaseConnection, and not giving ELKI any actual data, the visualization cannot produce a particularly useful result (because it doesn't have the actual data, after all). Nor can the data be evaluated automatically.
The "all in one trivial clustering" is an artifact from the automatic attempts to visualize the data, but for the reasons discussed above this cannot work. This clustering is automatically added for unlabeled data, to allow some visualizations to work.
There are two things to do for you:
set an output handler. For example -resulthandler ResultWriter, which will produce an output similar to this:
ID=0 lof-outlier=1.0
Where ID= is the object number, and lof-outlier= is the LOF outlier score.
Alternatively, you can implement your own output handler. An example is found here:
http://elki.dbs.ifi.lmu.de/browser/elki/trunk/src/tutorial/outlier/SimpleScoreDumper.java
fix DBIDRangeDatabaseConnection. You are however bitten by a bug in ELKI 0.6.0~beta1: the DBIDRangeDatabaseConnection actually doesn't initialize its parameters correctly.
The trivial bug fix (parameters not initialized correctly in the constructor) is here:
http://elki.dbs.ifi.lmu.de/changeset/11027/elki
Alternatively, you can create a dummy input file and use the regular text input. A file containing
0
1
2
...
should do the trick. Use -dbc.in numbers100.txt -dbc.filter FixedDBIDsFilter -dbc.startid 0. The latter arguments are to have your IDs start at 0, not 1 (default).
This workaround will produce a slightly different output format:
ID=0 0.0 lof-outlier=1.0
where the additional column is from the dummy file. The dummy values will not affect the algorithm result of LOF, when an external distance function is used; but this approach will use some additional memory.

Related

Algorithm strategy to prevent values from bouncing between 2 values when on edge of two 'buckets'

I'm tracking various colored balls in OpenCV (Python) in real-time. The tracking is very stable. i.e. when stationary the values do not change with more then 1 / 2 pixels for the center of the circle.
However i'm running into what must surely be a well researched issue: I need to now place the positions of the balls into an rougher grid - essentially simply dividing (+ rounding) the x,y positions.
e.g.
input range is 0 -> 9
target range is 0 -> 1 (two buckets)
so i do: floor(input / 5)
input: [0 1 2 3 4 5 6 7 8 9]
output: [0 0 0 0 0 1 1 1 1 1]
This is fine, but the problem occurs when just a small change in the initial value might result it to be either in quickly changes output single I.e. at the 'edge' of the divisions -or a 'sensitive' area.
input: [4 5 4 5 4 5 5 4 ...]
output:[0 1 0 1 0 1 1 0 ...]
i.e. values 4 and 5 (which falls withing the 1 pixel error/'noisy' margin) cause rapid changes in output.
What are some of the strategems / algorithms that deal with these so help me further?
I searched but it seems i do not how to express the issue correctly for Google (or StackOverflow).
I tried adding 'deadzones'. i.e. rather then purely dividing i leave 'gaps' in my ouput range which means a value sometimes has no output (i.e. between 'buckets'). This somewhat works but means i have a lot (i.e. the range of the fluctuation) of the screen that is not used...
i.e.
input = [0 1 2 3 4 5 6 7 8 9]
output = [0 0 0 0 x x 1 1 1 1]
Temporal averaging is not ideal (and doesn't work too well either) - and increases the latency.
I just have a 'hunch' there is a whole set of Computer / Signal science about this.

Algorithm for read matrixes

An algorithm that need process a matrix n x m that is scalable.
E.g. I have a time series of 3 seconds containing the values: 2,1,4.
I need to decompose it to take a 3 x 4 matrix, where 3 is the number of elements of time series and 4 the maximum value. The resulting matrix that would look like this:
1 1 1
1 0 1
0 0 1
0 0 1
Is this a bad solution or is it only considered a data entry problem?
The question is,
do I need to distribute information from each row of the matrix for various elements without losing the values?

How to solve 5 * 5 Cube in efficient easy way

There is 5*5 cube puzzle named Happy cube Problem where for given mat , need to make a cube .
http://www.mathematische-basteleien.de/cube_its.htm#top
Its like, 6 blue mats are given-
From the following mats, Need to derive a Cube -
These way it has 3 more solutions.
So like first cub
For such problem, the easiest approach I could imagine was Recursion based where for each cube, I have 6 position , and for each position I will try check all other mate and which fit, I will go again recursively to solve the same. Like finding all permutations of each of the cube and then find which fits the best.So Dynamic Programming approach.
But I am making loads of mistake in recursion , so is there any better easy approach which I can use to solve the same?
I made matrix out of each mat or diagram provided, then I rotated them in each 90 clock-wise 4 times and anticlock wise times . I flip the array and did the same, now for each of the above iteration I will have to repeat the step for other cube, so again recursion .
0 0 1 0 1
1 1 1 1 1
0 1 1 1 0
1 1 1 1 1
0 1 0 1 1
-------------
0 1 0 1 0
1 1 1 1 0
0 1 1 1 1
1 1 1 1 0
1 1 0 1 1
-------------
1 1 0 1 1
0 1 1 1 1
1 1 1 1 0
0 1 1 1 1
0 1 0 1 0
-------------
1 0 1 0 0
1 1 1 1 1
0 1 1 1 0
1 1 1 1 1
1 1 0 1 0
-------------
1st - block is the Diagram
2nd - rotate clock wise
3rd - rotate anti clockwise
4th - flip
Still struggling to sort out the logic .
I can't believe this, but I actually wrote a set of scripts back in 2009 to brute-force solutions to this exact problem, for the simple cube case. I just put the code on Github: https://github.com/niklasb/3d-puzzle
Unfortunately the documentation is in German because that's the only language my team understood, but source code comments are in English. In particular, check out the file puzzle_lib.rb.
The approach is indeed just a straightforward backtracking algorithm, which I think is the way to go. I can't really say it's easy though, as far as I remember the 3-d aspect is a bit challenging. I implemented one optimization: Find all symmetries beforehand and only try each unique orientation of a piece. The idea is that the more characteristic the pieces are, the less options for placing pieces exist, so we can prune early. In the case of many symmetries, there might be lots of possibilities and we want to inspect only the ones that are unique up to symmetry.
Basically the algorithm works as follows: First, assign a fixed order to the sides of the cube, let's number them 0 to 5 for example. Then execute the following algorithm:
def check_slots():
for each edge e:
if slot adjacent to e are filled:
if the 1-0 patterns of the piece edges (excluding the corners)
have XOR != 0:
return false
if the corners are not "consistent":
return false
return true
def backtrack(slot_idx, pieces_left):
if slot_idx == 6:
# finished, we found a solution, output it or whatever
return
for each piece in pieces_left:
for each orientation o of piece:
fill slot slot_idx with piece in orientation o
if check_slots():
backtrack(slot_idx + 1, pieces_left \ {piece})
empty slot slot_idx
The corner consistency is a bit tricky: Either the corner must be filled by exactly one of the adjacent pieces or it must be accessible from a yet unfilled slot, i.e. not cut off by the already assigned pieces.
Of course you can ignore to drop some or all of the consistency checks and only check in the end, seeing as there are only 8^6 * 6! possible configurations overall. If you have more than 6 pieces, it becomes more important to prune early.

File sharding between servers algorithm

I want to distribute files across multiple servers and have them available with very little overhead. So I was thinking of the following naive algorithm:
Providing that each file has an unique ID number: 120151 I'm thinking of segmenting the files using the modulo (%) operator. This works if I know the number of servers in advance:
Example with 2 servers (stands for n servers):
server 1 : ID % 2 = 0 (contains even IDs)
server 2 : ID % 2 = 1 (contains odd IDs)
However when I need to scale this and add more servers I will have to re-shuffle the files to obey the new algorithm rules and we don't want that.
Example:
Say I add server 3 into the mix because I cannot handle the load. Server 3 will contain files that respect the following criteria:
server 3 : ID%3 = 2
Step 1 is to move the files from server 1 and server 2 where ID%3 = 2.
However, I'll have to move some files between server 1 and server 2 so that the following occurs:
server 1 : ID%3 = 0
server 2 : ID%3 = 1
What's the optimal way to achieve this?
My approach would be to use consistent hashing. From Wikipedia:
Consistent hashing is a special kind of hashing such that when a hash
table is resized and consistent hashing is used, only K/n keys need to
be remapped on average, where K is the number of keys, and n is the
number of slots.
The general idea is this:
Think of your servers as arranged on a ring, ordered by their server_id
Each server is assigned a uniformly distributed (random) id, e.g. server_id = SHA(node_name).
Each file is equally assigned a uniformly distributed id, e.g. file_id = SHA(ID), where ID is as given in your example.
Choose the server that is 'closest' to the file_id, i.e. where server_id > file_id (start choosing with the smallest server_id).
If there is no such node, there is a wrap around on the ring
Note: you can use any hash function that generates uniformly distributed hashes, so long as you use the same hash function for both servers and files.
This way, you get to keep O(1) access, and adding/removing is straight forward and does not require reshuffling all files:
a) adding a new server, the new node gets all the files from the next node on the ring with ids lower than the new server
b) removing a server, all of its files are given to the next node on the ring
Tom White's graphically illustrated overview explains in more detail.
To summarize your requirements:
Each server should store an (almost) equal amount of files.
You should be able to determine which server holds a given file - based only on the file's ID, in O(1).
When adding a file, requirements 1 and 2 should hold.
When adding a server, you want to move some files to it from all existing servers, such that requirements 1 and 2 would hold.
Your strategy when adding a 3rd server (x is the file's ID):
x%6 Old New
0 0 0
1 1 1
2 0 --> 2
3 1 --> 0
4 0 --> 1
5 1 --> 2
Alternative strategy:
x%6 Old New
0 0 0
1 1 1
2 0 0
3 1 1
4 0 --> 2
5 1 --> 2
To locate a server after the change:
0: x%6 in [0,2]
1: x%6 in [1,3]
2: x%6 in [4,5]
Adding a 4th server:
x%12 Old New
0 0 0
1 1 1
2 0 0
3 1 1
4 2 2
5 2 2
6 0 0
7 1 1
8 0 --> 3
9 1 --> 3
10 2 2
11 2 --> 3
To locate a server after the change:
0: x%12 in [0,2, 6]
1: x%12 in [1,3, 7]
2: x%12 in [4,5,10]
3: x%12 in [8,9,11]
When you add server, you can always build a new function (actually several alternative functions). The value of the divisor for n servers equals to lcm(1,2,...,n), so it grows very fast.
Note that you didn't mention if files are removed, and if you plan to handle that.

Multiple Inputs for Backpropagation Neural Network

I've been working on this for about a week. There are no errors in my coding, I just need to get algorithm and concept right. I've implemented a neural network consisting of 1 hidden layer. I use the backpropagation algorithm to correct the weights.
My problem is that the network can only learn one pattern. If I train it with the same training data over and over again, it produces the desired outputs when given input that is numerically close to the training data.
training_input:1, 2, 3
training_output: 0.6, 0.25
after 300 epochs....
input: 1, 2, 3
output: 0.6, 0.25
input 1, 1, 2
output: 0.5853, 0.213245
But if I use multiple varying training sets, it only learns the last pattern. Aren't neural networks supposed to learn multiple patterns? Is this a common beginner mistake? If yes then point me in the right direction. I've looked at many online guides, but I've never seen one that goes into detail about dealing with multiple input. I'm using sigmoid for the hidden layer and tanh for the output layer.
+
Example training arrays:
13 tcp telnet SF 118 2425 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 26 10 0.38 0.12 0.04 0 0 0 0.12 0.3 anomaly
0 udp private SF 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 3 0 0 0 0 0.75 0.5 0 255 254 1 0.01 0.01 0 0 0 0 0 anomaly
0 tcp telnet S3 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 255 79 0.31 0.61 0 0 0.21 0.68 0.6 0 anomaly
The last columns(anomaly/normal) are the expected outputs. I turn everything into numbers, so each word can be represented by a unique integer.
I give the network one array at a time, then I use the last column as the expected output to adjust the weights. I have around 300 arrays like these.
As for the hidden neurons, I tried from 3, 6 and 20 but nothing changed.
+
To update the weights, I calculate the gradient for the output and hidden layers. Then I calculate the deltas and add them to their associated weights. I don't understand how that is ever going to learn to map multiple inputs to multiple outputs. It looks linear.
If you train a neural network too much, with respect to the number of iterations through the back-propagation algorithm, on one data set the weights will eventually converge to a state where it will give the best outcome for that specific training set (overtraining for machine learning). It will only learn the relationships between input and target data for that specific training set, but not the broader more general relationship that you might be looking for. It's better to merge some distinctive sets and train your network on the full set.
Without seeing the code for the back-propagation algorithm I could not give you any advice on if it's working correctly. One problem I had when implementing the back-propagation was not properly calculating the derivative of the activation function around the input value. This website was very helpful for me.
No Neural networks are not supposed to know multiple tricks.
You train them for a specific task.
Yes they can be trained for other tasks as well
But then they get optimized for another task.
So thats why you should create load and save functions, for your network so that you can easily switch brains and perform other tasks, if required.
If your not sure what taks it is currently train a neural to find the diference between the tasks.

Resources