Identifying outliers based on Y values - outliers

Is it possible to identify outliers from the Y not from X data?
I mean if I want to remove data having a big difference of Y values not X values.
Thanks

Related

How do I combine two properties (each has opposite impact) of the data points to filter out the best data point?

This question is more logical than programming. My dataset has several data points (you can think the dataset as an Array and the data points as it's elements). Each data point is defined by its two properties. For example, if x is one of the data points among X data points in the dataset, a and b are the properties or characteristics of x. Here, larger the value of a (that ranges from 0 to 1, think it as a probability), x has a good chance to be selected. Moreover, larger the value of b (think it as any number that is larger than 1), x has the least chance to be selected. Among X data points from the X, I need to select a data point that has the maximum a value and minimum b value. Note that there may be some instances when a single data point may not hold both the conditions at the same time. For example, x my have the largest a value but not the least b value at the same time and vice-versa. Hence, I want to combine both a and b to yield another meaningful weight value that helps me to filter out the right data point from X.
If there any mathematical solution to my problem?

Set membership query for "fuzzy" set

I have a set Sx of integers over which I want to answer set membership queries in as little space as possible. (edited to make more clear in response to Niklas' comment)
However, I am allowed some "fuzziness" in the numbers, i.e., it is ok if instead of a number x in Sx, I store any other number y in the range [x-k, x+k]. Here k is a given "fuzziness" constant.
Now, we have modified the set Sx to another set Sy formed by all the y values, which are fuzzed versions of x values. Now, I will like to have a data structure that can answer set membership queries over Sy, possibly with some error probability e (note that the elements in Sx are no longer relevant to the set membership query, it is ok even if all the elements in Sx are changed to different values).
A simple answer would be to create a bloom filter consisting of all elements in Sx, which will consume O(|Sx|log(1/e)). I will like to know if this bound can be improved upon in my specific scenario.
In practice, the number k is around 3, while the numbers in set S are spaced by around 30.

Difference between Rand and Jaccard similarity index?

What is the theoretical difference between Rand and Jaccard similarity/validation index?
I'm not interested in equations, but the interpretation of their difference.
I know Jaccard index neglects true negatives, but why? And what kind of impact does this have?
Thanks
I worked with these in my Master's thesis in computational biology so hopefully I should be able to answer this in a way which helps you-
The shorter version -
J=TP/(TP+FP+FN) while R=(TP+TN)/(TP+TN+FP+FN)
Naturally, TN are neglected by Jaccard by definition. For very large datasets, the number of TN can be pretty huge, which was the case in my thesis. So, that term was driving all the analysis. When I shifted from rand index to Jaccard Index, I neglected the contribution of TN and was able to understand things better.
The longer version-
Rand and Jaccard Indices are more often used to compare Partitionings/clusterings than usual response characteristic statistics like senstivity/specificity etc. But they can in some sense be extended to the idea of a true positive or a true negative. Let's go over this in greater detail-
For a set of elements S={a1,a2....an}, we can define two different clustering algos X and Y which divide them into r clusters each - X1,X2...Xr clusters and Y1,Y2....Yr clusters. Combine all X clusters or all Y clusters and you will get your complete S set again.
Now, we define:-
A= the number of pairs of elements in S that are in the same set in X and in the same set in Y
B= the number of pairs of elements in S that are in different sets in X and in different sets in Y
C= the number of pairs of elements in S that are in the same set in X and in different sets in Y
D= the number of pairs of elements in S that are in different sets in X and in the same set in Y
Rand Index is defined as - R=(A+B)/(A+B+C+D)
Now look at things this way - Let X be your results from a diagnostic test, while Y are the actual labels on the data points. So, A,B,C,D then reduce to TP,TN,FP,FN (in that order). Basically, R reduces to the definition I gave above.
Now, Jaccard Index-
For two sets M,N Jaccard index disregards elements that are in different sets for both clustering algorithms X and Y i.e. it neglects B, which is true negatives.
J = (A)/(A+C+D) which reduces to J=(TP)/(TP+FP+FN).
And that's how the two statistics are fundamentally different. If you want more info on these, here's a pretty good paper, and a website which might be of use to you -
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.6189&rep=rep1&type=pdf
http://clusteval.sdu.dk/313/clustering_quality_measures/542
Hope this helps!

Robot moving in a grid algorithm possible paths and time complexity ?

I'm not able to understand how for the problem below the number of paths are (x+y)!/x!y! .. I understand it comes from choose X items out of a path of X+Y items, but why is it not choosing x items over x+y + choosing y items over x+y ? Why does it have to be only x ?
A robot is located at the top-left corner of a m x n grid (marked
‘Start’ in the diagram below). The robot can only move either down or
right at any point in time. The robot is trying to reach the
bottom-right corner of the grid (marked ‘Finish’ in the diagram
below). How many possible paths are there?
Are all of these paths unique ?
How do I determine that ?
And what would be the time complexity for the backtracking algorithm ?
This is somewhat based on Mukul Joshi's answer, but hopefully a little clearer.
To go from 0,0 to x,y, you need to move right exactly x times and down exactly y times.
Let each right movement be represented by a 0 and a down movement by a 1.
Let a string of 0s and 1s then indicate a path from 0,0 to x,y. This path will contain x 0s and y 1s.
Now we want to count all such strings. This is equivalent to counting the number of permutations of any string containing x 0s and y 1s. This string is a multiset (each element can appear more than once), thus we want a multiset permutation, which can be calculated as n!/(m1!m2!...mk!) where n is the total number of characters, k is the number of unique characters and mi is the number of times the ith unique character is repeated. Since there are x+y characters in total, and 0 is repeated x times and 1 is repeated y times, we get to (x+y)!/x!y!.
Time Complexity:
The time complexity of backtracking / brute force would involve having to explore all of these paths. Think of it as a tree, with there being (x+y)!/x!y! leaves. I might be wrong, but I think the number of nodes in trees with a branching factor > 1 can be represented as the big-O of the number of leaves, thus we end up with O((x+y)!/x!y!) nodes, and thus the same time complexity.
Ok, I give you a solution to that problem so that you have better time catching it.
First of all, let us decide a solution algorithm. We will count all possible paths for every cell to reach end from it. The algorithm will check cells and write there sum of right and bottom cells. We do it because robot can move down and follow any of bottom paths or move right and follow any of rightside paths, thus, adding the total number of different paths. It is quite obvious for me to prove the divercity of these paths. If you want I can do it in comments.
Initial values for cells will be 1 for rightmost bottom cell (finish) because there only 1 way to get there from this cell (not to move at all). And if cell doesn't exist (e.g. taking bottom cell for bottommost cell) it will have value of 0.
Building cell values one by one will result in a Pascal's Triangle which values are (x + y)! / x! / y! in a (x, y) cell where x is the Ox distance from finish and y is Oy one.
Talking about complexity we will have x * y iterations over grid cells, each iteration is a constant time. If you don't want to use backtracking algorith you can use the formula that is mentioned above and have O(x + y) instead of O(x * y)
Well here is the explanation.
To reach till the destination no matter how you go, the path has to have m rows and n columns.
Consider that you represent row by 1 and column by 0. Your path is a string of m+n characters. But it can have only m 1s and n 0s.
if you have m+n different characters the number of permutations will be (m+n)! but when you have repeating characters then it will be (m+n)!/m!n! Refer to this
Of course this will be unique. Test it for 4*3 grid and you can see it.
You don't add "How many ways can I distribute my X moves?" to "How many ways can I distribute my Y moves?" for two reasons:
The distribution of X moves and Y moves are not independent. For each configuration of X moves, there is only 1 possible configuration of Y moves.
If they were independent, you wouldn't add them, you would multiply them. For example, if I have X different color shirts and Y different color pants, there are X * Y different combinations of shirts and pants.
Note that for #1 there is nothing special about X - I could just have easily chosen Y and said: "The distribution of Y moves and X moves are not independent. For each configuration of Y moves, there is only 1 possible configuration of X moves." Which is why, as others have pointed out, counting the number of ways to distribute your Y moves gives the same result as counting the number of ways to distribute your X moves.

Which algorithm will be required to do this?

I have data of this form:
for x=1, y is one of {1,4,6,7,9,18,16,19}
for x=2, y is one of {1,5,7,4}
for x=3, y is one of {2,6,4,8,2}
....
for x=100, y is one of {2,7,89,4,5}
Only one of the values in each set is the correct value, the rest is random noise.
I know that the correct values describe a sinusoid function whose parameters are unknown. How can I find the correct combination of values, one from each set?
I am looking something like "travelling salesman"combinatorial optimization algorithm
You're trying to do curve fitting, for which there are several algorithms depending on the type of curve you want to fit your curve to (linear, polynomial, etc.). I have no idea whether there is a specific algorithm for sinusoidal curves (Fourier approximations), but my first idea would be to use a polynomial fitting algorithm with a polynomial approximation of the sine.
I wonder whether you need to do this in the course of another larger program, or whether you are trying to do this task on its own. If so, then you'd be much better off using a statistical package, my preferred one being R. It allows you to import your data and fit curves and draw graphs in just a few lines, and you could also use R in batch-mode to call it from a script or even a program (this is what I tend to do).
It depends on what you mean by "exactly", and what you know beforehand. If you know the frequency w, and that the sinusoid is unbiased, you have an equation
a cos(w * x) + b sin(w * x)
with two (x,y) points at different x values you can find a and b, and then check the generated curve against all the other points. Choose the two x values with the smallest number of y observations and try it for all the y's. If there is a bias, i.e. your equation is
a cos(w * x) + b sin(w * x) + c
You need to look at three x values.
If you do not know the frequency, you can try the same technique, unfortunately the solutions may not be unique, there may be more than one w that fits.
Edit As I understand your problem, you have a real y value for each x and a bunch of incorrect ones. You want to find the real values. The best way to do this is to fit curves through a small number of points and check to see if the curve fits some y value in the other sets.
If not all the x values have valid y values then the same technique applies, but you need to look at a much larger set of pairs, triples or quadruples (essentially every pair, triple, or quad of points with different y values)
If your problem is something else, and I suspect it is, please specify it.
Define sinusoid. Most people take that to mean a function of the form a cos(w * x) + b sin(w * x) + c. If you mean something different, specify it.
2 Specify exactly what success looks like. An example with say 10 points instead of 100 would be nice.
It is extremely unclear what this has to do with combinatorial optimization.
Sinusoidal equations are so general that if you take any random value of all y's these values can be fitted in sinusoidal function unless you give conditions eg. Frequency<100 or all parameters are integers,its not possible to diffrentiate noise and data theorotically so work on finding such conditions from your data source/experiment first.
By sinusoidal, do you mean a function that is increasing for n steps, then decreasing for n steps, etc.? If so, you you can model your data as a sequence of nodes connected by up-links and down-links. For each node (possible value of y), record the length and end-value of chains of only ascending or descending links (there will be multiple chain per node). Then you scan for consecutive runs of equal length and opposite direction, modulo some initial offset.

Resources