ELKI's LOF implementation for heavily duplicated data - probability

Does ELKI fail for data which has many duplicate values in it? I have files with more than 2 million observations(1D), but it contains only a few hundred unique values. The rest are duplicates. When I run this file in ELKI, for LOF or LoOP calculations, it returns NAN as outlier scores for any k less than the number of occurrences of a value with highest frequency. I can imagine the LRD calculation must be causing this problem if duplicates are taken as nearest neighbours. But should'nt it NOT be doing this? Can we rely on the results ELKI is producing for such cases?

It is not so much a matter of ELKI, but of the algorithms.
Most outlier detection algorithms use the k nearest neighbors.
If these are identical, the values can be problematic. In LOF, the neighbors of duplicated points can obtain an outlier score of infinity. Similarly, the outlier scores of LoOP probably reach NaN due to a division by 0 if there are too many duplicates.
But that is not a matter of ELKI, but of the definition of these methods. Any implementation that sticks to these definition should exhibit these effects. There are some methods to avoid/reduce the effects:
add jitter to the data set
remove duplicates (but never consider highly dupilcated values outliers!)
increase the neighborhood size
It is easy to prove that such results do arise in LOF/LoOP equations if the data has duplicates.
This limitation of these algorithms can most probably be "fixed", but we want the implementations in ELKI to be close to the original publication, so we avoid doing unpublished changes. But if a "LOFdup" method is published and contributed to ELKI, we would add that obviously.
Note that neither LOF nor LoOP is meant to be used with 1-dimensional data. For 1-dimensional data, I recommend focusing on "traditional" statistical literature instead, such as kernel density estimation. 1-dimensional numerical data is special, because it is ordered - this allows for both optimizations and much more advanced statistics that would be infeasible or require too much observations on multivariate data. LOF and similar methods are very basic statistics (so basic that many statisticians would outright reject them as "stupid" or "naive") - with the key benefit that they easily scale to large, multivariate data sets. Sometimes naive methods such as naive bayes can work very well in practise; the same holds for LOF and LoOP: there are some questionable decisions in the algorithms. But they work, and scale. Just as with naive bayes - the independence assumption in naive bayes is questionable, but naive bayes classification often works well, and scales very well.
In other words, this is not a bug in ELKI. The implementation does what is published.

Related

Algorithm for incomplete ranking with imprecise comparisons

SUMMARY
I'm looking for an algorithm to rank objects. Two objects can be compared. However, the comparisons are real world comparisons that may be flawed. Also, I care more about finding out the very best object than which ones are the worst.
TO MOTIVATE:
Think that I'm scientifically evaluating materials. I combine two materials. I want to find the best working material for in-depth testing. So, I don't care about materials that are unpromising. However, each test can be a false positive or have anomalies between those particular two materials.
PRECISE PROBLEM:
There is an unlimited pool of objects.
Two objects can be compared to each other. It is resource expensive to compare two objects.
It's resource expensive to consider an additional object. So, an object should only be included in the evaluation if it can be fully ranked.
It is very important to find the very best object in the pool of the tested ones. If an object is in the bottom half, it doesn't matter to find out where in the bottom half it is. The importance of finding out the exact rank is a gradient with the top much more important.
Most of the time, if A > B and B > C, it is safe to assume that A > C. Sometimes, there are false positives. Sometimes A > B and B > C and C > A. This is not an abstract math space but real world measurements.
At the start, it is not known how many comparisons are allowed to be taken. The algorithm is granted permission to do another comparison until it isn't. Thus, the decision on including an additional object or testing more already tested objects has to be made.
TO MOTIVATE MORE IN-DEPTH:
Imagine that you are tasked with hiring a team of boxers. You know nothing about evaluating boxers but can ask two boxers to fight each other. There is an unlimited number of boxers in the world. But it's expensive to fly them in. Ideally, you want to hire the n best boxers. Realistically, you don't know if the boxers are going to accept your offer. Plus, you don't know how competitively the other boxing clubs bid. You are going to make offers to only the best n boxers, but have to be prepared to know which next n boxers to send offers to. That you only get the worst boxers is very unlikely.
SOME APPROACHES
I could think of the following approaches. However, they all have drawbacks. I feel like there should be a much better approach.
USE TRADITIONAL SORTING ALGORITHMS
Traditional sorting algorithms could be used.
Drawback:
- A false positive could serious throw of the correctness of the algorithm.
- A sorting algorithm would spend half the time sorting the bottom half of the pack, which is unimportant.
- Sorting algorithms start with all items. With this problem, we are allowed to do the first test, not knowing if we are allowed to do a second test. We may end up only being allowed to do two test. Or we may be allowed to do a million tests.
USE TOURNAMENT ALGORITHMS
There are algorithms for tournaments. E.g., everyone gets a first match. The winner of the first match moves on to the next round. There is a variety of tournament strategies that accounts for people having a bad day or being paired with the champion in their first match.
Drawback:
- This seems pretty promising. The difficulty is to find one that allows adding one more player at a time as we are allowed more comparisons. It seems that there should be a highly specialized solution that's better than a standard tournament algorithm.
BINARY SEARCH
We could start with two objects. Each time an object is added, we could use a binary search to find its spot in the ranking. Because the top is more important, we could use a weighted binary search. E.g. instead of testing the mid point, it tests the point at the top 1/3.
Drawback:
- The algorithm doesn't correct for false positives. If there is a false positive at the top early on, it could skew the whole rest of the tests.
COUNT WINS AND LOSSES
The wins and losses could be counted. The algorithm would choose test subjects by a priority of the least losses and second priority of the most wins. This would focus on testing the best objects. If an object has zero losses, it would get the focus of the testing. It would either quickly get a loss and drop in priority, or it would get a lot more tests because it's the likely top candidate.
DRAWBACK:
- The approach is very nice in that it corrects for false positives. It also allows adding more objects to the test pool easily. However, it does not consider that a win against a top object counts a lot more than a win against a bottom object. Thus, comparisons are wasted.
GRAPH
All the objects could be added to a graph. The graph could be flattened.
DRAWBACK:
- I don't know how to flatten such a messy graph that could have cycles and ambiguous end nodes. There could be multiple objects that are undefeated. How would one pick a winner in such a messy graph? How would one know which comparison would be the most valuable?
SCORING
As a win depends on the rank of the loser, a win could be given a score. Say A > B, means that A gets 1 point. if C > A, C gets 2 points because A has 1 point. In the end, objects are ranked by how many points they have.
DRAWBACK
- The approach seems promising in that it is easy to add new objects to the pool of tested objects. It also takes into account that wins against top objects should count for more. I can't think of a good way to determine the points. That first comparison, was awarded 1 point. Once 10,000 objects are in the pool, an average win would be worth 5,000 points. The award of both tests should be roughly equal. Later comparisons overpower the earlier comparisons and make them be ignored when they shouldn't.
Does anyone have a good idea on tackling this problem?
I would search for an easily computable value for an object, that could be compared between objects to give a good enough approximation of order. You could compare each new object with the current best accurately, then insertion sort the loser into a list of the rest using its computed value.
The best will always be accurate. The ordering of the rest depending on your "value".
I would suggest looking into Elo Rating systems and its derivatives. (like Glicko, BayesElo, WHR, TrueSkill etc.)
So you assign each object a preliminary rating, and then update that value according to the matches/comparisons you make. (with bigger changes to the ratings the more unexpected the outcome was)
This still leaves open the question of how to decide which object to compare to which other object to gain most information. For that I suggest looking into tournament systems and playoff formats. Though I suspect that an optimal solution will be decidedly more ad-hoc than that.

Sorting in Computer Science vs. sorting in the 'real' world

I was thinking about sorting algorithms in software, and possible ways one could surmount the O(nlogn) roadblock. I don't think it IS possible to sort faster in a practical sense, so please don't think that I do.
With that said, it seems with almost all sorting algorithms, the software must know the position of each element. Which makes sense, otherwise, how would it know where to place each element according to some sorting criteria?
But when I crossed this thinking with the real world, a centrifuge has no idea what position each molecule is in when it 'sorts' the molecules by density. In fact, it doesn't care about the position of each molecule. However it can sort trillions upon trillions of items in a relatively short period of time, due to the fact that each molecule follows density and gravitational laws - which got me thinking.
Would it be possible with some overhead on each node (some value or method tacked on to each of the nodes) to 'force' the order of the list? Something like a centrifuge, where only each element cares about its relative position in space (in relation to other nodes). Or, does this violate some rule in computation?
I think one of the big points brought up here is the quantum mechanical effects of nature and how they apply in parallel to all particles simultaneously.
Perhaps classical computers inherently restrict sorting to the domain of O(nlogn), where as quantum computers may be able to cross that threshold into O(logn) algorithms that act in parallel.
The point that a centrifuge being basically a parallel bubble sort seems to be correct, which has a time complexity of O(n).
I guess the next thought is that if nature can sort in O(n), why can't computers?
EDIT: I had misunderstood the mechanism of a centrifuge and it appears that it does a comparison, a massively-parallel one at that. However there are physical processes that operate on a property of the entity being sorted rather than comparing two properties. This answer covers algorithms that are of that nature.
A centrifuge applies a sorting mechanism that doesn't really work by means of comparisons between elements, but actually by a property ('centrifugal force') on each individual element in isolation.Some sorting algorithms fall into this theme, especially Radix Sort. When this sorting algorithm is parallelized it should approach the example of a centrifuge.
Some other non-comparative sorting algorithms are Bucket sort and Counting Sort. You may find that Bucket sort also fits into the general idea of a centrifuge (the radius could correspond to a bin).
Another so-called 'sorting algorithm' where each element is considered in isolation is the Sleep Sort. Here time rather than the centrifugal force acts as the magnitude used for sorting.
Computational complexity is always defined with respect to some computational model. For example, an algorithm that's O(n) on a typical computer might be O(2n) if implemented in Brainfuck.
The centrifuge computational model has some interesting properties; for example:
it supports arbitrary parallelism; no matter how many particles are in the solution, they can all be sorted simultaneously.
it doesn't give a strict linear sort of particles by mass, but rather a very close (low-energy) approximation.
it's not feasible to examine the individual particles in the result.
it's not possible to sort particles by different properties; only mass is supported.
Given that we don't have the ability to implement something like this in general-purpose computing hardware, the model may not have practical relevance; but it can still be worth examining, to see if there's anything to be learned from it. Nondeterministic algorithms and quantum algorithms have both been active areas of research, for example, even though neither is actually implementable today.
The trick is there, that you only have a probability of sorting your list using a centrifuge. As with other real-world sorts [citation needed], you can change the probability that your have sorted your list, but never be certain without checking all the values (atoms).
Consider the question: "How long should you run your centrifuge for?"
If you only ran it for a picosecond, your sample may be less sorted than the initial state.. or if you ran it for a few days, it may be completely sorted. However, you wouldn't know without actually checking the contents.
A real world example of a computer based "ordering" would be autonomous drones that cooperatively work with each other, known as "drone swarms". The drones act and communicate both as individuals and as a group, and can track multiple targets. The drones collectively decide which drones will follow which targets and the obvious need to avoid collisions between drones. The early versions of this were drones that moved through way points while staying in formation, but the formation could change.
For a "sort", the drones could be programmed to form a line or pattern in a specific order, initially released in any permutation or shape, and collectively and in parallel they would quickly form the ordered line or pattern.
Getting back to a computer based sort, one issue is that there's one main memory bus, and there's no way for a large number of objects to move about in memory in parallel.
know the position of each element
In the case of a tape sort, the position of each element (record) is only "known" to the "tape", not to the computer. A tape based sort only needs to work with two elements at a time, and a way to denote run boundaries on a tape (file mark, or a record of different size).
IMHO, people overthink log(n). O(nlog(n)) IS practically O(n). And you need O(n) just to read the data.
Many algorithms such as quicksort do provide a very fast way to sort elements. You could implement variations of quicksort that would be very fast in practice.
Inherently all physical systems are infinitely parallel. You might have a buttload of atoms in a grain of sand, nature has enough computational power to figure out where each electron in each atom should be. So if you had enough computational resources (O(n) processors) you could sort n numbers in log(n) time.
From comments:
Given a physical processor that has k number of elements, it can achieve a parallelness of at most O(k). If you process n numbers arbitrarily, it would still process it at a rate related to k. Also, you could formulate this problem physically. You could create n steel balls with weights proportional to the number you want to encode, which could be solved by a centrifuge in a theory. But here the amount of atoms you are using is proportional to n. Whereas in a standard case you have a limited number of atoms in a processor.
Another way to think about this is, say you have a small processor attached to each number and each processor can communicate with its neighbors, you could sort all those numbers in O(log(n)) time.
I worked in an office summers after high school when I started college. I had studied in AP Computer Science, among other things, sorting and searching.
I applied this knowledge in several physical systems that I can recall:
Natural merge sort to start…
A system printed multipart forms including a file-card-sized tear off, which needed to be filed in a bank of drawers.
I started with a pile of them and sorted the pile to begin with. The first step is picking up 5 or so, few enough to be easily placed in order in your hand. Place the sorted packet down, criss-crossing each stack to keep them separate.
Then, merge each pair of stacks, producing a larger stack. Repeat until there is only one stack.
…Insertion sort to complete
It is easier to file the sorted cards, as each next one is a little farther down the same open drawer.
Radix sort
This one nobody else understood how I did it so fast, despite repeated tries to teach it.
A large box of check stubs (the size of punch cards) needs to be sorted. It looks like playing solitaire on a large table—deal out, stack up, repeat.
In general
30 years ago, I did notice what you’re asking about: the ideas transfer to physical systems quite directly because there are relative costs of comparisons and handling records, and levels of caching.
Going beyond well-understood equivalents
I recall an essay about your topic, and it brought up the spaghetti sort. You trim a length of dried noodle to indicate the key value, and label it with the record ID. This is O(n), simply processing each item once.
Then you grab the bundle and tap one end on the table. They align on the bottom edges, and they are now sorted. You can trivially take off the longest one, and repeat. The read-out is also O(n).
There are two things going on here in the “real world” that don’t correspond to algorithms. First, aligning the edges is a parallel operation. Every data item is also a processor (the laws of physics apply to it). So, in general, you scale the available processing with n, essentially dividing your classic complexity by a factor on n.
Second, how does aligning the edges accomplish a sort? The real sorting is in the read-out which lets you find the longest in one step, even though you did compare all of them to find the longest. Again, divide by a factor of n, so finding the largest is now O(1).
Another example is using analog computing: a physical model solves the problem “instantly” and the prep work is O(n). In principle the computation is scaling with the number of interacting components, not the number of prepped items. So the computation scales with n². The example I'm thinking of is a weighted multi-factor computation, which was done by drilling holes in a map, hanging weights from strings passing through the holes, and gathering all the strings on a ring.
Sorting is still O(n) total time. That it is faster than that is because of Parallelization.
You could view a centrifuge as a Bucketsort of n atoms, parallelized over n cores(each atom acts as a processor).
You can make sorting faster by parallelization but only by a constant factor because the number of processors is limited, O(n/C) is still O(n) (CPUs have usually < 10 cores and GPUs < 6000)
The centrifuge is not sorting the nodes, it applies applies a force to them then they react in parallel to it.
So if you were to implement a bubble sort where each node is moving itself in parallel up or down based on it's "density", you'd have a centrifuge implementation.
Keep in mind that in the real world you can run a very large amount of parallel tasks where in a computer you can have a maximum of real parallel tasks equals to the number of physical processing units.
In the end, you would also be limited with the access to the list of elements because it cannot be modified simultaneously by two nodes...
Would it be possible with some overhead on each node (some value or
method tacked on to each of the nodes) to 'force' the order of the
list?
When we sort using computer programs we select a property of the values being sorted. That's commonly magnitude of the number or the alphabetical order.
Something like a centrifuge, where only each element cares about its
relative position in space (in relation to other nodes)
This analogy aptly reminds me of simple bubble sort. How smaller numbers bubble up in each iteration. Like your centrifuge logic.
So to answer this, don't we actually do something of that sort in software based sorting?
First of all, you are comparing two different contexts, one is logic(computer) and the other is physics which (so far) is proven that we can model some parts of it using mathematical formulas and we as programmers can use this formulas to simulate (some parts of) physics in the logic work (e.g physics engine in game engine).
Second We have some possibilities in the computer (logic) world that is nearly impossible in physics for example we can access memory and find the exact location of each entity at each time but in physics that is a huge problem Heisenberg's uncertainty principle.
Third If you want to map centrifuges and its operation in real world, to computer world, it is like someone (The God) has given you a super-computer with all the rules of physics applied and you are doing your small sorting in it (using centrifuge) and by saying that your sorting problem was solved in o(n) you are ignoring the huge physics simulation going on in background...
Consider: is "centrifuge sort" really scaling better? Think about what happens as you scale up.
The test tubes have to get longer and longer.
The heavy stuff has to travel further and further to get to the bottom.
The moment of inertia increases, requiring more power and longer times to accelerate up to sorting speed.
It's also worth considering other problems with centrifuge sort. For example, you can only operate on a narrow size scale. A computer sorting algorithm can handle integers from 1 to 2^1024 and beyond, no sweat. Put something that weighs 2^1024 times as much as a hydrogen atom into a centrifuge and, well, that's a black hole and the galaxy has been destroyed. The algorithm failed.
Of course the real answer here is that computational complexity is relative to some computational model, as mentioned in other answer. And "centrifuge sort" doesn't make sense in the context of common computational models, such as the RAM model or the IO model or multitape Turing machines.
Another perspective is that what you're describing with the centrifuge is analogous to what's been called the "spaghetti sort" (https://en.wikipedia.org/wiki/Spaghetti_sort). Say you have a box of uncooked spaghetti rods of varying lengths. Hold them in your fist, and loosen your hand to lower them vertically so the ends are all resting on a horizontal table. Boom! They're sorted by height. O(constant) time. (Or O(n) if you include picking the rods out by height and putting them in a . . . spaghetti rack, I guess?)
You can note there that it's O(constant) in the number of pieces of spaghetti, but, due to the finite speed of sound in spaghetti, it's O(n) in the length of the longest strand. So nothing comes for free.

Comparison of the runtime of Nearest Neighbor queries on different data structures

Given n points in d-dimensional space, there are several data structures, such as Kd-Trees, Quadtrees, etc. to index the points. On these data structures it is possible to implement straight-forward algorithm for nearest neighbor queries around a given input point.
Is there a book, paper, survey, ... that compares the theoretical (mostly expected) runtime of the nearest neighbor query on different data structures?
The data I am looking at is composed of fairly small point clouds, so it can all be processed in main memory. For the sake of simplicity, I assume the data to be uniformly distributed. That is, im am not interested in real-world performance, but rather theoretical results
You let the dimension of the points undefined and you just give an approximation for the number of points. What does small means? It's relative what one person means by small.
What you search, of course, doesn't exist. Your question is pretty much this:
Question:
For a small (whatever does small means to you) dataset, of any dimension with data that follow a uniform distribution, what's the optimal data structure to use?
Answer:
There is no such data structure.
Wouldn't it be too strange to have an answer on that? A false analogy would be to put as a synonym of this question, the "Which is the optimal programming language?" question that most of the first year undergrads have. Your question is not that naive, but it's walking on the same track.
Why there is no such data structure?
Because, the dimension of the dataset is variable. This means, that you might have a dataset in 2 dimensions, but it could also mean that you have a dataset in 1000 dimensions, or even better a dataset in 1000 dimensions, with an intrinsic dimension that is much less than 1000. Think about it, could one propose a data structure that would behave equally good for the three datasets I mentioned it? I doubt it.
In fact, there are some data structures that behave really nicely in low dimensions (quadtrees and KD-trees for example), while others are doing much better in higher dimensions (RKD-tree forest for instance).
Moreover, the algorithms and the expectations used for Nearest Neighbour search are heavily dependent on the dimension of the dataset (as well as the size of the dataset and the nature of the queries (for example a query that is too far from the dataset or equidistant from the points of the dataset will probably result in a slow search performance)).
In lower dimensions, one would perform a k-Nearest Neighbour(k-NN) search. In higher dimensions, it would be more wise to perform k-Approximate NN search. In this case, we follow the following trade-off:
Speed VS accuracy
What happens is that we achieve faster execution of the program, by sacrificing the correctness of our result. In other words, our search routine will be relatively fast, but it may (the possibility of this depends on many parameters, such as the nature of your problem and the library you are using) not return the true NN, but an approximation of the exact NN. For example it might not find the exact NN, but the third NN to the query point. You could also check the approximate-nn-searching wiki tag.
Why not always searching for the exact NN? Because of the curse of dimensionality, which results in the solutions provided in the lower dimensions to behave as good as the brute force would do (search all the points in the dataset for every query).
You see my answer already got big, so I should stop here. Your question is too broad, but interesting, I must admit. :)
In conclusion, which would be the optimal data structure (and algorithm) to use depends on your problem. The size of the dataset you are handling, the dimension and the intrinsic dimension of the points play a key role. The number and the nature of the queries also play an important role.
For nearest neighbor searches of potentially non-uniform point data I think a kd-tree will give you the best performance in general. As far as broad overviews and theoretical cost analysis I think Wikipedia is an OK place to start, but keep in mind it does not cover much real-world optimization:
http://en.wikipedia.org/wiki/Nearest_neighbor_search
http://en.wikipedia.org/wiki/Space_partitioning
Theoretical performance is one thing but real world performance is something else entirely. Real world performance depends as much on the details of the data structure implementation as it does on the type of data structure. For example, a pointer-less (compact array) implementation can be many times faster than a pointer-based implementation because of improved cache coherence and faster data allocation. Wider branching may be slower in theory but faster in practice if you leverage SIMD to test several branches simultaneously.
Also the exact nature of your point data can have a big impact on performance. Uniform distributions are less demanding and can be handled quickly with simpler data structures. Non-uniform distributions require more care. (Kd-trees work well for both uniform and non-uniform data.) Also, if your data is too large to process in-core then you will need to take an entirely different approach compared to smaller data sets.

Best algorithm to minimize an output value by varying input data

I have an incoming stream of data and a set of transformations, which can be applied to the stream in various combinations to get a numerical output value. I need to find which subset of the transformation minimizes the number.
The data is an ordered list of numbers with metadata attached to each one.
The transformations are quasi-linear: they are technically executable code in a Turing-complete language, but they are known to belong to a restricted subset which always halts, and they transform the input number to output number with arithmetic operations, whose flow is dependent on metadata attached. Moreover, the operations are almost all the time linear (but they are not bound to be—meaning this may be a place for optimization, but not restriction).
Basically, a brute-force approach involving 2n steps (where n is a number of transformations) would work, but it is woefully ineffective, and I'm almost absolutely sure this would not scale in production. Are there any algorithms to solve this task faster?
If almost all operations are linear, can't you use linear programming as heuristics?
And maybe in between do checks whether some transformations are particularly slow, in which case you can still switch to brute force.
Do you need to find the optimal output?

Algorithm to optimize parameters based on imprecise fitness function

I am looking for a general algorithm to help in situations with similar constraints as this example :
I am thinking of a system where images are constructed based on a set of operations. Each operation has a set of parameters. The total "gene" of the image is then the sequential application of the operations with the corresponding parameters. The finished image is then given a vote by one or more real humans according to how "beautiful" it is.
The question is what kind of algorithm would be able to do better than simply random search if you want to find the most beautiful image? (and hopefully improve the confidence over time as votes tick in and improve the fitness function)
Given that the operations will probably be correlated, it should be possible to do better than random search. So for example operation A with parameters a1 and a2 followed by B with parameters b1 could generally be vastly superior to B followed by A. The order of operations will matter.
I have tried googling for research papers on random walk and markov chains as that is my best guesses about where to look, but so far have found no scenarios similar enough. I would really appreciate even just a hint of where to look for such an algorithm.
I think what you are looking for fall in a broad research area called metaheuristics (which include many non-linear optimization algorithms such as genetic algorithms, simulated annealing or tabu search).
Then if your raw fitness function is just giving a statistical value somehow approximating a real (but unknown) fitness function, you can probably still use most metaheuristics by (somehow) smoothing your fitness function (averaging results would do that).
Do you mean the Metropolis algorithm?
This approach uses a random walk, weighted by the fitness function. It is useful for locating local extrema in complicated fitness landscapes, but is generally slower than deterministic approaches where those will work.
You're pretty much describing a genetic algorithm in which the sequence of operations represents the "gene" ("chromosome" would be a better term for this, where the parameter[s] passed to each operation represents a single "gene", and multiple genes make up a chromosome), the image produced represents the phenotypic expression of the gene, and the votes from the real humans represent the fitness function.
If I understand your question, you're looking for an alternative algorithm of some sort that will evaluate the operations and produce a "beauty" score similar to what the real humans produce. Good luck with that - I don't think there really is any such thing, and I'm not surprised that you didn't find anything. Human brains, and correspondingly human evaluations of aesthetics, are much too staggeringly complex to be reducible to a simplistic algorithm.
Interestingly, your question seems to encapsulate the bias against using real human responses as the fitness function in genetic-algorithm-based software. This is a subject of relevance to me, since my namesake software is specifically designed to use human responses (or "votes") to evaluate music produced via a genetic process.
Simple Markov Chain
Markov chains, which you mention, aren't a bad way to go. A Markov chain is just a state machine, represented as a graph with edge weights which are transition probabilities. In your case, each of your operations is a node in the graph, and the edges between the nodes represent allowable sequences of operations. Since order matters, your edges are directed. You then need three components:
A generator function to construct the graph of allowed transitions (which operations are allowed to follow one another). If any operation is allowed to follow any other, then this is easy to write: all nodes are connected, and your graph is said to be complete. You can initially set all the edge weights to 1.
A function to traverse the graph, crossing N nodes, where N is your 'gene-length'. At each node, your choice is made randomly, but proportionally weighted by the values of the edges (so better edges have a higher chance of being selected).
A weighting update function which can be used to adjust the weightings of the edges when you get feedback about an image. For example, a simple update function might be to give each edge involved in a 'pleasing' image a positive vote each time that image is nominated by a human. The weighting of each edge is then normalised, with the currently highest voted edge set to 1, and all the others correspondingly reduced.
This graph is then a simple learning network which will be refined by subsequent voting. Over time as votes accumulate, successive traversals will tend to favour the more highly rated sequences of operations, but will still occasionally explore other possibilities.
Advantages
The main advantage of this approach is that it's easy to understand and code, and makes very few assumptions about the problem space. This is good news if you don't know much about the search space (e.g. which sequences of operations are likely to be favourable).
It's also easy to analyse and debug - you can inspect the weightings at any time and very easily calculate things like the top 10 best sequences known so far, etc. This is a big advantage - other approaches are typically much harder to investigate ("why did it do that?") because of their increased abstraction. Although very efficient, you can easily melt your brain trying to follow and debug the convergence steps of a simplex crawler!
Even if you implement a more sophisticated production algorithm, having a simple baseline algorithm is crucial for sanity checking and efficiency comparisons. It's also easy to tinker with, by messing with the update function. For example, an even more baseline approach is pure random walk, which is just a null weighting function (no weighting updates) - whatever algorithm you produce should perform significantly better than this if its existence is to be justified.
This idea of baselining is very important if you want to evaluate the quality of your algorithm's output empirically. In climate modelling, for example, a simple test is "does my fancy simulation do any better at predicting the weather than one where I simply predict today's weather will be the same as yesterday's?" Since weather is often correlated on a timescale of several days, this baseline can give surprisingly good predictions!
Limitations
One disadvantage of the approach is that it is slow to converge. A more agressive choice of update function will push promising results faster (for example, weighting new results according to a power law, rather than the simple linear normalisation), at the cost of giving alternatives less credence.
This is equivalent to fiddling with the mutation rate and gene pool size in a genetic algorithm, or the cooling rate of a simulated annealing approach. The tradeoff between 'climbing hills or exploring the landscape' is an inescapable "twiddly knob" (free parameter) which all search algorithms must deal with, either directly or indirectly. You are trying to find the highest point in some fitness search space. Your algorithm is trying to do that in less tries than random inspection, by looking at the shape of the space and trying to infer something about it. If you think you're going up a hill, you can take a guess and jump further. But if it turns out to be a small hill in a bumpy landscape, then you've just missed the peak entirely.
Also note that since your fitness function is based on human responses, you are limited to a relatively small number of iterations regardless of your choice of algorithmic approach. For example, you would see the same issue with a genetic algorithm approach (fitness function limits the number of individuals and generations) or a neural network (limited training set).
A final potential limitation is that if your "gene-lengths" are long, there are many nodes, and many transitions are allowed, then the size of the graph will become prohibitive, and the algorithm impractical.

Resources