Sorting algorithm speeds - sorting

I have made a sorting algorithm that works like no other I know of, and I would like to have it tested. Did anyone come across a site where you can compare how your algorithm stands beside already known ones?
So bunch of data and end-cases are needed to be thrown at it with time results.

Related

What is the difference between Genetic Algorithm and Iterated Local Search Algorithm?

I'm basically trying to use Genetic Algorithm or Iterated Local Search Algorithm to get an optimal solution for a question.Can someone please explain what is the basic difference between these two algorithms and is there any situations where one of them is better than the other?
Let me start from the second question. I believe that there is no way to determine a better algorithm for a given problem without any trials and tests. The behavior of an algorithm heavily depends on problem's properties. If we are talking about complex problems with hundreds and thousands of variables, it's just too difficult to predict anything. I'm not talking about your engineer's intuition, some deep problem understanding, previous experience, etc, they are not really measurable.
The main difference between global and local search is quite straightforward - local search considers just one or a few of possible solutions at a single point of time and it tries to improve them with some modifications. Thus, each iteration it considers just a small portion of a search space (=local neighboorhood). Global search tries to take into account whole problem with all its parameters at the same time. For example, PSO samples huge amount of candidates and tries to move all of them into the global optimum's direction using some simple formula.

Is there a way to make StoogeSort more curve-like?

I'm currently studying Analysis of Algorithms and their respective runtime, and i came across a sorting algorithm called Stooge sort, and the weird way it behaves really caught my attention. I'm trying to determine the runtime using a program created by a professor of mine, but the amount of points that i have are very small, because the runtime starts to grow in a very quick manner and i can't let my computer execute a program for an entire day.
My question is: Is there a way to make the algorithm behave more like a curve without changing it's complexity? Because i've so far calculated 5 points that would be useful (these points are the first real number after the Stooge sort "ladder" graph changes, reffering to the size of the array getting sorted), but that's not as much as i need.
I'm using the algorithm provided on the wikipedia page of the Stooge Sort.
Five points is too little data to say it doesn't behave like a curve.
In fact, you can find a pretty accurate curve fit for your data:
source: http://mycurvefit.com/index.html?action=openshare&id=7b237893-c52c-49db-bcf6-e29ccf391b7c
But, again, there is very little data to conclude anything.

Custom sorting algorithm needed

I have a need for an unusual sorting algorithm which would be massively useful to a lot of people, but I would prefer to leave the specific application vague as I have not found particularly good solutions in my research and was wondering if folks here could bring new ideas to the table. This is a real-world sort, so it has some restrictions which are different from many algorithms. Here are the requirements.
The lists to be sorted are of no uniform number of elements.
The values by which elements are sorted are not directly observable.
The comparison operation of two elements is expensive.
You may run as many comparison operations as you wish in parallel as you wish with no increase in expense.
Each element may only participate in one comparison operation at a time.
The result of a comparison operation only gives greater than, less than, or equal.
There is a probability that the comparison operation results in an incorrect value which is dynamic given the difference in the hidden values of the elements.
We have no indication when the comparison gives an incorrect value.
We may assume that the dynamic error rate of comparison is normally distributed.
Elements might intermittently be unavailable for comparison.
So, shot in the dark, hoping for somebody with an itch. The general gist is that you want to find the best way to set up a set of parallel comparisons to reveal as much information about the proper sort order as possible. A good answer would be able to describe the probability of error after n groups of actions. I'm sure some folks will be able to figure out what is being sorted based on this information, but for those who can't, believe me, there are many, many people who would benefit from this algorithm.
I'd look at comparator networks. One of the assumptions is the ability of doing multiple comparisons in parallel, and the usual goal is to minimize number of "layers" of comparisons. A so-called AKS network can achieve O(log n) time this way.
But they work with an assumption of all comparisons done correctly. I guess that handling errors could be done afterwards, by making additional layer of comparators to compare every two consecutive items after main sorting...
Starting point: Wikipedia
Anyway, this looks more like a scientific research topic.

Detect duplicated/similar text among large datasets?

I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem?
We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes
http://d3s.mff.cuni.cz/~holub/sw/shash/
http://matpalm.com/resemblance/simhash/
This answer is of a very high complexity class (worst case it's quintic, expected case it's quartic to verify your database the first time, then quartic/cubic to add a record,) so it doesn't scale well, unfortunately there isn't a much better answer that I can think of right now.
The algorithm is called the Ratcliff-Obershelp algorithm, It's implemented in python's difflib. The algorithm itself is cubic time worst case and quadratic expected. Then you have to do that for each possible pair of records, which is quadratic. When adding a record, of course, this is only linear.
EDIT: Sorry, I misread the documentation, difflib is quadratic only, rather than cubic. Use it rather than the other algorithm.
Look at shngle-min-hash techniques. Here is a presentation that could help you.
One approach I have used to do something similar is to construct a search index in the usual based on word statistics and then use the new item as if it was a search against that index - if the score for the top item in the search is too high then the new item is too similar. No doubt some of the standard text search libraries could be used for this although if it is only a few thousands of records it is pretty trivial to build your own.

String search algorithms

I have a project on benchmarking String Matching Algorithms and I would like to know if there is a standard for every algorithm so that I would be able to get fair results with my experimentation. I am planning to use java's system.nanotime in getting the running time of every algorithm. Any comment or reactions regarding my problem is very much appreciated. Thanks!
I am not entirely sure what you're asking. However, I am guessing you are asking how to get the most realistic results. You need to run your algorithm hundreds, or even thousands of iterations to get an average. It is also very important to turn off any caching that your language may do, and don't reuse objects, unless it is part of your algorithm.
I am not entirely sure what you're asking. However, another interpretation of what you are asking can be answered by trying to work out how a given algorithm performs as you increase the size of the problem. Using raw time to compare algorithms at a given string size does not necessarily allow for accurate comparison. Instead, you could try each algorithm with different string sizes and see how the algorithm behaves as string size varies.
And Mark's advice is good too. So you are running repeated trials for many different string lengths to get a picture of how one algorithm works, then repeating that for the next algorithm.
Again, it's not clear what you're asking, but here's another thought in addition to what Tony and Mark said:
Be very careful about testing only "real" input or only "random" input. Some algorithms are tuned to do well on typical input (searching for a word in English text), while others are tuned for working well on pathologically hard cases. You'll need a huge mix of possible inputs of all different types and sizes to do a truly good benchmark.

Resources