I'm trying to think of a way to desing a data structure that I can efficiently insert to, remove from and search in it.
The catch is that the search function is getting a similar object as input, with 2 attributes, and I need to find an object in my dataset, such that both the 1st and 2nd of the object in my dataset are equal to or bigger than the one in search function's input.
So for example, if I send as input, the following object:
object[a] = 9; object[b] = 14
Then a valid found object could be:
object[a] = 9; object[b] = 79
but not:
object[a] = 8; object[b] = 28
Is there anyway to store the data such that the search complexity is better than linear?
EDIT:
I forgot to include in my original question. The search has to return the smallest possible object in the dataset, by multipication of the 2 attributes.
Meaning that the value of object[a]*object[b] of an object that fits the original condition, is smaller than any other object in the dataset that also fits.
You may want to use k-d tree data structure, which is typically use to index k dimensional points. The search operation, like what you perform, requires O(log n) in average.
This post may help when attributes are hierarchically linked like name, forename. For point in a 2D space k-d tree is more adapted as explain by fajarkoe.
class Person {
string name;
string forename;
... other non key attributes
}
You have to write a comparator function which take two objects of class X as input and returns -1, 0 or +1 for <, = and > cases.
Libraries like glibc(), with qsort() and bsearch or more higher languages like Java and its java.util.Comparator class and java.util.SortedMap (implementation java.util.TreeMap) as containers use comparators.
Other languages use equivalent concept.
The comparator method may be wrote followin your spec like:
int compare( Person left, Person right ) {
if( left.name < right.name ) {
return -1;
}
if( left.name > right.name ) {
return +1;
}
if( left.forename < right.forename ) {
return -1;
}
if( left.forename > right.forename ) {
return +1;
}
return 0;
}
Complexity of qsort()
Quicksort, or partition-exchange sort, is a sorting algorithm
developed by Tony Hoare that, on average, makes O(n log n) comparisons
to sort n items. In the worst case, it makes O(n2) comparisons, though
this behavior is rare. Quicksort is often faster in practice than
other O(n log n) algorithms.1 Additionally, quicksort's sequential
and localized memory references work well with a cache. Quicksort is a
comparison sort and, in efficient implementations, is not a stable
sort. Quicksort can be implemented with an in-place partitioning
algorithm, so the entire sort can be done with only O(log n)
additional space used by the stack during the recursion.2
Complexity of bsearch()
If the list to be searched contains more than a few items (a dozen,
say) a binary search will require far fewer comparisons than a linear
search, but it imposes the requirement that the list be sorted.
Similarly, a hash search can be faster than a binary search but
imposes still greater requirements. If the contents of the array are
modified between searches, maintaining these requirements may even
take more time than the searches. And if it is known that some items
will be searched for much more often than others, and it can be
arranged so that these items are at the start of the list, then a
linear search may be the best.
I was reading this question. The selected answer contains the following two algorithms. I couldn't understand why the first one's time complexity is O(ln(n)). At the worst case, if the array don't contain any duplicates it will loop n times so does the second one. Am I wrong or am I missing something? Thank you
1) A faster (in the limit) way
Here's a hash based approach. You gotta pay for the autoboxing, but it's O(ln(n)) instead of O(n2). An enterprising soul would go find a primitive int-based hash set (Apache or Google Collections has such a thing, methinks.)
boolean duplicates(final int[] zipcodelist)
{
Set<Integer> lump = new HashSet<Integer>();
for (int i : zipcodelist)
{
if (lump.contains(i)) return true;
lump.add(i);
}
return false;
}
2)Bow to HuyLe
See HuyLe's answer for a more or less O(n) solution, which I think needs a couple of add'l steps:
static boolean duplicates(final int[] zipcodelist) {
final int MAXZIP = 99999;
boolean[] bitmap = new boolean[MAXZIP+1];
java.util.Arrays.fill(bitmap, false);
for (int item : zipcodeList)
if (!bitmap[item]) bitmap[item] = true;
else return true;
}
return false;
}
The first solution should have expected complexity of O(n), since the whole zip code list must be traversed, and processing each zip code is O(1) expected time complexity.
Even taking into consideration that insertion into HashMap may trigger a re-hash, the complexity is still O(1). This is a bit of non sequitur, since there may be no relation between Java HashMap and the assumption in the link, but it is there to show that it is possible.
From HashSet documentation:
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets.
It's the same for the second solution, which is correctly analyzed: O(n).
(Just an off-topic note, BitSet is faster than array, as seen in the original post, since 8 booleans are packed into 1 byte, which uses less memory).
I have an assignment to create an algorithm to find duplicates in an array which includes number values. but it has not said which kind of numbers, integers or floats. I have written the following pseudocode:
FindingDuplicateAlgorithm(A) // A is the array
mergeSort(A);
for int i <- 0 to i<A.length
if A[i] == A[i+1]
i++
return A[i]
else
i++
have I created an efficient algorithm?
I think there is a problem in my algorithm, it returns duplicate numbers several time. for example if array include 2 in two for two indexes i will have ...2, 2,... in the output. how can i change it to return each duplicat only one time?
I think it is a good algorithm for integers, but does it work good for float numbers too?
To handle duplicates, you can do the following:
if A[i] == A[i+1]:
result.append(A[i]) # collect found duplicates in a list
while A[i] == A[i+1]: # skip the entire range of duplicates
i++ # until a new value is found
Do you want to find Duplicates in Java?
You may use a HashSet.
HashSet h = new HashSet();
for(Object a:A){
boolean b = h.add(a);
boolean duplicate = !b;
if(duplicate)
// do something with a;
}
The return-Value of add() is defined as:
true if the set did not already
contain the specified element.
EDIT:
I know HashSet is optimized for inserts and contains operations. But I'm not sure if its fast enough for your concerns.
EDIT2:
I've seen you recently added the homework-tag. I would not prefer my answer if itf homework, because it may be to "high-level" for an allgorithm-lesson
http://download.oracle.com/javase/1.4.2/docs/api/java/util/HashSet.html#add%28java.lang.Object%29
Your answer seems pretty good. First sorting and them simply checking neighboring values gives you O(n log(n)) complexity which is quite efficient.
Merge sort is O(n log(n)) while checking neighboring values is simply O(n).
One thing though (as mentioned in one of the comments) you are going to get a stack overflow (lol) with your pseudocode. The inner loop should be (in Java):
for (int i = 0; i < array.length - 1; i++) {
...
}
Then also, if you actually want to display which numbers (and or indexes) are the duplicates, you will need to store them in a separate list.
I'm not sure what language you need to write the algorithm in, but there are some really good C++ solutions in response to my question here. Should be of use to you.
O(n) algorithm: traverse the array and try to input each element in a hashtable/set with number as the hash key. if you cannot enter, than that's a duplicate.
Your algorithm contains a buffer overrun. i starts with 0, so I assume the indexes into array A are zero-based, i.e. the first element is A[0], the last is A[A.length-1]. Now i counts up to A.length-1, and in the loop body accesses A[i+1], which is out of the array for the last iteration. Or, simply put: If you're comparing each element with the next element, you can only do length-1 comparisons.
If you only want to report duplicates once, I'd use a bool variable firstDuplicate, that's set to false when you find a duplicate and true when the number is different from the next. Then you'd only report the first duplicate by only reporting the duplicate numbers if firstDuplicate is true.
public void printDuplicates(int[] inputArray) {
if (inputArray == null) {
throw new IllegalArgumentException("Input array can not be null");
}
int length = inputArray.length;
if (length == 1) {
System.out.print(inputArray[0] + " ");
return;
}
for (int i = 0; i < length; i++) {
if (inputArray[Math.abs(inputArray[i])] >= 0) {
inputArray[Math.abs(inputArray[i])] = -inputArray[Math.abs(inputArray[i])];
} else {
System.out.print(Math.abs(inputArray[i]) + " ");
}
}
}
There are 2 arrays given for ex.
A = [20,4,21,6,3]
B = [748,32,48,92,23......]
assuming B is very large and can hold all the elements of array A.
Find the way in which array B is in (containing all the elements of array A as well) sorted order.
Design an algorithm in the most efficient way.
This sounds like merge sort algorithm. You will find tons of examples here. You can then modify to suit.
Given that your array is integer array, you can use Radix sort algorithm to sort B in linear time, O(n). Wikipedia has a nice write-up and sample python code.
Radix sort is linear with respect to the number of elements. While it also has a dependence on the size of the array, you take it as a constant; just like you take the comparison operator to be constant as well. When sorting bignum for instance, the comparison operator would also depend on the integer size!
Smells like homework. Basically, write into array B starting from the end, keeping track of the place you are reading from in both A and B.
Just try it out :
Merge the array A into B .
Use quick sort algorithm.
Before adding elements from A to B check if by doing that if you exceed the size of the array, if not Merge A into B
Then do a quick sort,
But if you just want to merge both arrays into a new arrays where new array has the combine length of both. Here is a jump start for you, try if you can go forward from here...
public double[] combineArrays(double[] first, double[] second) {
int totalLenth = first.length + second.length;
double[] newDoubles = new double[totalLenth];
for (int i = 0; i < first.length; i++) {
newDoubles[i] = first[i];
}
for (int j = first.length; j < newDoubles.length; j++) {
newDoubles[j] = second[j - first.length];
}
return newDoubles;
}
Hope this helps, Good Luck.
You can also modify an insertion sort idea:
0) do all necessary tests: if arrays are null, if bigger array has enough space
1) add small array at the end of the big array
2) do normal insertion sort, but start it at the beginning of the small array
here if you do quick_sort or some other "quickiest" O(n*log_n) sort, the problem is that you are not using the fact, that both array are sorted. With the insertion sort you are using the fact, that array B is sorted (but not the fact that A is sorted, so maybe we should develop the idea and modify insertion sort to use that fact as well).
What's the best algorithm for comparing two arrays to see if they have the same members?
Assume there are no duplicates, the members can be in any order, and that neither is sorted.
compare(
[a, b, c, d],
[b, a, d, c]
) ==> true
compare(
[a, b, e],
[a, b, c]
) ==> false
compare(
[a, b, c],
[a, b]
) ==> false
Obvious answers would be:
Sort both lists, then check each
element to see if they're identical
Add the items from one array to a
hashtable, then iterate through the
other array, checking that each item
is in the hash
nickf's iterative search algorithm
Which one you'd use would depend on whether you can sort the lists first, and whether you have a good hash algorithm handy.
You could load one into a hash table, keeping track of how many elements it has. Then, loop over the second one checking to see if every one of its elements is in the hash table, and counting how many elements it has. If every element in the second array is in the hash table, and the two lengths match, they are the same, otherwise they are not. This should be O(N).
To make this work in the presence of duplicates, track how many of each element has been seen. Increment while looping over the first array, and decrement while looping over the second array. During the loop over the second array, if you can't find something in the hash table, or if the counter is already at zero, they are unequal. Also compare total counts.
Another method that would work in the presence of duplicates is to sort both arrays and do a linear compare. This should be O(N*log(N)).
Assuming you don't want to disturb the original arrays and space is a consideration, another O(n.log(n)) solution that uses less space than sorting both arrays is:
Return FALSE if arrays differ in size
Sort the first array -- O(n.log(n)) time, extra space required is the size of one array
For each element in the 2nd array, check if it's in the sorted copy of
the first array using a binary search -- O(n.log(n)) time
If you use this approach, please use a library routine to do the binary search. Binary search is surprisingly error-prone to hand-code.
[Added after reviewing solutions suggesting dictionary/set/hash lookups:]
In practice I'd use a hash. Several people have asserted O(1) behaviour for hashes, leading them to conclude a hash-based solution is O(N). Typical inserts/lookups may be close to O(1), and some hashing schemes guarantee worst-case O(1) lookup, but worst-case insertion -- in constructing the hash -- isn't O(1). Given any particular hashing data structure, there would be some set of inputs which would produce pathological behaviour. I suspect there exist hashing data structures with the combined worst-case to [insert-N-elements then lookup-N-elements] of O(N.log(N)) time and O(N) space.
You can use a signature (a commutative operation over the array members) to further optimize this in the case where the array are usually different, saving the o(n log n) or the memory allocation.
A signature can be of the form of a bloom filter(s), or even a simple commutative operation like addition or xor.
A simple example (assuming a long as the signature side and gethashcode as a good object identifier; if the objects are, say, ints, then their value is a better identifier; and some signatures will be larger than long)
public bool MatchArrays(object[] array1, object[] array2)
{
if (array1.length != array2.length)
return false;
long signature1 = 0;
long signature2 = 0;
for (i=0;i<array1.length;i++) {
signature1=CommutativeOperation(signature1,array1[i].getHashCode());
signature2=CommutativeOperation(signature2,array2[i].getHashCode());
}
if (signature1 != signature2)
return false;
return MatchArraysTheLongWay(array1, array2);
}
where (using an addition operation; use a different commutative operation if desired, e.g. bloom filters)
public long CommutativeOperation(long oldValue, long newElement) {
return oldValue + newElement;
}
This can be done in different ways:
1 - Brute force: for each element in array1 check that element exists in array2. Note this would require to note the position/index so that duplicates can be handled properly. This requires O(n^2) with much complicated code, don't even think of it at all...
2 - Sort both lists, then check each element to see if they're identical. O(n log n) for sorting and O(n) to check so basically O(n log n), sort can be done in-place if messing up the arrays is not a problem, if not you need to have 2n size memory to copy the sorted list.
3 - Add the items and count from one array to a hashtable, then iterate through the other array, checking that each item is in the hashtable and in that case decrement count if it is not zero otherwise remove it from hashtable. O(n) to create a hashtable, and O(n) to check the other array items in the hashtable, so O(n). This introduces a hashtable with memory at most for n elements.
4 - Best of Best (Among the above): Subtract or take difference of each element in the same index of the two arrays and finally sum up the subtacted values. For eg A1={1,2,3}, A2={3,1,2} the Diff={-2,1,1} now sum-up the Diff = 0 that means they have same set of integers. This approach requires an O(n) with no extra memory. A c# code would look like as follows:
public static bool ArrayEqual(int[] list1, int[] list2)
{
if (list1 == null || list2 == null)
{
throw new Exception("Invalid input");
}
if (list1.Length != list2.Length)
{
return false;
}
int diff = 0;
for (int i = 0; i < list1.Length; i++)
{
diff += list1[i] - list2[i];
}
return (diff == 0);
}
4 doesn't work at all, it is the worst
If the elements of an array are given as distinct, then XOR ( bitwise XOR ) all the elements of both the arrays, if the answer is zero, then both the arrays have the same set of numbers. The time complexity is O(n)
I would suggest using a sort first and sort both first. Then you will compare the first element of each array then the second and so on.
If you find a mismatch you can stop.
If you sort both arrays first, you'd get O(N log(N)).
What is the "best" solution obviously depends on what constraints you have. If it's a small data set, the sorting, hashing, or brute force comparison (like nickf posted) will all be pretty similar. Because you know that you're dealing with integer values, you can get O(n) sort times (e.g. radix sort), and the hash table will also use O(n) time. As always, there are drawbacks to each approach: sorting will either require you to duplicate the data or destructively sort your array (losing the current ordering) if you want to save space. A hash table will obviously have memory overhead to for creating the hash table. If you use nickf's method, you can do it with little-to-no memory overhead, but you have to deal with the O(n2) runtime. You can choose which is best for your purposes.
Going on deep waters here, but:
Sorted lists
sorting can be O(nlogn) as pointed out. just to clarify, it doesn't matter that there is two lists, because: O(2*nlogn) == O(nlogn), then comparing each elements is another O(n), so sorting both then comparing each element is O(n)+O(nlogn) which is: O(nlogn)
Hash-tables:
Converting the first list to a hash table is O(n) for reading + the cost of storing in the hash table, which i guess can be estimated as O(n), gives O(n). Then you'll have to check the existence of each element in the other list in the produced hash table, which is (at least?) O(n) (assuming that checking existance of an element the hash-table is constant). All-in-all, we end up with O(n) for the check.
The Java List interface defines equals as each corresponding element being equal.
Interestingly, the Java Collection interface definition almost discourages implementing the equals() function.
Finally, the Java Set interface per documentation implements this very behaviour. The implementation is should be very efficient, but the documentation makes no mention of performance. (Couldn't find a link to the source, it's probably to strictly licensed. Download and look at it yourself. It comes with the JDK) Looking at the source, the HashSet (which is a commonly used implementation of Set) delegates the equals() implementation to the AbstractSet, which uses the containsAll() function of AbstractCollection using the contains() function again from hashSet. So HashSet.equals() runs in O(n) as expected. (looping through all elements and looking them up in constant time in the hash-table.)
Please edit if you know better to spare me the embarrasment.
Pseudocode :
A:array
B:array
C:hashtable
if A.length != B.length then return false;
foreach objA in A
{
H = objA;
if H is not found in C.Keys then
C.add(H as key,1 as initial value);
else
C.Val[H as key]++;
}
foreach objB in B
{
H = objB;
if H is not found in C.Keys then
return false;
else
C.Val[H as key]--;
}
if(C contains non-zero value)
return false;
else
return true;
The best way is probably to use hashmaps. Since insertion into a hashmap is O(1), building a hashmap from one array should take O(n). You then have n lookups, which each take O(1), so another O(n) operation. All in all, it's O(n).
In python:
def comparray(a, b):
sa = set(a)
return len(sa)==len(b) and all(el in sa for el in b)
Ignoring the built in ways to do this in C#, you could do something like this:
Its O(1) in the best case, O(N) (per list) in worst case.
public bool MatchArrays(object[] array1, object[] array2)
{
if (array1.length != array2.length)
return false;
bool retValue = true;
HashTable ht = new HashTable();
for (int i = 0; i < array1.length; i++)
{
ht.Add(array1[i]);
}
for (int i = 0; i < array2.length; i++)
{
if (ht.Contains(array2[i])
{
retValue = false;
break;
}
}
return retValue;
}
Upon collisions a hashmap is O(n) in most cases because it uses a linked list to store the collisions. However, there are better approaches and you should hardly have collisions anyway because if you did the hashmap would be useless. In all regular cases it's simply O(1). Besides that, it's not likely to have more than a small n of collisions in a single hashmap so performance wouldn't suck that bad; you can safely say that it's O(1) or almost O(1) because the n is so small it's can be ignored.
Here is another option, let me know what you guys think.It should be T(n)=2n*log2n ->O(nLogn) in the worst case.
private boolean compare(List listA, List listB){
if (listA.size()==0||listA.size()==0) return true;
List runner = new ArrayList();
List maxList = listA.size()>listB.size()?listA:listB;
List minList = listA.size()>listB.size()?listB:listA;
int macthes = 0;
List nextList = null;;
int maxLength = maxList.size();
for(int i=0;i<maxLength;i++){
for (int j=0;j<2;j++) {
nextList = (nextList==null)?maxList:(maxList==nextList)?minList:maList;
if (i<= nextList.size()) {
MatchingItem nextItem =new MatchingItem(nextList.get(i),nextList)
int position = runner.indexOf(nextItem);
if (position <0){
runner.add(nextItem);
}else{
MatchingItem itemInBag = runner.get(position);
if (itemInBag.getList != nextList) matches++;
runner.remove(position);
}
}
}
}
return maxLength==macthes;
}
public Class MatchingItem{
private Object item;
private List itemList;
public MatchingItem(Object item,List itemList){
this.item=item
this.itemList = itemList
}
public boolean equals(object other){
MatchingItem otheritem = (MatchingItem)other;
return otheritem.item.equals(this.item) and otheritem.itemlist!=this.itemlist
}
public Object getItem(){ return this.item}
public Object getList(){ return this.itemList}
}
The best I can think of is O(n^2), I guess.
function compare($foo, $bar) {
if (count($foo) != count($bar)) return false;
foreach ($foo as $f) {
foreach ($bar as $b) {
if ($f == $b) {
// $f exists in $bar, skip to the next $foo
continue 2;
}
}
return false;
}
return true;
}