Lazy Shuffle Algorithms - algorithm

I have a large list of elements that I want to iterate in random order. However, I cannot modify the list and I don't want to create a copy of it either, because 1) it is large and 2) it can be expected that the iteration is cancelled early.
List<T> data = ...;
Iterator<T> shuffled = shuffle(data);
while (shuffled.hasNext()) {
T t = shuffled.next();
if (System.console().readLine("Do you want %s?", t).startsWith("y")) {
return t;
}
}
System.out.println("That's all");
return t;
I am looking for an algorithm were the code above would run in O(n) (and preferably require only O(log n)space), so caching the elements that were produced earlier is not an option. I don't care if the algorithm is biased (as long as it's not obvious).
(I uses pseudo-Java in my question, but you can use other languages if you wish)
Here is the best I got so far.
Iterator<T> shuffle(final List<T> data) {
int p = data.size();
while ((data.size() % p) == 0) p = randomPrime();
return new Iterator<T>() {
final int prime = p;
int n = 0, i = 0;
public boolean hasNext() { return i < data.size(); }
public T next() {
i++; n += prime;
return data.get(n);
}
}
}
Iterating all elements in O(n), constant space, but obviously biased as it can produce only data.size() permutations.

The easiest shuffling approaches I know of work with indices. If the List is not an ArrayList, you may end up with a very inefficient algorithm if you try to use one of the below (a LinkedList does have a get by ID, but it's O(n), so you'll end up with O(n^2) time).
If O(n) space is fine, which I'm assuming it's not, I'd recommend the Fisher-Yates / Knuth shuffle, it's O(n) time and is easy to implement. You can optimise it so you only need to perform a single operation before being able to get the first element, but you'll need to keep track of the rest of the modified list as you go.
My solution:
Ok, so this is not very random at all, but I can't see a better way if you want less than O(n) space.
It takes O(1) space and O(n) time.
There may be a way to push it up the space usage a little and get more random results, but I haven't figured that out yet.
It has to do with relative primes. The idea is that, given 2 relative primes a (the generator) and b, when you loop through a % b, 2a % b, 3a % b, 4a % b, ..., you will see every integer 0, 1, 2, ..., b-2, b-1, and this will also happen before seeing any integer twice. Unfortunately I don't have a link to a proof (the wikipedia link may mention or imply it, I didn't check in too much detail).
I start off by increasing the length until we get a prime, since this implies that any other number will be a relative prime, which is a whole lot less limiting (and just skip any number greater than the original length), then generate a random number, and use this as the generator.
I'm iterating through and printing out all the values, but it should be easy enough to modify to generate the next one given the current one.
Note I'm skipping 1 and len-1 with my nextInt, since these will produce 1,2,3,... and ...,3,2,1 respectively, but you can include these, but probably not if the length is below a certain threshold.
You may also want to generate a random number to multiply the generator by (mod the length) to start from.
Java code:
static Random gen = new Random();
static void printShuffle(int len)
{
// get first prime >= len
int newLen = len-1;
boolean prime;
do
{
newLen++;
// prime check
prime = true;
for (int i = 2; prime && i < len; i++)
prime &= (newLen % i != 0);
}
while (!prime);
long val = gen.nextInt(len-3) + 2;
long oldVal = val;
do
{
if (val < len)
System.out.println(val);
val = (val + oldVal) % newLen;
}
while (oldVal != val);
}

This is an old thread, but in case anyone comes across this in future, a paper by Andrew Kensler describes a way to do this in constant time and constant space. Essentially, you create a reversible hash function, and then use it (and not an array) to index the list. Kensler describes a method for generating the necessary function, and discusses "cycle walking" as a way to deal with a domain that is not identical to the domain of the hash function. Afnan Enayet's summary of the paper is here: https://afnan.io/posts/2019-04-05-explaining-the-hashed-permutation/.

You may try using a buffer to do this. Iterate through a limited set of data and put it in a buffer. Extract random values from that buffer and send it to output (or wherever you need it). Iterate through the next set and keep overwriting this buffer. Repeat this step.
You'll end up with n + n operations, which is still O(n). Unfortunately, the result will not be actually random. It will be close to random if you choose your buffer size properly.
On a different note, check these two: Python - run through a loop in non linear fashion, random iteration in Python
Perhaps there's a more elegant algorithm to do this better. I'm not sure though. Looking forward to other replies in this thread.

This is not a perfect answer to your question, but perhaps it's useful.
The idea is to use a reversible random number generator and the usual array-based shuffling algorithm done lazily: to get the i'th shuffled item, swap a[i] with and a randomly chosen a[j] where j is in [i..n-1], then return a[i]. This can be done in the iterator.
After you are done iterating, reset the array to original order by "unswapping" using the reverse direction of the RNG.
The unshuffling reset will never take longer than the original iteration, so it doesn't change asymptotic cost. Iteration is still linear in the number of iterations.
How to build a reversible RNG? Just use an encryption algorithm. Encrypt the previously generated pseudo-random value to go forward, and decrypt to go backward. If you have a symmetric encryption algorithm, then you can add a "salt" value at each step forward to prevent a cycle of two and subtract it for each step backward. I mention this because RC4 is simple and fast and symmetric. I've used it before for tasks like this. Encrypting 4-byte values then computing mod to get them in the desired range will be quick indeed.
You can press this into the Java iterator pattern by extending Iterator to allow resets. See below. Usage will look like:
ShuffledList<Integer> lst = new SuffledList<>();
... build the list with the usual operations
ResetableInterator<Integer> i = lst.iterator();
while (i.hasNext()) {
int val = i.next();
... use the randomly selected value
if (anyConditinoAtAll) break;
}
i.reset(); // Unshuffle the array
I know this isn't perfect, but it will be fast and give a good shuffle. Note that if you don't reset, the next iterator will still be a new random shuffle, but the original order will be lost forever. If the loop body can generate an exception, you'd want the reset in a finally block.
class ShuffledList<T> extends ArrayList<T> implements Iterable<T> {
#Override
public Iterator<T> iterator() {
return null;
}
public interface ResetableInterator<T> extends Iterator<T> {
public void reset();
}
class ShufflingIterator<T> implements ResetableInterator<T> {
int mark = 0;
#Override
public boolean hasNext() {
return true;
}
#Override
public T next() {
return null;
}
#Override
public void remove() {
throw new UnsupportedOperationException("Not supported.");
}
#Override
public void reset() {
throw new UnsupportedOperationException("Not supported yet.");
}
}
}

Related

convert to divide and conquer algorithm. Kotlin

convert method "FINAL" to divide and conquer algorithm
the task sounded like this: The buyer has n coins of
H1,...,Hn.
The seller has m
coins in denominations of
B1,...,Bm.
Can the buyer purchase the item
the cost S so that the seller has an exact change (if
necessary).
fun Final(H: ArrayList<Int>, B: ArrayList<Int>, S: Int): Boolean {
var Clon_Price = false;
var Temp: Int;
for (i in H) {
if (i == S)
return true;
}
for (i in H.withIndex()) {
Temp = i.value - S;
for (j in B) {
if (j == Temp)
Clon_Price = true;
}
}
return Clon_Price;
}
fun main(args: Array<String>) {
val H:ArrayList<Int> = ArrayList();
val B:ArrayList<Int> = ArrayList();
println("Enter the number of coins the buyer has:");
var n: Int = readln().toInt();
println("Enter their nominal value:")
while (n > 0){
H.add(readln().toInt());
n--
}
println("Enter the number of coins the seller has:");
var m: Int = readln().toInt();
println("Enter their nominal value:")
while (m > 0){
B.add(readln().toInt());
m--
}
println("Enter the product price:");
val S = readln().toInt();
if(Final(H,B,S)){
println("YES");
}
else{
println("No!");
}
Introduction
Since this is an assignment, I will only give you insights to solve this problem and you will need to do the coding yourself.
The algorithm
Receives two ArrayList<Int> and an Int parameter
if the searched (S) element can be found in H, then the result is true
Otherwise it loops H
Computes the difference between the current element and S
Searches for a match in B and if it's found, then true is being returned
If the method has not returned yet, then return false;
Divide et impera (Divide and conquer)
Divide and conquer is the process of breaking down a complicated task into similar, but simpler subtasks, repeating this breaking down until the subtasks become trivial (this was the divide part) and then, using the results of the trivial subtasks we can solve the slightly more complicated subtasks and go upwards in our layers of unsolved complexities until the problem is solved (this is the conquer part).
A very handy data-structure to use is the Stack. You can use the stack of your memory, which are fancy words for recursion, or, you can solve it iteratively, by managing such a stack yourself.
This specific problem
This algorithm does not seem to necessitate divide and conquer, given the fact that you only have two array lists that can be iterated, so, I guess, this is an early assignment.
To make sure this is divide and conquer, you can add two parameters to your method (which are 0 and length - 1 at the start) that reflect the current problem-space. And upon each call, check whether the starting and ending index (the two new parameters) are equal. If they are, you already have a trivial, simplified subtask and you just iterate the second ArrayList.
If they are not equal, then you still need to divide. You can simply
//... Some code here
return Final(H, B, S, start, end / 2) || Final(H, B, S, end / 2 + 1, end);
(there you go, I couldn't resist writing code, after all)
for your nontrivial cases. This automatically breaks down the problem into sub-problems.
Self-criticism
The idea above is a simplistic solution for you to get the gist. But, in reality, programmers dislike recursion, as it can lead to trouble. So, once you complete the implementation of the above, you are well-advised to convert your algorithm to make sure it's iterative, which should be fairly easy once you succeeded implementing the recursive version.

Find K most frequent words from billions of given words [duplicate]

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.
My thinking is like this.
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.
To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).
We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be
2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).
This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.
This can be done in O(n) time
Solution 1:
Steps:
Count words and hash it, which will end up in the structure like this
var hash = {
"I" : 13,
"like" : 3,
"meow" : 3,
"geek" : 3,
"burger" : 2,
"cat" : 1,
"foo" : 100,
...
...
Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size
Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:
0 1 2 3 100
[[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
Then just traverse the array from the end, and collect the k words.
Solution 2:
Steps:
Same as above
Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
After traversing through the array, we just convert the min heap to array and return the array.
You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.
If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.
A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.
The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).
After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).
So this strategy is:
Go through each word and insert it into a hash table: O(n)
Select the Kth smallest element: O(n)
Partition around that element: O(n)
If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).
The O(n) time bound is optimal within a constant factor because we must examine each word at least once.
The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.
If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.
Edit:
By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.
This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.
You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.
This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].
If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.
You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.
Your problem is same as this-
http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
Use Trie and min heap to efficieinty solve it.
If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant.
Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.
As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n).
Practically, the long step is counting.
Utilize memory efficient data structure to store the words
Use MaxHeap, to find the top K frequent words.
Here is the code
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;
public class TopKFrequentItems {
private int maxSize;
private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;
public TopKFrequentItems(int k) {
this.maxSize = k;
this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}
private Comparator<TrieEntry> maxHeapComparator() {
return new Comparator<TrieEntry>() {
#Override
public int compare(TrieEntry o1, TrieEntry o2) {
return o1.frequency - o2.frequency;
}
};
}
public void add(String word) {
this.trie.insert(word);
}
public List<TopK> getItems() {
for (TrieEntry trieEntry : this.trie.getAll()) {
if (this.maxHeap.size() < this.maxSize) {
this.maxHeap.add(trieEntry);
} else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
this.maxHeap.remove();
this.maxHeap.add(trieEntry);
}
}
List<TopK> result = new ArrayList<TopK>();
for (TrieEntry entry : this.maxHeap) {
result.add(new TopK(entry));
}
return result;
}
public static class TopK {
public String item;
public int frequency;
public TopK(String item, int frequency) {
this.item = item;
this.frequency = frequency;
}
public TopK(TrieEntry entry) {
this(entry.word, entry.frequency);
}
#Override
public String toString() {
return String.format("TopK [item=%s, frequency=%s]", item, frequency);
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + frequency;
result = prime * result + ((item == null) ? 0 : item.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
TopK other = (TopK) obj;
if (frequency != other.frequency)
return false;
if (item == null) {
if (other.item != null)
return false;
} else if (!item.equals(other.item))
return false;
return true;
}
}
}
Here is the unit tests
#Test
public void test() {
TopKFrequentItems stream = new TopKFrequentItems(2);
stream.add("hell");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hero");
stream.add("hero");
stream.add("hero");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("home");
stream.add("go");
stream.add("go");
assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}
For more details refer this test case
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above
While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element
To restrict the size of treeset see this link
Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2.
as you mentioned "partitioning using the first letter of words", we got
("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold")
"then partitioning the largest multi-word set using the next character until you have k single-word sets."
it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.
Perhaps I misunderstand your point. Can you please detail your process about partition?
I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.
I was struggling with this as well and get inspired by #aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:
for each word, store it as part of map of it's occurrence: Map<String, Integer>.
then, based on the count, remove it from the previous count set, and add it into the new count set.
The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.
The code
public class WordFrequencyCounter {
private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
List<Set<String>> reverseCounters = new ArrayList<Set<String>>();
private static class MutableCounter {
int i = 1;
}
public List<String> countMostFrequentWords(String text, int max) {
int lastPosition = 0;
int length = text.length();
for (int i = 0; i < length; i++) {
char c = text.charAt(i);
if (c <= WORD_SEPARATOR_MAX) {
if (i != lastPosition) {
String word = text.substring(lastPosition, i);
MutableCounter counter = counters.get(word);
if (counter == null) {
counter = new MutableCounter();
counters.put(word, counter);
} else {
Set<String> strings = reverseCounters.get(counter.i);
strings.remove(word);
counter.i ++;
}
addToReverseLookup(counter.i, word);
}
lastPosition = i + 1;
}
}
List<String> ret = new ArrayList<String>();
int count = 0;
for (int i = reverseCounters.size() - 1; i >= 0; i--) {
Set<String> strings = reverseCounters.get(i);
for (String s : strings) {
ret.add(s);
System.out.print(s + ":" + i);
count++;
if (count == max) break;
}
if (count == max) break;
}
return ret;
}
private void addToReverseLookup(int count, String word) {
while (count >= reverseCounters.size()) {
reverseCounters.add(new HashSet<String>());
}
Set<String> strings = reverseCounters.get(count);
strings.add(word);
}
}
I just find out the other solution for this problem. But I am not sure it is right.
Solution:
Use a Hash table to record all words' frequency T(n) = O(n)
Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
After traverse the whole hash table, the result is in this buffer.
The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
Also there is an implementation of it here.
Simplest code to get the occurrence of most frequently used word.
function strOccurence(str){
var arr = str.split(" ");
var length = arr.length,temp = {},max;
while(length--){
if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
{
temp[arr[length]] = 1;
}
else if(arr[length].trim().length > 0)
{
temp[arr[length]] = temp[arr[length]] + 1;
}
}
console.log(temp);
var max = [];
for(i in temp)
{
max[temp[i]] = i;
}
console.log(max[max.length])
//if you want second highest
console.log(max[max.length - 2])
}
In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.
import java.io.*;
import java.lang.reflect.Array;
import java.util.*;
public class TopKWordsTextFile {
static class SortObject implements Comparable<SortObject>{
private String key;
private int value;
public SortObject(String key, int value) {
super();
this.key = key;
this.value = value;
}
#Override
public int compareTo(SortObject o) {
//descending order
return o.value - this.value;
}
}
public static void main(String[] args) {
HashMap<String,Integer> hm = new HashMap<>();
int k = 1;
try {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));
String line;
while ((line = br.readLine()) != null) {
// process the line.
//System.out.println(line);
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++){
if(hm.containsKey(tokens[i])){
//If the key already exists
Integer prev = hm.get(tokens[i]);
hm.put(tokens[i],prev+1);
}else{
//If the key doesn't exist
hm.put(tokens[i],1);
}
}
}
//Close the input
br.close();
//Print all words with their repetitions. You can use 3 for printing top 3 words.
k = hm.size();
// Get a set of the entries
Set set = hm.entrySet();
// Get an iterator
Iterator i = set.iterator();
int index = 0;
// Display elements
SortObject[] objects = new SortObject[hm.size()];
while(i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
//System.out.print("Key: "+e.getKey() + ": ");
//System.out.println(" Value: "+e.getValue());
String tempS = (String) e.getKey();
int tempI = (int) e.getValue();
objects[index] = new SortObject(tempS,tempI);
index++;
}
System.out.println();
//Sort the array
Arrays.sort(objects);
//Print top k
for(int j=0; j<k; j++){
System.out.println(objects[j].key+":"+objects[j].value);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.
**
C++11 Implementation of the above thought
**
class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {
unordered_map<int,int> map;
for(int num : nums){
map[num]++;
}
vector<int> res;
// we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
// pair<first, second>: first is frequency, second is number
priority_queue<pair<int,int>> pq;
for(auto it = map.begin(); it != map.end(); it++){
pq.push(make_pair(it->second, it->first));
// onece the size bigger than size-k, we will pop the value, which is the top k frequent element value
if(pq.size() > (int)map.size() - k){
res.push_back(pq.top().second);
pq.pop();
}
}
return res;
}
};

Algorithm to find duplicate in an array

I have an assignment to create an algorithm to find duplicates in an array which includes number values. but it has not said which kind of numbers, integers or floats. I have written the following pseudocode:
FindingDuplicateAlgorithm(A) // A is the array
mergeSort(A);
for int i <- 0 to i<A.length
if A[i] == A[i+1]
i++
return A[i]
else
i++
have I created an efficient algorithm?
I think there is a problem in my algorithm, it returns duplicate numbers several time. for example if array include 2 in two for two indexes i will have ...2, 2,... in the output. how can i change it to return each duplicat only one time?
I think it is a good algorithm for integers, but does it work good for float numbers too?
To handle duplicates, you can do the following:
if A[i] == A[i+1]:
result.append(A[i]) # collect found duplicates in a list
while A[i] == A[i+1]: # skip the entire range of duplicates
i++ # until a new value is found
Do you want to find Duplicates in Java?
You may use a HashSet.
HashSet h = new HashSet();
for(Object a:A){
boolean b = h.add(a);
boolean duplicate = !b;
if(duplicate)
// do something with a;
}
The return-Value of add() is defined as:
true if the set did not already
contain the specified element.
EDIT:
I know HashSet is optimized for inserts and contains operations. But I'm not sure if its fast enough for your concerns.
EDIT2:
I've seen you recently added the homework-tag. I would not prefer my answer if itf homework, because it may be to "high-level" for an allgorithm-lesson
http://download.oracle.com/javase/1.4.2/docs/api/java/util/HashSet.html#add%28java.lang.Object%29
Your answer seems pretty good. First sorting and them simply checking neighboring values gives you O(n log(n)) complexity which is quite efficient.
Merge sort is O(n log(n)) while checking neighboring values is simply O(n).
One thing though (as mentioned in one of the comments) you are going to get a stack overflow (lol) with your pseudocode. The inner loop should be (in Java):
for (int i = 0; i < array.length - 1; i++) {
...
}
Then also, if you actually want to display which numbers (and or indexes) are the duplicates, you will need to store them in a separate list.
I'm not sure what language you need to write the algorithm in, but there are some really good C++ solutions in response to my question here. Should be of use to you.
O(n) algorithm: traverse the array and try to input each element in a hashtable/set with number as the hash key. if you cannot enter, than that's a duplicate.
Your algorithm contains a buffer overrun. i starts with 0, so I assume the indexes into array A are zero-based, i.e. the first element is A[0], the last is A[A.length-1]. Now i counts up to A.length-1, and in the loop body accesses A[i+1], which is out of the array for the last iteration. Or, simply put: If you're comparing each element with the next element, you can only do length-1 comparisons.
If you only want to report duplicates once, I'd use a bool variable firstDuplicate, that's set to false when you find a duplicate and true when the number is different from the next. Then you'd only report the first duplicate by only reporting the duplicate numbers if firstDuplicate is true.
public void printDuplicates(int[] inputArray) {
if (inputArray == null) {
throw new IllegalArgumentException("Input array can not be null");
}
int length = inputArray.length;
if (length == 1) {
System.out.print(inputArray[0] + " ");
return;
}
for (int i = 0; i < length; i++) {
if (inputArray[Math.abs(inputArray[i])] >= 0) {
inputArray[Math.abs(inputArray[i])] = -inputArray[Math.abs(inputArray[i])];
} else {
System.out.print(Math.abs(inputArray[i]) + " ");
}
}
}

Efficient algorithm to remove any map that is contained in another map from a collection of maps

I have set (s) of unique maps (Java HashMaps currently) and wish to remove from it any maps that are completely contained by some other map in the set (i.e. remove m from s if m.entrySet() is a subset of n.entrySet() for some other n in s.)
I have an n^2 algorithm, but it's too slow. Is there a more efficient way to do this?
Edit:
the set of possible keys is small, if that helps.
Here is an inefficient reference implementation:
public void removeSubmaps(Set<Map> s) {
Set<Map> toRemove = new HashSet<Map>();
for (Map a: s) {
for (Map b : s) {
if (a.entrySet().containsAll(b.entrySet()))
toRemove.add(b);
}
}
s.removeAll(toRemove);
}
Not sure I can make this anything other than an n^2 algorithm, but I have a shortcut that might make it faster. Make a list of your maps with the length of the each map and sort it. A proper subset of a map must be shorter or equal to the map you're comparing - there's never any need to compare to a map higher on the list.
Here's another stab at it.
Decompose all your maps into a list of key,value,map number. Sort the list by key and value. Go through the list, and for each group of key/value matches, create a permutation of all the map number pairs - these are all potential subsets. When you have the final list of pairs, sort by map numbers. Go through this second list, and count the number of occurrences of each pair - if the number matches the size of one of the maps, you've found a subset.
Edit: My original interpretation of the problem was incorrect, here is new answer based on my re-read of the question.
You can create a custom hash function for HashMap which returns the product of all hash value of its entries. Sort the list of hash value and start loop from biggest value and find all divisor from smaller hash values, these are possible subsets of this hashmap, use set.containsAll() to confirm before marking them for removal.
This effectively transforms the problem into a mathematical problem of finding possible divisor from a collection. And you can apply all the common divisor-search optimizations.
Complexity is O(n^2), but if many hashmaps are subsets of others, the actual time spent can be a lot better, approaching O(n) in best-case scenario (if all hashmaps are subset of one). But even in worst case scenario, division calculation would be a lot faster than set.containsAll() which itself is O(n^2) where n is number of items in a hashmap.
You might also want to create a simple hash function for hashmap entry objects to return smaller numbers to increase multiply/division performance.
Here's a subquadratic (O(N**2 / log N)) algorithm for finding maximal sets from a set of sets: An Old Sub-Quadratic Algorithm for Finding Extremal Sets.
But if you know your data distribution, you can do much better in average case.
This what I ended up doing. It works well in my situation as there is usually some value that is only shared by a small number of maps. Kudos to Mark Ransom for pushing me in this direction.
In prose: Index the maps by key/value pair, so that each key/value pair is associated with a set of maps. Then, for each map: Find the smallest set associated with one of it's key/value pairs; this set is typically small for my data. Each of the maps in this set is a potential 'supermap'; no other map could be a 'supermap' as it would not contain this key/value pair. Search this set for a supermap. Finally remove all the identified submaps from the original set.
private <K, V> void removeSubmaps(Set<Map<K, V>> maps) {
// index the maps by key/value
List<Map<K, V>> mapList = toList(maps);
Map<K, Map<V, List<Integer>>> values = LazyMap.create(HashMap.class, ArrayList.class);
for (int i = 0, uniqueRowsSize = mapList.size(); i < uniqueRowsSize; i++) {
Map<K, V> row = mapList.get(i);
Integer idx = i;
for (Map.Entry<K, V> entry : row.entrySet())
values.get(entry.getKey()).get(entry.getValue()).add(idx);
}
// find submaps
Set<Map<K, V>> toRemove = Sets.newHashSet();
for (Map<K, V> submap : mapList) {
// find the smallest set of maps with a matching key/value
List<Integer> smallestList = null;
for (Map.Entry<K, V> entry : submap.entrySet()) {
List<Integer> list = values.get(entry.getKey()).get(entry.getValue());
if (smallestList == null || list.size() < smallestList.size())
smallestList = list;
}
// compare with each of the maps in that set
for (int i : smallestList) {
Map<K, V> map = mapList.get(i);
if (isSubmap(submap, map))
toRemove.add(submap);
}
}
maps.removeAll(toRemove);
}
private <K,V> boolean isSubmap(Map<K, V> submap, Map<K,V> map){
if (submap.size() >= map.size())
return false;
for (Map.Entry<K,V> entry : submap.entrySet()) {
V other = map.get(entry.getKey());
if (other == null)
return false;
if (!other.equals(entry.getValue()))
return false;
}
return true;
}

Algorithm to tell if two arrays have identical members

What's the best algorithm for comparing two arrays to see if they have the same members?
Assume there are no duplicates, the members can be in any order, and that neither is sorted.
compare(
[a, b, c, d],
[b, a, d, c]
) ==> true
compare(
[a, b, e],
[a, b, c]
) ==> false
compare(
[a, b, c],
[a, b]
) ==> false
Obvious answers would be:
Sort both lists, then check each
element to see if they're identical
Add the items from one array to a
hashtable, then iterate through the
other array, checking that each item
is in the hash
nickf's iterative search algorithm
Which one you'd use would depend on whether you can sort the lists first, and whether you have a good hash algorithm handy.
You could load one into a hash table, keeping track of how many elements it has. Then, loop over the second one checking to see if every one of its elements is in the hash table, and counting how many elements it has. If every element in the second array is in the hash table, and the two lengths match, they are the same, otherwise they are not. This should be O(N).
To make this work in the presence of duplicates, track how many of each element has been seen. Increment while looping over the first array, and decrement while looping over the second array. During the loop over the second array, if you can't find something in the hash table, or if the counter is already at zero, they are unequal. Also compare total counts.
Another method that would work in the presence of duplicates is to sort both arrays and do a linear compare. This should be O(N*log(N)).
Assuming you don't want to disturb the original arrays and space is a consideration, another O(n.log(n)) solution that uses less space than sorting both arrays is:
Return FALSE if arrays differ in size
Sort the first array -- O(n.log(n)) time, extra space required is the size of one array
For each element in the 2nd array, check if it's in the sorted copy of
the first array using a binary search -- O(n.log(n)) time
If you use this approach, please use a library routine to do the binary search. Binary search is surprisingly error-prone to hand-code.
[Added after reviewing solutions suggesting dictionary/set/hash lookups:]
In practice I'd use a hash. Several people have asserted O(1) behaviour for hashes, leading them to conclude a hash-based solution is O(N). Typical inserts/lookups may be close to O(1), and some hashing schemes guarantee worst-case O(1) lookup, but worst-case insertion -- in constructing the hash -- isn't O(1). Given any particular hashing data structure, there would be some set of inputs which would produce pathological behaviour. I suspect there exist hashing data structures with the combined worst-case to [insert-N-elements then lookup-N-elements] of O(N.log(N)) time and O(N) space.
You can use a signature (a commutative operation over the array members) to further optimize this in the case where the array are usually different, saving the o(n log n) or the memory allocation.
A signature can be of the form of a bloom filter(s), or even a simple commutative operation like addition or xor.
A simple example (assuming a long as the signature side and gethashcode as a good object identifier; if the objects are, say, ints, then their value is a better identifier; and some signatures will be larger than long)
public bool MatchArrays(object[] array1, object[] array2)
{
if (array1.length != array2.length)
return false;
long signature1 = 0;
long signature2 = 0;
for (i=0;i<array1.length;i++) {
signature1=CommutativeOperation(signature1,array1[i].getHashCode());
signature2=CommutativeOperation(signature2,array2[i].getHashCode());
}
if (signature1 != signature2)
return false;
return MatchArraysTheLongWay(array1, array2);
}
where (using an addition operation; use a different commutative operation if desired, e.g. bloom filters)
public long CommutativeOperation(long oldValue, long newElement) {
return oldValue + newElement;
}
This can be done in different ways:
1 - Brute force: for each element in array1 check that element exists in array2. Note this would require to note the position/index so that duplicates can be handled properly. This requires O(n^2) with much complicated code, don't even think of it at all...
2 - Sort both lists, then check each element to see if they're identical. O(n log n) for sorting and O(n) to check so basically O(n log n), sort can be done in-place if messing up the arrays is not a problem, if not you need to have 2n size memory to copy the sorted list.
3 - Add the items and count from one array to a hashtable, then iterate through the other array, checking that each item is in the hashtable and in that case decrement count if it is not zero otherwise remove it from hashtable. O(n) to create a hashtable, and O(n) to check the other array items in the hashtable, so O(n). This introduces a hashtable with memory at most for n elements.
4 - Best of Best (Among the above): Subtract or take difference of each element in the same index of the two arrays and finally sum up the subtacted values. For eg A1={1,2,3}, A2={3,1,2} the Diff={-2,1,1} now sum-up the Diff = 0 that means they have same set of integers. This approach requires an O(n) with no extra memory. A c# code would look like as follows:
public static bool ArrayEqual(int[] list1, int[] list2)
{
if (list1 == null || list2 == null)
{
throw new Exception("Invalid input");
}
if (list1.Length != list2.Length)
{
return false;
}
int diff = 0;
for (int i = 0; i < list1.Length; i++)
{
diff += list1[i] - list2[i];
}
return (diff == 0);
}
4 doesn't work at all, it is the worst
If the elements of an array are given as distinct, then XOR ( bitwise XOR ) all the elements of both the arrays, if the answer is zero, then both the arrays have the same set of numbers. The time complexity is O(n)
I would suggest using a sort first and sort both first. Then you will compare the first element of each array then the second and so on.
If you find a mismatch you can stop.
If you sort both arrays first, you'd get O(N log(N)).
What is the "best" solution obviously depends on what constraints you have. If it's a small data set, the sorting, hashing, or brute force comparison (like nickf posted) will all be pretty similar. Because you know that you're dealing with integer values, you can get O(n) sort times (e.g. radix sort), and the hash table will also use O(n) time. As always, there are drawbacks to each approach: sorting will either require you to duplicate the data or destructively sort your array (losing the current ordering) if you want to save space. A hash table will obviously have memory overhead to for creating the hash table. If you use nickf's method, you can do it with little-to-no memory overhead, but you have to deal with the O(n2) runtime. You can choose which is best for your purposes.
Going on deep waters here, but:
Sorted lists
sorting can be O(nlogn) as pointed out. just to clarify, it doesn't matter that there is two lists, because: O(2*nlogn) == O(nlogn), then comparing each elements is another O(n), so sorting both then comparing each element is O(n)+O(nlogn) which is: O(nlogn)
Hash-tables:
Converting the first list to a hash table is O(n) for reading + the cost of storing in the hash table, which i guess can be estimated as O(n), gives O(n). Then you'll have to check the existence of each element in the other list in the produced hash table, which is (at least?) O(n) (assuming that checking existance of an element the hash-table is constant). All-in-all, we end up with O(n) for the check.
The Java List interface defines equals as each corresponding element being equal.
Interestingly, the Java Collection interface definition almost discourages implementing the equals() function.
Finally, the Java Set interface per documentation implements this very behaviour. The implementation is should be very efficient, but the documentation makes no mention of performance. (Couldn't find a link to the source, it's probably to strictly licensed. Download and look at it yourself. It comes with the JDK) Looking at the source, the HashSet (which is a commonly used implementation of Set) delegates the equals() implementation to the AbstractSet, which uses the containsAll() function of AbstractCollection using the contains() function again from hashSet. So HashSet.equals() runs in O(n) as expected. (looping through all elements and looking them up in constant time in the hash-table.)
Please edit if you know better to spare me the embarrasment.
Pseudocode :
A:array
B:array
C:hashtable
if A.length != B.length then return false;
foreach objA in A
{
H = objA;
if H is not found in C.Keys then
C.add(H as key,1 as initial value);
else
C.Val[H as key]++;
}
foreach objB in B
{
H = objB;
if H is not found in C.Keys then
return false;
else
C.Val[H as key]--;
}
if(C contains non-zero value)
return false;
else
return true;
The best way is probably to use hashmaps. Since insertion into a hashmap is O(1), building a hashmap from one array should take O(n). You then have n lookups, which each take O(1), so another O(n) operation. All in all, it's O(n).
In python:
def comparray(a, b):
sa = set(a)
return len(sa)==len(b) and all(el in sa for el in b)
Ignoring the built in ways to do this in C#, you could do something like this:
Its O(1) in the best case, O(N) (per list) in worst case.
public bool MatchArrays(object[] array1, object[] array2)
{
if (array1.length != array2.length)
return false;
bool retValue = true;
HashTable ht = new HashTable();
for (int i = 0; i < array1.length; i++)
{
ht.Add(array1[i]);
}
for (int i = 0; i < array2.length; i++)
{
if (ht.Contains(array2[i])
{
retValue = false;
break;
}
}
return retValue;
}
Upon collisions a hashmap is O(n) in most cases because it uses a linked list to store the collisions. However, there are better approaches and you should hardly have collisions anyway because if you did the hashmap would be useless. In all regular cases it's simply O(1). Besides that, it's not likely to have more than a small n of collisions in a single hashmap so performance wouldn't suck that bad; you can safely say that it's O(1) or almost O(1) because the n is so small it's can be ignored.
Here is another option, let me know what you guys think.It should be T(n)=2n*log2n ->O(nLogn) in the worst case.
private boolean compare(List listA, List listB){
if (listA.size()==0||listA.size()==0) return true;
List runner = new ArrayList();
List maxList = listA.size()>listB.size()?listA:listB;
List minList = listA.size()>listB.size()?listB:listA;
int macthes = 0;
List nextList = null;;
int maxLength = maxList.size();
for(int i=0;i<maxLength;i++){
for (int j=0;j<2;j++) {
nextList = (nextList==null)?maxList:(maxList==nextList)?minList:maList;
if (i<= nextList.size()) {
MatchingItem nextItem =new MatchingItem(nextList.get(i),nextList)
int position = runner.indexOf(nextItem);
if (position <0){
runner.add(nextItem);
}else{
MatchingItem itemInBag = runner.get(position);
if (itemInBag.getList != nextList) matches++;
runner.remove(position);
}
}
}
}
return maxLength==macthes;
}
public Class MatchingItem{
private Object item;
private List itemList;
public MatchingItem(Object item,List itemList){
this.item=item
this.itemList = itemList
}
public boolean equals(object other){
MatchingItem otheritem = (MatchingItem)other;
return otheritem.item.equals(this.item) and otheritem.itemlist!=this.itemlist
}
public Object getItem(){ return this.item}
public Object getList(){ return this.itemList}
}
The best I can think of is O(n^2), I guess.
function compare($foo, $bar) {
if (count($foo) != count($bar)) return false;
foreach ($foo as $f) {
foreach ($bar as $b) {
if ($f == $b) {
// $f exists in $bar, skip to the next $foo
continue 2;
}
}
return false;
}
return true;
}

Resources