Find K most frequent words from billions of given words [duplicate]

Find K most frequent words from billions of given words [duplicate] - algorithm

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.
My thinking is like this.
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.
To summarize, the total time is O(n+nlg(n)+K)， Since K is surely smaller than N, so it is actually O(nlg(n)).
We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be
2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).
This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

This can be done in O(n) time
Solution 1:
Steps:
Count words and hash it, which will end up in the structure like this
var hash = {
"I" : 13,
"like" : 3,
"meow" : 3,
"geek" : 3,
"burger" : 2,
"cat" : 1,
"foo" : 100,
...
...
Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size
Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:
0 1 2 3 100
[[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
Then just traverse the array from the end, and collect the k words.
Solution 2:
Steps:
Same as above
Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
After traversing through the array, we just convert the min heap to array and return the array.

You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.
If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.

A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.
The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).
After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).
So this strategy is:
Go through each word and insert it into a hash table: O(n)
Select the Kth smallest element: O(n)
Partition around that element: O(n)
If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).
The O(n) time bound is optimal within a constant factor because we must examine each word at least once.
The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.

If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.
Edit:
By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.
This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.

You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.
This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].
If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.

You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.

Your problem is same as this-
http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
Use Trie and min heap to efficieinty solve it.

If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant.
Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.
As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n).
Practically, the long step is counting.

Utilize memory efficient data structure to store the words
Use MaxHeap, to find the top K frequent words.
Here is the code
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;
public class TopKFrequentItems {
private int maxSize;
private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;
public TopKFrequentItems(int k) {
this.maxSize = k;
this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}
private Comparator<TrieEntry> maxHeapComparator() {
return new Comparator<TrieEntry>() {
#Override
public int compare(TrieEntry o1, TrieEntry o2) {
return o1.frequency - o2.frequency;
}
};
}
public void add(String word) {
this.trie.insert(word);
}
public List<TopK> getItems() {
for (TrieEntry trieEntry : this.trie.getAll()) {
if (this.maxHeap.size() < this.maxSize) {
this.maxHeap.add(trieEntry);
} else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
this.maxHeap.remove();
this.maxHeap.add(trieEntry);
}
}
List<TopK> result = new ArrayList<TopK>();
for (TrieEntry entry : this.maxHeap) {
result.add(new TopK(entry));
}
return result;
}
public static class TopK {
public String item;
public int frequency;
public TopK(String item, int frequency) {
this.item = item;
this.frequency = frequency;
}
public TopK(TrieEntry entry) {
this(entry.word, entry.frequency);
}
#Override
public String toString() {
return String.format("TopK [item=%s, frequency=%s]", item, frequency);
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + frequency;
result = prime * result + ((item == null) ? 0 : item.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
TopK other = (TopK) obj;
if (frequency != other.frequency)
return false;
if (item == null) {
if (other.item != null)
return false;
} else if (!item.equals(other.item))
return false;
return true;
}
}
}
Here is the unit tests
#Test
public void test() {
TopKFrequentItems stream = new TopKFrequentItems(2);
stream.add("hell");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hero");
stream.add("hero");
stream.add("hero");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("home");
stream.add("go");
stream.add("go");
assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}
For more details refer this test case

use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above
While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element
To restrict the size of treeset see this link

Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2.
as you mentioned "partitioning using the first letter of words", we got
("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold")
"then partitioning the largest multi-word set using the next character until you have k single-word sets."
it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.
Perhaps I misunderstand your point. Can you please detail your process about partition?

I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.

I was struggling with this as well and get inspired by #aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:
for each word, store it as part of map of it's occurrence: Map<String, Integer>.
then, based on the count, remove it from the previous count set, and add it into the new count set.
The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.
The code
public class WordFrequencyCounter {
private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
List<Set<String>> reverseCounters = new ArrayList<Set<String>>();
private static class MutableCounter {
int i = 1;
}
public List<String> countMostFrequentWords(String text, int max) {
int lastPosition = 0;
int length = text.length();
for (int i = 0; i < length; i++) {
char c = text.charAt(i);
if (c <= WORD_SEPARATOR_MAX) {
if (i != lastPosition) {
String word = text.substring(lastPosition, i);
MutableCounter counter = counters.get(word);
if (counter == null) {
counter = new MutableCounter();
counters.put(word, counter);
} else {
Set<String> strings = reverseCounters.get(counter.i);
strings.remove(word);
counter.i ++;
}
addToReverseLookup(counter.i, word);
}
lastPosition = i + 1;
}
}
List<String> ret = new ArrayList<String>();
int count = 0;
for (int i = reverseCounters.size() - 1; i >= 0; i--) {
Set<String> strings = reverseCounters.get(i);
for (String s : strings) {
ret.add(s);
System.out.print(s + ":" + i);
count++;
if (count == max) break;
}
if (count == max) break;
}
return ret;
}
private void addToReverseLookup(int count, String word) {
while (count >= reverseCounters.size()) {
reverseCounters.add(new HashSet<String>());
}
Set<String> strings = reverseCounters.get(count);
strings.add(word);
}
}

I just find out the other solution for this problem. But I am not sure it is right.
Solution:
Use a Hash table to record all words' frequency T(n) = O(n)
Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
After traverse the whole hash table, the result is in this buffer.
The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.

Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie

This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
Also there is an implementation of it here.

Simplest code to get the occurrence of most frequently used word.
function strOccurence(str){
var arr = str.split(" ");
var length = arr.length,temp = {},max;
while(length--){
if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
{
temp[arr[length]] = 1;
}
else if(arr[length].trim().length > 0)
{
temp[arr[length]] = temp[arr[length]] + 1;
}
}
console.log(temp);
var max = [];
for(i in temp)
{
max[temp[i]] = i;
}
console.log(max[max.length])
//if you want second highest
console.log(max[max.length - 2])
}

In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.
import java.io.*;
import java.lang.reflect.Array;
import java.util.*;
public class TopKWordsTextFile {
static class SortObject implements Comparable<SortObject>{
private String key;
private int value;
public SortObject(String key, int value) {
super();
this.key = key;
this.value = value;
}
#Override
public int compareTo(SortObject o) {
//descending order
return o.value - this.value;
}
}
public static void main(String[] args) {
HashMap<String,Integer> hm = new HashMap<>();
int k = 1;
try {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));
String line;
while ((line = br.readLine()) != null) {
// process the line.
//System.out.println(line);
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++){
if(hm.containsKey(tokens[i])){
//If the key already exists
Integer prev = hm.get(tokens[i]);
hm.put(tokens[i],prev+1);
}else{
//If the key doesn't exist
hm.put(tokens[i],1);
}
}
}
//Close the input
br.close();
//Print all words with their repetitions. You can use 3 for printing top 3 words.
k = hm.size();
// Get a set of the entries
Set set = hm.entrySet();
// Get an iterator
Iterator i = set.iterator();
int index = 0;
// Display elements
SortObject[] objects = new SortObject[hm.size()];
while(i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
//System.out.print("Key: "+e.getKey() + ": ");
//System.out.println(" Value: "+e.getValue());
String tempS = (String) e.getKey();
int tempI = (int) e.getValue();
objects[index] = new SortObject(tempS,tempI);
index++;
}
System.out.println();
//Sort the array
Arrays.sort(objects);
//Print top k
for(int j=0; j<k; j++){
System.out.println(objects[j].key+":"+objects[j].value);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.

**
C++11 Implementation of the above thought
**
class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {
unordered_map<int,int> map;
for(int num : nums){
map[num]++;
}
vector<int> res;
// we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
// pair<first, second>: first is frequency, second is number
priority_queue<pair<int,int>> pq;
for(auto it = map.begin(); it != map.end(); it++){
pq.push(make_pair(it->second, it->first));
// onece the size bigger than size-k, we will pop the value, which is the top k frequent element value
if(pq.size() > (int)map.size() - k){
res.push_back(pq.top().second);
pq.pop();
}
}
return res;
}
};

Related

Quick Sort Time Complexity Best Case Input

I have to find time complexity of quick sort for BEST CASE INPUT in a c program & i have selected the last element of array as pivot.
Now i know what input values i have to enter for best case, i.e., keep 1st middle element at the last place(pivot) & next pivot should be the next middle element.
But i have to generate this kind of best case input array of very big sizes like 1000, 5000, 100000.., for quick sort.
I can code, but can anyone please help me understand how to generate that kind of best case input array for quick sort with last pivot, using c programming.
I just need the logic like how to generate that kind of array using c programming.

Basically you need to do a divide & conquer approach akin to quicksort itself. Do it with a function that given a range of indices in the output:
generates the first-half partition by recursively calling itself
generates the second-half partition by recursively calling itself
inserts the pivot value after the second-half partition.
One thing to note is that since you are just generating output not sorting anything, you don't actually have to have any values as input -- you can just represent ranges logically as a start value at some index in the array and a count.
Some C# code is below; this is untested -- don't look if you want to do this yourself.
static int[] GenerateBestCaseQuickSort(int n)
{
var ary = new int[n];
GenerateBestCaseQuickSortAux(ary, 0, n, 1);
return ary;
}
static void GenerateBestCaseQuickSortAux(int[] ary, int start_index, int count, int start_value)
{
if (count == 0)
return;
if (count == 1)
{
ary[start_index] = start_value;
return;
}
int partition1_count = count / 2;
int partition2_count = count - partition1_count - 1; // need to save a spot for the pivot so -1...
int pivot_value_index = start_index + partition1_count;
int pivot_value = start_value + partition1_count;
GenerateBestCaseQuickSort(ary, start_index, partition1_count, start_value);
GenerateBestCaseQuickSort(ary, pivot_value_index, partition2_count, pivot_value+1);
ary[start_index + count - 1] = pivot_value;
}

Fuse tuples to find equivalence classes

Suppose we have a finite domain D={d1,..dk} containg k elements.
We consider S a subset of D^n, i.e. a set of tuples of the form < a1,..,an >, with ai in D.
We want to represent it (compactly) using S' a subset of 2^D^n, i.e. a set of tuples of the form < A1,..An > with Ai being subsets of D. The implication is that for any tuple s' in S' all elements in the cross product of Ai exist in S.
For instance, consider D={a,b,c} so k=3, n=2 and the tuples S=< a,b >+< a,c >+< b,b >+< b,c >.
We can use S'=<{a,b},{b,c}> to represent S.
This singleton solution is also minimal, S'=<{a},{b,c}>+<{b},{b,c}> is also a solution but it is larger, therefore less desirable.
Some sizes, in concrete instances, that we need to handle : k ~ 1000 elements in the domain D, n <= 10 relatively small (main source of complexity), |S| ranging to large values > 10^6.
A naïve approach consists in first plunging S into the domain of S' 2^D^n, then using the following test, two by two, two tuples s1,s2 in S' can be fused to form a single tuple in S' iff. they differ by only one component.
e.g.
< a,b >+< a,c > -> <{a},{b,c}> (differ on second component)
< b,b >+< b,c > -> <{b},{b,c}> (differ on second component)
<{a},{b,c}> + <{b},{b,c}> -> <{a,b},{b,c}> (differ on first component)
Now there could be several minimal S', we are interested in finding any one, and approximations of minimisation of some kind are also ok, provided they don't give wrong results (i.e. even if S' is not as small as it could be, but we get very fast results).
Naive algorithm has to deal with the fact that any newly introduced "fused" tuple could match with some other tuple so it scales really badly on large input sets, even with n remaining low. You need |S'|^2 comparisons to ensure convergence, and any time you do fuse two elements, I'm currently retesting every pair (how can I improve that ?).
A lot of efficiency is iteration order dependent, so sorting the set in some way(s) could be an option, or perhaps indexing using hashes, but I'm not sure how to do it.
Imperative pseudo code would be ideal, or pointers to a reformulation of the problem to something I can run a solver on would really help.

Here's some psuedo (C# code that I haven't tested) that demonstrates your S'=<{a},{b,c}>+<{b},{b,c}> method. Except for the space requirements, which when using an integer index for the element are negligible; the overall efficiency and speed for Add'ing and Test'ing tuples should be extremely fast. If you want a practical solution then you already have one you just have to use the correct ADTs.
ElementType[] domain = new ElementType[]; // a simple array of domain elements
FillDomain(domain); // insert all domain elements
SortArray(domain); // sort the domain elements K log K time
SortedDictionary<int, HashSet<int>> subsets; // int's are index/ref into domain
subsets = new SortedDictionary<int, HashSet<int>>();
//
void AddTuple(SortedDictionary<int, HashSet<int>> tuples, ElementType[] domain, ElementType first, elementType second) {
int a = BinarySearch(domain, first); // log K time (binary search)
int b = BinarySearch(domain, second); // log K time (binary search)
if(tuples.ContainsKey(a)) { // log N time (binary search on sorted keys)
if(!tuples[a].Contains(b)) { // constant time (hash lookup)
tuples[a].Add(b); // constant time (hash add)
}
} else { // constant time (instance + hash add)
tuples[a] = new HashSet<in>();
tuples[a].Add(b);
}
}
//
bool ContainsTuple(SortedDictionary<int, HashSet<int>> tuples, ElementType[] domain, ElementType first, ElementType second) {
int a = BinarySearch(domain, first); // log K time (binary search)
int b = BinarySearch(domain, second); // log K time (binary search)
if(tuples.ContainsKey(a)) { // log N time (binary search on sorted keys)
if(tuples[a].Contains(b)) { // constant time (hash test)
return true;
}
}
return false;
}
The space savings for optimizing your tuple subset S' won't outweight the slowdown of the optimization process itself. For size optimization (if you know you're K will be less than 65536 you could use short integers instead of integers in the SortedDictionary and HashSet. But even 50 mil integers only take up 4 bytes per 32bit integer * 50 mil ~= 200 MB.
EDIT
Here's another approach by encoding/mapping your tuples to a string you can take advantage of binary string compare and the fact that UTF-16 / UTF-8 encoding is very size efficient. Again this still doesn't doing the merging optimization you want, but speed and efficiency would be pretty good.
Here's some quick pseudo code in JavaScript.
Array.prototype.binarySearch = function(elm) {
var l = 0, h = this.length - 1, i;
while(l <= h) {
i = (l + h) >> 1;
if(this[i] < elm) l = ++i;
else if(this[i] > elm) h = --i;
else return i;
}
return -(++l);
};
// map your ordered domain elements to characters
// For example JavaScript's UTF-16 should be fine
// UTF-8 would work as well
var domain = {
"a": String.fromCharCode(1),
"b": String.fromCharCode(2),
"c": String.fromCharCode(3),
"d": String.fromCharCode(4)
}
var tupleStrings = [];
// map your tuple to the string encoding
function map(tuple) {
var str = "";
for(var i=0; i<tuple.length; i++) {
str += domain[tuple[i]];
}
return str;
}
function add(tuple) {
var str = map(tuple);
// binary search
var index = tupleStrings.binarySearch(str);
if(index < 0) index = ~index;
// insert depends on tupleString's type implementation
tupleStrings.splice(index, 0, str);
}
function contains(tuple) {
var str = map(tuple);
// binary search
return tupleString.binarySearch(str) >= 0;
}
add(["a","b"]);
add(["a","c"]);
add(["b","b"]);
add(["b","c"]);
add(["c","c"]);
add(["d","a"]);
alert(contains(["a","a"]));
alert(contains(["d","a"]));
alert(JSON.stringify(tupleStrings, null, "\n"));

Finding the longest sub-string with no repetition in a string. Time Complexity?

I recently interviewed with a company for software engineering position. I was asked the question of longest unique sub-string in a string. My algorithms was as follows -
Start from the left-most character, and keep storing the character in a hash table with the key as the character and the value as the index_where_it_last_occurred. Add the character to the answer string as long as its not present in the hash table. If we encounter a stored character again, I stop and note down the length. I empty the hash table and then start again from the right index of the repeated character. The right index is retrieved from the (index_where_it_last_occurred) flag. If I ever reach the end of the string, I stop and return the longest length.
For example, say the string was, abcdecfg.
I start with a, store in hash table. I store b and so on till e. Their indexes are stored as well. When I encounter c again, I stop since it's already hashed and note down the length which is 5. I empty the hash table, and start again from the right index of the repeated character. The repeated character being c, I start again from the position 3 ie., the character d. I keep doing this while I don't reach the end of string.
I am interested in knowing what the time complexity of this algorithm will be. IMO, it'll be O(n^2).
This is the code.
import java.util.*;
public class longest
{
static int longest_length = -1;
public static void main(String[] args)
{
Scanner in = new Scanner(System.in);
String str = in.nextLine();
calc(str,0);
System.out.println(longest_length);
}
public static void calc(String str,int index)
{
if(index >= str.length()) return;
int temp_length = 0;
LinkedHashMap<Character,Integer> map = new LinkedHashMap<Character,Integer>();
for (int i = index; i<str.length();i++)
{
if(!map.containsKey(str.charAt(i)))
{
map.put(str.charAt(i),i);
++temp_length;
}
else if(map.containsKey(str.charAt(i)))
{
if(longest_length < temp_length)
{
longest_length = temp_length;
}
int last_index = map.get(str.charAt(i));
// System.out.println(last_index);
calc(str,last_index+1);
break;
}
}
if(longest_length < temp_length)
longest_length = temp_length;
}
}

If the alphabet is of size K, then when you restart counting you jump back at most K-1 places, so you read each character of the string at most K times. So the algorithm is O(nK).
The input string which contains n/K copies of the alphabet exhibits this worst-case behavior. For example if the alphabet is {a, b, c}, strings of the form "abcabcabc...abc" have the property that nearly every character is read 3 times by your algorithm.
You can solve the original problem in O(K+n) time, using O(K) storage space by using dynamic programming.
Let the string be s, and we'll keep a number M which will be the the length of maximum unique_char string ending at i, P, which stores where each character was previously seen, and best, the longest unique-char string found so far.
Start:
Set P[c] = -1 for each c in the alphabet.
M = 0
best = 0
Then, for each i:
M = min(M+1, i-P[s[i]])
best = max(best, M)
P[s[i]] = i
This is trivially O(K) in storage, and O(K+n) in running time.

Lazy Shuffle Algorithms

I have a large list of elements that I want to iterate in random order. However, I cannot modify the list and I don't want to create a copy of it either, because 1) it is large and 2) it can be expected that the iteration is cancelled early.
List<T> data = ...;
Iterator<T> shuffled = shuffle(data);
while (shuffled.hasNext()) {
T t = shuffled.next();
if (System.console().readLine("Do you want %s?", t).startsWith("y")) {
return t;
}
}
System.out.println("That's all");
return t;
I am looking for an algorithm were the code above would run in O(n) (and preferably require only O(log n)space), so caching the elements that were produced earlier is not an option. I don't care if the algorithm is biased (as long as it's not obvious).
(I uses pseudo-Java in my question, but you can use other languages if you wish)
Here is the best I got so far.
Iterator<T> shuffle(final List<T> data) {
int p = data.size();
while ((data.size() % p) == 0) p = randomPrime();
return new Iterator<T>() {
final int prime = p;
int n = 0, i = 0;
public boolean hasNext() { return i < data.size(); }
public T next() {
i++; n += prime;
return data.get(n);
}
}
}
Iterating all elements in O(n), constant space, but obviously biased as it can produce only data.size() permutations.

The easiest shuffling approaches I know of work with indices. If the List is not an ArrayList, you may end up with a very inefficient algorithm if you try to use one of the below (a LinkedList does have a get by ID, but it's O(n), so you'll end up with O(n^2) time).
If O(n) space is fine, which I'm assuming it's not, I'd recommend the Fisher-Yates / Knuth shuffle, it's O(n) time and is easy to implement. You can optimise it so you only need to perform a single operation before being able to get the first element, but you'll need to keep track of the rest of the modified list as you go.
My solution:
Ok, so this is not very random at all, but I can't see a better way if you want less than O(n) space.
It takes O(1) space and O(n) time.
There may be a way to push it up the space usage a little and get more random results, but I haven't figured that out yet.
It has to do with relative primes. The idea is that, given 2 relative primes a (the generator) and b, when you loop through a % b, 2a % b, 3a % b, 4a % b, ..., you will see every integer 0, 1, 2, ..., b-2, b-1, and this will also happen before seeing any integer twice. Unfortunately I don't have a link to a proof (the wikipedia link may mention or imply it, I didn't check in too much detail).
I start off by increasing the length until we get a prime, since this implies that any other number will be a relative prime, which is a whole lot less limiting (and just skip any number greater than the original length), then generate a random number, and use this as the generator.
I'm iterating through and printing out all the values, but it should be easy enough to modify to generate the next one given the current one.
Note I'm skipping 1 and len-1 with my nextInt, since these will produce 1,2,3,... and ...,3,2,1 respectively, but you can include these, but probably not if the length is below a certain threshold.
You may also want to generate a random number to multiply the generator by (mod the length) to start from.
Java code:
static Random gen = new Random();
static void printShuffle(int len)
{
// get first prime >= len
int newLen = len-1;
boolean prime;
do
{
newLen++;
// prime check
prime = true;
for (int i = 2; prime && i < len; i++)
prime &= (newLen % i != 0);
}
while (!prime);
long val = gen.nextInt(len-3) + 2;
long oldVal = val;
do
{
if (val < len)
System.out.println(val);
val = (val + oldVal) % newLen;
}
while (oldVal != val);
}

This is an old thread, but in case anyone comes across this in future, a paper by Andrew Kensler describes a way to do this in constant time and constant space. Essentially, you create a reversible hash function, and then use it (and not an array) to index the list. Kensler describes a method for generating the necessary function, and discusses "cycle walking" as a way to deal with a domain that is not identical to the domain of the hash function. Afnan Enayet's summary of the paper is here: https://afnan.io/posts/2019-04-05-explaining-the-hashed-permutation/.

You may try using a buffer to do this. Iterate through a limited set of data and put it in a buffer. Extract random values from that buffer and send it to output (or wherever you need it). Iterate through the next set and keep overwriting this buffer. Repeat this step.
You'll end up with n + n operations, which is still O(n). Unfortunately, the result will not be actually random. It will be close to random if you choose your buffer size properly.
On a different note, check these two: Python - run through a loop in non linear fashion, random iteration in Python
Perhaps there's a more elegant algorithm to do this better. I'm not sure though. Looking forward to other replies in this thread.

This is not a perfect answer to your question, but perhaps it's useful.
The idea is to use a reversible random number generator and the usual array-based shuffling algorithm done lazily: to get the i'th shuffled item, swap a[i] with and a randomly chosen a[j] where j is in [i..n-1], then return a[i]. This can be done in the iterator.
After you are done iterating, reset the array to original order by "unswapping" using the reverse direction of the RNG.
The unshuffling reset will never take longer than the original iteration, so it doesn't change asymptotic cost. Iteration is still linear in the number of iterations.
How to build a reversible RNG? Just use an encryption algorithm. Encrypt the previously generated pseudo-random value to go forward, and decrypt to go backward. If you have a symmetric encryption algorithm, then you can add a "salt" value at each step forward to prevent a cycle of two and subtract it for each step backward. I mention this because RC4 is simple and fast and symmetric. I've used it before for tasks like this. Encrypting 4-byte values then computing mod to get them in the desired range will be quick indeed.
You can press this into the Java iterator pattern by extending Iterator to allow resets. See below. Usage will look like:
ShuffledList<Integer> lst = new SuffledList<>();
... build the list with the usual operations
ResetableInterator<Integer> i = lst.iterator();
while (i.hasNext()) {
int val = i.next();
... use the randomly selected value
if (anyConditinoAtAll) break;
}
i.reset(); // Unshuffle the array
I know this isn't perfect, but it will be fast and give a good shuffle. Note that if you don't reset, the next iterator will still be a new random shuffle, but the original order will be lost forever. If the loop body can generate an exception, you'd want the reset in a finally block.
class ShuffledList<T> extends ArrayList<T> implements Iterable<T> {
#Override
public Iterator<T> iterator() {
return null;
}
public interface ResetableInterator<T> extends Iterator<T> {
public void reset();
}
class ShufflingIterator<T> implements ResetableInterator<T> {
int mark = 0;
#Override
public boolean hasNext() {
return true;
}
#Override
public T next() {
return null;
}
#Override
public void remove() {
throw new UnsupportedOperationException("Not supported.");
}
#Override
public void reset() {
throw new UnsupportedOperationException("Not supported yet.");
}
}
}

Reverse-printing single-linked list

Anybody knows how to print single-linked list in reverse order (one pass and a fixed and independent of the number of elements RAM).

My answer. There is no answer that solves this to your specs. It can't be multi-pass, it cant be recursive (which I think is considered single), it has to be constant memory...
I dont think you will discover a solution, and the people who said you could do it obviously have some form of a trick aspect to the question. They aren't obviously using the same set of definitions that this community is using.

I think you can do that in O(n) time and O(1) space, but technically it's not "one pass".
reverse the linked-list: O(n)
print: O(n)
reverse the linked-list back: O(n)

The this option assumes that you know the count (if not that's one pass gone already), or failing that if you must use one pass, then just set count to some reasonable large maximum upper limit value.
long count = 10000; // set this to whatever the count is, or calcualte it
string filename = #"c:\foo.out";
using (StreamWriter writer = new StreamWriter(filename))
{
int index = 0;
long maxLength = 12; // set this to the max length of an item + 2 (for /r/n)
while(values.Next())
{
writer.BaseStream.Seek(maxLength * (count - index - 1), SeekOrigin.Begin);
writer.WriteLine(values[index].ToString().PadLeft(maxLength));
writer.Flush();
index++;
}
}
The output will be in the c:\foo.out file, padded by spaces. Since the question did not state where you need to output, or what format the output should be in (such as not include blanks lines beforehand). Given it's a linked list the length could be very large (>int.MaxValue), such that writing output to a file is quite a reasonable transport format.
This answer meets both O(n) write performance (and indeed one pass), while also using no additional memory than the output stream which is always going to have to be O(n) because how else will you fit them all on the screen..
A response to this answer would be that you can't seek backwards in the output stream, then just print a \r return character and seek backwards that way, failing that reply to the interviewer asking if identifying or meeting impossible requirements are part of the job description.

String s ="";
for(each element el in list)
s=el.data+", "+s;
println(s);
This is one pass.

Here is a solution with the java.util.LinkedList
The memory stays the same since you remove and add the element to the same list.
I think is reasonable to assume any decent implementation of a singly linked list will keep track of its size, head and tail.
import java.util.Arrays;
import java.util.LinkedList;
class PrintReverse {
public static void main(String[] args) {
Integer[] array = {1, 2, 3, 4};
LinkedList<Integer> list = new LinkedList<Integer>(Arrays.asList(array));
System.out.println(list);
printListBackward(list);
}
public static void printListBackward(LinkedList<Integer> list) {
int size = list.size();
for (int i = 0; i < size; i++) {
Integer n = list.removeLast();
list.push(n);
System.out.println(n.toString());
}
System.out.println(list);
}
}
Which produces the following output...
[1, 2, 3, 4]
4
3
2
1
[1, 2, 3, 4]
What do you think?

Well, you didn't say that it had to be efficient. (Plus there probably isn't a constant memory implementation that's more efficient.) Also, as commenters have pointed out, this is one-pass only when length(list) == 1.
void printReversedLinkedList(listptr root) {
listptr lastPrinted = null;
while (lastPrinted != root) {
listptr ptr = root; // start from the beginning of the list
// follow until next is EoL or already printed
while (ptr->next != null && ptr->next != lastPrinted)
ptr = ptr->next;
// print current node & update last printed
printf("%s", ptr->data);
lastPrinted = ptr;
}
Constant memory, O(n^2) efficiency.

void printList(listItem node) {
if (node.next != null) {
printList(node.next);
}
echoOrSystemOutPrintlnOrPrintfOrWhatever(node.data);
}
printList(rootNode);

Import to an array, print array in reverse order.

Go through the singly linked list and push each item onto a stack. Pop the stack until it is empty, while printing out each element that is popped.
node = head of node list
stack = new Stack()
while ( node is not NULL )
{
stack.push( node.data )
node = node.next
}
while ( !stack.empty() )
{
print( stack.pop() )
}

I am really wondering how a singly linked list can be traversed reverse.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Find K most frequent words from billions of given words [duplicate] - algorithm

You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.

Your problem is same as this- http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/ Use Trie and min heap to efficieinty solve it.

This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf Also there is an implementation of it here.

Related

Quick Sort Time Complexity Best Case Input

Fuse tuples to find equivalence classes

Finding the longest sub-string with no repetition in a string. Time Complexity?

Lazy Shuffle Algorithms

Reverse-printing single-linked list

Categories

Resources