Count frequency of items in array - without two for loops - algorithm

Need to know is there a way to count the frequency of items in a array without using two loops. This is without knowing the size of the array. If I know the size of the array I can use switch without looping. But I need more versatile than that. I think modifying the quicksort may give better results.
Array[n];
TwoDArray[n][2];
First loop will go on Array[], while second loop is to find the element and increase it count in two-d array.
max = 0;
for(int i=0;i<Array.length;i++){
found= false;
for(int j=0;j<TwoDArray[max].length;j++){
if(TwoDArray[j][0]==Array[i]){
TwoDArray[j][1]+=;
found = true;
break;
}
}
if(found==false){
TwoDArray[max+1][0]=Array[i];
TwoDArray[max+1][1]=1;
max+=;
}
If you can comment or provide better solution would be very helpful.

Use map or hash table to implement this. Insert key as the array item and value as the frequency.
Alternatively you can use array too if the range of array elements are not too large. Increase the count of value at indexes corresponding to the array element.

I would build a map keyed by the item in the array and with a value that is the count of that item. One pass over the array to build the map that contains the counts. For each item, look it's count up in the map, increment the count, and put the new count back into the map.
The map put and get operations can be constant time (e.g., if you use a hash map implementation with a good hash function and properly sized backing store). This means you can compute the frequencies in time proportional to the number of elements in your array.

I'm not saying this is better than using a map or hash table (especially not when there are lots of duplicates, though in that case you can get close to O(n) sorting with certain techniques, so this is not too bad either), it's just an alternative.
Sort the array
Use a (single) for-loop to iterate through the sorted array
If you find the same element as the previous one, increment the current count
If you find a different element, store the previous element and its count and set the count to 1
At the end of the loop, store the previous element and its count

Related

find endpoints for range given a value within the range

I am trying to solve a simple problem, but at the moment I cannot think of a better solution. I am testing an API that is not documented.
There is an ID used to fetch objects and it has a min and max value with random values missing in-between. I'm trying to test the responses I receive for random objects, but to find objects, I need to have valid IDs.
It would be very inefficient to test random numbers and hope that I get an object back. The best I can do is find a range, get a random number between that range and check if it exists before conducting tests.
A sample list of all of the IDs in the database might look like this:
[1005, 25984, 25986, 29587, 30000, ...]
Assuming the deviation from one value to another will never exceed C, e.g. from the first value to the next value, the difference will never be greater than a pre-defined constant, how would you calculate the min/max of the range given only one value in the range?
Starting from a given value and looping until the last value is found is horrible but that is how it was implemented by previous devs. Below is pseudocode that more or less covers what they do.
// this can be any valid object ID from the database
// assuming the ID's in the database are [1005, 25984, 25986, 29587, 30000]
// "i" could be any one of these values
var i = givenPredefinedObjectId;
var deviation = 100;
// objectWithIdExists() is going to lookup an object with the ID "i" in the database
// if there is no object with the ID "i" , it will return false
// otherwise the object will get tested and return true
while(objectWithIdExists(i)){
i++;
}
for(i; i < i+deviation; i++){
if(objectWithIdExists(i)){
goto while loop;
}
}
endPoint = i - deviation;
Assuming there is no knowledge about the possible values except you can check if they exist and you are given one valid value (there is no array with all possible IDs, that was just an example), how would you find the min/max values?
Unbounded binary search is feasible, with a factor of C slowdown. Given an algorithm for unbounded binary search that, given access to the oracle less_equal(n) for some natural number n, returns n in time O(log n), implement the oracle on input k by querying all of the IDs C*k, C*k+1, ..., C*k+C-1 and reporting that k is less than or equal to n if and only if one ID is found. The running time is O(C*log((max-min)/C)).

how can I get the location for the maximum value in fortran?

I have a 250*2001 matrix. I want to find the location for the maximum value for a(:,i) where i takes 5 different values: i = i + 256
a(:,256)
a(:,512)
a(:,768)
a(:,1024)
a(:,1280)
I tried using MAXLOC, but since I'm new to fortran, I couldn't get it right.
Try this
maxloc(a(:,256:1280:256))
but be warned, this call will return a value in the range 1..5 for the second dimension. The call will return the index of the maxloc in the 2001*5 array section that you pass to it. So to get the column index of the location in the original array you'll have to do some multiplication. And note that since the argument in the call to maxloc is a rank-2 array section the call will return a 2-element vector.
Your question is a little unclear: it could be either of two things you want.
One value for the maximum over the entire 250-by-5 subarray;
One value for the maximum in each of the 5 250-by-1 subarrays.
Your comments suggest you want the latter, and there is already an answer for the former.
So, in case it is the latter:
b(1:5) = MAXLOC(a(:,256:1280:256), DIM=1)

Hadoop : Number of input records for reducer

Is there anyway by which each reducer process could determine the number of elements or records it has to process ?
Short answer - ahead of time no, the reducer has no knowledge of how many values are backed by the iterable. The only way you can do this is to count as you iterate, but you can't then re-iterate over the iterable again.
Long answer - backing the iterable is actually a sorted byte array of the serialized key / value pairs. The reducer has two comparators - one to sort the key/value pairs in key order, then a second to determine the boundary between keys (known as the key grouper). Typically the key grouper is the same as the key ordering comparator.
When iterating over the values for a particular key, the underlying context examines the next key in the array, and compares to the previous key using the grouping comparator. If the comparator determines they are equal, then iteration continues. Otherwise iteration for this particular key ends. So you can see that you cannot ahead of time determine how may values you will be passed for any particular key.
You can actually see this in action if you create a composite key, say a Text/IntWritable pair. For the compareTo method sort by first the Text, then the IntWritable field. Next create a Comparator to be used as the group comparator, which only considers the Text part of the key. Now as you iterate over the values in the reducer, you should be able to observe IntWritable part of the key changing with each iteration.
Some code i've used before to demonstrates this scenario can be found on this pastebin
Your reducer class must extend the MapReducer Reduce class:
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
and then must implement the reduce method using the KEYIN/VALUEIN arguments specified in the extended Reduce class
reduce(KEYIN key, Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
The values associated with a given key can be counted via
int count = 0;
Iterator<VALUEIN> it = values.iterator();
while(it.hasNext()){
it.Next();
count++;
}
Though I'd propose doing this counting along side your other processing as to not make two passes through your value set.
EDIT
Here's an example vector of vectors that will dynamically grow as you add to it (so you won't have to statically declare your arrays, and hence don't need the size of the values set). This will work best for non-regular data (IE the number of columns is not the same for every row in your input csv file), but will have the most overhead.
Vector table = new Vector();
Iterator<Text> it = values.iterator();
while(it.hasNext()){
Text t = it.Next();
String[] cols = t.toString().split(",");
int i = 0;
Vector row = new Vector(); //new vector will be our row
while(StringUtils.isNotEmpty(cols[i])){
row.addElement(cols[i++]); //here were adding a new column for every value in the csv row
}
table.addElement(row);
}
Then you can access the Mth column of the Nth row via
table.get(N).get(M);
Now, if you knew the # of columns would be set, you could modify this to use a Vector of arrays which would probably be a little faster/more space efficient.

Store and update huge (and sparse?) multi-dimensional array efficiently to count conditional probabilities

Just for fun I would like to count the conditional probabilities that a word (from a natural language) appears in a text, depending on the last and next to last word. I.e. I would take a huge bunch of e.g. English texts and count how often each combination n(i|jk) and n(jk) appears (where j,k,i are sucsessive words).
The naive approach would be to use a 3-D array (for n(i|jk)), using a mapping of words to position in 3 dimensions. The position look-up could be done efficiently using tries (at least that's my best guess), but already for O(1000) words I would run into memory constraints. But I guess that this array would be only sparsely filled, most entries being zero, and I would thus waste lots of memory. So no 3-D array.
What data structure would be suited better for such a use case and still be efficient to do a lot of small updates like I do them when counting the appearances of the words? (Maybe there is a completely different way of doing this?)
(Of course I also need to count n(jk), but that's easy, because it's only 2-D :)
The language of choice is C++ I guess.
C++ code:
struct bigram_key{
int i, j;// words - indexes of the words in a dictionary
// a constructor to be easily constructible
bigram_key(int a_i, int a_j):i(a_i), j(a_j){}
// you need to sort keys to be used in a map container
bool operator<(bigram_key const &other) const{
return i<other.i || (i==other.i && j<other.j);
}
};
struct bigram_data{
int count;// n(ij)
map<int, int> trigram_counts;// n(k|ij) = trigram_counts[k]
}
map<bigram_key, bigram_data> trigrams;
The dictionary could be a vector of all found words like:
vector<string> dictionary;
but for better lookup word->index it could be a map:
map<string, int> dictionary;
When you read a new word. You add it to the dictionary and get its index k, you already have i and j indexes of the previous two words so then you just do:
trigrams[bigram_key(i,j)].count++;
trigrams[bigram_key(i,j)].trigram_counts[k]++;
For better performance you may search for bigram only once:
bigram_data &bigram = trigrams[bigram_key(i,j)];
bigram.count++;
bigram.trigram_counts[k]++;
Is it understandable? Do you need more details?

Is there an easy way to have a "mode" function on an array of singles in vb6?

I need to run "mode" (which value occurs most frequently) on an array of singles in vb6. Is there a quick way do do this on large arrays?
Have a look online for a decent implementation of a sort algorithm for VB6 (I can't believe it doesn't have one built in!), sort the array, and then go through it counting the occurrences (which will be straightforward as you've all the same items together in the array) - keep a track of the most frequently occurring item on your way through and you're done. This should be O(n ln(n)) - that is, fast enough - if you've used a decent sort algorithm (quicksort or similar).
You could use a hash table. Hash all of the elements of your array (which is O(n)). You'll need a back-end data structure to hold the unique values that each hash bin contains and the number of occurances (some sort of associative memory similar to the C++ std::map). As long as you can guarantee that there will be no more than a constant, m, number of collisions (for dissimilar hash input values) in any given bin, this is O(m log m), but since m is constant, this is really O(1). This assumption may not be reasonable, but the key is to get good enough spread for your input values.
To pull out the mode, examine all of the elements in the hash table, which will be values that occur in your original input array and the number of times they occur. Find the value with the largest number of occurances (again O(n)). Total complexity is O(n) if you can find a suitable hash function. Worst case performance will be O(n log n) if the hash function doesn't provide you with good collision performance.
On another note, .Net provides a large runtime library that might make this easier. If it's feasible, you might want to consider using a new version of VB.
Included a reference to Microsoft Scripting Runtime, and used a Dictionary object to keep tally of frequency, then looked for index highest frequency and the corresponding key is the mode. Not the quickest/most elegant solution, but I just needed something up fast that worked.
Function fnModeSingle(ByRef pValues() As Single) As Single
Dim dict As Dictionary
Set dict = New Dictionary
dict.CompareMode = BinaryCompare
Dim i As Long
Dim pCurVal As Single
For i = 0 To uBound(pValues)
'limit the values that have to be analyzed to desired precision'
pCurVal = Round(pValues(i), 2)
If (pCurVal > 0) Then
'this will create a dictionary entry if it doesn't exist
dict.Item(pCurVal) = dict.Item(pCurVal) + 1
End If
Next
'find index of first largest frequency'
Dim KeyArray, itemArray
KeyArray = dict.Keys
itemArray = dict.Items
pCount = 0
Dim pModeIdx As Integer
'find index of mode'
For i = 0 To UBound(itemArray)
If (itemArray(i) > pCount) Then
pCount = itemArray(i)
pModeIdx = i
End If
Next
'get value corresponding to selected mode index'
fnModeSingle = KeyArray(pModeIdx)
Set dict = Nothing
End Function

Resources