Related
I hope this isn't more of a statistics question...
Suppose I have an interface:
public interface PairValidatable<T>
{
public boolean isValidWith(T);
}
Now if I have a large array of PairValidatables, how do I find the largest subset of that array where every pair passes the isValidWith test?
To clarify, if there are three entries in a subset, then elements 0 and 1 should pass isValidWith, elements 1 and 2 should pass isValidWith, and elements 0 and 2 should pass isValidWith.
Example,
public class Point implements PairValidatable<Point>
{
int x;
int y;
public Point(int xIn, int yIn)
{
x = xIn;
y = yIn;
}
public boolean isValidWith(Point other)
{
//whichever has the greater x must have the lesser (or equal) y
return x > other.x != y > other.y;
}
}
The intuitive idea is to keep a vector of Points, add array element 0, and compare each remaining array element to the vector if it passes the validation with every element in the vector, adding it to the vector if so... but the problem is that element 0 might be very restrictive. For example,
Point[] arr = new Point[5];
arr[0] = new Point(1000, 1000);
arr[1] = new Point(10, 10);
arr[2] = new Point(15, 7);
arr[3] = new Point(3, 6);
arr[4] = new Point(18, 6);
Iterating through as above would give us a subset containing only element 0, but the subset of elements 1, 2 and 4 is a larger subset where every pair passes the validation. The algorithm should then return the points stored in elements 1, 2 and 4. Though elements 3 and 4 are valid with each other and elements 1 and 4 are valid with each other, elements 2 and 3 are not, nor elements 1 and 3. The subset containing 1, 2 and 4 is a larger subset than 3 and 4.
I would guess some tree or graph algorithm would be best for solving this but I'm not sure how to set it up.
The solution doesn't have to be Java-specific, and preferably could be implemented in any language instead of relying on Java built-ins. I just used Java-like pseudocode above for familiarity reasons.
Presumably isValidWith is commutative -- that is, if x.isValidWith(y) then y.isValidWith(x). If you know nothing more than that, you have an instance of the maximum clique problem, which is known to be NP-complete:
Skiena, S. S. "Clique and Independent Set" and "Clique." §6.2.3 and 8.5.1 in The Algorithm Design Manual. New York: Springer-Verlag, pp. 144 and 312-314, 1997.
Therefore, if you want an efficient algorithm, you will have to hope that your specific isValidWith function has more structure than mere commutativity, and you will have to exploit that structure.
For your specific problem, you should be able to do the following:
Sort your points in increasing order of x coordinate.
Find the longest decreasing subsequence of the y coordinates in the sorted list.
Each operation can be performed in O(n*log(n)) time, so your particular problem is efficiently solvable.
I try to find an effective random logic algorithm for this scenario. It doesn't matter which programming Language:
Say I have 20 element array filled with numbers
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
From this I need to construct each time 15 size array BUT
each time I set numbers that must be in this new array, and the remaining slots will be filled with random numbers from the master array.
For example:
In the new array the numbers that must be in are: 1,11,13,20,8,9
so the new array will be:
[1,N,N,11,N,20,8,N,9,N,N,N,13,N,N]
Where the Ns are random numbers from ALL 20 elements of the Master array.
Another example:
given 2,18,17,9,5
create new 10 element array:
[2,2,18,2,11,17,20,5,5,9]
No problem with duplicate elements
I'm trying to find some good algorithm for this.
If you want to receive one random number at a time and don't want to create the full result array up front, an alternative to my other answer is this:
Get a random number ranging from 0..requested_number (where requested_number is the total number of elements to fetch).
If this index is between 0 and length(required), print it from the array required; then remove it from the array;
.. else the next index is > length(required) and so pick any random number out of the optional array.
Decrease requested_number and repeat until this reaches 0.
You need 2 calls to random; the first to select an index from total_number - required_number, so you know from which array to pick a value, and the second time for optional, to pick a random number out of the entire available range.
Here is a basic implementation in C (footnote: using mod on rand() does not yield A Good Random Number, but it'll do for this example).
int main()
{
int optional[] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 };
int required[] = { 21,22,23,24,25 };
int requested_number = 15;
int take_from_required, optional_size, next;
srand(time(NULL));
if (requested_number < sizeof(required)/sizeof(required[0]))
{
printf ("requested number of elements must be at least as large as required array\n");
return EDOM;
}
/* Use this much from 'required': */
take_from_required = sizeof(required)/sizeof(required[0]);
/* Use this much from 'optional': */
optional_size = sizeof(optional)/sizeof(optional[0]);
while (requested_number > 0)
{
/* Please note this is a fairly bad 'random'!
As discussed many times before on SO. */
next = rand() % requested_number;
/* Take from which array? */
if (next >= take_from_required)
{
printf ("%d\n", optional[rand() % optional_size]);
} else
{
printf ("%d (required)\n", required[next]);
required[next] = required[take_from_required-1];
take_from_required--;
}
requested_number--;
}
return 0;
}
If I understand correctly, this is the issue:
optional [ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 ]
required [ 2,18,17,9,5 ]
Now construct a new array containing at least all elements of required, and filled to its capacity with elements taken from optional.
The problem seems to be that you need to take out random numbers from either required or optional and at the same time make sure required is empty at the end. [*]
Create a new array result (which needs to be at least as long as required -- then again, that can be inferred from the question). Copy all elements of required into it; fill the rest with random elements from optional.
At this point, you fulfill the primary condition, but the elements of required always appear first. So, as a last step, shuffle the elements now stored in the result array (for example, with the well-known Fisher-Yates shuffle).
[*] 'Empty', because all numbers in required must be used at least once. Taking them "out" of the array is the easiest way to make sure this happens. Things start to get complicated when (a) you may have duplicates of any number (from both optional and required) and (b) required is not a subset of optional.
What I’d like to do is to convert a string from one "alphabet" to another, much like converting numbers between bases, but more abstract and with arbitrary digits.
For instance, converting "255" from the alphabet "0123456789" to the alphabet "0123456789ABCDEF" would result in "FF". One way to do this is to convert the input string into an integer, and then back again. Like so: (pseudocode)
int decode(string input, string alphabet) {
int value = 0;
for(i = 0; i < input.length; i++) {
int index = alphabet.indexOf(input[i]);
value += index * pow(alphabet.length, input.length - i - 1);
}
return value;
}
string encode(int value, string alphabet) {
string encoded = "";
while(value > 0) {
int index = value % alphabet.length;
encoded = alphabet[index] + encoded;
value = floor(value / alphabet.length);
}
return encoded;
}
Such that decode("255", "0123456789") returns the integer 255, and encode(255, "0123456789ABCDEF") returns "FF".
This works for small alphabets, but I’d like to be able to use base 26 (all the uppercase letters) or base 52 (uppercase and lowercase) or base 62 (uppercase, lowercase and digits), and values that are potentially over a hundred digits. The algorithm above would, theoretically, work for such alphabets, but, in practice, I’m running into integer overflow because the numbers get so big so fast when you start doing 62^100.
What I’m wondering is if there is an algorithm to do a conversion like this without having to keep up with such gigantic integers? Perhaps a way to begin the output of the result before the entire input string has been processed?
My intuition tells me that it might be possible, but my math skills are lacking. Any help would be appreciated.
There are a few similar questions here on StackOverflow, but none seem to be exactly what I'm looking for.
A general way to store numbers in an arbitrary base would be to store it as an array of integers. Minimally, a number would be denoted by a base and array of int (or short or long depending on the range of bases you want) representing different digits in that base.
Next, you need to implement multiplication in that arbitrary base.
After that you can implement conversion (clue: if x is the old base, calculate x, x^2, x^3,..., in the new base. After that, multiply digits from old base accordingly to these numbers and then add them up).
Java-like Pseudocode:
ArbitraryBaseNumber source = new ArbitraryBaseNumber(11,"103A");
ArbitraryBaseNumber target = new ArbitraryBaseNumber(3,"0");
for(int digit : base3Num.getDigitListAsIntegers()) { // [1,0,3,10]
target.incrementBy(digit);
if(not final digit) {
target.multiplyBy(source.base);
}
}
The challenge that remains, of course, is to implement ArbitraryBaseNumber, with incrementBy(int) and multiplyBy(int) methods. Essentially to do that, you do in code exactly what a schoolchild does when doing addition and long-multiplication on paper. Google and you'll find example.
Say there is a word set and I would like to clustering them based on their char bag (multiset). For example
{tea, eat, abba, aabb, hello}
will be clustered into
{{tea, eat}, {abba, aabb}, {hello}}.
abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.
To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.
Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.
Is there a more efficient way to solve the problem?
For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better)
See my answer here.
Since you are going to sort words, I assume all characters ascii values are in the range 0-255. Then you can do a Counting Sort over the words.
The counting sort is going to take the same amount of time as the size of the input word. Reconstruction of the string obtained from counting sort will take O(wordlen). You cannot make this step less than O(wordLen) because you will have to iterate the string at least once ie O(wordLen). There is no predefined order. You cannot make any assumptions about the word without iterating though all the characters in that word. Traditional sorting implementations(ie comparison based ones) will give you O(n * lg n). But non comparison ones give you O(n).
Iterate over all the words of the list and sort them using our counting sort. Keep a map of
sorted words to the list of known words they map. Addition of elements to a list takes constant time. So overall the complexity of the algorithm is O(n * avgWordLength).
Here is a sample implementation
import java.util.ArrayList;
public class ClusterGen {
static String sortWord(String w) {
int freq[] = new int[256];
for (char c : w.toCharArray()) {
freq[c]++;
}
StringBuilder sortedWord = new StringBuilder();
//It is at most O(n)
for (int i = 0; i < freq.length; ++i) {
for (int j = 0; j < freq[i]; ++j) {
sortedWord.append((char)i);
}
}
return sortedWord.toString();
}
static Map<String, List<String>> cluster(List<String> words) {
Map<String, List<String>> allClusters = new HashMap<String, List<String>>();
for (String word : words) {
String sortedWord = sortWord(word);
List<String> cluster = allClusters.get(sortedWord);
if (cluster == null) {
cluster = new ArrayList<String>();
}
cluster.add(word);
allClusters.put(sortedWord, cluster);
}
return allClusters;
}
public static void main(String[] args) {
System.out.println(cluster(Arrays.asList("tea", "eat", "abba", "aabb", "hello")));
System.out.println(cluster(Arrays.asList("moon", "bat", "meal", "tab", "male")));
}
}
Returns
{aabb=[abba, aabb], ehllo=[hello], aet=[tea, eat]}
{abt=[bat, tab], aelm=[meal, male], mnoo=[moon]}
Using an alphabet of x characters and a maximum word length of y, you can create hashes of (x + y) bits such that every anagram has a unique hash. A value of 1 for a bit means there is another of the current letter, a value of 0 means to move on to the next letter. Here's an example showing how this works:
Let's say we have a 7 letter alphabet(abcdefg) and a maximum word length of 4. Every word hash will be 11 bits. Let's hash the word "fade": 10001010100
The first bit is 1, indicating there is an a present. The second bit indicates that there are no more a's. The third bit indicates that there are no more b's, and so on. Another way to think about this is the number of ones in a row represents the number of that letter, and the total zeroes before that string of ones represents which letter it is.
Here is the hash for "dada": 11000110000
It's worth noting that because there is a one-to-one correspondence between possible hashes and possible anagrams, this is the smallest possible hash guaranteed to give unique hashes for any input, which eliminates the need to check everything in your buckets when you are done hashing.
I'm well aware that using large alphabets and long words will result in a large hash size. This solution is geared towards guaranteeing unique hashes in order to avoid comparing strings. If you can design an algorithm to compute this hash in constant time(given you know the values of x and y) then you'll be able to solve the entire grouping problem in O(n).
I would do this in two steps, first sort all your words according to their length and work on each subset separately(this is to avoid lots of overlaps later.)
The next step is harder and there are many ways to do it. One of the simplest would be to assign every letter a number(a = 1, b = 2, etc. for example) and add up all the values for each word, thereby assigning each word to an integer. Then you can sort the words according to this integer value which drastically cuts the number you have to compare.
Depending on your data set you may still have a lot of overlaps("bad" and "cac" would generate the same integer hash) so you may want to set a threshold where if you have too many words in one bucket you repeat the previous step with another hash(just assigning different numbers to the letters) Unless someone has looked at your code and designed a wordlist to mess you up, this should cut the overlaps to almost none.
Keep in mind that this approach will be efficient when you are expecting small numbers of words to be in the same char bag. If your data is a lot of long words that only go into a couple char bags, the number of comparisons you would do in the final step would be astronomical, and in this case you would be better off using an approach like the one you described - one that has no possible overlaps.
One thing I've done that's similar to this, but allows for collisions, is to sort the letters, then get rid of duplicates. So in your example, you'd have buckets for "aet", "ab", and "ehlo".
Now, as I say, this allows for collisions. So "rod" and "door" both end up in the same bucket, which may not be what you want. However, the collisions will be a small set that is easily and quickly searched.
So once you have the string for a bucket, you'll notice you can convert it into a 32-bit integer (at least for ASCII). Each letter in the string becomes a bit in a 32-bit integer. So "a" is the first bit, "b" is the second bit, etc. All (English) words make a bucket with a 26-bit identifier. You can then do very fast integer compares to find the bucket a new words goes into, or find the bucket an existing word is in.
Count the frequency of characters in each of the strings then build a hash table based on the frequency table. so for an example, for string aczda and aacdz we get 20110000000000000000000001. Using hash table we can partition all these strings in buckets in O(N).
26-bit integer as a hash function
If your alphabet isn't too large, for instance, just lower case English letters, you can define this particular hash function for each word: a 26 bit integer where each bit represents whether that English letter exists in the word. Note that two words with the same char set will have the same hash.
Then just add them to a hash table. It will automatically be clustered by hash collisions.
It will take O(max length of the word) to calculate a hash, and insertion into a hash table is constant time. So the overall complexity is O(max length of a word * number of words)
Can be in any language or even pseudocode. I was asked this in an interview question, and was curious what you guys can come up with.
I think this is a trick question - the obvious answer of generating digits using a standard library routine is almost certainly flawed, if you want to generate every possible 10000 digit number with equal probability...
If an algorithmic random number generator maintains n bits of state, then clearly it can generate at most 2n possible different output sequences, because there are only 2n different initial configurations.
233219 < 1010000 < 233220, so if your algorithm uses less than 33220 bits of internal state, it cannot possibly generate some of the 1010000 possible 10000-digit (decimal) numbers.
Typical standard library random number generators won't use anything like this much internal state. Even the Mersenne Twister (the most frequently mentioned generator with a large state that I'm aware of) only keeps 624 32-bit words (= 19968 bits) of state.
Just one of many ways. You can pass in any string of the alphabet of characters you want to use:
public class RandomUtils
{
private static readonly Random random = new Random((int)DateTime.Now.Ticks);
public static string GenerateRandomDigitString(int length)
{
const string digits = "1234567890";
return GenerateRandomString(length, digits);
}
public static string GenerateRandomAlphaString(int length)
{
const string alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
return GenerateRandomString(length, alpha);
}
public static string GenerateRandomString(int length, string alphabet)
{
int maxlen = alphabet.Length;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++)
{
sb.Append(alphabet[random.Next(0, maxlen)]);
}
return sb.ToString();
}
}
Without additional requirements, this will work:
StringBuilder randomStr = new StringBuilder(10000);
Random rnd = new Random();
for(int i = 0; i<10000;i++)
{
char randomChar = rnd.AsChar();
randomStr[i] = randomChar;
}
This will result in unprintable characters and other unpleasentness. Using an ASCII encoder you can get letters, numbers and punctutaiton by sticking to the range 32 - 126. Or creating a random number between 0 and 94 and adding 32. Not sure which aspect they were looking for in the question.
BTW, No I did not know the visible range off the top of my head, I looked it up on wikipedia.
Generate a number in the range 0..9. Convert it to a digit. Stuff that into a string. Repeat 10000 times.
I always like saying Computer Random Numbers are always only pseudo-random. Anyway, your favourite language will invariably have a random library. Next what is a numeric string ? 0-9 valued for each character ? Well let's start with that assumption. So we can generate bytes between to Ascii codes of 0-9 with offset (48) and (int) random*10 (since random generators typically return floats). Then place these all in a char buffer 10000 long and convert to string.
Return a string containing 10,000 1s -- that's just as random as any other digit string of the same length.
I think the real question was to determine what the interviewer actually wanted. For example, random in what sense? Uncompressable? Random over multiple runs of the same algorithm? Etc.
You can start with a list of seed digits:
seeds = [4,9,3,1,2,5,5,4,4,8,4,3] # This should be relatively large
Then, use a counter to keep track of which digit was last used. This would be system-wide and shouldn't reset with the system:
def next_digit():
counter = 0
while True:
yield counter
counter += 1
pos_it = next_digit()
rand_it = next_digit()
Next, use an algorithm that uses modulus to determine the "next number":
def random_digit():
position = pos_it.next() % len(seeds)
digit = seeds[position] * rand_it.next()
return digit % 10
Last, generate 10,000 of those digits.
output = ""
for i in range(10000):
output = "%s%s" % (output, random_digit())
I believe that an ideal answer would use more prime numbers, but this should be pretty sufficient.