Interview - 7 ways to store and compare millions of data - algorithm

Was just asked this day before.
Could come up with -
1. Arrays
2. Linked Lists
3. Sets/Vectors
4. Min/Max heaps
5. Trees
6. Map
Interviewer said I haven't included one obvious answer which is as good, if not better than trees (assuming, they are balanced). Any ideas?

There are lots of options that can be added. Skip list, graph, btree, log-structured merge-tree, bloom filter...
I consider that a horrible interview question. The interviewer is asking for useless trivia, and believes that how much trivia you know in common with the interviewer is a measure of something useful. Speaking personally, I'd consider that a red flag that would limit my interest in the company.

Related

Sort elements with the fewest comparisons possible

I have a folder full of images. There are too many to just 'rank'. I made a program that shows two at a time and let's the user pick which one of the two is better. At the end I would like all of the photos to be ordered from best to worst.
I am purely trying to optimize for the fewest amount of comparisons possible. I don't care if the program runs in n cubed time. I've read the other questions here with similar questions but I'm looking for something more advanced.
I'm thinking maybe some sort of algorithm that based on what comparisons you've already made, the program chooses two images to compare that will offer the most information. Maybe even an algorithm that makes complex connections to help determine the orders and potential orders.
Like I said I don't care if it is slow just purely trying to minimize comparisons
If total order exists, you need at least nlog2(n) comparisons. It can be easily proved mathematically. No way around. So regular sorting algorithms in nlog(n) will do the job.
What you are trying to do is called 'topological sort'. Google it and read about it in wikipedia. You can achieve partial sorts in less comparisons. Its kind of a graduate sort. The more comparisons you get, the better the result will be.
However, what do you do if no total order exists? Humans are not able to generate a total order for subjective tasks.
For example picture 1 is better than 2, 2 is better than 3 but 3 is better than 1.
In this case no sorting algorithm can produce a permutation which will match all the decisions. During topological sort, you can detect those inconsitent decisions and get rid of them.
You are looking for a sorting algorithm - pick one. Most algorithms just need a comparison function (a < b?). This is when you show the user two pictures and he has to choose the better one.
You might wan't to read trough some of the algorithms and choose the best one for you. E.g. on quicksort, you would pick a random picture and the user have to compare this picture against all other pictures in the first round - might be too boring from the end user perspective.

The number with the biggest number of repetitions within an array

I was asked an interview question which asked me to return the number with the biggest repetition within an array, for example, {1,1,2,3,4} returns 1.
I first proposed a method in hashtable, which requires space complexity O(n).
Then I said sort the array first and then go through it then we can find the number.
Which requires O(NlogN)
The interviewer was still not satisfied.
Any optimization?
Thanks.
Interviewers aren't always looking for solutions per se. Sometimes they're looking to find out your capacity to do other things. You should have asked if there there were any constraints on the data such as:
is it already sorted?
are the values limited to a certain range?
This establishes your ability to think about problems rather than just blindly vomit forth stuff that you've read in textbooks. For example, if it's already sorted, you can do it in O(n) time, O(1) space simply by looking for the run with the largest size.
If it's unsorted but limited to the values 1..100, you can still do it in O(n) time, O(1) space by creating a count of each possible value, initially all set to zero, then incrementing for each item.
Then ask the interviewer other things like:
What sort of behaviour do they want if there are two numbers with the same count?
If they're not satisfied with your provided solutions, try to get a clue as to where they're thinking. Do they think it can be done in O(log N) or O(1)? Interviews are never a one-way street.
There are boundless other "solutions" like stashing the whole thing into a class so that you can perform other optimisations (such as caching the information, or using a different data structures which makes the operation faster). Discussing these with your interviewer will give them a chance to see you in action.
As an aside, I tell my children to always show working out in their school assignments. If they just plop down the answer and it's wrong, they'll get nothing. However, if they show their working out and get the wrong answer, the teacher can at least see that they had the right idea (they probably just made one little mistake along the way).
It's exactly the same thing here. If you simply say "hashtable" and the interviewer has a different idea, that'll be a pretty short interview question.
However, saying "based on unsorted arrays, no possibility of keeping the data in a different data structure, and no limitations on the data values, it would appear hashtables are the most efficient way, BUT, if there was some other information I'm not yet privy to, there might be a better method" will show that you've given it some thought, and possibly open a dialogue with the interviewer that will help you out.
Bottom line, when an interviewer asks you a question, don't always assume it's as straightforward as you initially think. Apart from tech knowledge, they may be looking to see how you approach problem solving, how you handle Kobayashi-Maru-type problems, how you'll work in a team, how you'll treat difficult customers, whether you're a closet psychopath and endless other possibilities.

LCA problem at interview

Sometimes I come across interview questions like this: "Find the common parent of any 2 nodes in a tree". I noticed that they ask LCA questions also at Google, Amazon, etc.
As wikipedia says LCA can be found by intersection of the paths from the given nodes to the root and it takes O(H), where H is the height of the tree. Besides, there are more advanced algorithms that process trees in O(N) and answer LCA queries in O(1).
I wonder what exactly interviewers want to learn about candidates asking this LCA question. The first algorithm of paths intersection seems trivial. Do they expect the candidates to remember pre-processing algorithms ? Do they expect the candidates to invent these algorithms and data structures on the fly ?
They want to see what you're made of. They want to see how you think, how you tackle on problem, and handle stress from deadline. I suppose that if you solve the problem because you already know the solution then they will just pose you another problem. It's not the solution they want, it's your brainz.
With many algorithms questions in an interview, I don't care (or want) you to remember the answer. I want you to be able to derive the answer from the main principles you know.
Realistically: I could give two cares less if you already know the answer to "how to do X", if you are able to quickly construct an answer on the fly. Ultimately: in the real world, I can't necessarily assume you have experience tackling problems of domain X, but if you run across a problem in said domain, I will certainly hope you'll have the analytical ability to figure out a reasonable answer, based on the general knowledge you hold.
A tree is a common data structure it's safe to assume most will know - if you know the answer offhand, you should be able to explain it. If you don't know the answer, but understand the data structure, you should be able to fairy easily derive the answer.

Divide people into teams for most satisfaction

Just a curiosity question. Remember when in class groupwork the professor would divide people up into groups of a certain number (n)?
Some of my professors would take a list of n people one wants to work with and n people one doesn't want to work with from each student, and then magically turn out groups of n where students would be matched up with people they prefer and avoid working with people they don't prefer.
To me this algorithm sounds a lot like a Knapsack problem, but I thought I would ask around about what your approach to this sort of problem would be.
EDIT: Found an ACM article describing something exactly like my question. Read the second paragraph for deja vu.
To me it sounds more like some sort of clique problem.
The way I see the problem, I'd set up the following graph:
Vertices would be the students
Two students would be connected by an edge if both of these following things hold:
At least one of the two students wants to work with the other one.
None of the two students doesn't want to work with the other one.
It is then a matter of partitioning the graph into cliques of size n. (Assuming the number of students is divisible by n)
If this was not possible, I'd probably let the first constraint on the edges slip, and have edges between two people as long as neither of them explicitly says that they don't want to work with the other one.
As for an approach to solving this efficiently, I have no idea, but this should hopefully get you closer to some insight into the problem.
You could model this pretty easily as a clustering problem and you wouldn't even really need to define a space, you could actually just define the distances:
Make two people very close if they both want to work together.
Close if one of them wants to work with the other.
Medium distance if there's just apathy.
Far away if either one doesn't want to work with the other.
Then you could just find clusters, yay. Then split up any clusters of overly large size, with confidence that the people in the clusters would all be fine working together.
This problem can be brute-forced, hence my approach would be first to brute force it, then fix it when I get a better idea.
There are a couple of algorithms you could use. A great example is the so called "stable marriage problem", which has a perfect solution. You can read more about it here:
http://en.wikipedia.org/wiki/Stable_marriage_problem
The stable marriage problem only works with two groups of people (men/women in the marriage case). If you want to form pair you can use a variation, the stable roommate problem. In this case you create pairs but everybody comes from a single pool.
But you asked for a team (which I translate into >2 people per team). In this case you could let everybody fill in their best to worst match and then run the

How do 20 questions AI algorithms work?

Simple online games of 20 questions powered by an eerily accurate AI.
How do they guess so well?
You can think of it as the Binary Search Algorithm.
In each iteration, we ask a question, which should eliminate roughly half of the possible word choices. If there are total of N words, then we can expect to get an answer after log2(N) questions.
With 20 question, we should optimally be able to find a word among 2^20 = 1 million words.
One easy way to eliminate outliers (wrong answers) would be to probably use something like RANSAC. This would mean, instead of taking into account all questions which have been answered, you randomly pick a smaller subset, which is enough to give you a single answer. Now you repeat that a few times with different random subset of questions, till you see that most of the time, you are getting the same result. you then know you have the right answer.
Of course this is just one way of many ways of solving this problem.
I recommend reading about the game here: http://en.wikipedia.org/wiki/Twenty_Questions
In particular the Computers section:
The game suggests that the information
(as measured by Shannon's entropy
statistic) required to identify an
arbitrary object is about 20 bits. The
game is often used as an example when
teaching people about information
theory. Mathematically, if each
question is structured to eliminate
half the objects, 20 questions will
allow the questioner to distinguish
between 220 or 1,048,576 subjects.
Accordingly, the most effective
strategy for Twenty Questions is to
ask questions that will split the
field of remaining possibilities
roughly in half each time. The process
is analogous to a binary search
algorithm in computer science.
A decision tree supports this kind of application directly. Decision trees are commonly used in artificial intelligence.
A decision tree is a binary tree that asks "the best" question at each branch to distinguish between the collections represented by its left and right children. The best question is determined by some learning algorithm that the creators of the 20 questions application use to build the tree. Then, as other posters point out, a tree 20 levels deep gives you a million things.
A simple way to define "the best" question at each point is to look for a property that most evenly divides the collection into half. That way when you get a yes/no answer to that question, you get rid of about half of the collection at each step. This way you can approximate binary search.
Wikipedia gives a more complete example:
http://en.wikipedia.org/wiki/Decision_tree_learning
And some general background:
http://en.wikipedia.org/wiki/Decision_tree
It bills itself as "the neural net on the internet", and therein lies the key. It likely stores the question/answer probabilities in a spare matrix. Using those probabilities, it's able to use a decision tree algorithm to deduce which question to ask that would best narrow down the next question. Once it narrows the number of possible answers to a few dozen, or if it's reached 20 questions already, then it starts reading off the most likely.
The really intriguing aspect of 20q.net is that unlike most decision tree and neural network algorithms I'm aware of, 20q supports a sparse matrix and incremental updates.
Edit: Turns out the answer's been on the net this whole time. Robin Burgener, the inventor, described his algorithm in detail in his 2005 patent filing.
It is using a learning algorithm.
k-NN is a good example of one of these.
Wikipedia: k-Nearest Neighbor Algorithm

Resources