Put elements of similar type together - algorithm

I read this interview question somewhere and was trying to solve it:
Given a fruit stall (at max 8 different types of fruits). Put fruits of similar types together.
Restrictions: a) Fruit Stall is your entire world (i.e. dont use extra space), b) Taking a fruit and knowing its type (getType()) is a costly operation but swapping is a very cheap operation.
Note: You need to write a code to handle all cases keeping in mind the max types of fruit can be 8.
So, the idea which pops in my mind is, we need to call getType() for all the fruits(array elements) and then sort them accordingly based on a particular type. I am not able to get how swapping can be done here without knowing the Type of the fruit and what can be the best solution to this problem?

Since this is an interview question, I'm going to assume that your fruit stall is an array. Divide the array into eight regions, so that each region contains only fruit of a given type, using seven pointers, one to the start of each region except the first. Use an eighth pointer to point at the start of the unsorted area.
Initialize the pointers to point at the start of the array. Getting the definition of the pointers is tricky because you have to cover cases where there are no fruits of a given type. One possible definition is that Pointer i contains the number of fruits sorted so far of types up to and including i, for i = 1..8. Then at the beginning all the pointers are set equal to zero and 1 1 1 2 2 3 4 4 | corresponds to p1=3 p2=5 p3=6 p4=8 p5=p6=p7=p8=8
Repeatedly look at the first fruit at the start of the unsorted region to find out its type. If it should not go in the final region swap it with the element at the start of the final region and advance the pointer to the start of the final region. If it should not go in the second last region swap it with the element at the start of the second last region and advance the pointer to the start of the second last region... and so on until the new fruit is in its correct place. Now advance the pointer to the first unsorted fruit and repeat.
This looks at each fruit once, and I don't think you can sort with fewer calls to getType(). You don't care about the number of swaps, so I think this is optimal.
I will put in lines showing the swaps starting with c1,c2,c1,c3,c2,c1,c4,c4. I won't bother to write in the cs and I will use a | to divide the region on the left where everything is known to be in order from the region on the right where the types are unknown
| 1 2 1 3 2 1 4 4
1 | 2 1 3 2 1 4 4
1 2 | 1 3 2 1 4 4
1 1 | 2 3 2 1 4 4
1 1 2 | 3 2 1 4 4
1 1 2 3 | 2 1 4 4
1 1 2 2 | 3 1 4 4
1 1 2 2 3 | 1 4 4
1 1 2 2 1 | 3 4 4
1 1 1 2 2 | 3 4 4
1 1 1 2 2 3 | 4 4
1 1 1 2 2 3 4 | 4
1 1 1 2 2 3 4 4 |

This can most likely be done as an in place merge sort. As you mentioned check the type of each fruit immediately. This wont use up any extra memory (many guides on how to do an in place merge sort) will only call getType() once, and will result in nlog(n) run time with n memory usage.
Is there any info we know right off the bat? It seems like the question is worded in such a way that they would normally give us an alternative way to avoid having to make the getType() call n times. If this is an in person interview question don't be surprised if the goal of this exercise is supposed to evolve as the interviewer starts going into it. This would explain why they specifically mention the getType() as being expensive

Related

Can you check for duplicates by taking the sum of the array and then the product of the array?

Let's say we have an array of size N with values from 1 to N inside it. We want to check if this array has any duplicates. My friend suggested two ways that I showed him were wrong:
Take the sum of the array and check it against the sum 1+2+3+...+N. I gave the example 1,1,4,4 which proves that this way is wrong since 1+1+4+4 = 1+2+3+4 despite there being duplicates in the array.
Next he suggested the same thing but with multiplication. i.e. check if the product of the elements in the array is equal to N!, but again this fails with an array like 2,2,3,2, where 2x2x3x2 = 1x2x3x4.
Finally, he suggested doing both checks, and if one of them fails, then there is a duplicate in the array. I can't help but feel that this is still incorrect, but I can't prove it to him by giving him an example of an array with duplicates that passes both checks. I understand that the burden of proof lies with him, not me, but I can't help but want to find an example where this doesn't work.
P.S. I understand there are many more efficient ways to solve such a problem, but we are trying to discuss this particular approach.
Is there a way to prove that doing both checks doesn't necessarily mean there are no duplicates?
Here's a counterexample: 1,3,3,3,4,6,7,8,10,10
Found by looking for a pair of composite numbers with factorizations that change the sum & count by the same amount.
I.e., 9 -> 3, 3 reduces the sum by 3 and increases the count by 1, and 10 -> 2, 5 does the same. So by converting 2,5 to 10 and 9 to 3,3, I leave both the sum and count unchanged. Also of course the product, since I'm replacing numbers with their factors & vice versa.
Here's a much longer one.
24 -> 2*3*4 increases the count by 2 and decreases the sum by 15
2*11 -> 22 decreases the count by 1 and increases the sum by 9
2*8 -> 16 decreases the count by 1 and increases the sum by 6.
We have a second 2 available because of the factorization of 24.
This gives us:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
Has the same sum, product, and count of elements as
1,3,3,4,4,5,6,7,9,10,12,13,14,15,16,16,17,18,19,20,21,22,22,23
In general you can find these by finding all factorizations of composite numbers, seeing how they change the sum & count (as above), and choosing changes in both directions (composite <-> factors) that cancel out.
I've just wrote a simple not very effective brute-force function. And it shows that there is for example
1 2 4 4 4 5 7 9 9
sequence that has the same sum and product as
1 2 3 4 5 6 7 8 9
For n = 10 there are more such sequences:
1 2 3 4 6 6 6 7 10 10
1 2 4 4 4 5 7 9 9 10
1 3 3 3 4 6 7 8 10 10
1 3 3 4 4 4 7 9 10 10
2 2 2 3 4 6 7 9 10 10
My write-only c++ code is here: https://ideone.com/2oRCbh

count the number of split points

I just got a question about counting the split points in a integer array, to ensure there is at least one duplicated integer on the two sides.
ex:
1 1 4 2 4 2 4 1
we can either split it into:
1 1 4 2 | 4 2 4 1
or
1 1 4 2 4 | 2 4 1
so that there is at least one '1', '2' ,and '4' are in both sides.
The integer can range from 1 to 100,000
The complexity requires O(n). How to solve this question?
Make one pass over the array and build count[i] = how many times the value i appears in the array. The problem is only solvable if count[i] >= 2 for all non-zero values. You can use this array to tell how many distinct values you have in your array.
Next, make another pass and using another array count2[i] (or you can reuse the first one), keep track of when you have visited each value at least once. Then use that position as your split point.
Example:
1 1 4 2 4 2 4 1
count = [3, 2, 0, 4] => 3 distinct values
1 1 4 2 4 2 4 1
^ => 1 distinct value so far
^ => 1 distinct value so far
^ => 2 distinct values so far
^ => 3 distinct values so far => this is your split point
There might be cases for which there is no solution, for example if the last 1 was at the beginning as well. To detect this, you can just make another pass over the rest of the array after you have decided on the split point and see if you still have all the values on that side.
You can avoid this last pass by using the count and count2 arrays to detect when you can no longer have a split point. This is left as an exercise.

Dropping people in Stata from a panel based on their situation in multiple years

I have an unbalanced panel of 7 years with every person interviewed 4 times and I want to drop all the people that reported that they were unemployed/inactive in all 4 periods. However, I do not want to drop the observations of the people that may have been out of the labour market for 1, 2 or 3 out of the 4 periods they were interviewed. How do I tell Stata to drop people based on their situation in multiple years (t to t-3)? When I do drop if ecostatus>3, for example, Stata drops observations that I need, i.e. the people that were inactive for less than the full period of the survey.
// create some example data
clear
input id t unemp
1 1 1
1 2 1
1 3 1
1 4 1
2 1 1
2 2 0
2 3 1
2 4 1
end
// create the total number of unemployment spells
bys id : egen totunemp = total(unemp)
// display the data
sort id t
list, sepby(id)
// keep those observations with at least one
// employment spell
keep if totunemp < 4
// display the data
list

Ascending Cardinal Numbers in APL

In the FinnAPL Idiom Library, the 19th item is described as “Ascending cardinal numbers (ranking, all different) ,” and the code is as follows:
⍋⍋X
I also found a book review of the same library by R. Peschi, in which he said, “'Ascending cardinal numbers (ranking, all different)' How many of us understand why grading the result of Grade Up has that effect?” That's my question too. I searched extensively on the internet and came up with zilch.
Ascending Cardinal Numbers
For the sake of shorthand, I'll call that little code snippet “rank.” It becomes evident what is happening with rank when you start applying it to binary numbers. For example:
X←0 0 1 0 1
⍋⍋X ⍝ output is 1 2 4 3 5
The output indicates the position of the values after sorting. You can see from the output that the two 1s will end up in the last two slots, 4 and 5, and the 0s will end up at positions 1, 2 and 3. Thus, it is assigning rank to each value of the vector. Compare that to grade up:
X←7 8 9 6
⍋X ⍝ output is 4 1 2 3
⍋⍋X ⍝ output is 2 3 4 1
You can think of grade up as this position gets that number and, you can think of rank as this number gets that position:
7 8 9 6 ⍝ values of X
4 1 2 3 ⍝ position 1 gets the number at 4 (6)
⍝ position 2 gets the number at 1 (7) etc.
2 3 4 1 ⍝ 1st number (7) gets the position 2
⍝ 2nd number (8) gets the position 3 etc.
It's interesting to note that grade up and rank are like two sides of the same coin in that you can alternate between the two. In other words, we have the following identities:
⍋X = ⍋⍋⍋X = ⍋⍋⍋⍋⍋X = ...
⍋⍋X = ⍋⍋⍋⍋X = ⍋⍋⍋⍋⍋⍋X = ...
Why?
So far that doesn't really answer Mr Peschi's question as to why it has this effect. If you think in terms of key-value pairs, the answer lies in the fact that the original keys are a set of ascending cardinal numbers: 1 2 3 4. After applying grade up, a new vector is created, whose values are the original keys rearranged as they would be after a sort: 4 1 2 3. Applying grade up a second time is about restoring the original keys to a sequence of ascending cardinal numbers again. However, the values of this third vector aren't the ascending cardinal numbers themselves. Rather they correspond to the keys of the second vector.
It's kind of hard to understand since it's a reference to a reference, but the values of the third vector are referencing the orginal set of numbers as they occurred in their original positions:
7 8 9 6
2 3 4 1
In the example, 2 is referencing 7 from 7's original position. Since the value 2 also corresponds to the key of the second vector, which in turn is the second position, the final message is that after the sort, 7 will be in position 2. 8 will be in position 3, 9 in 4 and 6 in the 1st position.
Ranking and Shareable
In the FinnAPL Idiom Library, the 2nd item is described as “Ascending cardinal numbers (ranking, shareable) ,” and the code is as follows:
⌊.5×(⍋⍋X)+⌽⍋⍋⌽X
The output of this code is the same as its brother, ascending cardinal numbers (ranking, all different) as long as all the values of the input vector are different. However, the shareable version doesn't assign new values for those that are equal:
X←0 0 1 0 1
⌊.5×(⍋⍋X)+⌽⍋⍋⌽X ⍝ output is 2 2 4 2 4
The values of the output should generally be interpreted as relative, i.e. The 2s have a relatively lower rank than the 4s, so they will appear first in the array.

Array problem using if and do loop

This is my code:
data INDAT8; set INDAT6;
Array myarray{24,27};
goodgroups=0;
do i=2 to 24 by 2;
do j=2 to 27;
if myarray[i,j] gt 1 then myarray[i+1,j] = 'bad';
else if myarray[i,j] eq 1 and myarray[i+1,j] = 1 then myarray[i+1,j]= 'good';
end;
end;
run;
proc print data=INDAT8;
run;
Problem:
I have the data in this format- it is just an example: n=2
X Y info
2 1 good
2 4 bad
3 2 good
4 1 bad
4 4 good
6 2 good
6 3 good
Now, the above data is in sorted manner (total 7 rows). I need to make a group of 2 , 3 or 4 rows separately and generate a graph. In the above data, I made a group of 2 rows. The third row is left alone as there is no other column in 3rd row to form a group. A group can be formed only within the same row. NOT with other rows.
Now, I will check if both the rows have “good” in the info column or not. If both rows have “good” – the group formed is also good , otherwise bad. In the above example, 3rd /last group is “good” group. Rest are all bad group. Once I’m done with all the rows, I will calculate the total no. of Good groups formed/Total no. of groups.
In the above example, the output will be: Total no. of good groups/Total no. of groups => 1/3.
This is the case of n=2(size of group)
Now, for n=3, we make group of 3 rows and for n=4, we make a group of 4 rows and find the good /bad groups in a similar way. If all the rows in a group has “good” block—the result is good block, otherwise bad.
Example: n= 3
2 1 good
2 4 bad
2 6 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
In the above case, I left the 4th row and last 2 rows as I can’t make group of 3 rows with them. The first group result is “bad” and last group result is “good”.
Output: 1/ 2
For n= 4:
2 1 good
2 4 good
2 6 good
2 7 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
6 4 good
6 5 good
In this case, I make a group of 4 and finds the result. The 5th,6th,7th,8th row are left behind or ignored. I made 2 groups of 4 rows and both are “good” blocks.
Output: 2/2
So, After getting 3 output values from n=2 , n-3, and n=4 I will plot a graph of these values.
If you can help in any any language using array, if and do loop. it would be great.
I can change my code accordingly.
Update:
The answer for this doesn't have to be in sas. Since it is more algorithm-related than anything, I will accept suggestions in any language as long as they show how to accomplish this using arrays and do.
I am having trouble understanding your problem statement, but from what I can gather here is what I can suggest:
Place data into bins and the process the summary data.
Implementation 1
Assumption: You don't know what the range of the first column will be or distriution will be sparse
Create a hash table. The Key will be the item you are doing your grouping on. The value will be the count seen so far.
Proces each record. If the key already exists, increment the count (value for that key in the hash). Otherwise add the key and set the value to 1.
Continue until you have processed all records
Count the number of keys in the hash table and the number of values that are greater than your threshold.
Implementation 2
Assumption: You know the range of the first column and the distriution is reasonably dense
Create an array of integers with enough elements so the index can match the column value. Initialize all elements to zero. This array will hold your count for each item you are grouping on
Process each record. Examine value of first column. Increment corresponding index in array. (So if you have "2 1 good", do groupCount[2]++)
Continue until you have processed all records
Walk each element in the array. Count how many items are non zero (meaning they appeared at least once) and how many items meet your threshold.
You can use the same approach for gathering the good and bad counts.

Resources