I have a file say input.dat like this
column1 column2
0 0
1.3 1.6
1.8 2.1
2.0
2.6
I need to extract subset of values from 1st column, which are closest to those in column 2, so that the total number of entries in both columns is equal.
In this example, the output I need to obtain
column1 column2
0 0
1.8 1.6
2.0 2.1
How can I get this ?
It's possible to do this with bash scripts if that is what you are limited to, but it would be easier to handle a problem like this with Python / C++ / Java because this is a version of optimized bipartite matching problem (you'd have to read each line repeatedly if done in script, or use a lot of helper variables)
==> If we can assume that values in both columns are sorted and increasing, a naive solution would be:
For every value in the 2nd column:
Read over values in the 1st column sequentially until the difference of col2_value - col1_value goes from negative to positive
Then find min( abs(negative_difference), positive_difference ) and pick the col1_value that corresponds to the smaller difference
Remove both entries from col1 and col2 and add them to the result table
Repeat this process until there is nothing left in col2 of the original table
This has worst-case run time of m*n, where m is # entries in col1 and n is # entries in col2 and average run time of O(n) if you are clever and do a constant time alternating check (compare -1, +1 from index of last chosen col1_value, since -2, +2 etc would of course result in bigger differences) instead of a sequential one to find the minimal difference between the current value in col2 and the values in vol1.
This is a naive solution because it does not minimize overall difference in the system. Optimum solution is NP, so for large datasets, the best you can probably do is use one of the approximation graphing algorithms for matching.
Related
Was asked this question in a coding round:
Given a matrix of 0's and 1's where, in any row - the values will be ascending order. i.e 1's are always after the 0's. Consider the example :
0,0,0,1,1
0,0,1,1,1
0,0,0,0,1
1,1,1,1,1
0,0,0,0,0
Find the first column that has a 1. ( from left - right )
In this case the first column ( in row 4 ) has a 1.
Answer is 1
I suggested a column wise traversal across all rows and exit when the current column encounters 1 in any of the rows.
Since the worse case performance is n * n ( comparing every element in the matrix) the interviewer wasn't pleased and was looking for a efficient solution - what is an efficient solution here ?
Take advantage of the fact that the rows are sorted which is evident from "in any row - the values will be ascending order. i.e 1's are always after the 0's"
Let there be m rows and n columns. Do a binary search on first row to figure out the first 1 and store that index in some variable, say index (One may think of a better variable name. I am just focused here on solving the problem optimally.) Continue binary search on every row, update the index if the first column containing 1 has lesser index than the index. After doing binary search on every row, you'll end up with the result in index variable.
Time complexity: m rows * log2(n columns) i.e. O(m * log2(n)).
This is the approach I could think of, which is better than the brute force approach having O(mn) time complexity. I don't think there would be a more optimal approach in terms of time and space complexity, as one has to search for the first 1 in every row.
[I don't think I should add the details on how to do a binary search to figure out the first column containing a 1. In case someone isn't very familiar with binary search, I leave this trivial part as an exercise.]
Given a sorted array, I believe an equation can be created to determine the index where any given number would be inserted.
For instance, given the sorted array of [ -1, 0, 1 ], there is an input/output table for my desired function like this:
x | f(x)
----------
-2 | 0
-1.5| 0
-1 | 0, 1
-0.5| 1
0 | 1, 2
0.5| 2
1 | 2, 3
1.5| 3
2 | 3
I have chosen to use x as the number I wish to insert into the array, and the function would output the indices that an insert function could use to insert x into the array sorted.
What interests me is that given this simplification of the problem, I notice two things:
The output of the function must be an integer
There are cases where the function could return 2 different values
And this is where I leave my thoughts to those who have more experience than me...
My first thought is that the output reminds me of Karnaugh mapping. There are two values the output can be in cases, but it doesn't matter which result is chosen.
My second thought is of quantum computing. I am not experienced enough to be specific, but if two functional outputs can be mapped to the qubit and processed quantumly, what opportunities does that hold in such a context? Could a quantum computer help derive this formula I'm looking for?
My example is very simple, but I just wanted to share this here in case anyone was interested.
A polynomial of degree n can be uniquely defined by n+1 points. However, you'll want a polynomial that can be fit to your n+1 points, while remaining monotonic. I'm not entirely certain how to accomplish this, but I'm sure that curve fitting libraries have already solved it for us. It probably just means adding a few more degrees of freedom to the polynomial, and minimizing some constraints.
Regarding your note on superpositioning- I doubt it has many implications for the world of quantum computing. Actually, I would argue that F shouldn't map to more than one value- as that would violate the definition of a function. If there are two indices it could map to, its because the values are equal, and hence order doesn't matter, so you should just pick one (insert before, or insert after an equal value) as you'd have to do in the implementation anyways.
Lets use the following table below. How can I efficiently check if the value 11 is in the table? Note that the numbers in the yellow may not always be consecutive. Looping through all the values is n^2 but that's not very efficient.
One possible solution is the following - put all the numbers either from the yellow row or the yellow column in some set, say hash set. Let's use the row for an example. After that iterate over the column and for each number x check if the number A - x is in the hash set(in your case A is 11). This approach would lead to linear complexity and linear additional memory. You do not need the hash set if you know that the numbers are sorted to get the same computational complexity.
I have an array of N elements (representing the N letters of a given alphabet), and each cell of the array holds an integer value, that integer value meaning the number of occurrences in a given text of that letter. Now I want to randomly choose a letter from all of the letters in the alphabet, based on his number of appearances with the given constraints:
If the letter has a positive (nonzero) value, then it can be always chosen by the algorithm (with a bigger or smaller probability, of course).
If a letter A has a higher value than a letter B, then it has to be more likely to be chosen by the algorithm.
Now, taking that into account, I've come up with a simple algorithm that might do the job, but I was just wondering if there was a better thing to do. This seems to be quite fundamental, and I think there might be more clever things to do in order to accomplish this more efficiently. This is the algorithm i thought:
Add up all the frequencies in the array. Store it in SUM
Choosing up a random value from 0 to SUM. Store it in RAN
[While] RAN > 0, Starting from the first, visit each cell in the array (in order), and subtract the value of that cell from RAN
The last visited cell is the chosen one
So, is there a better thing to do than this? Am I missing something?
I'm aware most modern computers can compute this so fast I won't even notice if my algorithm is inefficient, so this is more of a theoretical question rather than a practical one.
I prefer an explained algorithm rather than just code for an answer, but If you're more comfortable providing your answer in code, I have no problem with that.
The idea:
Iterate through all the elements and set the value of each element as the cumulative frequency thus far.
Generate a random number between 1 and the sum of all frequencies
Do a binary search on the values for this number (finding the first value greater than or equal to the number).
Example:
Element A B C D
Frequency 1 4 3 2
Cumulative 1 5 8 10
Generate a random number in the range 1-10 (1+4+3+2 = 10, the same as the last value in the cumulative list), do a binary search, which will return values as follows:
Number Element returned
1 A
2 B
3 B
4 B
5 B
6 C
7 C
8 C
9 D
10 D
The Alias Method has amortized O(1) time per value generated, but requires two uniforms per lookup. Basically, you create a table where each column contains one of the values to be generated, a second value called an alias, and a conditional probability of choosing between the value and its alias. Use your first uniform to pick any of the columns with equal likelihood. Then choose between the primary value and the alias based on your second uniform. It takes a O(n log n) work to initially set up a valid table for n values, but after the table's built generating values is constant time. You can download this Ruby gem to see an actual implementation.
Two other very fast methods by Marsaglia et al. are described here. They have provided C implementations.
I need to write a function which will compare 2-5 "files" (well really 2-5 sets of database rows, but similar concept), and I have no clue of how to do it. The resulting diff should present the 2-5 files side by side. The output should show added, removed, changed and unchanged rows, with a column for each file.
What algorithm should I use to traverse rows so as to keep complexity low? The number of rows per file is less than 10,000. I probably won't need External Merge as total data size is in the megabyte range. Simple and readable code would of course also be nice, but it's not a must.
Edit: the files may be derived from some unknown source, there is no "original" to which the other 1-4 files can be compared to; all files will have to be compared to the others in their own right somehow.
Edit 2: I, or rather my colleague, realized that the contents may be sorted, as the output order is irrelevant. This solution means using additional domain knowledge to this part of the application, but also that diff complexity is O(N) and less complicated code. This solution is simple and I'll disregards any answers to this edit when I close the bounty. However I'll answer my own question for future reference.
If all of the n files (where 2 <= n <= 5 for the example) have to be compared to the others, then it seems to me that the number of combinations to compare will be C(n,2), defined by (in Python, for instance) as:
def C(n,k):
return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
Thus, you would have 1, 3, 6 or 10 pairwise comparisons for n = 2, 3, 4, 5 respectively.
The time complexity would then be C(n,2) times the complexity of the pairwise diff algorithm that you chose to use, which would be an expected O(ND), in the case of Myers' algorithm, where N is the sum of the lengths of the two sequences to be compared, A and B, and D is the size of the minimum edit script for A and B.
I'm not sure about the environment in which you need this code but difflib in Python, as an example, can be used to find the differences between all sorts of sequences - not just text lines - so it might be useful to you. The difflib documentation doesn't say exactly what algorithm it uses, but its discussion of its time complexity makes me think that it is similar to Myers'.
Pseudo code (for Edit 2):
10: stored cells = <empty list>
for each column:
if cell < stored cells:
stored cells = cell
elif cell == lastCell:
stored cells += cell
if stored cells == <empty>:
return result
result += stored cells
goto 10
The case of 2 files can be solved with a standard diff algorithm.
From 3 files on you can use a "majority vote" algorithm:
If more than half of the records are the same: 2 out of 3, 3 out of 4, 3 out 5 than these are the reference to consider the other record(s) changed.
Also this means quite a speedup for the algorithm if the number of changes is comparatively low.
Pseudocode:
initialize as many line indexes as there are files
while there are still at least 3 indexes incrementable
if all indexed records are the same
increment all line indexes
else
if at least one is different - check majority vote
if there is a majority
mark minority changes, increment all line indexes
else
mark minority additions (maybe randomly deciding e.g. in a 2:2 vote)
check addition or removing and set line indexes accordingly
increment all indexes
endif
endif
endwhile