I'm looking for an algorithm, that would be able to find cheapest and most efficent way to buy resources.
Example data (Let's base this on rocks that contain minerals)
Rock A (Contains 300 units of iron, 200 units of copper, 500 units of silver)
Rock B (Contains 150 units of iron, 400 units of copper, 100 units of silver)
Rock C (Contains 180 units of iron, 300 units of copper, 150 units of silver)
Rock D (Contains 200 units of iron, 350 units of copper, 80 units of silver)
Rock E (Contains 220 units of iron, 150 units of copper, 400 units of silver)
Rock F (Contains 30 000 units of iron, 150 units of copper, 400 units of silver)
Each unit costs 1. So A rock costs a sum of units inside.
Cases:
First case, needs 2600 units of Copper
Second case needs 5000 units of Iron
Third case needs 4600 units of Silver
What algorithm could I use to estimate which types of rocks are needed to pay lowest unit price (have as low loss as possible).
That case I came up with an algorithm that would calculate for each item a ratio of wasted vs needed materials.
Still ratio could lead me to getting rock F in case of Iron. Since that would be cheapest ratio. But overall value of stone is big. And could be achived with lower value stones as I dont need 30 000 units of Iron.
Secondly, and way more complex. Is to combine all 3 cases and get best combination of stones to fit all requirements at lowest price (waste).
This is the unbounded Knapsack problem but instead of maximization, you need minimization. The amount of resource you need is the "weight" and the cost is the "value".
These are the re-written properties:
m[0] = 0
m[w] = min(v[i] + m[w - w[i]]) for w[i] < w
Where m[j] is the solution for j amount of resource and v[i] is the cost of the ith rock.
Here is some pseudocode:
m[0] = 0
for i = 1 to W: # W is your target amount of resource i.e. 2600, 500, 4600
minv = max_value_possible
# rocks is the vector with the <cost,resource> pairs of each rock e.g.<650,150>
# for Rock B, iron
for r in rocks:
if r.second < i: minv = min(minv, m[i - r.second] + r.first)
m[i] = minv
Knapsack problem
The greedy approach you're talking about will give you a suboptimal solution.
In my opinion, it will be the best way if you follow your first idea. The percentage of a mineral in relation to the overall amount gives you the best result:
For example if you search for the mineral iron:
Rock A: 300/1000 = 30% iron
Rock F: 30000 / 30550 = 98.2% iron
Related
90 Kilograms of rocket fuel is necessary to propel 100 Kilograms of
mass into Earth's orbit from sea level. However, this becomes tricky
as, now the mass of the rocket is (100 + 90) = 190 Kilograms,
inclusive of the original mass and the mass of the required fuel. This
would now mean that we need an additional 81 KG of fuel to send the
extra weight of the required fuel. Thus requiring 271 KG of total
mass.
And the problem goes on and on forever, where we need to add additional fuel for the additional mass of the additional fuel. Seems like a O (∞) problem.
I am confused as to how to design a O(1) constant time algorithm to compute the rocket mass inorder to send M kg of mass. Also please let me know if there are other examples of O (∞).
This is more of a math problem than it is a question about asymptotic notation.
Specifically, the math works out like this. You have an initial mass of m that you want to launch. You need to add .9m mass of fuel to move that to space. But that fuel itself requires .9(.9m) = (0.9)2m additional fuel, and that (0.9)2m fuel requires (0.9)3m additional fuel, etc. You therefore need to compute the quantity
m + 0.9m + (0.9)2m + (0.9)3m + (0.9)4m + …
= m · ((0.9)0 + (0.9)1 + (0.9)2 + (0.9)3 + (0.9)4 + …)
That bit in parentheses is the sum of an infinite geometric series. You can show via a variety of techniques (do a quick Google search for a few examples) that the sum works out to
1 / (1 - 0.9) = 1 / 0.1 = 10,
and so the total mass of the rocket will be 10m, of which 9m is fuel.
In the paper "When Is 'Nearest Neighbor' Meaningful?" we read that, "We show that under certain broad conditions (in terms of data and query distributions, or workload), as dimensionality increases, the distance to the nearest
neighbor approaches the distance to the farthest neighbor. In other words, the contrast in distances to different data points becomes nonexistent. The conditions
we have identified in which this happens are much broader than the independent and identically distributed (IID) dimensions assumption that other work
assumes."
My question is how I should generate a dataset that resembles this effect? I have created three points each with 1000 dimensions with random numbers ranging from 0-255 for each dimension but points create different distances and do not reproduce what is mentioned above. It seems changing dimensions (e.g. 10 or 100 or 1000 dimensions) and ranges (e.g. [0,1]) do not change anything. I still get different distances which should not be any problem for e.g. clustering algorithms!
I hadn't heard of this before either, so I am little defensive, since I have seen that real and synthetic datasets in high dimensions really do not support the claim of the paper in question.
As a result, what I would suggest, as a first, dirty, clumsy and maybe not good first attempt is to generate a sphere in a dimension of your choice (I do it like like this) and then place a query at the center of the sphere.
In that case, every point lies in the same distance with the query point, thus the Nearest Neighbor has a distance equal to the Farthest Neighbor.
This, of course, is independent from the dimension, but it's what came at a thought after looking at the figures of the paper. It should be enough to get you stared, but surely, better datasets may be generated, if any.
Edit about:
distances for each point got bigger with more dimensions!!!!
this is expected, since the higher the dimensional space, the sparser the space is, thus the greater the distance is. Moreover, this is expected, if you think for example, the Euclidean distance, which gets grater as the dimensions grow.
I think the paper is right. First, your test: One problem with your test may be that you are using too few points. I used 10000 point and below are my results (evenly distributed point in [0.0 ... 1.0] in all dimensions). For DIM=2, min/max differ almost by a factor of 1000, for DIM=1000, they only differ by a factor of 1.6, for DIM=10000 by 1.248 . So I'd say these results confirm the paper's hypothesis.
DIM/N = 2 / 10000
min/avg/max= 1.0150906548224441E-5 / 0.019347838262624064 / 0.9993862941797146
DIM/N = 10 / 10000.0
min/avg/max= 0.011363500131326938 / 0.9806472676701363 / 1.628460468042207
DIM/N = 100 / 10000
min/avg/max= 0.7701271349716637 / 1.3380320375218808 / 2.1878136533925328
DIM/N = 1000 / 10000
min/avg/max= 2.581913326565635 / 3.2871335447262178 / 4.177669393187736
DIM/N = 10000 / 10000
min/avg/max= 8.704666143050158 / 9.70540814778645 / 10.85760200249862
DIM/N = 100000 / 1000 (N=1000!)
min/avg/max= 30.448610133282717 / 31.14936583713578 / 31.99082677476165
I guess the explanation is as follows: Lets take three randomly generated vectors, A, B and C. The total distance is based on the sum of the distances of each individual row of these vectors. The more dimensions the vectors have, the more the total sum of differences will approach a common average. In other word, it is highly unlikely that a vector C has in all elements a larger distance to A than another vector B has to A. With increasing dimensions, C and B will have increasingly similar distance to A (and to each other).
My test dataset was created as follow. The dataset is essentially a cube ranging from 0.0 to 1.0 in every dimension. The coordinates were created with uniform distribution in every dimension between 0.0 and 1.0. Example code (N=10000, DIM=[2..10000]):
public double[] generate(int N, int DIM) {
double[] data = new double[N*DIM];
for (int i = 0; i < N; i++) {
int pos = DIM*i;
for (int d = 0; d < DIM; d++) {
data[pos+d] = R.nextDouble();
}
}
return data;
}
Following the equation given at the bottom of the accepted answer here, we get:
d=2 -> 98460
d=10 -> 142.3
d=100 -> 1.84
d=1,000 -> 0.618
d=10,000 -> 0.247
d=100,000 -> 0.0506 (using N=1000)
I have the following class:
class Person
{
GenderEnum Gender;
RaceEnum Race;
double Salary;
...
}
I want to create 1000 instances of this class such that the collection of 1000 Persons follow these 5 demographic statistics:
50% male; 50% female
55% white; 20% black; 15% Hispanic; 5% Asian; 2% Native American; 3% Other;
10% < $10K; 15% $10K-$25K; 35% $25K-$50K; 20% $50K-$100K; 15% $100K-$200K; 5% over $200K
Mean salary for females is 77% of mean salary for males
Mean Salary as a percentage of mean white salary:
white - 100%.
black - 75%.
Hispanic - 83%.
Asian - 115%.
Native American - 94%.
Other - 100%.
The categories above are exactly what I want but the percentages given are just examples. The actual percentages will be inputs to my application and will be based on what district my application is looking at.
How can I accomplish this?
What I've tried:
I can pretty easily create 1000 instances of my Person class and assign the Gender and race to match my demographics. (For my project I'm assuming male/female ratio is independent of race). I can also randomly create a list of salaries based on the specified percentage brackets. Where I run into trouble is figuring out how to assign those salaries to my Person instances in such a way that the mean salaries across gender and mean salaries across race match the specified conditions.
I think you can solve this by assuming that the distribution of income for all categories is the same shape as the one you gave, but scaled by a factor which makes all the values larger or smaller. That is, the income distribution has the same number of bars and the same mass proportion in each bar, but the bars are shifted towards smaller values or towards larger values, and all bars are shifted by the same factor.
If that's reasonable, then this has an easy solution. Note that the mean value of the income distribution over all people is sum(p[i]*c[i], i, 1, #bars), which I'll call M, where p[i] = mass proportion of bar i and c[i] = center of bar i. For each group j, you have the mean sum(s[j]*p[i]*c[i], i, 1, #bars) = s[j]*M where s[j] is the scale factor for group j. Furthermore you know that the overall mean is equal to the sum of the means of the groups, weighting each by the proportion of people in that category, i.e. M = sum(s[j]*M*q[j], j, 1, #groups) where q[j] is the proportion of people in the group. Finally you are given specific values for the mean of each group relative to the mean for white people, i.e. you know (s[j]*M)/(s[k]*M) = s[j]/s[k] = some fraction, where k is the index for the white group. From this much you can solve these equations for s[k] (the scaling factor for the white group) and then s[j] from that.
I've spelled this out for the racial groups only. You can repeat the process for men versus women, starting with the distribution you found for each racial group and finding an additional scaling factor. I would guess that if you did it the other way, gender first and then race, you would get the same results, but although it seems obvious I wouldn't be sure unless I worked out a proof of it.
This problem is based on a puzzle by Joel Spolsky from 2001.
A guy "gets a job as a street painter, painting the dotted lines down the middle of the road." On the first day he finishes up 300 yards, on the second - 150, and on the 3rd even less so. The boss is furious and demands an explanation.
"I can't help it," says the guy. "Every day I get farther and farther away from the paint can!"
My question is, can you estimate the distance he covered in the 3rd day?
One of the comments in the linked thread does derive a precise solution, but my question is about a good enough estimation -- say, 10% -- that is easy to make from the general principles.
clarification: this is about a certain method in analysis of algorithms, not about developing an algorithm, nor code.
There are a lot of unknowns here - his walking speed, his painting speed, for how long does the paint in the brush last...
But clearly there are two processes going on here. One is quadratic - it's the walking to and fro between the paint can and the painting point. The other is linear - it's the process of painting, itself.
Thinking about the 10th or even the 100th day, it is clear that the linear component becomes negligible, and the process becomes very nearly quadratic - the walking takes almost all the time. During the first few minutes of the first day, on the contrary, it is close to being linear.
We can thus say that the time t as a function of the distance s follows a power law t ~ s^a with a changing coefficient a = 1.0 ... 2.0. This also means that s ~ t^b, b = 1/a.
Applying the empirical orders of growth analysis:
The b coefficient between day 1 and day 2 is approximated as
b(1,2) = log (450/300) / log 2 = 0.585 ;; and so,
a(1,2) = 1/b(1,2) = 1/0.585 = 1.71
Just as expected, the a coefficient is below 2. Going for the time period between day 2 and day 3, we can set it approximately to the middle value between 1.71 and 2.0,
a(2,3) = 1.85 ;; a = 1.0 .... 2.0
b(2,3) = 0.54 ;; b = 1.0 .... 0.5
s(3) = s(2) * (3/2)^b(2,3)
= 450 * (3/2)^0.54
= 560 yards
Thus the distance covered in the third day can be estimated as 560 - 450 = 110 yards.
What if the a coefficient had the maximum possible value, 2.0, already (which is impossible)? Then, 450*(3/2)^0.5 = 551 yards. And for the other extreme, if it were the same 1.71 (which it clearly can't be, either), 450*(3/2)^0.585 = 570.
This means that the estimate of 110 yards is plausible, with an error of less than 10 yards on either side.
considering four assumptions :-
painting speed = infinity
walking speed = x
he can paint only infinitly small in one brush stroke.
he leaves his can at starting point.
The distance he walks for painting dy road at y distance = 2y
Total distance he walks = intgeration of 2y*dy = y^2 = y^2
Total time he can paint y distance = y^2/x
Time taken to paint 300 yards = 1 day
(300)^2/x = 1
x = 90000 yards/day
Total time he can paint distance y = y^2/90000
(y/300)^2 = 2 after second day
y = 300*2^(1/2) = 424
Day 1 = 300
Day 2 = 424-300 = 124
Day 3 = 300*3^(1/2)-424 = 520 - 424 = 96
Ans : 300/124/96 assuming the first day its 300
how do I estimate SNR from a single audio file containing speech?
I know of two methods:
log power histogram pecentile difference (aka "NIST quick method"), described here: http://labrosa.ee.columbia.edu/~dpwe/tmp/nist/doc/stnr.txt
10*log10( (S-N)/N ), where
S = sum{x[i]^2 * e[i]}
N = sum{x[i]^2 * (1-e[i])}
e[i] some sort of voice activity detection (speech/non-speech indicator)
are there any better methods that do not require stereo data (or data in both clean and noisy version)? I also would like to avoid the "second method" described in the NIST document (see 1.) that makes strong assumptions about the distributions.
Human voice uses frequencies from 300 Hz to 3 kHz. This is what (old) telephone systems are using. Human voice never uses all these frequencies at a time, this is why we can do a frequency analysis for finding the noise floor - without any reference or voice activity detection e[i]:
Compute FFT with a frequency resolution of ~ 10 - 20 Hz.
With a samplerate of 48 kHz you would use an FFT length of samplerate/resolution = 4800 samples, which should the get rounded to the nearest power of 2, which is 4096
Identify the necessary bins which hold the results from 300 - 3000 Hz.
The bin index k holds the result for frequency k*samplerate/FFT_length. For above 48 kHz input and FFT length 4096 this is k(300 Hz) = 300 * 4096 / 48000 ~= 25 and k(3000 Hz) = 3000 * 4096 / 48000 ~= 250.
Calculate the energy in each necessary bin: E[k] = FFT[k].re ^2 + FFT[k].im ^2. It depends on your FFT algorithm "where" the real and imaginary parts are written.
N = min{ E[k=25..250] } * number_of_bins (=250-25+1)
S = sum{ E[k=25..250] }
SNR = (S-N)/N. The level is 10*log10(SNR)
As the SNR varies over time, go back to step 1 with some new samples - probably with some overlap