Is it possible to get averages from multiple rows in Excel? - algorithm

How is it possible to produce averages of multiple data in excel?
If I have something like the following:
How would I calculate the average for each row of the table without programming each area under "AvgBest" and "AvgMean"? If it makes it any easier I'm happy to compute the best and mean in separate spread sheets as it can all be put back together afterwards.
I'm trying to produce the averages of all of the sample data.
Is this possible in Excel?

To calculate the simple means, you could do the following. For the average best:
=AVERAGE(A2,D2,G2,J2)
in cell M2, and then copied down to the end of your data.
For the average mean:
=AVERAGE(B2,E2,H2,K2)
in cell N2, and then copied down to the end of your data.

Related

Is there any option to do FOR loop in excel?

I have an excel that I'm calculating my Scrum Task's completed average. I have Story point item also in the excel. My calculation is:
Result= SP * percentage of completion --> This calculation is for each row and after that I sum up all result and taking the summary.
But sometimes I am adding new task and for each task I am adding the calculation to the average result.
Is there any way to use for loop in the excel?
for(int i=0;i<50;i++){ if(SP!=null && task!=null)(B+i)*(L+i)}
My calculation is like below:
AVERAGE((B4*L4+B5*L5+B6*L6+B7*L7+B8*L8+B9*L9+B10*L10)/SUM(B4:B10))
First of all, AVERAGE is not doing anything in your formula, since the argument you pass to it is just one single value. You already do an average calculation by dividing by the sum. That average is in fact a weighted average, and so you could not even achieve that with a plain AVERAGE function.
I see several ways to make this formula more generic, so it keeps working when you add rows:
1. Use SUMPRODUCT
=SUMPRODUCT(B4:B100,L4:L100)/SUM(B4:B100)
The row number 100 is chosen arbitrarily, but should evidently encompass all data rows. If you have no data occurring below your table, then it is safe to add a large margin. You'll want to avoid the situation where you think you add a line to the table, but actually get outside of the range of the formula. Using proper Excel tables can help to avoid this situation.
2. Use an array formula
This would be a second resort for when the formula becomes more complicated and cannot be executed with a "simple" SUMPRODUCT. But the above would translate to this array formula:
=SUM(B4:B100*L4:L100)/SUM(B4:B100)
Once you have typed this in the formula bar, make sure to press Ctrl+Shift+Enter to enter it. Only then will it act as an array formula.
Again, the same remark about row number 100.
3. Use an extra column
Things get easy when you use an extra column for storing the product of B & L values for each row. So you would put in cell N4 the following formula:
=B4*L4
...and then copy that relative formula to the other rows. You can hide that column if you want.
Then the overal formula can be:
=SUM(N4:N100)/SUM(B4:B100)
With this solution you must take care to always copy a row when inserting a new row, as you need the N column to have the intermediate product formula also for any new row.

Ruby: checking array for unique values and also returning minimum distance (haversine formula)

Having a hard time wrap my head around this.
I have two CSV's that make up a fiberoptic network: one for latitude, one for longitude. These were pulled from a KMZ file and both CSV's consist of 170k rows each due to the poorly built KMZ.
I have a CSV of potential customers that I want to compare against the fiber network. If the minimum distance (calculated using the Haversine formula), is less than 5280 feet, it'll print that out to the output csv file.
I've been with successful with this in the past when there aren't so many lat/long pairs: 20k in the past, but now we have 170k. The output csv file gets massive, as you can imagine: 3 million rows and counting.
Then what I would have to do is do a check (generally using MySQL MIN() function, but I'm sure there's a better way) to return the minimum distance per address and group by address: since you really only care about the minimum distance, per address. You don't want multiple rows for each address.
require 'csv'
require 'haversine'
#this could be put into one file, works as is
fib_lat = CSV.read("swfl_fiber_lat.csv")
fib_long = CSV.read("swfl_fiber_long.csv")
#use zip to read both arrays at the same time
fib_coords = fib_lat.map(&:last).zip(fib_long.map(&:last))
#multiple column CSV with customer data, headers turned on
customers = CSV.read("swfl_1a_geocoded.csv", headers:true)
CSV.open('swfl-output-data-within-1mile.csv','w', :write_headers=> true, :headers => ['First Name','Last Name','Latitude','Longitude','Feet to Fiber','Address','City','State','Zip','County','Company','Title Code Description','PrimarySIC6 Description','Business Status Code Description','Phone Number','Tollfree Phonenumber','EmployeeSize Location Description','Sales Volume Location Decode','Telecommunications Expense','Email Address']) do |csv_object|
fib_coords.each do |fib_lat, fib_long|
customers.each do |cust|
if (Haversine.distance(cust[2].to_f, cust[3].to_f, fib_lat.to_f, fib_long.to_f).to_feet < 5280)
data_out = ["#{cust[0]},#{cust[1]},#{cust[2].to_f},#{cust[3].to_f}, #{Haversine.distance(cust[2].to_f, cust[3].to_f, fib_lat.to_f, fib_long.to_f).to_feet.round(2)},#{cust[5]},#{cust[6]},#{cust[7]},#{cust[8]},#{cust[9]},#{cust[10]},#{cust[11]},#{cust[12]},#{cust[13]},#{cust[14]},#{cust[15]},#{cust[16]},#{cust[17]},#{cust[18]}"]
csv_object << data_out
end
end
end
end
I'm trying to think of a way to return the customer (maybe using .uniq arr#min and only the minimum address per customer without shoving that into the output CSV. Then, if there is indeed a distance below 5,280 and an associated customer, only put that into the output CSV array.
In regards to the psuedocode: if distance is the minimum per customer, make sure that customer value is unique, then shove that into the output CSV. Just not 100% on how to implement this in my flurry of loops.
Any and all insight is appreciated.
First off, where are your performance issues? I'm going to assume its not in calculating fib_coords but in looping through the customers. There are a couple of changes I would make:
1) I would not read the whole customers CSV file into memory in one go but loop through the customers CSV file using the CSV::for_each method. Loading the whole CSV file is probably using a fair bit of memory which could be better used for the fib_coords array. This means reversing the order of customers and fib_coords loops.
2) Secondly, you can avoid searching the whole fib_coords array. If you sort by the first column so that it is in latitudinal order, calculate the minimum possible latitude (customer.latitude - 5280ft), find the first potential match in fib_coords using bsearch which is much faster than a linear search and loop through fib_coords from there until the latitude in fib_coords is out of range ( > customer.latitude + 5280ft).

Building a histogram faster

I am working with a large dataset that I need to build a histogram of. I feel like my method of just going through the entire list and marking in a second array the frequency is a slow approach. Any suggestions on how to speed the process up?
Given that a histogram is a graph containing the counts of all items in each bin, you can't make one without visiting all the items.
However, you can:
Create the histogram as you collect the data. Then it takes no time to generate.
Break up the data into N parts, and work on each part in parallel. When each part is done counting, just sum the results for each bin. (You can also combine this with #1)
Sample the data. In theory, looking at a fraction of your data, you should be able to estimate the rest of it. The Math.

How Split a dataset into two random halves in weka?

I want to split my dataset into two random halves in weka.
How can I do it?
I had same question and the answer is too simple. First, you need to randomly shuffle the order of instances with weka filter (Unsupervised-> instances) and then split data set into two parts. You can find a complete explanation at below link:
http://cs-people.bu.edu/yingy/intro_to_weka.pdf
you can use first randomize data set in filter , to make it randomly, secondly use, the Remove percentage filter, use first for 30% for testing and save it then reuse it but check the INVERT box so will be the other 70% and save it
so u will have the testing, and training sets randomized and splitted
I have an idea but not using Weka native api. How about use Random Number Generator? Math.random() generates numbers from 0 to 1.
Suppose that we want to split dataset into set1 and set2.
for every instance in dataset
{
if Math.random() < 0.5
put the instance into set1
else
put the instance into set2
}
I think that this method may generate similar number of instances for the two subset. If you want to generate exactly the same quantities, you may add additional conditions to if-else.
Hope this may offer you some inspiration.

Random distribution of data

How do I distribute a small amount of data in a random order in a much larger volume of data?
For example, I have several thousand lines of 'real' data, and I want to insert a dozen or two lines of control data in a random order throughout the 'real' data.
Now I am not trying to ask how to use random number generators, I am asking a statistical question, I know how to generate random numbers, but my question is how do I ensure that this the data is inserted in a random order while at the same time being fairly evenly scattered through the file.
If I just rely on generating random numbers there is a possibility (albeit a very small one) that all my control data, or at least clumps of it, will be inserted within a fairly narrow selection of 'real' data. What is the best way to stop this from happening?
To phrase it another way, I want to insert control data throughout my real data without there being a way for a third party to calculate which rows are control and which are real.
Update: I have made this a 'community wiki' so if anyone wants to edit my question so it makes more sense then go right ahead.
Update: Let me try an example (I do not want to make this language or platform dependent as it is not a coding question, it is a statistical question).
I have 3000 rows of 'real' data (this amount will change from run to run, depending on the amount of data the user has).
I have 20 rows of 'control' data (again, this will change depending on the number of control rows the user wants to use, anything from zero upwards).
I now want to insert these 20 'control' rows roughly after every 150 rows or 'real' data has been inserted (3000/20 = 150). However I do not want it to be as accurate as that as I do not want the control rows to be identifiable simply based on their location in the output data.
Therefore I do not mind some of the 'control' rows being clumped together or for there to be some sections with very few or no 'control' rows at all, but generally I want the 'control' rows fairly evenly distributed throughout the data.
There's always a possibility that they get close to each other if you do it really random :)
But What I would do is:
You have N rows of real data and x of control data
To get an index of a row you should insert i-th control row, I'd use: N/(x+1) * i + r, where r is some random number, diffrent for each of the control rows, small compared to N/x. Choose any way of determining r, it can be either gaussian or even flat distribution. i is an index of the control row, so it's 1<=i<x
This way you can be sure that you avoid condensation of your control rows in one single place. Also you can be sure that they won't be in regular distances from each other.
Here's my thought. Why don't you just loop through the existing rows and "flip a coin" for each row to decide whether you will insert random data there.
for (int i=0; i<numberOfExistingRows; i++)
{
int r = random();
if (r > 0.5)
{
InsertRandomData();
}
}
This should give you a nice random distribution throughout the data.
Going with the 3000 real data rows and 20 control rows for the following example (I'm better with example than with english)
If you were to spread the 20 control rows as evenly as possible between the 3000 real data rows you'd insert one at each 150th real data row.
So pick that number, 150, for the next insertion index.
a) Generate a random number between 0 and 150 and subtract it from the insertion index
b) Insert the control row there.
c) Increase insertion index by 150
d) Repeat at step a)
Of course this is a very crude algorithm and it needs a few improvements :)
If the real data is large or much larger than the control data, just generate interarrival intervals for your control data.
So pick a random interval, copy out that many lines of real data, insert control data, repeat until finished. How to pick that random interval?
I'd recommend using a gaussian deviate with mean set to the real data size divided by the control data size, the former of which could be estimated if necessary, rather than measured or assumed known. Set the standard deviation of this gaussian based on how much "spread" you're willing to tolerate. Smaller stddev means a more leptokurtic distribution means tighter adherence to uniform spacing. Larger stdev means a more platykurtic distribution and looser adherence to uniform spacing.
Now what about the first and last sections of the file? That is: what about an insertion of control data at the very beginning or very end? One thing you can do is to come up with special-case estimates for these... but a nice trick is as follows: start your "index" into the real data at minus half the gaussian mean and generate your first deviate. Don't output any real data until your "index" into the real data is legit.
A symmetric trick at the end of the data should also work quite well (simply: keep generating deviates until you reach an "index" at least half the gaussian mean beyond the end of the real data. If the index just before this was off the end, generate data at the end.
You want to look at more than just statistics: it's helpful in developing an algorithm for this sort of thing to look at rudimentary queueing theory. See wikipedia or the Turing Omnibus, which has a nice, short chapter on the subject whose title is "Simulation".
Also: in some circumstance non-gaussian distributions, particularly the Poisson distribution, give better, more natural results for this sort of thing. The algorithm outline above still applies using half the mean of whatever distribution seems right.

Resources