Excel 2010 is quick to calculate but slow to start up - performance

I have created an excel database that takes information from a number of sheets and stores the data in a single 'database' sheet in a layout that makes it easy to create a pivot table to capture all of the data in the database. The information in the database tab is all either linked directly to cells in other sheets or using the offset function to link to cells in other sheets.
The database page is just over 10,000 rows and just over 100 columns (i.e. about a million linked cells). I have been trying to find solutions to this problem, but have been unable to do so. Calculation time is relatively quick as the workbook does not make use of array formulas such as sumifs or sumproduct.
Is the database just too big or is there a way to avoid the extremely slow start up (20-30 minutes)?
Thanks!

Related

How to read data using Google query language and update using sheets api, when query doesn't return row numbers?

What I want is to search one row in Google Spreadsheet and Update the value of one column. (When we have a large number of rows)
By looking at the main requirement, it seems easy. Yes of course. There are easy and straight forward ways to do that. But, when we have more data(rows) in spreadsheets around 100,000 rows, the usual methods are very slow. I was able to search for data using Google Query Language and it is a very efficient way (less than 1 sec for more than 50,000 records)
Now Google offers batch update mechanism, and we have to set the range in order to update the data.
But if we use the query api to search the data, we will not get the row number and we don't know where to update. Google offered two independent solutions, but how to combine these solutions efficiently? (Especially for large number of records).
Also, Are there any alternate solutions I missed?
The easiest way is to add a column with row numbers. You can then use query to retrieve the rows, which will contain the row numbers as well.
Another way is to use text finder, whose performance is comparable/better than query.

Data Types and Indexes

Is there some sort of performance difference for inserting, updating, or deleting data when you use the TEXT data type?
I went here and found this:
Tip: There is no performance difference among these three types, apart
from increased storage space when using the blank-padded type, and a
few extra CPU cycles to check the length when storing into a
length-constrained column. While character(n) has performance
advantages in some other database systems, there is no such advantage
in PostgreSQL; in fact character(n) is usually the slowest of the
three because of its additional storage costs. In most situations text
or character varying should be used instead.
This makes me believe there should not be a performance difference, but my friend, who is much more experienced than I am, says inserts, updates, and deletes are slower for the TEXT data type.
I had a table that was partitioned with a trigger and function, and extremely heavily indexed, but the inserts did not go that slow.
Now I have another table, with 5 more columns all of which are text data type, the same exact trigger and function, no indexes, but the inserts are terribly slow.
From my experience, I think he is correct, but what do you guys think?
Edit #1:
I am uploading the same exact data, just the 2nd version has 5 more columns.
Edit #2:
By "Slow" I mean with the first scenario, I was able to insert 500 or more rows per second, but now I can only insert 20 rows per second.
Edit #3: I didn't add the indexes to the 2nd scenario like they are in the 1st scenario because indexes are supposed to slow down inserts, updates, and deletes, from my understanding.
Edit #4: I guarantee it is exactly the same data, because I'm the one uploading it. The only difference is, the 2nd scenario has 5 additional columns, all text data type.
Edit #5: Even when I removed all of the indexes on scenario 2 and left all of them on scenario 1, the inserts were still slower on scenario 2.
Edit #6: Both scenarios have the same exact trigger and function.
Edit #7:
I am using an ETL tool, Pentaho, to insert the data, so there is no way for me to show you the code being used to insert the data.
I think I might have had too many transformation steps in the ETL tool. When I tried to insert data in the same transformation as the steps that actually transform the data, it was massively slow, but when I simply inserted the data already transformed into an empty table and then inserted data from this table into the actual table I'm using,the inserts were much faster than scenario 1 at 4000 rows per second.
The only difference between scenario 1 and scenario 2, other than the increase in columns in scenario 2, is the number of steps in the ETL transformation.Scenario two has about 20 or more steps in the ETL transformation. In some cases, there are 50 more.
I think I can solve my problem by reducing the number of transformation steps, or putting the transformed data into an empty table and then inserting the data from this table into the actual table I'm using.
PostgreSQL text and character varying are the same, with the exception of the (optional) length limit for the latter. They will perform identically.
The only reasons to prefer character varying are
you want to impose a length limit
you want to conform with the SQL standard

How to find close GPS co-ordinates in large table at a given time

(This is a theoretical question for a system design I am working on - advising changes is great)
I have a large table of GPS data which contains the following rows:
locID - PK
userID - User ID of the user of the app
lat
long
timestamp - point in UNIX time when this data was recorded
I am trying to design a way which will allow a server to go through this dataset and check if any "users" were in a specific "place" together (eg. 50m apart) at a specific time range (2min) - eg. did user1 visit the same vicinity of user2 within that 2min time gap.
The only way I can currently think of is check each row one by one with all the rows in the same timeframe using a co-ordinate distance check algorithm. - But this comes up with the issue if the users are all around the world and have thousands maybe millions of rows in that 5min timeframe this would not work efficiently.
Also what if I want to know how long they were in each others vicinity?
Any ideas/thoughts would be helpful. Including the database to use? I am thinking either PostgreSQL or maybe Cassandra. And the table layout. All help appreciated.
Divide the globe into patches, where each patch is small enough to contain only a few thousand people, say 200m by 200m, and add the patchID as an attribute to each entry in the database. Note that two users cannot be in close proximity if they aren't in the same patch or in adjacent patches. Therefore, when checking for two users in the same place at a given time, query the database for a given patchID and the eight surrounding patchIDs, to get a subset of the database that may contain possible collisions.

MS Access - matching a small data set with a very large data set

I have a huge excel file with more than a million rows and a bunch of columns (300) which I've imported to an access database. I'm trying to run an inner join query on it which matches on a numeric field in a relatively small dataset. I would like to capture all the columns of data from the huge dataset if possible. I was able to get the query to run in about 1/2 hour when I selected just one column from the huge dataset. However, when I select all the columns from the larger dataset, and have the query writes to a table, it just never stops.
One consideration is that the smaller dataset's join field is a number, while the larger one's is in text. To get around this, I created a query on the larger dataset which converts the text field to a number using the "val" function. The text field in question is indexed, but I'm thinking I should convert on the table itself to a numeric field to match the smaller dataset's type. Maybe that would make the lookup more efficient.
Other than that, I could use and would greatly appreciate some suggestions of a good strategy to get this query to run in a reasonable amount of time.
Access is a relational database. It is designed to work efficiently if your structure respects the relational model. Volume is not the issue.
Step 1: normalize your data. If you don't have a clue about what that means, there is a wizard in Access that can help you for this (Database Tools, Analyze table) , or search for Database normalization
Step 2: index the join fields
Step 3: enjoy fast results
Your idea of having both sides of the join in the same type IS a must. If you don't do that, indexes and optimisation won't be able to operate.

Amortizing the calculation of distribution (and percentile), applicable on App Engine?

This is applicable to Google App Engine, but not necessarily constrained for it.
On Google App Engine, the database isn't relational, so no aggregate functions (such as sum, average etc) can be implemented. Each row is independent of each other. To calculate sum and average, the app simply has to amortize its calculation by recalculating for each individual new write to the database so that it's always up to date.
How would one go about calculating percentile and frequency distribution (i.e. density)? I'd like to make a graph of the density of a field of values, and this set of values is probably on the order of millions. It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.
Is there some algorithm to calculate or approximate density/frequency/percentile distribution that can be calculated over a period of time?
By the way, the data is indeterminate in that the maximum and minimum may be all over the place. So the distribution would have to take approximately 95% of the data and only do a density based on that.
Getting the whole row (with that limit of 1000 at a time...) over and over again in order to get a single number per row is sure unappealing. So denormalize the data by recording that single number in a separate entity that holds a list of numbers (to a limit of I believe 1 MB per query, so with 4-byte numbers no more than 250,000 numbers per list).
So when adding a number also fetch the latest "added data values list" entity, if full make a new one instead, append the new number, save it. Probably no need to be transactional if a tiny error in the statistics is no killer, as you appear to imply.
If the data for an item can be changed have separate entities of the same kind recording the "deleted" data values; to change one item's value from 23 to 45, add 23 to the latest "deleted values" list, and 45 to the latest "added values" one -- this covers item deletion as well.
It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.
This is the most obvious approach to me, why are you are you trying to avoid it?

Resources