I have one thousand Google Form Responses spreadsheets. These are students answer sheets. I built a spreadsheet and pull data (TimeStamps and scores) for each student by using Google Spreadsheet formulas (INDEX MATCH and IMPORTDATA). Each student has different pages. But, it takes too many times and sometimes causes some source sheets being unresponsive (I think because of heavy formula usage). My questions;
Is it possible to do the same thing (pulling data if matches student's name from one thousand spreadsheets) by using Google Script?
If possible, which ones (Google Spreadsheets with formulas or Google Script) performance is better?
By looking your answers I will decide to begin learning Google Script or not.
Thanks in advance.
Is it possible to do the same thing (pulling data if matches student's name from one thousand spreadsheets) by using Google Script?
Yes, it's possible.
NOTE: Bear in mind that Google Sheets has a 5 million cell limit, so if your data exceeds this limit, you should consider to use another data repository.
If possible, which ones (Google Spreadsheets with formulas or Google Script) performance is better?
Since most Google Sheets formulas are recalculated every time that a change is made in the spreadsheet that holds them, it's very likely that Google Apps Script will be better when using Google Sheets/Google Apps Script as database management system because we could have more control over when the database transactions will be made.
Related
Measurement of execution time of built-in functions for Spreadsheet
Why do we use SpreadsheetApp.flush();?
Both does the same thing. Both will be as intensive on your computer. My advice would be to upgrade your PC!
Related
Our team uses Spotfire to host online analyses and also prepare monthly reports. One pain point that we have is around validation. The reports are all prepared reports, and the process for creating them each month is as simple as 1) refresh the data (through Infolink connected to Oracle) and 2) Press button to export each report. The format of the final product is a PDF.
The issue is that there are a lot of small things that can go wrong with the reports (filter accidentally applied, wrong month selected, data didn't refresh, new department not grouped correctly, etc.) meaning that someone on our team has to manually validate each of the reports. We create almost 20 reports each month and some of them are as many as 100 pages.
We've done a great job automating the creation of the reports, but now we have this weird imbalance where it takes like 25 minutes to create all the reports but 4+ hours to validate each one.
Does anyone know of a good way to automate, or even cut down, the time we have to spend each month validating the reports? I did a brief google and all I could find was in the realm of validating reports to meet government regulation standards
It depends on 2 factors:
Do your reports have the same template (format) each time you extract them? You said that you pull them out automatically so I guess the answer is Yes.
What exactly are you trying to check/validate? You need to have a clear list on what are you validating. You mentioned month, grouping, data values (for the refresh)). But the clearer the picture you have for validation, the more likely the process can be fully automated.
There are so called RPA (robot process automation) tools that can automate complex workflows.
A "data extract" task, which is part of a workflow, can detect and collect data from documents (PDF for example).
A robot that runs on the validating machine can:
batch read all your PDF reports from specified locations on your computer (or on another computer);
based on predefined templates it can read through the documents for specific fields that you specify (through defined anchors on the templates) and collect the exact data from there;
compare the extracted data with the baseline that you set (compare the month to be correct, compare a data field to confirm proper refresh of the data, another data field to confirm grouping, etc.);
It takes a bit of time to dissect the PDF for each report template and correctly set the anchors but then it runs seamless each time.
One such tool I used is called Atomatik. It has a studio environment where you design the robot (or robots) and run the process.
Maybe someone can shed any light, personal experience or reference to official documentation.
Suppose, I have a Google Spreadsheet, which I connected to other Spreadsheets by using IMPORTRANGE. I noticed that my receiving Spreadsheet started loading slower than normal. Are there any tricks for optimizing the loading speed? For example, does it make any difference if I:
Use IMPORTRANGE less frequently by loading the data (let's say, once) to a separate tab, and then query that tab internally from within the same spreadsheet?
or
Use IMPORTRANGE frequently in multiple cells and run Query for each cell individually, and avoid having a large dedicated tab that gets all the info first?
Use IMPORTRANGE less frequently by loading the data (let's say, once) to a separate tab, and then query that tab internally from within the same spreadsheet?
definitely the right approach to gain performance speed
So I will be embarking on designing a dashboard that will display KPI's and other relevant information for my team. Since I am in the early stages of this project and am not very familiar on the technical process behind designing a dashboard, I need some questions vetted out first before I go and shop for some solutions to avoid reinventing the wheel.
Here are some of my questions:
We want a dashboard that can provide live-time information via our data sources (or as close to live-time as possible). What function allows a dashboard to update itself with concurrent datasources? From a conceptual standpoint, I can understand creating a dashboard out of Microsoft Excel, and having the dashboard dependent on the values you may have set within your pivot table.
How do you make a dashboard request information from multiple datasources on its own? Just like the excel example, a user may have to go into the pivot tables to update values, but I want to know how would a dashboard request this by itself and what is the exact method from a programming standpoint? Does the code execute itself every time you refresh the webpage?
How do you create datasources organically? I know for some solutions such as SharePoint BI Center, there are pre-supported datasources like an excel sheet or SharePoint and it's as easy as uploading your document and letting the design handle the rest. However, there are going to be some datasources that I know that will need to be fetched. Do I need to understand something else like an event recorder in order to navigate this issue?
Introduction
The dashboard (or a report, respectively) is usually the result of a long chain of steps. Very much simplified it could look like this:
src1
|------\
src2 | /---- Dashboards
|------+---[DWH]-[BR]-+
src n | | \---- Reports etc.
|------/ [Big Data]
Keep in mind, this is only a very, very simple structure of a data backend / frontend.
DWH means Data Warehouse, where data might be stored temporarily (you referred to this as fetching). This could be a database, could be a Big Data engine, could be a combination of both...
Afterwards, there are Business Rules (BR). Those might be specific rules in how different departments calculate and relate to data, but also simple things like algebra.
Questions
So, the main question should not be about the technology:
What software should we choose?
How can we create a dashboard?
but on the contrary focused on your business processes (see it like a top-down view):
How does our core process look like? Where would I like to measure data?
How would department a calculate sales in difference to department b? Should all use the same rule?
Where does everyone store the data? Can we access it? Do we need structural data?
And, very easy to forget but also easily sometimes one of the biggest parts: Is the identifier of a business object (say, sales id) everywhere build and formatted in the same way?
Conclusion
When those questions are at least in the back of your head and you keep working in this direction, more or less automatically data will spill out at certain points of that process.
Then it won't matter if you use Excel, a small-to medium app like Tableau, Tibco Spotfire, QlikView, Power BI or you want to go full scale with a big Hadoop backend, databases and JasperReports, Apache Drill, Pentaho, SSIS on top of it... it will come out eventually.
TL;DR
Focus on the processes first. Make sure to understand them. Draft in Excel. Then proceed in getting the data and the tools you need to help your use cases. It will work out much better from a "top-down" approach than trying to solve your requirements with tools only.
In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).
The queries that you run on a table like this are usually of the form (meta-SQL):
SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour
So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.
I'm currently learning the ins and outs of Hadoop and the likes.
Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.
Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?
If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?
It's even kind of been done (kind of).
LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg
A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/
We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.
http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html
My answer relates to HBase, but applies equally to BigTable.
Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.
Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.
Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.
Video: http://www.youtube.com/watch?v=5U3EnfiKs44
Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/
If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.
http://www.youtube.com/watch?v=QI8623HlYd4
It's not MapReduce but it is geared towards high-speed table scan like what you described.
My little site should be pooling list of items from a table using the active user's location as a filter. Think Craigslist, where you search for "dvd' but the results are not from all the DB, they are filtered by a location you select. My question has 2 levels:
should I go a-la-craigslist, and ask users to use a city level location? My problem with this is that you need to generate what seems to me a hard coded, hand made list of locations.
should I go a-la-zipCode. The idea of just asking the user to type his zipcode, and then pool all items that are in the same or in a certain distance from his zip code.
I seem to prefer the zip code way as it seems more elegant solution, but how on earth do one goes about creating a DB of all zip codes and implement the function that given zip code 12345, gets all zipcodes in 1 mile distance?
this should be fairly common "task" as many sites have a need similar to mine, so I am hoping not to re-invent the wheel here.
Getting a Zip Code database is no problem. You can try this free one:
http://zips.sourceforge.net/
Although I don't know how current it is, or you can use one of many providers. We have an annual subscription to ZipCodeDownload.com, and for maybe $100 we get monthly updates with the latest Zip Code data complete with Lat/Longs of the centroid of the zip code.
As for querying for all zips within a certain radius, you are going to need a spatial library of some sort. If you just have a table of zips with lats/longs, you will need a database-oriented mechanism. SQL Server 2008 has the capability built in, and there are open source libraries and commercial libraries that will add such capabilities to SQL Server 2005. The open source database PostgreSQL has a project, PostGIS that adds this capability to that database. It is here: http://postgis.refractions.net/
Other database platforms probably have similar projects, but those are the ones I am aware of. With one of these DB based libraries you should be able to directly query for any zip codes (or any rows of any kind that have lat/long columns) within a given radius.
If you want to go a different route you can use spatial tools with a mapping library. There are open source options here as well, such as SharpMap and many others (Google can help out) that can use the free Tiger maps for the united states as the data source. However, this route is somewhat more complicated and possibly less performant if all you need is a radius search.
Finally, you may want to look into a web service. This, as you say, is a common need, and I imagine there are any number ob web services that you can subscribe to that can provide all zip codes in a given radius from a provided zip code. A quick Google search turned up this:
http://www.zip-codes.com/free-zip-code-tools.asp#radius
But there are MANY resources to be had for the searching on this subject.
how on earth do one [...] implement the function that given zip code 12345, gets all zipcodes in 1 mile distance?
Here is a sample on how to do that:
http://www.codeproject.com/KB/cs/zipcodeutil.aspx
Just to be technical... PostGIS isn't a project of the Postgres community... it's a stand-alone project that is built on top of Postgres. If you want help or support with PostGIS, you'll want to go to it's community instead of Postgres.
You can use PostGIS. Additionally, I've used deCarta's mapping libraries. They have technology which allows you to geokey any arbitrary data type. Then you can query these spatially.
disclaimer: I work for deCarta
Wouldn't it be more efficient to just figure out which cities are within a 1 mile radius and store that information in a table? Then you don't have to do calculations in the database all the time.