I was recently asked this in an interview (Software Engineer) and didn't really know how to go about answering the question.
The question was focused on both the algorithm of the spreadsheet and how it would interact with the browser. I was a bit confused on what data structure would be optimal to handle the cells and their values. I guess any form of hash table would work with cells being the unique key and the value being the object in the cell? And then when something gets updated, you'd just update that entry in your table. The interviewer hinted at a graph but I was unsure of how a graph would be useful for a spreadsheet.
Other things I considered were:
Spreadsheet in a browser = auto-save. At any update, send all the data back to the server
Cells that are related to each other, i.e. C1 = C2+C3, C5 = C1-C4. If the value of C2 changes, both C1 and C5 change.
Usage of design patterns? Does one stand out over another for this particular situation?
Any tips on how to tackle this problem? Aside from the algorithm of the spreadsheet itself, what else could the interviewer have wanted? Does the fact that its in a browser as compared to a separate application add any difficulties?
For an interview this is a good question. If this was asked as an actual task in your job, then there would be a simple answer of use a third party component, there are a few good commercial ones.
While we can't say for sure what your interviewer wanted, for me this is a good question precisely because it is so open ended and has so many correct possible answers.
You can talk about the UI and how to implement the kind of dynamic grid you need for a spreadsheet and all the functionality of the cells and rows and columns and selection of cells and ranges and editing of values and formulas. You probably could talk for a while on the UI implications alone.
Alternatively you can go the data route, talk about data structures to hold a spreadsheet, talk exactly about links between cells for formulas, talk about how to detect and deal with circular references, talk about how in a browser you have less control over memory and for very large spreadsheets you could run into problems earlier. You can talk about what is available in JavaScript vs a native language and how this impacts the data structures and calculations. Also along with data, a big important issue with spreadsheets is numerical accuracy and floating point number calculations. Floating point numbers are made to be fast but are not necessarily accurate in extreme levels of precision and this leads to a lot of confusing questions. I believe very recently Excel switched to their own representation of a fixed decimal number as it's now viable to due spreadsheet level calculations without using the built-in floating point calculations. You can also talk about data structures and calculation and how they affect performance. In a browser you don't have threads (yet) so you can't run all the calculations in the background. If you have 100,000 rows with complex calculations and change one value that cascades across everything, you can get a warning about a slow script. You need to break up the calculation.
Finally you can run form the user experience angle. How is the experience in a browser different from a native application? What are the advantages and what cool things can you do in a browser that may be difficult in a desktop application? What things are far more complicated or even totally impossible (example, associate your spreadsheet app with a file type so a user can double-click a file and open it in your online spreadsheet app, although I may be wrong about that still being unsupported).
Good question, lots of right answers, very open ended.
On the other hand, you could also have had a bad interviewer that is specifically looking for the answer they want and in that case you're pretty much out of luck unless you're telepathic.
You can say hopelessly too much about this. I'd probably start with:
If most of the cells are filled, use a simply 2D array to store it.
Otherwise use a hash table of location to cell
Or perhaps something like a kd-tree, which should allow for more efficient "get everything in the displayed area" queries.
By graph, your interviewer probably meant have each cell be a vertex and each reference to another cell be a directed edge. This would allow you to do checks for circular references fairly easily, and allow for efficiently updating of all cells that need to change.
"In a browser" (presumably meaning "over a network" - actually "in a browser" doesn't mean all that much by itself - one can write a program that runs in a browser but only runs locally) is significant - you probably need to consider:
What are you storing locally (everything or just the subset of cells that are current visible)
How are you sending updates to the server (are you sending every change or keeping a collection of changed cells and only sending updates on save, or are you not storing changes separately and just sending the whole grid across during save)
Auto-save should probably be considered as well
Will you have an "undo", will this only be local, if not, how will you handle this on the server and how will you send through the updates
Is only this one user allowed to work with it at a time (or do you have to cater for multi-user, which brings dealing with conflicts, among other things, to the table)
Looking at the CSS cursor property just begs for one to create
a spreadsheet web application.
HTML table or CSS grid? HTML tables are purpose built for tabular
Resizing cell height and width is achievable with offsetX and
Storing the data is trivial. It can be Mongo, mySQL, Firebase,
...whatever. On blur, send update.
Javascrip/ECMA is more than capable of delivering all the Excel built-in
functions. Did I mention web workers?
Need to increment letters as in column ID's? I got you covered.
Most importantly, don't do it. Why? Because it's already been done.
Find a need and work that project.
I've written a small graphics engine for my game that has multiple canvases in a tree(these basically represent layers.) Whenever something in a layer changes, the engine marks the affected layers as "soiled" and in the render code the lowest affected layer is copied to its parent via drawImage(), which is then copied to its parent and so on up to the root layer(the onscreen canvas.) This can result in multiple drawImage() calls per frame but also prevents rerendering anything below the affected layer. However, in frames where nothing changes no rendering or drawImage() calls take place, and in frames where only foreground objects move, rendering and drawImage() calls are minimal.
I'd like to compare this to using multiple onscreen canvases as layers, as described in this article:
In the onscreen canvas approach, we handle rendering on a per-layer basis and let the browser handle displaying the layers on screen properly. From the research I've done and everything I've read, this seems to be generally accepted as likely more efficient than handling it manually with drawImage(). So my question is, can the browser determine what needs to be re-rendered more efficiently than I can, despite my insider knowledge of exactly what has changed each frame?
I already know the answer to this question is "Do it both ways and benchmark." But in order to get accurate data I need real-world application, and that is months away. By then if I have an acceptable approach I will have bigger fish to fry. So I'm hoping someone has been down this road and can provide some insight into this.
The browser cannot determine anything when it comes to the canvas element and the rendering as it is a passive element - everything in it is user rendered by the means of JavaScript. The only thing the browser does is to pipe what's on the canvas to the display (and more annoyingly clear it from time to time when its bitmap needs to be re-allocated).
There is unfortunately no golden rule/answer to what is the best optimization as this will vary from case to case - there are many techniques that could be mentioned but they are merely tools you can use but you will still have to figure out what would be the right tool or the right combination of tools for your specific case. Perhaps layered is good in one case and perhaps it doesn't bring anything to another case.
Optimization in general is very much an in-depth analysis and break-down of patterns specific to the scenario, that are then isolated and optimized. The process if often experiment, benchmark, re-adjust, experiment, benchmark, re-adjust, experiment, benchmark, re-adjust... of course experience reduce this process to a minimum but even with experience the specifics comes in a variety of combinations that still require some fine-tuning from case to case (given they are not identical).
Even if you find a good recipe for your current project it is not given that it will work optimal with your next project. This is one reason no one can give an exact answer to this question.
However, when it comes canvas what you want to achieve is a minimum of clear operations and minimum areas to redraw (drawImage or shapes). The point with layers is to groups elements together to enable this goal.
I am using Bing Maps with Ajax and I have about 80,000 locations to drop pushpins into. The purpose of the feature is to allow a user to search for restaurants in Louisiana and click the pushpin to see the health inspection information.
Obviously it doesn't do much good to have 80,000 pins on the map at one time, but I am struggling to find the best solution to this problem. Another problem is that the distance between these locations is very small (All 80,000 are in Louisiana). I know I could use clustering to keep from cluttering the map, but it seems like that would still cause performance problems.
What I am currently trying to do is to simply not show any pins until a certain zoom level and then only show the pins within the current view. The way I am currently attempting to do that is by using the viewchangeend event to find the zoom level and the boundaries of the map and then querying the database (through a web service) for any points in that range.
It feels like I am going about this the wrong way. Is there a better way to manage this large amount of data? Would it be better to try to load all points initially and then have the data on hand without having to hit my web service every time the map moves. If so, how would I go about it?
I haven't been able to find answers to my questions, which usually means that I am asking the wrong questions. If anyone could help me figure out the right question it would be greatly appreciated.
Well, I've implemented a slightly different approach to this. It was just a fun exercise, but I'm displaying all my data (about 140.000 points) in Bing Maps using the HTML5 canvas.
I previously load all the data to the client. Then, I've optimized the drawing process so much that I've attached it to the "Viewchange" event (which fires all the time during the view change process).
I've blogged about this. You can check it here.
My example does not have interaction on it but could be easily done (should be a nice topic for a blog post). You would have thus to handle the events manually and search for the corresponding points yourself or, if the amount of points to draw and/or the zoom level was below some threshold, show regular pushpins.
Anyway, another option, if you're not restricted to Bing Maps, is to use the likes of Leaflet. It allows you to create a Canvas Layer which is a tile-based layer but rendered in client-side using HTML5 canvas. It opens a new range of possibilities. Check for example this map in GisCloud.
Yet another option, although more suitable to static data, is using a technique called UTFGrid. The lads that developed it can certainly explain it better than me, but it scales for as many points as you want with a fenomenal performance. It consists on having a tile layer with your info, and an accompanying json file with something like an "ascii-art" file describing the features on the tiles. Then, using a library called wax it provides complete mouse-over, mouse-click events on it, without any performance impact whatsoever.
I've also blogged about it.
I think clustering would be your best bet if you can get away with using it. You say that you tried using clustering but it still caused performance problems? I went to test it out with 80000 data points at the V7 Interactive SDK and it seems to perform fine. Test it out yourself by going to the link and change the line in the Load module - clustering tab:
then hit the Run button. The performance seems acceptable to me with that many data points.
All web developers run into this problem when the amount of data in their project grows, and I have yet to see a definitive, intuitive best practice for solving it. When you start a project, you often create forms with tags to help pick related objects for one-to-many relationships.
For instance, I might have a system with Neighbors and each Neighbor belongs to a Neighborhood. In version 1 of the application I create an edit user form that has a drop down for selecting users, that simply lists the 5 possible neighborhoods in my geographically limited application.
In the beginning, this works great. So long as I have maybe 100 records or less, my select box will load quickly, and be fairly easy to use. However, lets say my application takes off and goes national. Instead of 5 neighborhoods I have 10,000. Suddenly my little drop-down takes forever to load, and once it loads, its hard to find your neighborhood in the massive alphabetically sorted list.
Now, in this particular situation, having hierarchical data, and letting users drill down using several dynamically generated drop downs would probably work okay. However, what is the best solution when the objects/records being selected are not hierarchical in nature? In the past, of done this with a popup with a search box, and a list, but this seems clunky and dated. In today's web 2.0 world, what is a good way to find one object amongst many for ones forms?
I've considered using an Ajaxifed search box, but this seems to work best for free text, and falls apart a little when the data to be saved is just a reference to another object or record.
Feel free to cite specific libraries with generic solutions to this problem, or simply share what you have done in your projects in a more general way
I think an auto-completing text box is a good approach in this case. Here on SO, they also use an auto-completing box for tags where the entry already needs to exist, i.e. not free-text but a selection. (remember that creating new tags requires reputation!)
I personally prefer this anyways, because I can type faster than select something with the mouse, but that is programmer's disease I guess :)
Auto-complete is usually the best solution in my experience for searches, but only where the user is able to provide text tokens easily, either as part of the object name or taxonomy that contains the object (such as a product category, or postcode).
However this doesn't always work, particularly where 'browse' behavior would be more suitable - to give a real example, I once wrote a page for a community site that allowed a user to send a message to their friends. We used auto-complete there, allowing multiple entries separated by commas.
It works great when you know the names of the people you want to send the message to, but we found during user acceptance that most people didn't really know who was on their friend list and couldn't use the page very well - so we added a list popup with friend icons, and that was more successful.
(this was quite some time ago - everyone just copies Facebook now...)
Different methods of organizing large amounts of data:
Spatial (geography/geometry)
Tags or facets
Different methods of searching large amounts of data:
Filtering (including autocomplete)
Sorting/paging (alphabetically-sorted data can also be paged by first letter)
Drill-down (assuming the data is organized as above)
Free-text search
Hierarchies are easy to understand and (usually) easy to implement. However, they can be difficult to navigate and lead to ambiguities. Spatial visualization is by far the best option if your data is actually spatial or can be represented that way; unfortunately this applies to less than 1% of the data we normally deal with day-to-day. Tags are great, but - as we see here on SO - can often be misused, misunderstood, or otherwise rendered less effective than expected.
If it's possible for you to reorganize your data in some relatively natural way, then that should always be the first step. Whatever best communicates the natural ordering is usually the best answer.
No matter how you organize the data, you'll eventually need to start providing search capabilities, and unlike organization of data, search methods tend to be orthogonal - you can implement more than one. Filtering and sorting/paging are the easiest, and if an autocomplete textbox or paged list (grid) can achieve the desired result, go for that. If you need to provide the ability to search truly massive amounts of data with no coherent organization, then you'll need to provide a full textual search.
If I could point you to some list of "best practices", I would, but HID is rarely so clear-cut. Use the aforementioned options as a starting point and see where that takes you.
I am currently developing a charting application (for the iPhone, although that is largely irrelevant) using their MVC pattern.
One aspect of the application is that you can overlay a number of statistics on the charts. I am a little unsure how I am going to structure these classes.
For each statistic there will be two aspects.
1. The calculation. The function which will take the data and calculate the relevant statistical figures.
2. The display. The statistics then need to be drawn over the top of the graph.
Obviously I want the code to comply with the MVC pattern as closely as possible, but I am planning to develop possibly hundreds of these statistics.
I could create three classes. One for the graphics, one for the logic and a factory class to tie the two together. This would then fit with the pattern, but this seems to be a huge extra overhead in terms of the number of classes in the system and additional complexity which I dont feel is necessary.
So, I am very tempted to create a single class for each statistic. But that would mean each class would have logic and graphics mixed in together, which is heavily frowned upon.
Are there any other suggestions as to how I can lay these out in a structured reuseable way without adding uneccessary complexity?
Thanks for the answers. Most useful, but has raised more questions!
MVC does fit the rest of the application perfectly. Also as its for the iPhone, I seem to be pushed along this path anyway. This is the only reason I am considering MVC for these statistics.
However, for these statistics, the user will not interact with them, they are purely for display. The statistics are painting various lines and symbols directly onto the view canvas. Each statistic paints their information in its own way. There is very little that can be shared between each one, also each piece of data can only be useful represented in one way. I can think of no other useful way that I would want to represent the information.
So it seems MVC is out for these, but I am unsure now what pattern would fit other than my newly invented "Mix logic and graphics" pattern which just feels wrong due to the Single Responsibility Principle (thanks for that link).
First of all, will the user interact with the statistics? If not, then you don't need MVC. (The Controller in MVC deals with user interaction).
You want to keep the number of classes to a minimum, which is good. Let's consider both the calculation and display separately.
How will you be displaying the statistics? Will they typically be text labels, or will there be other graph elements (error bars or things like that)? Try to figure out the different ways you will want to display your statistics.
How many different calculations will you have? Does each calculation map directly to a single graph element, or might it be drawn in a number of different ways? Try to figure out how the calculations relate to the graph elements.
As a concrete example, suppose you have a set of data points that you have plotted. You want to display the mean, the median, and the mode. You could display each of these as separate horizontal lines that cut through the chart at the appropriate Y value. The calculations are all independent, but the display logic can be shared. Or, perhaps you want to display the mean as both a line and as a text label. Here, there is only one calculation, but two different display methods.
MVC design is about separating the underlying data from its presentation. By doing this, you can re-use bits of presentation logic for many different pieces of data. Also, you can display a single piece of data in multiple ways, and they will all stay in sync.
It depends on what you call complex. Most people consider methods or classes that are responsible for multiple things complex and classes or methods that are responsible for one thing simple. This is also known as the SRP
When you use MVC, the view contains the display logic, the model contains the business logic (the calculation in your case), and the controller ties them together. You can use different ways to implement MVC. Complex applications need to seperate domain model and map to a viewmodel, but most simple applications can use one model.
If you define complexity on a different way, MVC might not be what you like, and you should try a different approach.