App Design and optimisation: graphing data - xcode

Working in Xcode, Cocoa, Objective C
I m building an app which has an SQLIte database holding the data. The data is a daily summary of about 6 float values over a period of potentially 40 years, although more usually 10 years.
I am writing routines which will graph the data into an NSview. The user has several options in the UI as to whether to draw line or bar graph, the time period to graph, whether the data is weekly or daily etc.
There are two main functions to write here, one for updating the graph settings and one for getting data from the database in a form which the graph can handle (nested array of dates and values)
the question I have is whether it is best to load the full set of graph able data and have the graph 'decide' which slices of data to graph. Or whether to submit multiple requests to the database each time the user selects an option.
For example, if the full set of graph able data were loaded, then if the user selects the weekly option, the graph drawRect method could simply iterate over each 7th entry in the array. Alternatively, I could ask the database to re-submit an array of graph able data.
I hope this makes sense

I think it's best to select only the data you need, rather than letting the graph decide.
There's a lot more to consider than just the quantity of data you'll be reading. There is also the number of memory allocations, how much memory will actually be used, and the memory that is used "behind the scenes" by the allocator. There is also backing store paging.

Related

How do I process a graph that is constantly updating, with low latency?

I am working on a project that involves many clients connecting to a server(servers if need be) that contains a bunch of graph info (node attributes and edges). They will have the option to introduce a new node or edge anytime they want and then request some information from the graph as a whole (shortest distance between two nodes, graph coloring, etc).
This is obviously quite easy to develop the naive algorithm for, but then I am trying to learn to scale this so that it can handle many users updating the graph at the same time, many users requesting information from the graph, and the possibility of handling a very large (500k +) nodes and possibly a very large number of edges as well.
The challenges I can foresee:
with a constantly updating graph, I need to process the whole graph every time someone requests information...which will increase computation time and latency quite a bit
with a very large graph, the computation time and latency will obviously be a lot higher (I read that this was remedied by some companies by batch processing a ton of results and storing them with an index for later use...but then since my graph is being constantly updated and users want the most up to date info, this is not a viable solution)
a large number of users requesting information which will be quite a load on the servers since it has to process the graph that many times
How do I start facing these challenges? I looked at hadoop and spark, but they seem have high latency solutions (with batch processing) or solutions that address problems where the graph is not constantly changing.
I had the idea of maybe processing different parts of the graph and indexing them, then keeping track of where the graph is updated and re-process that section of the graph (a kind of distributed dynamic programming approach), but im not sure how feasible that is.
Thanks!
How do I start facing these challenges?
I'm going to answer this question, because it's the important one. You've enumerated a number of valid concerns, all of which you'll need to deal with and none of which I'll address directly.
In order to start, you need to finish defining your semantics. You might think you're done, but you're not. When you say "users want the most up to date info", does "up to date" mean
"everything in the past", which leads to total serialization of each transaction to the graph, so that answers reflect every possible piece of information?
Or "everything transacted more than X seconds ago", which leads to partial serialization, which multiple database states in the present that are progressively serialized into the past?
If 1. is required, you may well have unavoidable hot spots in your code, depending on the application. You have immediate information for when to roll back a transaction because it of inconsistency.
If 2. is acceptable, you have the possibility for much better performance. There are tradeoffs, though. You'll have situations where you have to roll back a transaction after initial acceptance.
Once you've answered this question, you've started facing your challenges and, I assume, will have further questions.
I don't know much about graphs, but I do understand a bit of networking.
One rule I try to keep in mind is... don't do work on the server side if you can get the client to do it.
All your server needs to do is maintain the raw data, serve raw data to clients, and notify connected clients when data changes.
The clients can have their own copy of raw data and then generate calculations/visualizations based on what they know and the updates they receive.
Clients only need to know if there are new records or if old records have changed.
If, for some reason, you ABSOLUTELY have to process data server side and send it to the client (for example, client is 3rd party software, not something you have control over and it expects processed data, not raw data), THEN, you do have a bit of an issue, so get a bad ass server... or 3 or 30. In this case, I would have to know exactly what the data is and how it's being processed in order to make any kind of suggestions on scaled configuration.

What is the best way to implement a service operation that returns places sorted by distance to the current user?

let's say a service operation like this
api/places/?category=entertainment&geo=123,456&area=30&orderBy=distance
so the user is searching for places of entertainment near the geo location (123,456), no further than 30 kms boundary, and want the results sorted by distance
suppose the search should be paged, say it will have 500 items satisfy the query, but the page size is 50, so it will have 10 pages.
each item in database stores only geo location of the place, and then I will have to fetch all 500 items from db first, calculate the distance of each item, cut the array to the page number and then return.
so every time the user is requesting next page, I will have to query all 500 and then do the same thing again.
is this the right way to implement a service like that, or is there a better strategy?
it seems to be worse when my database don't have the geo location, because I am using a different API provider to give me the geo of a place. It means I will have to query everything and hit another service to get the geo, calculate, and finally able to sort... :(
thank you very much!
If you were developing a single-page app for the search results, you wouldn't need to send another request to the server every time the user presses Next. You could use a pagination library that takes the full set of results and sorts them into pages accordingly.
In this situation, there's a trade between the size of the data you want to store, and the speed and efficiency of your web application. In this sort of situation, you should really be dealing with large data sets. You should ideally have additional databases for each general geographic region (such as Northeast, Southeast) that store the distance between each store and each location the user can enter. You should use a separate server for this, and aggregate the data at intervals (say, every six hours) using an automated database operation, such as running MongoDB scripts.
You should also consider using weighted graphs to store the distances between locations. You could then more easily traverse them with a graphing algorithm.

How to implement a spreadsheet in a browser?

I was recently asked this in an interview (Software Engineer) and didn't really know how to go about answering the question.
The question was focused on both the algorithm of the spreadsheet and how it would interact with the browser. I was a bit confused on what data structure would be optimal to handle the cells and their values. I guess any form of hash table would work with cells being the unique key and the value being the object in the cell? And then when something gets updated, you'd just update that entry in your table. The interviewer hinted at a graph but I was unsure of how a graph would be useful for a spreadsheet.
Other things I considered were:
Spreadsheet in a browser = auto-save. At any update, send all the data back to the server
Cells that are related to each other, i.e. C1 = C2+C3, C5 = C1-C4. If the value of C2 changes, both C1 and C5 change.
Usage of design patterns? Does one stand out over another for this particular situation?
Any tips on how to tackle this problem? Aside from the algorithm of the spreadsheet itself, what else could the interviewer have wanted? Does the fact that its in a browser as compared to a separate application add any difficulties?
Thanks!
For an interview this is a good question. If this was asked as an actual task in your job, then there would be a simple answer of use a third party component, there are a few good commercial ones.
While we can't say for sure what your interviewer wanted, for me this is a good question precisely because it is so open ended and has so many correct possible answers.
You can talk about the UI and how to implement the kind of dynamic grid you need for a spreadsheet and all the functionality of the cells and rows and columns and selection of cells and ranges and editing of values and formulas. You probably could talk for a while on the UI implications alone.
Alternatively you can go the data route, talk about data structures to hold a spreadsheet, talk exactly about links between cells for formulas, talk about how to detect and deal with circular references, talk about how in a browser you have less control over memory and for very large spreadsheets you could run into problems earlier. You can talk about what is available in JavaScript vs a native language and how this impacts the data structures and calculations. Also along with data, a big important issue with spreadsheets is numerical accuracy and floating point number calculations. Floating point numbers are made to be fast but are not necessarily accurate in extreme levels of precision and this leads to a lot of confusing questions. I believe very recently Excel switched to their own representation of a fixed decimal number as it's now viable to due spreadsheet level calculations without using the built-in floating point calculations. You can also talk about data structures and calculation and how they affect performance. In a browser you don't have threads (yet) so you can't run all the calculations in the background. If you have 100,000 rows with complex calculations and change one value that cascades across everything, you can get a warning about a slow script. You need to break up the calculation.
Finally you can run form the user experience angle. How is the experience in a browser different from a native application? What are the advantages and what cool things can you do in a browser that may be difficult in a desktop application? What things are far more complicated or even totally impossible (example, associate your spreadsheet app with a file type so a user can double-click a file and open it in your online spreadsheet app, although I may be wrong about that still being unsupported).
Good question, lots of right answers, very open ended.
On the other hand, you could also have had a bad interviewer that is specifically looking for the answer they want and in that case you're pretty much out of luck unless you're telepathic.
You can say hopelessly too much about this. I'd probably start with:
If most of the cells are filled, use a simply 2D array to store it.
Otherwise use a hash table of location to cell
Or perhaps something like a kd-tree, which should allow for more efficient "get everything in the displayed area" queries.
By graph, your interviewer probably meant have each cell be a vertex and each reference to another cell be a directed edge. This would allow you to do checks for circular references fairly easily, and allow for efficiently updating of all cells that need to change.
"In a browser" (presumably meaning "over a network" - actually "in a browser" doesn't mean all that much by itself - one can write a program that runs in a browser but only runs locally) is significant - you probably need to consider:
What are you storing locally (everything or just the subset of cells that are current visible)
How are you sending updates to the server (are you sending every change or keeping a collection of changed cells and only sending updates on save, or are you not storing changes separately and just sending the whole grid across during save)
Auto-save should probably be considered as well
Will you have an "undo", will this only be local, if not, how will you handle this on the server and how will you send through the updates
Is only this one user allowed to work with it at a time (or do you have to cater for multi-user, which brings dealing with conflicts, among other things, to the table)
Looking at the CSS cursor property just begs for one to create
a spreadsheet web application.
HTML table or CSS grid? HTML tables are purpose built for tabular
data.
Resizing cell height and width is achievable with offsetX and
offsetY.
Storing the data is trivial. It can be Mongo, mySQL, Firebase,
...whatever. On blur, send update.
Javascrip/ECMA is more than capable of delivering all the Excel built-in
functions. Did I mention web workers?
Need to increment letters as in column ID's? I got you covered.
Most importantly, don't do it. Why? Because it's already been done.
Find a need and work that project.

Techniques for handling arrays whose storage requirements exceed RAM

I am author of a scientific application that performs calculations on a gridded basis (think finite difference grid computation). Each grid cell is represented by a data object that holds values of state variables and cell-specific constants. Until now, all grid cell objects have been present in RAM at all times during the simulation.
I am running into situations where the people using my code wish to run it with more grid cells than they have available RAM. I am thinking about reworking my code so that information on only a subset of cells is held in RAM at any given time. Unfortunately the grids (or matrices if you prefer) are not sparse, which eliminates a whole class of possible solutions.
Question: I assume that there are libraries out in the wild designed to facilitate this type of data access (i.e. retrieve constants and variables, update variables, store for future reference, wipe memory, move on...) After several hours of searching Google and Stack Overflow, I have found relatively few libraries of this sort.
I am aware of a few options, such as this one from the HSL mathematical library: http://www.hsl.rl.ac.uk/specs/hsl_of01.pdf. I'd prefer to work with something that is open source and is written in Fortran or C. (my code is mostly Fortran 95/2003, with a little C and Python thrown in for good measure!)
I'd appreciate any suggestions regarding available libraries or advice on how to reformulate my problem. Thanks!
Bite the bullet and roll your own?
I deal with too-large data all the time, such as 30,000+ data series of half-hourly data that span decades. Because of the regularity of the data (daylight savings changeovers a problem though) it proved quite straightforward to devise a scheme involving a random-access disc file and procedures ReadDay and WriteDay that use a series number, and a day number, with further details because series start and stop at different dates. Thus, a day's data in an array might be Array(Run,DayNum) but now is ReturnCode = ReadDay(Run,DayNum,Array) and so forth, the codes indicating presence/absence of that day's data, etc. The key is that a day's data is a convenient size, and a regular (almost) size, and although my prog. allocates a buffer of one record per series, it runs in ~100MB of memory rather than GB.
Because your array is non-sparse, it is regular. Granted that a grid cell's data are of fixed size, you could devise a random-access disc file with each record holding one cell, or, perhaps a row's worth of cells (or a column's worth of cells) or some worthwhile blob size. I choose to have 4,096 bytes/record as that is the disc file allocation size. Let the computer's operating system and disc storage controller do whatever buffering to real memory they feel up to. Typical execution is restricted to the speed of data transfer however, unless the local data's computation is heavy. Thus, I get cpu use of a few percent until data requests start being satisfied from buffers.
Because fortran uses the same syntax for functions as for arrays (unlike say Pascal), instead of declaring DIMENSION ARRAY(Big,Big) you would remove that and devise FUNCTION ARRAY(i,j), and all read references in your source file stay as they are. Alas, in the absence of a "palindromic" function declaration, assignments of values to your array will have to be done with a different syntax and you devise a subroutine or similar. Possibly a scratchpad array could be collated, worked upon with convenient syntax, and then written back if changed.

Database for brute force solving board games

A few years back, researchers announced that they had completed a brute-force comprehensive solution to checkers.
I have been interested in another similar game that should have fewer states, but is still quite impractical to run a complete solver on in any reasonable time frame. I would still like to make an attempt, as even a partial solution could give valuable information.
Conceptually I would like to have a database of game states that has every known position, as well as its succeeding positions. One or more clients can grab unexplored states from the database, calculate possible moves, and insert the new states into the database. Once an endgame state is found, all states leading up to it can be updated with the minimax information to build a decision trees. If intelligent decisions are made to pick probable branches to explore, I can build information for the most important branches, and then gradually build up to completion over time.
Ignoring the merits of this idea, or the feasability of it, what is the best way to implement such a database? I made a quick prototype in sql server that stored a string representation of each state. It worked, but my solver client ran very very slow, as it puled out one state at a time and calculated all moves. I feel like I need to do larger chunks in memory, but the search space is definitely too large to store it all in memory at once.
Is there a database system better suited to this kind of job? I will be doing many many inserts, a lot of reads (to check if states (or equivalent states) already exist), and very few updates.
Also, how can I parallelize it so that many clients can work on solving different branches without duplicating too much work. I'm thinking something along the lines of a program that checks out an assignment, generates a few million states, and submits it back to be integrated into the main database. I'm just not sure if something like that will work well, or if there is prior work on methods to do that kind of thing as well.
In order to solve a game, what you really need to know per a state in your database is what is its game-theoretic value, i.e. if it's win for the player whose turn it is to move, or loss, or forced draw. You need two bits to encode this information per a state.
You then find as compact encoding as possible for that set of game states for which you want to build your end-game database; let's say your encoding takes 20 bits. It's then enough to have an array of 221 bits on your hard disk, i.e. 213 bytes. When you analyze an end-game position, you first check if the corresponding value is already set in the database; if not, calculate all its successors, calculate their game-theoretic values recursively, and then calculate using min/max the game-theoretic value of the original node and store in database. (Note: if you store win/loss/draw data in two bits, you have one bit pattern left to denote 'not known'; e.g. 00=not known, 11 = draw, 10 = player to move wins, 01 = player to move loses).
For example, consider tic-tac-toe. There are nine squares; every one can be empty, "X" or "O". This naive analysis gives you 39 = 214.26 = 15 bits per state, so you would have an array of 216 bits.
You undoubtedly want a task queue service of some sort, such as RabbitMQ - probably in conjunction with a database which can store the data once you've calculated it. Alternately, you could use a hosted service like Amazon's SQS. The client would consume an item from the queue, generate the successors, and enqueue those, as well as adding the outcome of the item it just consumed to the queue. If the state is an end-state, it can propagate scoring information up to parent elements by consulting the database.
Two caveats to bear in mind:
The number of items in the queue will likely grow exponentially as you explore the tree, with each work item causing several more to be enqueued. Be prepared for a very long queue.
Depending on your game, it may be possible for there to be multiple paths to the same game state. You'll need to check for and eliminate duplicates, and your database will need to be structured so that it's a graph (possibly with cycles!), not a tree.
The first thing that popped into my mind is the Linda-style of a shared 'whiteboard', where different processes can consume 'problems' off the whiteboard, add new problems to the whiteboard, and add 'solutions' to the whiteboard.
Perhaps the Cassandra project is the more modern version of Linda.
There have been many attempts to parallelize problems across distributed computer systems; Folding#Home provides a framework that executes binary blob 'cores' to solve protein folding problems. Distributed.net might have started the modern incarnation of distributed problem solving, and might have clients that you can start from.

Resources