I have been attempting to create a Dash app as a companion to a report, which I have deployed to heroku:
https://ftacv-simulation.herokuapp.com/
This works reasonably well, for the simplest form of the simulation. However, upon the introduction of more complex features, the heroku server often times out (i.e. a single callback goes over the 30 second limit, and the process is terminated). The two main features are the introduction of a more complex simulation, which requires 15-20 simple simulation runs, and the saving of older plots for the purposes of comparison.
I think I have two potential solutions to this. The first is restructuring the code so that the single large task is broken up into multiple callbacks, none of which go over the 30s limit, and potentially storing the data for the older plots in the user's browser. The second is moving to a different provider that can handle more intense computation (such as AWS).
Which of these approaches would you recommend? Or would you propose a different solution?
Related
My team builds and maintains an API built on top of PySpark SQL. It is meant for production use cases, and does a good job at scaling to large data on clusters. We can also run it locally, which is useful for development, testing, and training people via interactive exercise sessions using Jupyter notebooks.
However, running fairly simple computations on Spark takes a little while, frequently a few dozens seconds, even on a dataframe of about 50k rows. Our library is to do differential privacy, which involves some randomization. Thus, training use cases involve running the same analysis multiple times to get average utility metrics. This means that runtimes quickly reach a few minutes, which is annoyingly long when you're trying to run a 1-2h exercise session.
My question is: are there Spark configuration options I could tweak to lower this runtime for small-data, single-machine use cases, and make teaching a little smoother?
I am implementing my first syncing code. In my case I will have 2 types of iOS clients per user that will sync records to a server using a lastSyncTimestamp, a 64 bit integer representing the Unix epoch in milliseconds of the last sync. Records can be created on the server or the clients at any time and the records are exchanged as JSON over HTTP.
I am not worried about conflicts as there are few updates and always from the same user. However, I am wondering if there are common things that I need to be aware of that can go wrong with a timestamp based approach such as syncing during daylight savings time, syncs conflicting with another, or other gotchas.
I know that git and some other version control system eschew syncing with timestamps for a content based negotiation syncing approach. I could imagine such an approach for my apps too, where using the uuid or hash of the objects, both peers announce which objects they own, and then exchange them until both peers have the same sets.
If anybody knows any advantages or disadvantages of content-based syncing versus timestamp-based syncing in general that would be helpful as well.
Edit - Here are some of the advantages/disadvantages that I have come up with for timestamp and content based syncing. Please challenge/correct.
Note - I am defining content-based syncing as simple negotiation of 2 sets of objects such as how 2 kids would exchange cards if you gave them each parts of a jumbled up pile of 2 identical sets of baseball cards and told them that as they look through them to announce and hand over any duplicates they found to the other until they both have identical sets.
Johnny - "I got this card."
Davey - "I got this bunch of cards. Give me that card."
Johnny - "Here is your card. Gimme that bunch of cards."
Davey - "Here are your bunch of cards."
....
Both - "We are done"
Advantages of timestamp-based syncing
Easy to implement
Single property used for syncing.
Disadvantages of timestamp-based syncing
Time is a relative concept to the observer and different machine's clocks can be out of sync. There are a couple ways to solve this. Generate timestamp on a single machine, which doesn't scale well and represents a single point of failure. Or use logical clocks such as vector clocks. For the average developer building their own system, vector clocks might be too complex to implement.
Timestamp based syncing works for client to master syncing but doesn't work as well for peer to peer syncing or where syncing can occur with 2 masters.
Single point of failure, whatever generates the timestamp.
Time is not really related to the content of what is being synced.
Advantages of content-based syncing
No per peer timestamp needs to be maintained. 2 peers can start a sync session and start syncing based on the content.
Well defined endpoint to sync - when both parties have identical sets.
Allows a peer to peer architecture, where any peer can act as client or server, providing they can host an HTTP server.
Sync works with the content of the sets, not with an abstract concept time.
Since sync is built around content, sync can be used to do content verification if desired. E.g. a SHA-1 hash can be computed on the content and used as the uuid. It can be compared to what is sent during syncing.
Even further, SHA-1 hashes can be based on previous hashes to maintain a consistent history of content.
Disadvantages of content-based syncing
Extra properties on your objects may be needed to implement.
More logic on both sides compared to timestamp based syncing.
Slightly more chatty protocol (this could be tuned by syncing content in clusters).
Part of the problem is that time is not an absolute concept. Whether something happens before or after something else is a matter of perspective, not of compliance with a wall clock.
Read up a bit on relativity of simultaneity to understand why people have stopped trying to use wall time for figuring these things out and have moved to constructs that represent actual causality using vector clocks (or at least Lamport clocks).
If you want to use a clock for synchronization, a logical clock will likely suit you best. You will avoid all of your clock sync issues and stuff.
I don't know if it applies in your environment, but you might consider whose time is "right", the client or the server (or if it even matters)? If all clients and all servers are not sync'd to the same time source there could be the possibility, however slight, of a client getting an unexpected result when syncing to (or from) the server using the client's "now" time.
Our development organization actually ran into some issues with this several years ago. Developer machines were not all sync'd to the same time source as the server where the SCM resided (and might not have been sync'd to any time source, thus the developer machine time could drift). A developer machine could be several minutes off after a few months. I don't recall all of the issues, but it seems like the build process tried to get all files modified since a certain time (the last build). Files could have been checked in, since the last build, that had modification times (from the client) that occurred BEFORE the last build.
It could be that our SCM procedures were just not very good, or that our SCM system or build process were unduly susceptible to this problem. Even today, all of our development machines are supposed to sync time with the server that has our SCM system on it.
Again, this was several years ago and I can't recall the details, but I wanted to mention it on the chance that it is significant in your case.
You could have a look at unison. It's file-based but you might find some of the ideas interesting.
Has Object Prevalance mechanisms been used in an actual Production system? I'm referring to something like Prevayler or Madeleine
The only thing I've found is Instiki, a wiki engine. But since they started they've switched to SQLite. (The actual instiki page is down)
A company I used to work for used Prevayler as part of a computer-based student examination/assessment system for about five or six years.
Prevayler was used to store the state of candidates’ tests on a server physically located within a single testing centre. The volume of data stored was fairly low, since at most there would only be a few hundred candidates taking a test at a single testing centre. Therefore it was practical to run Prevayler on commodity hardware in 2004 – the ‘server’ was in most cases just a typical low-end desktop machine temporarily borrowed for the purpose of running an exam.
The idea was that if a candidate’s computer crashed while they were taking the test, then they could quickly resume the test on the same or different computer. It worked pretty well.
There were occasional difficulties when some new requirement led to a change to the object model, since by default Prevayler close-couples the object model to the representation of data on disk. This wasn’t actually a major problem for us, since changes to the object model occurred between exams at which point we could usually afford to throw old data away (with some exceptions due to bad design on our part).
There are lots of things you can do to make it feasible to change the object model, it’s a matter of what’s best for your application. Throwing old data away was generally the best solution for us.
There was also a back-end system that aggregated candidates’ tests from all testing centres into an SQL database. That stored a higher volume of data than Prevayler could have reasonably coped with at the time. It would probably be feasible to use Prevayler there today, but I don’t think the usage patterns would have suited Prevayler particularly well, since most of the data tended to be written, read once for marking, then forgotten about and treated as archive data unless the result of a test got queried.
That company has sinced moved away from Prevayler but the reason for that was more political than it was technical.
Well, we're using Prevayler in a project that's aiming towards production by sometime next year, but we're not close enough to give any real scouting report. We think it's going to work...
"LMAX is a new retail financial trading platform. As a result it has to process many trades with low latency. The system is built on the JVM platform and centers on a Business Logic Processor that can handle 6 million orders per second on a single thread. The Business Logic Processor runs entirely in-memory using event sourcing."
http://martinfowler.com/articles/lmax.html
I recently completed development of a mid-traficked(?) website (peak 60k hits/hour), however, the site only needs to be updated once a minute - and achieving the required performance can be summed up by a single word: "caching".
For a site like SO where the data feeding the site changes all the time, I would imagine a different approach is required.
Page cache times presumably need to be short or non-existent, and updates need to be propogated across all the webservers very rapidly to keep all users up to date.
My guess is that you'd need a distributed cache to control the serving of data and pages that is updated on the order of a few seconds, with perhaps a distributed cache above the database to mediate writes?
Can those more experienced that I outline some of the key architectural/design principles they employ to ensure highly interactive websites like SO are performant?
The vast majority of sites have many more reads than writes. It's not uncommon to have thousands or even millions of reads to every write.
Therefore, any scaling solution depends on separating the scaling of the reads from the scaling of the writes. Typically scaling reads is really cheap and easy, scaling the writes is complicated and costly.
The most straightforward way to scale reads is to cache entire pages at a time and expire them after a certain number of seconds. If you look at the popular web-site, Slashdot. you can see that this is the way they scale their site. Unfortunately, this caching strategy can result in counter-intuitive behaviour for the end user.
I'm assuming from your question that you don't want this primitive sort of caching. Like you mention, you'll need to update the cache in place.
This is not as scary as it sounds. The key thing to realise is that from the server's point of view. Stackoverflow does not update all the time. It updates fairly rarely. Maybe once or twice per second. To a computer a second is nearly an eternity.
Moreover, updates tend to occur to items in the cache that do not depend on each other. Consider Stack Overflow as example. I imagine that each question page is cached separately. Most questions probably have an update per minute on average for the first fifteen minutes and then probably once an hour after that.
Thus, in most applications you barely need to scale your writes. They're so few and far between that you can have one server doing the writes; Updating the cache in place is actually a perfectly viable solution. Unless you have extremely high traffic, you're going to get very few concurrent updates to the same cached item at the same time.
So how do you set this up? My preferred solution is to cache each page individually to disk and then have many web-heads delivering these static pages from some mutually accessible space.
When a write needs to be done it is done from exactly one server and this updates that particular cached html page. Each server owns it's own subset of the cache so there isn't a single point of failure. The update process is carefully crafted so that a transaction ensures that no two requests are not writing to the file at exactly the same time.
I've found this design has met all the scaling requirements we have so far required. But it will depend on the nature of the site and the nature of the load as to whether this is the right thing to do for your project.
You might be interested in this article which describes how wikimedia's servers are structured. Very enlightening!
The article links to this pdf - be sure not to miss it.
I'm about to start testing an intranet web application. Specifically, I've to determine the application's performance.
Please could someone suggest formal/informal standards for how I can judge the application's performance.
Use some tool for stress and load testing. If you're using Java take a look at JMeter. It provides different methods to test you application performance. You should focus on:
Response time: How fast your application is running for normal requests. Test some read/write use case
Load test: How your application behaves in high traffic times. The tool will submit several requests (you can configure that properly) during a period of time.
Stress test: Do your application can operate during a long period of time? This test will push your application to the limits
Start with this, if you're interested, there are other kinds of tests.
"Specifically, I have to determine the application's performance...."
This comes full circle to the issue of requirements, the captured expectations of your user community for what is considered reasonable and effective. Requirements have a number of components
General Response time, " Under a load of .... The Site shall have a general response time of less than x, y% of the time..."
Specific Response times, " Under a load of .... Credit Card processing shall take less than z seconds, a% of the time..."
System Capacity items, " Under a load of .... CPU|Network|RAM|DISK shall not exceed n% of capacity.... "
The load profile, which is the mix of the number of users and transactions which will take place under which the specific, objective, measures are collected to determine system performance.
You will notice the the response times and other measures are no absolutes. Taking a page from six sigma manufacturing principals, the cost to move from 1 exception in a million to 1 exception in a billion is extraordinary and the cost to move to zero exceptions is usually a cost not bearable by the average organization. What is considered acceptable response time for a unique application for your organization will likely be entirely different from a highly commoditized offering which is a public internet facing application. For highly competitive solutions response time expectations on the internet are trending towards the 2-3 second range where user abandonment picks up severely. This has dropped over the past decade from 8 seconds, to 4 seconds and now into the 2-3 second range. Some applications, like Facebook, shoot for almost imperceptible response times in the sub one second range for competitive reasons. If you are looking for a hard standard, they just don't exist.
Something that will help your understanding is to read through a couple of industry benchmarks for style, form, function.
TPC-C Database Benchmark Document
SpecWeb2009 Benchmark Design Document
Setting up a solid set of performance tests which represents your needs is a non-trivial matter. You may want to bring in a specialist to handle this phase of your QA efforts.
On your tool selection, make sure you get one that can
Exercise your interface
Report against your requirements
You or your team has the skills to use
You can get training on and will attend with management's blessing
Misfire on any of the four elements above and you as well have purchased the most expensive tool on the market and hired the most expensive firm to deploy it.
Good luck!
To test the front-end then YSlow is great for getting statistics for how long your pages take to load from a user perspective. It breaks down into stats for each specfic HTTP request, the time it took, etc. Get it at http://developer.yahoo.com/yslow/
Firebug, of course, also is essential. You can profile your JS explicitly or in real time by hitting the profile button. Making optimisations where necessary and seeing how long all your functions take to run. This changed the way I measure the performance of my JS code. http://getfirebug.com/js.html
Really the big thing I would think is response time, but other indicators I would look at are processor and memory usage vs. the number of concurrent users/processes. I would also check to see that everything is performing as expected under normal and then peak load. You might encounter scenarios where higher load causes application errors due to various requests stepping on each other.
If you really want to get detailed information you'll want to run different types of load/stress tests. You'll probably want to look at a step load test (a gradual increase of users on system over time) and a spike test (a significant number of users all accessing at the same time where almost no one was accessing it before). I would also run tests against the server right after it's been rebooted to see how that affects the system.
You'll also probably want to look at a concept called HEAT (Hostile Environment Application Testing). Really this shows what happens when some part of the system goes offline. Does the system degrade successfully? This should be a key standard.
My one really big piece of suggestion is to establish what the system is supposed to do before doing the testing. The main reason is accountability. Get people to admit that the system is supposed to do something and then test to see if it holds true. This is key because because people will immediately see the results and that will be the base benchmark for what is acceptable.