My team builds and maintains an API built on top of PySpark SQL. It is meant for production use cases, and does a good job at scaling to large data on clusters. We can also run it locally, which is useful for development, testing, and training people via interactive exercise sessions using Jupyter notebooks.
However, running fairly simple computations on Spark takes a little while, frequently a few dozens seconds, even on a dataframe of about 50k rows. Our library is to do differential privacy, which involves some randomization. Thus, training use cases involve running the same analysis multiple times to get average utility metrics. This means that runtimes quickly reach a few minutes, which is annoyingly long when you're trying to run a 1-2h exercise session.
My question is: are there Spark configuration options I could tweak to lower this runtime for small-data, single-machine use cases, and make teaching a little smoother?
Related
I have been attempting to create a Dash app as a companion to a report, which I have deployed to heroku:
https://ftacv-simulation.herokuapp.com/
This works reasonably well, for the simplest form of the simulation. However, upon the introduction of more complex features, the heroku server often times out (i.e. a single callback goes over the 30 second limit, and the process is terminated). The two main features are the introduction of a more complex simulation, which requires 15-20 simple simulation runs, and the saving of older plots for the purposes of comparison.
I think I have two potential solutions to this. The first is restructuring the code so that the single large task is broken up into multiple callbacks, none of which go over the 30s limit, and potentially storing the data for the older plots in the user's browser. The second is moving to a different provider that can handle more intense computation (such as AWS).
Which of these approaches would you recommend? Or would you propose a different solution?
I have developed a simple Spark application that analyzes a dataset. The data analyzed comes from a CSV of 2 million records and 25 attributes. The analysis relates to simple transformations/actions of RDDs and I also used the MLlib library algorithms.
Being my first experience I've taken many pieces of code from documentation or examples present online. However, for a complete execution of a simple algorithm ALS of User Recommendation, for example, it takes several minutes.
I use the application on a laptop (i7 2GHz, 12GB RAM).
I would like to know if I only need to use this application in a cluster of computers to increase performance (in terms of speed) and if so, it is normal that running a Recommendation Engine Model in local takes so long time.
If yes, with a good cluster of computer can I get results in real time?
Thanks in advance!
Are there any tools to generate an automated scenario with a predefined ramp up of user requests (running same map-reduce job) and monitoring some specific metrics of Hadoop cluster under load? I am looking ideally for something like LoadRunner but free/open source tool.
The tool does not have to have a cool UI but rather an ability to record and save scenarios that include a ramp up and a rendezvous point for several users (wait until other users reach some point and do some action simultaneously).
The Hadoop distribution I am going to test is the latest MapR.
Searching internet did not bring any good free alternatives to HP LoadRunner. In case you had an experience with Hadoop (or MapR in particular) load testing, please share what tool you have used.
Every solution you will look at has both a tool quotient and a labor quotient in the total price. There are many open source tools which take the tool cost to zero but the labor charge is so high that your total cost to deliver will be higher than a purchase of a commercial tool with a lower labor charge. Also, many people look at performance testing tools as load generation alone, ignoring the automated collection of monitoring data and the analysis of the results where you can pin an increase in response times with a correlated use of resources at the same time. This is a laborious process made longer to do when you are using decoupled tools.
As you have mentioned LoadRunner, when you are provided a tool you should compare what is available in that tool to whatever you are provided. For instance,
there are Java, C, C++ & VB interfaces available in LoadRunner. You are going to find a way to exercise your map and reduce infrastructure. Compare the integrated monitoring capabilities (native/SNMP/terminal user with command line...) as well as analysis and reporting. Where capabilities do not exist you will either need to build the capability or acquire it elsewhere.
You have also brought up the concept of Rendezvous. You will want to be careful with its application in any tool. Unless you have a very large population the odds of Simultaneous collision in the same area of code/action at the same time becomes quite small. Humans are chaotic instruments, arriving and departing independently from one another. On the other hand, if you are automating an agent which is based upon a clock tick then rendezvous makes a lot more sense. Taking a look at your job submission logs by IP address can provide an objective model for how many are submitted simultaneously (rendezvous) versus how many are running concurrently. I audit a lot of tests and rendezvous is the most abused item across tools, resulting in thousands of lost engineering hours chasing engineering ghosts that would never occur in natural use.
i have 500 directories, and 1000 files (each about 3-4k lines) for each directory. i want to run the same clojure program (already written) on each of these files. i have 4 octa-core servers. what is a good way to distribute the processes across these cores? cascalog (hadoop + clojure)?
basically, the program reads a file, uses a 3rd party Java jar to do computations, and inserts the results into a DB
note that: 1. being able to use 3rd party libraries/jar is mandatory
2. there is no querying of any sorts
Because there is no "reduce" stage to your overall process as I understand it, it makes sense to put 125 of the directories on each server and then spend the rest you time trying to make this program process them faster. Up to the point where you saturate the DB of course.
Most of the "big-data" tools available (Hadoop, Storm) focus on processes that need both very powerful map and reduce operations, with perhaps multiple stages of each. Your case all you really need is a decent way to keep track of which jobs passed and which didn't. I'm as bad as anyone (and worse than many) at predicting development times, though in this case I'd say it would an even chance that rewriting your process on one of the map-reduce-esque tools will take longer than adding a monitoring process to keep track of which jobs finished and which failed so you can rerun the failed ones later (preferably automatically).
Onyx is a recent pure Clojure alternative to Hadoop/Storm. As long as you're familiar with Clojure, working with Onyx is pretty simple. You should give this data-driven approach a try:
https://github.com/MichaelDrogalis/onyx
I've heard reports that Hadoop is poised to replace data warehousing. So I was wondering if there were actual case studies done with success/failure rates or if some of the developers here had worked on a project where this was done, either totally or partially?
With the advent of "Big Data" there seems to be a lot of hype with it and I'm trying to figure out fact from fiction.
We have a huge database conversion in the works and I'm thinking this may be an alternative solution.
Ok so there are a lot of success stories out there with Big Data startups, especially in AdTech, though it's not so much "replace" the old expensive proprietary ways but they are just using Hadoop first time round. This I guess is the benefit of being a startup - no legacy systems. Advertising, although somewhat boring from the outside, is very interesting from a technical and data science point of view. There is a huge amount of data and the challenge is to more efficiently segment users and bid for ad space. This usually means some machine learning is involved.
It's not just AdTech though, Hadoop is used in banks for fraud detection and various other transactional analysis.
So my two cents as to why this is happening I'll try to summarise with a comparison of my main experience, that is using HDFS with Spark and Scala, vs traditional approaches that use SAS, R & Teradata:
HDFS is a very very very effective way to store huge amounts of data in an easily accessible distributed way without the overhead of first structuring the data.
HDFS does not require custom hardware, it works on commodity hardware and is therefore cheaper per TB.
HDFS & the hadoop ecosystem go hand in glove with dynamic and flexible cloud architectures. Google Cloud and Amazon AWS have such rich and cheap features that completely eliminate the need for in house DCs. There is no need to buy 20 powerful servers and 100s TB of storage to then discover it's not enough, or it's too much, or it's only needed for 1 hour a day. Setting up a cluster with cloud services is getting easier and easier, there are even scripts out there that make doing it possible for those with only a small amount of sysadm/devops experience.
Hadoop and Spark, particularly when used with a high level statically typed language like Scala (but Java 8 is also OK-ish) means data scientists can now do things they could never do with scripting languages like R, Python and SAS. First they can wire up their modelling code with other production systems, all in one language, all in one virtual environment. Think about all the high velocity tools written in Scala; Kafka, Akka, Spray, Spark, SparkStreaming, GraphX etc, and in Java: HDFS, HBase, Cassandra - now all these tools are highly interoperable. What this means is for first time in history, data analysts can reliably automate analytics and build stable products. They have the high-level functionality they need, but with the predictability and reliability of static typing, FP and unit testing. Try building a large complicated concurrent system in Python. Try writting unit tests in R or SAS. Try compiling your code, watching the tests pass, and conclude "hey it works! lets ship it" in a dynamically typed language.
These four points combined means that A: storing data is now a lot lot cheaper, B: processing data is now a lot lot cheaper and C: human resource costs are much much cheaper as now you don't need several teams siloed off into analysts, modellers, engineers, developers, you can mash these skills together to make hybrids ultimately needing to employ less people.
Things won't change over night, currently the labour market is majorly lacking two groups; good Big Data DevOps and Scala engineers/developers, and their rates clearly reflect that. Unfortunately the supply is quite low even though the demand is very high. Although I still conjecture Hadoop for warehousing is much cheaper, finding talent can be a big costs that is restricting the pace of transition.