I have developed a simple Spark application that analyzes a dataset. The data analyzed comes from a CSV of 2 million records and 25 attributes. The analysis relates to simple transformations/actions of RDDs and I also used the MLlib library algorithms.
Being my first experience I've taken many pieces of code from documentation or examples present online. However, for a complete execution of a simple algorithm ALS of User Recommendation, for example, it takes several minutes.
I use the application on a laptop (i7 2GHz, 12GB RAM).
I would like to know if I only need to use this application in a cluster of computers to increase performance (in terms of speed) and if so, it is normal that running a Recommendation Engine Model in local takes so long time.
If yes, with a good cluster of computer can I get results in real time?
Thanks in advance!
Related
The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms.
Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It passes by 9 iterations of cross-validation to tune some parameters of a Logistic Regression multi-class classifier.
It is expected that this kind of Machine Learning processing will be expensive in term of time and resources.
I am running now the code and everything seems to be OK, except that I have no idea if my application is running efficiently or not.
I couldn't find guidelines saying that for a certain type and amount of data, and for certain type of processing and computing resources the processing time should be in the approximate order of...
Is there any method that help in judging if my application is running slow or fast, or it is purely a matter of experience?
I had the same question and I didn't find a real answer/tool/way to test how good my performances were just looking "only inside" my application.
I mean, as far as I know, there's no tool like a speedtest or something like for the internet connection :-)
The only way I found is to re-write my app (if possible) with another stack in order to see if the difference (in terms of time) is THAT big.
Otherwise, I found very interesting 2 main resources, even if quite old:
1) A sort of 4 point guide to remember when coding:
Understanding the Performance of Spark Applications, SPark Summit 2013
2) A 2-episode article from Cloudera blog to tune at best your jobs:
episode1
episode2
Hoping it could help
FF
Your question is pretty generic, so I would also highlight few generic areas where you can look out for performance optimizations: -
Scheduling Delays - Are there significant scheduling delays in scheduling the tasks? if yes then you can analyze the reasons (may be your cluster needs more resources etc).
Utilization of Cluster - are your jobs utilizing the available cluster resources (like CPU, mem)? In case not then again look out for the reasons. May be creating more partitions helps in faster execution. May be there is significant time taken in serialization, so can you switch to Kyro Serialization.
JVM Tuning - Consider analyzing GC logs and tune if you find anomalies.
Executor Configurations - Analyze the memory/ cores provided to your executors. It should be sufficient to hold the data processed by the task/job.
your DAG and
Driver Configuration - Same as executors, Driver should also have enough memory to hold the results of certain functions like collect().
Shuffling - See how much time is spend in Shuffling and kind of Data Locality used by your task.
All the above are needed for the preliminary investigations and in some cases it can also increase the performance of your jobs to an extent but there could be complex issues for which the solution will depend upon case to case basis.
Please also see Spark Tuning Guide
I just started learning Hadoop, in the official guide, it mentioned that double amount of
clusters is able to make querying double size of data as fast as original.
On the other hand, traditional RDBM still spend twice amount of time on querying result.
I cannot grasp the relation between cluster and processing data. Hope someone can give me
some idea.
It's the basic idea of distributed computing.
If you have one server working on data of size X, it will spend time Y on it.
If you have 2X data, the same server will (roughly) spend 2Y time on it.
But if you have 10 servers working in parallel (in a distributed fashion) and they all have the entire data (X), then they will spend Y/10 time on it. You would gain the same effect by having 10 times more resources on the one server, but usually this is not feasible and/or doable. (Like increasing CPU power 10-fold is not very reasonable.)
This is of course a very rough simplification and Hadoop doesn't store the entire dataset on all of the servers - just the needed parts. Hadoop has a subset of the data on each server and the servers work on the data they have to produce one "answer" in the end. This requires communications and different protocols to agree on what data to share, how to share it, how to distribute it and so on - this is what Hadoop does.
I'm currently investigating how to store and analyze enriched time based data with up to 1000 columns per line. At the moment Cassandra together with either Solr, Hadoop or Spark offered by Datastax Enterprise seem to fulfill my requirements on the rough. But the devil is in the detail.
Out of the 1000 columns about 60 are used for real-time-like queries (web-frontend, user sends form and expect quick response). These queries are more or less GROUPBY statements where the number or occurrences are counted.
As Cassandra itself does not provide the required analytical capabilities (no GROUPBY), I'm left these alternatives:
Roughly query via Cassandra and filter the resultset within self-written code
Index the data with Solr and run facet.pivot queries
Use either Hadoop or Spark and run the queries
The first approach seems cumbersome and prone to errors… Solr does have some anayltic features but without multifield grouping I'm stuck with pivots. I don't know whether this is a good or performant approach though… Last but not least there are Hadoop and Spark, the prior known not to be the best for real-time queries, the later pretty new and maybe not production ready.
So which way to go? There is no one-fits-all here, but before I go one way through I'd like to get some feedback. Maybe I'm thinking to complex or my expectations are too high :S
Thanks in advance,
Arman
In a place I work now we have a similar set of tech requirements and a solution is Cassandra-Solr-Spark, exactly in that order.
So if a query can be "covered" by Cassandra indices - good, if not - it's covered by Solr. For testing & less often queries - Spark (Scala, no SparkSQL due to old version of it -- it's a bank, everything should be tested and matured, from cognac to software, argh).
Generally I agree with the solution, though sometimes I have a feeling that some client's requests should NOT be taken seriously at all, saving us from loads of weird queries :)
I would recommend Spark, if you take a loot at the list of companies using it you'll such names as Amazon, eBay and Yahoo!. Also, as you noted in the comment, it's becoming a mature tool.
You've given arguments against Cassandra and Solr already, so I'll focus on explaining why Hadoop MapReduce wouldn't do as well as Spark for real-time queries.
Hadoop and MapReduce were designed to leverage hard disk under the assumption that for big data IO is negligible. As a result data are read and wrote at least twice - in map stage and in reduce stage. This allows you to recover from failures as partial result are secured but it that's not want you want when aiming for real-time queries.
Spark not only aims to fix MapReduce shortcomings, it also focuses on interactive data analysis, which is exactly what you want. This goal is achieved mainly by utilizing RAM and the results are astonishing. Spark jobs will often be 10-100 times faster than MapReduce equivalents.
The only caveat is the amount of memory you have. Most probably your data is probably going to feat in the RAM you can provide or you can rely on sampling. Usually when interactively working with data there is no real need to use MapReduce and it seems to be so in your case.
We are currently interested in evaluating datameer and have a few questions. Are there any datameer users that can answer these questions:
Since datameer works off HDFS, are the querying speeds similar to that of Hive? How does the querying speed compare with columnar databases?
Since Hadoop is known for high latency, is it advisable to use datameer for real time quering?
Thank you.
Ravi
Regarding 1:
Query speeds are comparable to Hive.
But Datameer is a lot faster in the design phase of your "query". Datameer provides a real time preview how the results of your "query" would look like, which is happening in memory and not on the cluster. The preview is based on a representative sample of your data. It's only a preview not the final results, but it gives you constant feedback if your analytics make sense while designing.
To test a Hive query you would have to execute it, which makes the design process very slow.
Datameer's big advantage over Hive is:
Loading data into Hadoop is much easier. No static schema creation, no ETL, etc. Just use a wizard to download data from your database, log files, social media, etc.
Designing analytics or making changes is a lot faster and can even be done by non technical users.
No need to install anything else since Datameer includes all you need for importing, analytics, scheduling, security, visualization etc. in one product
If you have real time requirements you should not pull data directly out of Datameer, Hive, Impala, etc.. Columnar storages make some processing faster but will still not be low latency. But you can use those tools together with a low latency database. Use Datameer/Hive/Impala for the heavy lifting to filter and pre aggregate big data into smaller data and then export that out into a database. In Datameer you could set this up very easily using one of Datameer's wizards.
Hope this helps,
Peter Voß (Datameer)
I had always thought that Mongo had excellent performance with it's mapreduce functionality, but am now reading that it is a slow implementation of it. So if I had to pick an alternative to benchmark against, what should it be?
My software will be such that users will often have millions of records, and often be sorting and crunching through unpredictable subsets that are 10s or 100s of thousands. Most of the analysis of data that uses the full millions of records can be done in summary tables and the like. I'd originally thought Hypertable was a viable alternative, but in doing research I saw in their documents their mention that Mongo would be a more performant option, while Hypertable had other benefits. But for my application speed is my number one initial priority.
First of all, it's important to decide on what is "fast enough". Undoubtedly there are faster solutions than MongoDB's map/reduce but in most cases you may be looking at significantly higher development cost.
That said MongoDB's map/reduce runs, at time of writing, on a single thread which means it will not utilize all the cpu available to it. Also, MongoDB has very little in the way of native aggregation functionality. This will change fixed with version 2.1 onwards that should improve performance though (see https://jira.mongodb.org/browse/SERVER-447 and http://www.slideshare.net/cwestin63/mongodb-aggregation-mongosf-may-2011).
Now, what MongoDB is good at is scaling up easily, especially when it comes to reads. And this is important because the best solution for number crunching on large datasets is definitely a map/reduce cloud like Augusto suggested. Let such an m/r do the number crunching while MongoDB makes the required data available at high speeds. Database query throughput too low is easily solved by adding more mongo shards. Number crunching/aggregation performance too slow is solved by adding more m/r boxes. Basically performance becomes a function of number of instances you reserve for the problem, and thus cost.