We're just trialling Spark, and it's proving really slow. To show what I mean, I've given an example below - it's taking Spark nearly 2 seconds to load in a text file with ten rows from HDFS, and count the number of lines. My questions:
Is this expected? How long does it take your platform?
Any possible ideas why? Currently I'm using Spark 1.3 on a two node Hadoop cluster (both 8 cores, 64G RAM). I'm pretty green when it comes to Hadoop and Spark, so I've done little configuration beyond the Ambari/HDP defaults.
Initially I was testing on a hundred million rows - Spark was taking about 10 minutes to simply count it.
Example:
Create text file of 10 numbers, and load it into hadoop:
for i in {1..10}; do echo $1 >> numbers.txt; done
hadoop fs -put numbers.txt numbers.txt
Start pyspark (which takes about 20 seconds ...):
pyspark --master yarn-client --executor-memory 4G --executor-cores 1 --driver-memory 4G --conf spark.python.worker.memory=4G
Load the file from HDFS and count it:
sc.textFile('numbers.txt').count()
According to the feedback, it takes Spark around 1.6 seconds to do that. Even with terrible configuration, I wouldn't expect it to take that long.
This is definitly too slow (on my local machine 0.3 sec) even for bad spark configuration (moreover usualy default spark configuration apply to most of the normal use of it ). Maybe you should double check your HDFS configuration or network related configuration .
It has nothing to do with cluster configuration. It is due to lazy evaluation.
There are two types of APIs in Spark : Transformations & Actions
Have a look at it from above documentation link.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
sc.textFile('numbers.txt').count() is an action operation with count() call.
Due to this reason, even though it took 2 seconds at first time for you, it took fraction of seconds at second time.
Related
I used scala to build a machine learning project in spark and use spark-submit to launch it with "--master yarn-cluster" as the parameter. The computing steps are very fast, but it always gets stuck at the writing tables step for hours. The output is just 3Mb. Has anyone had this problem before?
the scala writing table code is listed as below
mlPredictResult
.select("orderid","prediction")
.write
.mode(SaveMode.Overwrite)
.saveAsTable("tmp_sbu_vadmtestdb.AntiCF_ClickFarming_predicted")
the spark-submit code is listed as below
spark-submit --class Ml_Learning --master yarn-cluster --executor-memory 5G --num-executors 50 AntiCF-1.0-SNAPSHOT.jar
In Spark, there are two types of commands, transformations (which are "lazy" i.e. they will be executed only when needed) and actions (which are executed immediately).
I assume that:
- The computing steps seems to be very fast as they are lazy.
- The write/saveAsTable seems to be very slow, as it is an action which trigger spark to perform the lazy transformation which weren't calculated till this point.
==> the reason it takes a lot of time to write to disk, is caused due to the need to perform the calculation before writing to the disk.
http://spark.apache.org/docs/latest/programming-guide.html
*RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
*
Note: It is possible that modifying your machine learning spark code and/or spark resources will reduce the calculation time
I'm new to this Hadoop and Big Data. We have hundreds of log files everyday. Each file is about ~78Mb. So, we thought we could benefit from Hadoop job which we could write Pig UDF and submit to Amazon EMR.
We did a really simple Pig UDF
public class ProcessLog extends EvalFunc<String> {
// Extract IP Address from log file line by line and convert that to JSON format.
}
It works locally with Pig and hadoop. So we submitted to Amazon EMR and we run with 5x x-large instances. It took about 40 minutes to finish. So, we thought if we double the instances (10x x-large) we would get the result faster but it ended up slower. What are the factors that we need to account for when writing Pig UDF to get the result faster?
Hundreds of log files ... Each file is about ~78Mb
The problem is that you don't have "Big Data". Unless you are doing seconds of processing for each MB, it will be faster NOT to use Hadoop. (The best definition of big data is "Data so big or streaming so fast that normal tools don't work".)
Hadoop has a lot of overhead, so you should use "normal" tools when your data is that small (a few GB). Your data probably fits into RAM on my phone! Use something like parallel to make sure all your cores are occupied.
You need to check following things when you run the job:
Number of mappers used
Number of reducers used
As you are processing 7 GB of data, it should create more than 56 mappers (split size 128M). In your case you can run it as map only job to convert each line to JSON. If it is not map only job, check how many reducers being used. If it is using only fewer mappers, then increasing number of reducers for the job might help. But you can eliminate reducers completely.
Please paste the progress log of the execution which includes counters. It will help in pin pointing the issue.
I'am actually trying to implement a solution with Hadoop using Hive on CDH 5.0 with Yarn. So my architecture is:
1 Namenode
3 DataNode
I'm querying ~123 millions rows with 21 columns
My node are virtualized with 2vCPU #2.27 and 8 GO RAM
So I tried some request and i got some result, and after that i tried the same requests in a basic MySQL with the same dataset in order to compare the results.
And actually MySQL is very faster than Hive. So I'm trying to understand why. I know I have some bad performance because of my hosts. My main question is : is my cluster well sizing ?
Do i need to add same DataNode for this amount of data (which is not very enormous in my opinion) ?
And if someone try some request with appoximately the same architecture, you are welcome to share me your results.
Thanks !
I'm querying ~123 millions rows with 21 columns [...] which is not very enormous in my opinion
That's exactly the problem, it's not enormous. Hive is a big data solution and is not designed to run on small data-sets like the one your using. It's like trying to use a forklift to take out your kitchen trash. Sure, it will work, but it's probably faster to just take it out by hand.
Now, having said all that, you have a couple of options if you want realtime performance closer to that of a traditional RDBMS.
Hive 0.13+ which uses TEZ, ORC and a number of other optimizations that greatly improve response time
Impala (part of CDH distributions) which bypasses MapReduce altogether, but is more limited in file format support.
Edit:
I'm saying that with 2 datanodes i get the same performance than with 3
That's not surprising at all. Since Hive uses MapReduce to handle query operators (join, group by, ...) it incurs all the cost that comes with MapReduce. This cost is more or less constant regardless of the size of data and number of datanodes.
Let's say you have a dataset with 100 rows in it. You might see 98% of your processing time in MapReduce initialization and 2% in actual data processing. As the size of your data increases, the cost associated with MapReduce becomes negligible compared to the total time taken.
Summary: How can I get Hadoop to use more CPUs concurrently on my server?
I'm running Cassandra and Hadoop on a single high-end server with 64GB RAM, SSDs, and 16 CPU cores. The input to my mapreduce job has 50M rows. During the map phase, Hadoop creates seven mappers. Six of those complete very quickly, and the seventh runs for two hours to complete the map phase. I've suggested more mappers like this ...
job.getConfiguration().set("mapred.map.tasks", "12");
but Hadoop continues to create only seven. I'd like to get more mappers running in parallel to take better advantage of the 16 cores in the server. Can someone explain how Hadoop decides how many mappers to create?
I have a similar concern during the reduce phase. I tell Hadoop to create 12 reducers like this ...
job.setNumReduceTasks(12);
Hadoop does create 12 reducers, but 11 complete quickly and the last one runs for hours. My job has 300K keys, so I don't imagine they're all being routed to the same reducer.
Thanks.
The map task number is depend on your input data.
For example:
if your data source is HBase the number is the region number of you data
if your data source is the file the map number is your file size/the block size(64mb or 128mb).
you cannot specify the map number in code
The problem of 6 fast and 1 slow is because the data unbalanced. I did not use Cassandra before, so I cannot tell you how to fix it.
I've set up and am testing out a pseudo-distributed Hadoop cluster (with namenode, job tracker, and task tracker/data node all on the same machine). The box I'm running on has about 4 gigs memory, 2 cpus, 32-bit, and is running Red Hat Linux.
I ran the sample grep programs found in the tutorials with various file sizes and number of files. I've found that grep takes around 45 seconds for a 1 mb file, 60 seconds for a 100 mb file, and about 2 minutes for a 1 gig file.
I also created my own Map Reduce program which cuts out all the logic entirely; the map and reduce functions are empty. This sample program took 25 seconds to run.
I have tried moving the datanode to a second machine, as well as added in a second node, but I'm only seeing changes of a few seconds. Particularly, I have noticed that setup and clean up times are always about 3 seconds, no matter what input I give it. This seems to me like a really long time just for setup.
I know that these times will vary greatly depending on my hardware, configuration, inputs, etc. but I was just wondering if anyone can let me know if these are the times I should be expecting or if with major tuning and configuration I can cut it down considerably (for example, grep taking < 5 seconds total).
So you have only 2 CPU's, Hadoop will spawn (in pseudo-distributed mode) many JVMs': One for the Namenode, 1 for the Datanode, 1 for the Tasktracker and 1 for the Jobtracker. For each file in your job path Hadoop sets up a mapper task and per task it will spawn a new JVM, too. So your two Cores are sharing 4-n applications. So your times are not unnormal... At least Hadoop won't be as fast for plain-text files as for sequence files. To get the REAL speedup you have to bring the text into serialized bytecode and let hadoop stream over it.
A few thoughts:
There is always a fixed time cost for every Hadoop job run to calculate the splits and launch the JVM's on each node to run the map and reduce jobs.
You won't experience any real speedup over UNIX grep unless you start running on multiple nodes with lots of data. With 100mb-1G files, a lot of the time will be spent setting up the jobs rather than doing actual grepping. If you don't anticipate dealing with more than a gig or two of data, it probably isn't worth using Hadoop.