I have a hive external table with 255 columns which has input data size of around 25 GB. This is a single node cluster set up with Hadoop-1.2.1 and hive-0.11.0.
I am able to create tables, databases etc... But when I try a count(*) query in hive, the mapper succeeds but the reducers never start. They are stuck at 0% forever.
The single node machine has a memory of 1TB. Any inputs here will be greatly appreciated.
My suggestion is to use beeline instead of hive, Hive is deprecated so some issues will not be resolved when it is getting deprecated.
I am working on a use case where I have to transfer data from RDBMS to HDFS. We have done the benchmarking of this case using sqoop and found out that we are able to transfer around 20GB data in 6-7 Mins.
Where as when I try the same with Spark SQL, the performance is very low(1 Gb of records is taking 4 min to transfer from netezza to hdfs). I am trying to do some tuning and increase its performance but its unlikely to tune it to the level of sqoop(around 3 Gb of data in 1 Min).
I agree to the fact that spark is primarily a processing engine but my main question is that both spark and sqoop are using JDBC driver internally so why there is so much difference in the performance(or may be I am missing something). I am posting my code here.
object helloWorld {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Netezza_Connection").setMaster("local")
val sc= new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
val df2 =sqlContext.sql("select * from POC")
val partitioner= new org.apache.spark.HashPartitioner(14)
val rdd=df2.rdd.map(x=>(String.valueOf(x.get(1)),x)).partitionBy(partitioner).values
rdd.saveAsTextFile("hdfs://Hostname/test")
}
}
I have checked many other post but could not get a clear answer for the internal working and tuning of sqoop nor I got sqoop vs spark sql benchmarking .Kindly help in understanding this issue.
You are using the wrong tools for the job.
Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. I don't think you can achieve kind of read parallelism with Spark.
Get the dataset with Sqoop and then process it with Spark.
you can try the following:-
Read data from netezza without any partitions and with increased fetch_size to a million.
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC")
repartition the data before writing it to final file.
val df3 = df2.repartition(10) //to reduce the shuffle
ORC formats are more optimized than TEXT. Write the final output to parquet/ORC.
df3.write.format("ORC").save("hdfs://Hostname/test")
#amitabh
Although marked as an answer, I disagree with it.
Once you give the predicate to partition the data while reading from the jdbc, spark will run separate tasks for each partition. In your case no of tasks should be 14 (u can confirm this using spark UI).
I notice that you are using local as master, which would provide only 1 core for executors. Hence there will be no parallelism. Which is what is happening in your case.
Now to get the same throughput as sqoop you need to make sure that these tasks are running in parallel. Theoretically this can be done either by:
1. Using 14 executors with 1 core each
2. Using 1 executor with 14 cores (other end of the spectrum)
Typically, I would go with 4-5 cores per executor. So I test the performance with 15/5= 3 executors (i added 1 to 14 to consider 1 core for the driver running in clustor mode).
Use: executor.cores, executor.instances in sparkConf.set to play with the configs.
If this does not significantly increase performance, the next thing would be to look at the executor memory.
Finally, I would tweak the application logic to look at mapRDD sizes, partition sizes and shuffle sizes.
I had the same problem because the piece of code you are using it's not working for partition.
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
You can check number of partitions created in you spark job by
df.rdd.partitions.length
you can use following code to connect db:
sqlContext.read.jdbc(url=db_url,
table=tableName,
columnName="ID",
lowerBound=1L,
upperBound=100000L,
numPartitions=numPartitions,
connectionProperties=connectionProperties)
To optimize your spark job following are the parameters:
1. # of partitions
2. --num-executors
3.--executor-cores
4. --executor-memory
5. --driver-memory
6. fetch-size
2,3,4 and 5 options are depends on you cluster configurations
you can monitor your spark job on spark ui.
Sqoop and Spark SQL both use JDBC connectivity to fetch the data from RDBMS engines but Sqoop has an edge here since it is specifically made to migrate the data between RDBMS and HDFS.
Every single option available in Sqoop has been fine-tuned to get the best performance while doing the data ingestions.
You can start with discussing the option -m which control the number of mappers.
This is what you need to do to fetch data in parallel from RDBMS. Can I do it in Spark SQL?
Of course yes but the developer would need to take care of "multithreading" that Sqoop has been taking care automatically.
The below solution helped me
var df=spark.read.format("jdbc").option("url","
"url").option("user","user").option("password","password").option("dbTable","dbTable").option("fetchSize","10000").load()
df.registerTempTable("tempTable")
var dfRepart=spark.sql("select * from tempTable distribute by primary_key") //this will repartition the data evenly
dfRepart.write.format("parquet").save("hdfs_location")
Apache Sqoop is retired now - https://attic.apache.org/projects/sqoop.html
Using Apache Spark is a good option. This link shows how Spark can be used instead of Sqoop - https://medium.com/zaloni-engineering/apache-spark-vs-sqoop-engineering-a-better-data-pipeline-ef2bcb32b745
Else one can choose any cloud services like Azure Data Factory or Amazon Redshift etc.
I am a quite newbie in Cassandra and have following question:
I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows.
I run a hadoop job (datanodes reside on cassandra nodes of course) that reads data from that table and I see that only 7k rows is read to map phase.
I checked CqlInputFormat source code and noticed that a CQL query is build to select node-local date and also LIMIT clause is added (1k default). So that 7k read rows can be explained:
7 nodes * 1k limit = 7k rows read total
The limit can be changed using CqlConfigHelper:
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
Please help me with questions below:
Is this a desired behavior?
Why CqlInputFormat does not page through the rest of rows?
Is it a bug or should I just increase the InputCQLPageRowSize value?
What if I want to read all data in table and do not know the row count?
My problem was related to a bug in cassandra 2.0.11 that added a strange LIMIT clause in underlying CQL query run to read data to the map task:
I posted that issue to cassandra jira: https://issues.apache.org/jira/browse/CASSANDRA-9074
It turned out that that problem was stricly related to the following bug fixed in cassandra 2.0.12: https://issues.apache.org/jira/browse/CASSANDRA-8166
I am trying to export data using sqoop from files stored in hdfs to vertica. For around 10k's of data the files get loaded within a few minutes. But when I try to run crores of data, it is loading around .5% within 15 mins or so. I have tried to increase the number of mappers, but they are not serving any purpose to improve efficienct. Even setting the chunk size to increase the number the mappers, does not increase the number.
Please help.
Thanks!
As you are using Batch export try increasing the records per transaction and records per statement parameter using the following properties:
sqoop.export.records.per.statement : property will aggregate multiple rows inside one single insert statement.
sqoop.export.records.per.transaction: how many insert statements will be issued per transaction
I hope these will surely solves the issue.
Most MPP/RDBMS have sqoop connectors to exploit the parallelism and increase efficiency in transfer of data between HDFS and MPP/RDBMS. However it seems the vertica has taken this approach: http://www.vertica.com/2012/07/05/teaching-the-elephant-new-tricks/
https://github.com/vertica/Vertica-Hadoop-Connector
Is this a "wide" dataset? It might be a sqoop bug https://issues.apache.org/jira/browse/SQOOP-2920 if number of columns is very high (in hundreds), sqoop starts choking (very high on cpu). When number of fields is small, it's usually other way around - when sqoop is bored and rdbms systems can't keep up.
I work on hive and i am new to it. I am facing some issues regarding the performance in hive query.
Number of mappers allocated to my job is very low even though there
are hundreds of mappers available. I have tried setting
mapred.map.tasks=200. But it takes only 20 to 30 mappers. I
understand, number of mappers depend upon the inputsplit. Is there
any other option to increase the mappers? if no then why is the
parameter(mapred.map.tasks) introduced ?
Is there any resource where i can understand to correlate hive
queries to map-reduce jobs, i.e where the different part of the
query is executed?
For more information about setting map tasks, check this link: http://wiki.apache.org/hadoop/HowManyMapsAndReduces. Basically, mapred.map.tasks is just a hint; it doesn't really control anything usually.
To see how Hive queries are executed, simply preface your query with explain. For example: explain select foo from bar;. If you need even more information, there's also explain extended.
I see this question has been asked long time ago, I'll try to answer it even though some of the suggestions here would not be available at the time when question has been asked.
To optimize Hive performance:
Tuning the number of mappers and reducers used by your Hive request; this could be done by tuning the input size for each mapper mapreduce.input.fileinputformat.split.maxsize, and the input size for each reducer: hive.exec.reducers.bytes.per.reducer
bare in mind that "the more the better" is not always true. So you need to tune those numbers to your needs.
Optimize the joins, convert Joins to map-joins, if one of the table is small table (if possible)...
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization)
Partition your table on columns that are often used in conditions (WHERE).
For example if you are requesting frequently SELECT * from myTable WHERE someColumn = 'someValue' it is recommended to partition your table on the column 'someColumn'
This will let your query search just the partition files someColumn=SomePartition instead of searching the whole table files.
Compressing the intermediate results may enhance the performance in some cases (depending on your hardware configuration, network and CPU / memory). This could be done by setting the property: hive.intermediate.compression.codec
Choosing the right compression codec, for example using Snappy (as in here):
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
Not been available at the time of question:
Using optimized file format to store your table , instead of using Text File, or Sequence File, you could use ORC (hive 0.11 +) for example (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC )
Using another engine to execute your queries on, instead of MapReduce, you could use Tez or even Spark.To use tez for example:
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
For further optimization you could refer here
You can decrease 'mapreduce.input.fileinputformat.split.maxsize' to increase the number of mappers (more splits).