I have the following code, that gets the distinct phone numbers, and create union of all the calls made.
//Get all the calls for the last 24 hours for each MSISDN in the hour
val sCallsPlaced = (grouped24HourCallsPlaS).join(distinctMSISDNs)
val oCallsPlaced = (grouped24HourCallsPlaO).join(distinctMSISDNs)
val sCallsReceived = grouped24HourCallsRecS.join(distinctMSISDNs)
val oCallsReceived = grouped24HourCallsRecO.join(distinctMSISDNs)
val callsToProcess = sCallsPlaced.union(oCallsPlaced)
.union(sCallsReceived)
.union(oCallsReceived)
The spark-defaults.conf file has the following:
spark.driver.memory=16g
spark.driver.cores=1
spark.driver.maxResultSize=2g
spark.executor.memory=24g
spark.executor.cores=10
spark.default.parallelism=256
The question is, will Spark be able to process 50G of data, with a 256G machine, with Hadoop services (namenode, datanode, secondaryname node), yarn, and HBase running on the same machine.
Hbase (HMaster, HQuorumPeer, and HRegionServers) take up around 20G each.
Also, is there a faster way than using "Union" in Spark.
How many partitions are there for each RDD? How does the Serde look for the records?
As for the relational algebra, maybe perform the unions first and then perform the join.
Related
Given partitioned by some_field (of int type) Hive table with data stored as Avro files, I want to query table using Spark SQL in a way that returned Data Frame have to be already partitioned by some_field (used for partitioning).
Query looks like just
SELECT * FROM some_table
By default Spark doesn't do that, returned data_frame.rdd.partitioner is None.
One way to get result is via explicit repartitioning after querying, but probably there is better solution.
HDP 2.6, Spark 2.
Thanks.
First of all you have to distinguish between partitioning of a Dataset and partitioning of the converted RDD[Row]. No matter what is the execution plan of the former one, the latter one won't have a Partitioner:
scala> val df = spark.range(100).repartition(10, $"id")
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.rdd.partitioner
res1: Option[org.apache.spark.Partitioner] = None
However internal RDD, might have a Partitioner:
scala> df.queryExecution.toRdd.partitioner
res2: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner#5a05e0f3)
This however is unlikely to help you here, because as of today (Spark 2.2), Data Source API is not aware of the physical storage information (with exception of simple partition pruning). This should change in the upcoming Data Source API. Please refer to the JIRA ticket (SPARK-15689) and design document for details.
I want to use BerkeleyDB with Hadoop and Spark is there any guide/tutorials available to run berkeley db over the cluster of multiple nodes (I have 8 nodes cluster)
Is it a right choice to use Berkeley to store BigData for analytics? As i want Tree like structured DB
Is there any better option?
I found the answer myself.
When we connect with berkeley db using
fileName = '/your/berkeley/file/path'
berkleyObject = bsddb3.btopen(fileName)
It basically gives us a dictionary to traverse containing complete data by which we can create dataframe using pandas
df = pandas.DataFrame(berkleyObject.items(),columns=['Key','value'])
and then we can load this data frame into Spark's SqlContext
sparkDF = sparkSql.createDataFrame(df)
I havent tried creating dataframe directly from berkeley bsddb3 object but i am sure it will work too
sparkSql.createDataFrame(berkleyObject.items())
As spark's dataframe is distributed like RDD so all the Sql queries we make will utilize spark's distributed processing i.e. It will run queries parallel on all slave/worker nodes.
sparkDF.registerTempTable("Data")
result = sparkSql.sql("SELECT * FROM Data WHERE Key == 'xxxx' ")
Only catch is the process converting Dictionary object into DataFrame object is too slow. I am still working on it.
I have a hive external table with 255 columns which has input data size of around 25 GB. This is a single node cluster set up with Hadoop-1.2.1 and hive-0.11.0.
I am able to create tables, databases etc... But when I try a count(*) query in hive, the mapper succeeds but the reducers never start. They are stuck at 0% forever.
The single node machine has a memory of 1TB. Any inputs here will be greatly appreciated.
My suggestion is to use beeline instead of hive, Hive is deprecated so some issues will not be resolved when it is getting deprecated.
I am working on a use case where I have to transfer data from RDBMS to HDFS. We have done the benchmarking of this case using sqoop and found out that we are able to transfer around 20GB data in 6-7 Mins.
Where as when I try the same with Spark SQL, the performance is very low(1 Gb of records is taking 4 min to transfer from netezza to hdfs). I am trying to do some tuning and increase its performance but its unlikely to tune it to the level of sqoop(around 3 Gb of data in 1 Min).
I agree to the fact that spark is primarily a processing engine but my main question is that both spark and sqoop are using JDBC driver internally so why there is so much difference in the performance(or may be I am missing something). I am posting my code here.
object helloWorld {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Netezza_Connection").setMaster("local")
val sc= new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
val df2 =sqlContext.sql("select * from POC")
val partitioner= new org.apache.spark.HashPartitioner(14)
val rdd=df2.rdd.map(x=>(String.valueOf(x.get(1)),x)).partitionBy(partitioner).values
rdd.saveAsTextFile("hdfs://Hostname/test")
}
}
I have checked many other post but could not get a clear answer for the internal working and tuning of sqoop nor I got sqoop vs spark sql benchmarking .Kindly help in understanding this issue.
You are using the wrong tools for the job.
Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. I don't think you can achieve kind of read parallelism with Spark.
Get the dataset with Sqoop and then process it with Spark.
you can try the following:-
Read data from netezza without any partitions and with increased fetch_size to a million.
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC")
repartition the data before writing it to final file.
val df3 = df2.repartition(10) //to reduce the shuffle
ORC formats are more optimized than TEXT. Write the final output to parquet/ORC.
df3.write.format("ORC").save("hdfs://Hostname/test")
#amitabh
Although marked as an answer, I disagree with it.
Once you give the predicate to partition the data while reading from the jdbc, spark will run separate tasks for each partition. In your case no of tasks should be 14 (u can confirm this using spark UI).
I notice that you are using local as master, which would provide only 1 core for executors. Hence there will be no parallelism. Which is what is happening in your case.
Now to get the same throughput as sqoop you need to make sure that these tasks are running in parallel. Theoretically this can be done either by:
1. Using 14 executors with 1 core each
2. Using 1 executor with 14 cores (other end of the spectrum)
Typically, I would go with 4-5 cores per executor. So I test the performance with 15/5= 3 executors (i added 1 to 14 to consider 1 core for the driver running in clustor mode).
Use: executor.cores, executor.instances in sparkConf.set to play with the configs.
If this does not significantly increase performance, the next thing would be to look at the executor memory.
Finally, I would tweak the application logic to look at mapRDD sizes, partition sizes and shuffle sizes.
I had the same problem because the piece of code you are using it's not working for partition.
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
You can check number of partitions created in you spark job by
df.rdd.partitions.length
you can use following code to connect db:
sqlContext.read.jdbc(url=db_url,
table=tableName,
columnName="ID",
lowerBound=1L,
upperBound=100000L,
numPartitions=numPartitions,
connectionProperties=connectionProperties)
To optimize your spark job following are the parameters:
1. # of partitions
2. --num-executors
3.--executor-cores
4. --executor-memory
5. --driver-memory
6. fetch-size
2,3,4 and 5 options are depends on you cluster configurations
you can monitor your spark job on spark ui.
Sqoop and Spark SQL both use JDBC connectivity to fetch the data from RDBMS engines but Sqoop has an edge here since it is specifically made to migrate the data between RDBMS and HDFS.
Every single option available in Sqoop has been fine-tuned to get the best performance while doing the data ingestions.
You can start with discussing the option -m which control the number of mappers.
This is what you need to do to fetch data in parallel from RDBMS. Can I do it in Spark SQL?
Of course yes but the developer would need to take care of "multithreading" that Sqoop has been taking care automatically.
The below solution helped me
var df=spark.read.format("jdbc").option("url","
"url").option("user","user").option("password","password").option("dbTable","dbTable").option("fetchSize","10000").load()
df.registerTempTable("tempTable")
var dfRepart=spark.sql("select * from tempTable distribute by primary_key") //this will repartition the data evenly
dfRepart.write.format("parquet").save("hdfs_location")
Apache Sqoop is retired now - https://attic.apache.org/projects/sqoop.html
Using Apache Spark is a good option. This link shows how Spark can be used instead of Sqoop - https://medium.com/zaloni-engineering/apache-spark-vs-sqoop-engineering-a-better-data-pipeline-ef2bcb32b745
Else one can choose any cloud services like Azure Data Factory or Amazon Redshift etc.
I'm collecting the data from a messaging app, I'm currently using Flume, it sends approx 50 Million records per day
I wish to use Kafka,
consume from Kafka using Spark Streaming
and persist it to hadoop and query with impala
I'm having issues with each approach I've tried..
Approach 1 - Save RDD as parquet, point an external hive parquet table to the parquet directory
// scala
val ssc = new StreamingContext(sparkConf, Seconds(bucketsize.toInt))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
lines.foreachRDD(rdd => {
// 1 - Create a SchemaRDD object from the rdd and specify the schema
val SchemaRDD1 = sqlContext.jsonRDD(rdd, schema)
// 2 - register it as a spark sql table
SchemaRDD1.registerTempTable("sparktable")
// 3 - qry sparktable to produce another SchemaRDD object of the data needed 'finalParquet'. and persist this as parquet files
val finalParquet = sqlContext.sql(sql)
finalParquet.saveAsParquetFile(dir)
The problem is that finalParquet.saveAsParquetFile outputs a huge number of files, the Dstream received from Kafka outputs over 200 files for a 1 minute batch size.
The reason that it outputs many files is because the computation is distributed as explained in another post- how to make saveAsTextFile NOT split output into multiple file?
However, the propsed solutions don't seem optimal for me , for e.g. as one user states - Having a single output file is only a good idea if you have very little data.
Approach 2 - Use HiveContext. insert RDD data directly to a hive table
# python
sqlContext = HiveContext(sc)
ssc = StreamingContext(sc, int(batch_interval))
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topics: 1})
lines = kvs.map(lambda x: x[1]).persist(StorageLevel.MEMORY_AND_DISK_SER)
lines.foreachRDD(sendRecord)
def sendRecord(rdd):
sql = "INSERT INTO TABLE table select * from beacon_sparktable"
# 1 - Apply the schema to the RDD creating a data frame 'beaconDF'
beaconDF = sqlContext.jsonRDD(rdd,schema)
# 2- Register the DataFrame as a spark sql table.
beaconDF.registerTempTable("beacon_sparktable")
# 3 - insert to hive directly from a qry on the spark sql table
sqlContext.sql(sql);
This works fine , it inserts directly to a parquet table but there are scheduling delays for the batches as processing time exceeds the batch interval time.
The consumer cant keep up with whats being produced and the batches to process begin to queue up.
it seems writing to hive is slow. I've tried adjusting batch interval size, running more consumer instances.
In summary
What is the best way to persist Big data from Spark Streaming given that there are issues with multiple files and potential latency with writing to hive?
What are other people doing?
A similar question has been asked here, but he has an issue with directories as apposed to too many files
How to make Spark Streaming write its output so that Impala can read it?
Many Thanks for any help
In solution #2, the number of files created can be controlled via the number of partitions of each RDD.
See this example:
// create a Hive table (assume it's already existing)
sqlContext.sql("CREATE TABLE test (id int, txt string) STORED AS PARQUET")
// create a RDD with 2 records and only 1 partition
val rdd = sc.parallelize(List( List(1, "hello"), List(2, "world") ), 1)
// create a DataFrame from the RDD
val schema = StructType(Seq(
StructField("id", IntegerType, nullable = false),
StructField("txt", StringType, nullable = false)
))
val df = sqlContext.createDataFrame(rdd.map( Row(_:_*) ), schema)
// this creates a single file, because the RDD has 1 partition
df.write.mode("append").saveAsTable("test")
Now, I guess you can play with the frequency at which you pull data from Kafka, and the number of partitions of each RDD (default, the partitions of your Kafka topic, that you can possibly reduce by repartitioning).
I'm using Spark 1.5 from CDH 5.5.1, and I get the same result using either df.write.mode("append").saveAsTable("test") or your SQL string.
I think the small file problem could be resolved somewhat. You may be getting large number of files based on kafka partitions. For me, I have 12 partition Kafka topic and I write using spark.write.mode("append").parquet("/location/on/hdfs").
Now depending on your requirements, you can either add coalesce(1) or more to reduce number of files. Also another option is to increase the micro batch duration. For example, if you can accept 5 minutes delay in writing day, you can have micro batch of 300 seconds.
For the second issues, the batches queue up only because you don't have back pressure enabled. First you should verify what is the max you can process in a single batch. Once you can get around that number, you can set spark.streaming.kafka.maxRatePerPartition value and spark.streaming.backpressure.enabled=true to enable limited number of records per micro batch. If you still cannot meet the demand, then the only options are to either increase partitions on topic or to increase resources on spark application.