I'm collecting the data from a messaging app, I'm currently using Flume, it sends approx 50 Million records per day
I wish to use Kafka,
consume from Kafka using Spark Streaming
and persist it to hadoop and query with impala
I'm having issues with each approach I've tried..
Approach 1 - Save RDD as parquet, point an external hive parquet table to the parquet directory
// scala
val ssc = new StreamingContext(sparkConf, Seconds(bucketsize.toInt))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
lines.foreachRDD(rdd => {
// 1 - Create a SchemaRDD object from the rdd and specify the schema
val SchemaRDD1 = sqlContext.jsonRDD(rdd, schema)
// 2 - register it as a spark sql table
SchemaRDD1.registerTempTable("sparktable")
// 3 - qry sparktable to produce another SchemaRDD object of the data needed 'finalParquet'. and persist this as parquet files
val finalParquet = sqlContext.sql(sql)
finalParquet.saveAsParquetFile(dir)
The problem is that finalParquet.saveAsParquetFile outputs a huge number of files, the Dstream received from Kafka outputs over 200 files for a 1 minute batch size.
The reason that it outputs many files is because the computation is distributed as explained in another post- how to make saveAsTextFile NOT split output into multiple file?
However, the propsed solutions don't seem optimal for me , for e.g. as one user states - Having a single output file is only a good idea if you have very little data.
Approach 2 - Use HiveContext. insert RDD data directly to a hive table
# python
sqlContext = HiveContext(sc)
ssc = StreamingContext(sc, int(batch_interval))
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topics: 1})
lines = kvs.map(lambda x: x[1]).persist(StorageLevel.MEMORY_AND_DISK_SER)
lines.foreachRDD(sendRecord)
def sendRecord(rdd):
sql = "INSERT INTO TABLE table select * from beacon_sparktable"
# 1 - Apply the schema to the RDD creating a data frame 'beaconDF'
beaconDF = sqlContext.jsonRDD(rdd,schema)
# 2- Register the DataFrame as a spark sql table.
beaconDF.registerTempTable("beacon_sparktable")
# 3 - insert to hive directly from a qry on the spark sql table
sqlContext.sql(sql);
This works fine , it inserts directly to a parquet table but there are scheduling delays for the batches as processing time exceeds the batch interval time.
The consumer cant keep up with whats being produced and the batches to process begin to queue up.
it seems writing to hive is slow. I've tried adjusting batch interval size, running more consumer instances.
In summary
What is the best way to persist Big data from Spark Streaming given that there are issues with multiple files and potential latency with writing to hive?
What are other people doing?
A similar question has been asked here, but he has an issue with directories as apposed to too many files
How to make Spark Streaming write its output so that Impala can read it?
Many Thanks for any help
In solution #2, the number of files created can be controlled via the number of partitions of each RDD.
See this example:
// create a Hive table (assume it's already existing)
sqlContext.sql("CREATE TABLE test (id int, txt string) STORED AS PARQUET")
// create a RDD with 2 records and only 1 partition
val rdd = sc.parallelize(List( List(1, "hello"), List(2, "world") ), 1)
// create a DataFrame from the RDD
val schema = StructType(Seq(
StructField("id", IntegerType, nullable = false),
StructField("txt", StringType, nullable = false)
))
val df = sqlContext.createDataFrame(rdd.map( Row(_:_*) ), schema)
// this creates a single file, because the RDD has 1 partition
df.write.mode("append").saveAsTable("test")
Now, I guess you can play with the frequency at which you pull data from Kafka, and the number of partitions of each RDD (default, the partitions of your Kafka topic, that you can possibly reduce by repartitioning).
I'm using Spark 1.5 from CDH 5.5.1, and I get the same result using either df.write.mode("append").saveAsTable("test") or your SQL string.
I think the small file problem could be resolved somewhat. You may be getting large number of files based on kafka partitions. For me, I have 12 partition Kafka topic and I write using spark.write.mode("append").parquet("/location/on/hdfs").
Now depending on your requirements, you can either add coalesce(1) or more to reduce number of files. Also another option is to increase the micro batch duration. For example, if you can accept 5 minutes delay in writing day, you can have micro batch of 300 seconds.
For the second issues, the batches queue up only because you don't have back pressure enabled. First you should verify what is the max you can process in a single batch. Once you can get around that number, you can set spark.streaming.kafka.maxRatePerPartition value and spark.streaming.backpressure.enabled=true to enable limited number of records per micro batch. If you still cannot meet the demand, then the only options are to either increase partitions on topic or to increase resources on spark application.
Related
I am very new to Spark and recently inherited an application which does the following :
Gets data from data source ( RDBMS , Excel , CSV etc ) in batches.
Creates DataSet using data from data sources
Writes to the parquet files
These 3 steps are happening sequentially. Read 100k - Create Data Frame(1m) - write to Parquet.
Currently system is running on a single box where we also have Spark running. We have a Spark Cluster that we are planning to use instead ( 4 workers ).
It currently takes us a very long time to write Rows of data to the disk 9 ( 1M records takes about 6 minutes ). Given that we are moving to the Spark Cluster how can I optimize this process and improve the performance.
Here is the code :
Dataset<Row> ds = spark.createDataFrame(rows, schema).coalesce(1);
create Dataset of rows of data.
ds.write().parquet(curFile.toString());
writes to the disk.
I have a simple Spark job that streams data to a Delta table.
The table is pretty small and is not partitioned.
A lot of small parquet files are created.
As recommended in the documentation (https://docs.delta.io/1.0.0/best-practices.html) I added a compaction job that runs once a day.
val path = "..."
val numFiles = 16
spark.read
.format("delta")
.load(path)
.repartition(numFiles)
.write
.option("dataChange", "false")
.format("delta")
.mode("overwrite")
.save(path)
Every time the compaction job runs the streaming job gets the following exception:
org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.
I tried to add the following config parameters to the streaming job:
spark.databricks.delta.retryWriteConflict.enabled = true # would be false by default
spark.databricks.delta.retryWriteConflict.limit = 3 # optionally limit the maximum amout of retries
It doesn't help.
Any idea how to solve the problem?
When you're streaming the data in, small files are being created (additive) and these files are being referenced in your delta log (an update). When you perform your compaction, you're trying to resolve the small files overhead by collating the data into larger files (currently 16). These large files are created alongside the small, but the change occurs when the delta log is written to. That is, transactions 0-100 make 100 small files, compaction occurs, and your new transaction tells you to now refer to the 16 large files instead. The problem is, you've already had transactions 101-110 occur from the streaming job while the compaction was occurring. After all, you're compacting ALL of your data and you essentially have a merge conflict.
The solution is is to go to the next step in the best practices and only compact select partitions using:
.option("replaceWhere", partition)
When you compact every day, the partition variable should represent the partition of your data for yesterday. No new files are being written to that partition, and the delta log can identify that the concurrent changes will not apply to currently incoming data for today.
Given partitioned by some_field (of int type) Hive table with data stored as Avro files, I want to query table using Spark SQL in a way that returned Data Frame have to be already partitioned by some_field (used for partitioning).
Query looks like just
SELECT * FROM some_table
By default Spark doesn't do that, returned data_frame.rdd.partitioner is None.
One way to get result is via explicit repartitioning after querying, but probably there is better solution.
HDP 2.6, Spark 2.
Thanks.
First of all you have to distinguish between partitioning of a Dataset and partitioning of the converted RDD[Row]. No matter what is the execution plan of the former one, the latter one won't have a Partitioner:
scala> val df = spark.range(100).repartition(10, $"id")
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.rdd.partitioner
res1: Option[org.apache.spark.Partitioner] = None
However internal RDD, might have a Partitioner:
scala> df.queryExecution.toRdd.partitioner
res2: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner#5a05e0f3)
This however is unlikely to help you here, because as of today (Spark 2.2), Data Source API is not aware of the physical storage information (with exception of simple partition pruning). This should change in the upcoming Data Source API. Please refer to the JIRA ticket (SPARK-15689) and design document for details.
I have the following code, that gets the distinct phone numbers, and create union of all the calls made.
//Get all the calls for the last 24 hours for each MSISDN in the hour
val sCallsPlaced = (grouped24HourCallsPlaS).join(distinctMSISDNs)
val oCallsPlaced = (grouped24HourCallsPlaO).join(distinctMSISDNs)
val sCallsReceived = grouped24HourCallsRecS.join(distinctMSISDNs)
val oCallsReceived = grouped24HourCallsRecO.join(distinctMSISDNs)
val callsToProcess = sCallsPlaced.union(oCallsPlaced)
.union(sCallsReceived)
.union(oCallsReceived)
The spark-defaults.conf file has the following:
spark.driver.memory=16g
spark.driver.cores=1
spark.driver.maxResultSize=2g
spark.executor.memory=24g
spark.executor.cores=10
spark.default.parallelism=256
The question is, will Spark be able to process 50G of data, with a 256G machine, with Hadoop services (namenode, datanode, secondaryname node), yarn, and HBase running on the same machine.
Hbase (HMaster, HQuorumPeer, and HRegionServers) take up around 20G each.
Also, is there a faster way than using "Union" in Spark.
How many partitions are there for each RDD? How does the Serde look for the records?
As for the relational algebra, maybe perform the unions first and then perform the join.
I'm new at Hbase. I'm facing a problem when bulk loading data from a text file into Hbase. Assuming I have a following table:
Key_id | f1:c1 | f2:c2
row1 'a' 'b'
row1 'x' 'y'
When I parse 2 records and put it into Hbase at the same time (same timestamps), then only version {row1 'x' 'y'} updated. Here is the explanation:
When you put data into HBase, a timestamp is required. The timestamp can be generated automatically by the RegionServer or can be supplied by you. The timestamp must be unique per version of a given cell, because the timestamp identifies the version. To modify a previous version of a cell, for instance, you would issue a Put with a different value for the data itself, but the same timestamp.
I'm thinking about the idea that specify the timestamps but I don't know how to set automatically timestamps for bulkloading and Does it affect the loading performance?? I need fastest and safely importing process for big data.
I tried to parse and put Each record into table, but the speed is very very slow...So another question is: How many records/size of data should in batch before put into hbase. (I write a simple java program to put. It's slower much more than I use Imporrtsv tool by commands to import. I dont know exactly how many size in batch of this tool..)
Many thx for your advise!
Q1: Hbase maintains versions using timestamps. If you wont provide it will take default provided by hbase system.
In the put request you can update custom time as well if you have such requirement. It doesn't not effect performance.
Q2 : You can do it in 2 ways.
Simple java client with batching technique shown below.
Mapreduce importtsv(batch client)
Ex: #1 Simple java client with batching technique.
I used hbase puts in batch List objects of 100000 record for parsing json(similar to your standalone csv client )
Below is code snippet through which I achieved this. Same thing can be done while parsing other formats as well)
May be you need to call this method in 2 places
1) with Batch of 100000 records.
2) For processing reminder of your batch records are less than 100000
public void addRecord(final ArrayList<Put> puts, final String tableName) throws Exception {
try {
final HTable table = new HTable(HBaseConnection.getHBaseConfiguration(), getTable(tableName));
table.put(puts);
LOG.info("INSERT record[s] " + puts.size() + " to table " + tableName + " OK.");
} catch (final Throwable e) {
e.printStackTrace();
} finally {
LOG.info("Processed ---> " + puts.size());
if (puts != null) {
puts.clear();
}
}
}
Note : Batch size internally it is controlled by hbase.client.write.buffer like below in one of your config xmls
<property>
<name>hbase.client.write.buffer</name>
<value>20971520</value> // around 2 mb i guess
</property>
which has default value say 2mb size. once you buffer is filled then it will flush all puts to actually insert in to your table.
Furthermore, Either mapreduce client or stand alone client with batch
technique. batching is controlled by above buffer property
If you need to overwrite record, you can configure hbase table to remember only one version.
This page explains how to do Bulk loading to hbase at maximum possible speed:
How to use hbase bulk loading and why