Spark Streaming : Join Dstream batches into single output Folder - hadoop

I am using Spark Streaming to fetch tweets from twitter by creating a StreamingContext as : val ssc = new StreamingContext("local[3]", "TwitterFeed",Minutes(1))
and creating twitter stream as :
val tweetStream = TwitterUtils.createStream(ssc, Some(new OAuthAuthorization(Util.config)),filters)
then saving it as text file
tweets.repartition(1).saveAsTextFiles("/tmp/spark_testing/")
and the problem is that the tweets are being saved as folders based on batch time but I need all the data of each batch in a same folder.
Is there any workaround for it?
Thanks

We can do this using Spark SQL's new DataFrame saving API which allow appending to an existing output. By default, saveAsTextFile, won't be able to save to a directory with existing data (see https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes ). https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations covers how to setup a Spark SQL context for use with Spark Streaming.
Assuming you copy the part from the guide with the SQLContextSingleton, The resulting code would look something like:
data.foreachRDD{rdd =>
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
// Convert your data to a DataFrame, depends on the structure of your data
val df = ....
df.save("org.apache.spark.sql.json", SaveMode.Append, Map("path" -> path.toString))
}
(Note the above example used JSON to save the result, but you can use different output formats too).

Related

Spark: Saving RDD in an already existing path in HDFS

I am able to save the RDD output to HDFS with saveAsTextFile method. This method throws an exception if the file path already exists.
I have a use case where I need to save the RDDS in an already existing file path in HDFS. Is there a way to do just append the new RDD data to the data that is already existing in the same path?
One possible solution, available since Spark 1.6, is to use DataFrames with text format and append mode:
val outputPath: String = ???
rdd.map(_.toString).toDF.write.mode("append").text(outputPath)

Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?

We have huge amounts of server data stored in S3 (soon to be in a Parquet format). The data needs some transformation, and so it can't be a straight copy from S3. I'll be using Spark to access the data, but I'm wondering if instead of manipulating it with Spark, writing back out to S3, and then copying to Redshift if I can just skip a step and run a query to pull/transform the data and then copy it straight to Redshift?
Sure thing, totally possible.
Scala code to read parquet (taken from here)
val people: RDD[Person] = ...
people.write.parquet("people.parquet")
val parquetFile = sqlContext.read.parquet("people.parquet") //data frame
Scala code to write to redshift (taken from here)
parquetFile.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()

Persisting Spark Streaming output

I'm collecting the data from a messaging app, I'm currently using Flume, it sends approx 50 Million records per day
I wish to use Kafka,
consume from Kafka using Spark Streaming
and persist it to hadoop and query with impala
I'm having issues with each approach I've tried..
Approach 1 - Save RDD as parquet, point an external hive parquet table to the parquet directory
// scala
val ssc = new StreamingContext(sparkConf, Seconds(bucketsize.toInt))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
lines.foreachRDD(rdd => {
// 1 - Create a SchemaRDD object from the rdd and specify the schema
val SchemaRDD1 = sqlContext.jsonRDD(rdd, schema)
// 2 - register it as a spark sql table
SchemaRDD1.registerTempTable("sparktable")
// 3 - qry sparktable to produce another SchemaRDD object of the data needed 'finalParquet'. and persist this as parquet files
val finalParquet = sqlContext.sql(sql)
finalParquet.saveAsParquetFile(dir)
The problem is that finalParquet.saveAsParquetFile outputs a huge number of files, the Dstream received from Kafka outputs over 200 files for a 1 minute batch size.
The reason that it outputs many files is because the computation is distributed as explained in another post- how to make saveAsTextFile NOT split output into multiple file?
However, the propsed solutions don't seem optimal for me , for e.g. as one user states - Having a single output file is only a good idea if you have very little data.
Approach 2 - Use HiveContext. insert RDD data directly to a hive table
# python
sqlContext = HiveContext(sc)
ssc = StreamingContext(sc, int(batch_interval))
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topics: 1})
lines = kvs.map(lambda x: x[1]).persist(StorageLevel.MEMORY_AND_DISK_SER)
lines.foreachRDD(sendRecord)
def sendRecord(rdd):
sql = "INSERT INTO TABLE table select * from beacon_sparktable"
# 1 - Apply the schema to the RDD creating a data frame 'beaconDF'
beaconDF = sqlContext.jsonRDD(rdd,schema)
# 2- Register the DataFrame as a spark sql table.
beaconDF.registerTempTable("beacon_sparktable")
# 3 - insert to hive directly from a qry on the spark sql table
sqlContext.sql(sql);
This works fine , it inserts directly to a parquet table but there are scheduling delays for the batches as processing time exceeds the batch interval time.
The consumer cant keep up with whats being produced and the batches to process begin to queue up.
it seems writing to hive is slow. I've tried adjusting batch interval size, running more consumer instances.
In summary
What is the best way to persist Big data from Spark Streaming given that there are issues with multiple files and potential latency with writing to hive?
What are other people doing?
A similar question has been asked here, but he has an issue with directories as apposed to too many files
How to make Spark Streaming write its output so that Impala can read it?
Many Thanks for any help
In solution #2, the number of files created can be controlled via the number of partitions of each RDD.
See this example:
// create a Hive table (assume it's already existing)
sqlContext.sql("CREATE TABLE test (id int, txt string) STORED AS PARQUET")
// create a RDD with 2 records and only 1 partition
val rdd = sc.parallelize(List( List(1, "hello"), List(2, "world") ), 1)
// create a DataFrame from the RDD
val schema = StructType(Seq(
StructField("id", IntegerType, nullable = false),
StructField("txt", StringType, nullable = false)
))
val df = sqlContext.createDataFrame(rdd.map( Row(_:_*) ), schema)
// this creates a single file, because the RDD has 1 partition
df.write.mode("append").saveAsTable("test")
Now, I guess you can play with the frequency at which you pull data from Kafka, and the number of partitions of each RDD (default, the partitions of your Kafka topic, that you can possibly reduce by repartitioning).
I'm using Spark 1.5 from CDH 5.5.1, and I get the same result using either df.write.mode("append").saveAsTable("test") or your SQL string.
I think the small file problem could be resolved somewhat. You may be getting large number of files based on kafka partitions. For me, I have 12 partition Kafka topic and I write using spark.write.mode("append").parquet("/location/on/hdfs").
Now depending on your requirements, you can either add coalesce(1) or more to reduce number of files. Also another option is to increase the micro batch duration. For example, if you can accept 5 minutes delay in writing day, you can have micro batch of 300 seconds.
For the second issues, the batches queue up only because you don't have back pressure enabled. First you should verify what is the max you can process in a single batch. Once you can get around that number, you can set spark.streaming.kafka.maxRatePerPartition value and spark.streaming.backpressure.enabled=true to enable limited number of records per micro batch. If you still cannot meet the demand, then the only options are to either increase partitions on topic or to increase resources on spark application.

How to read a hive partition into an Apache Crunch pipeline?

I am able to read text files in hdfs into apache crunch pipeline.
But now I need to read the hive partitions.
The problem is that as per our design, I am not supposed to directly access the file. Hence, now I need some way by which I can access the partitions using something like HCatalog.
You can use the org.apache.hadoop.hive.metastore API or HCat API. Here is a simple example of using hive.metastore. You would have to make call to either or before starting on your Pipeline unless you want to join to some Hive partition in the mapper/reducer.
HiveMetaStoreClient hmsc = new HiveMetaStoreClient(hiveConf)
HiveMetaStoreClient hiveClient = getHiveMetastoreConnection();
List<Partition> partitions = hiveClient.listPartittions("default", "my_hive_table", 1000)
for(Partition partition: partitions) {
System.out.println("HDFS data location of the partition: " + partition.getSd().getLocation())
}
The only other thing you will need is to export the hive conf dir:
export HIVE_CONF_DIR=/home/mmichalski/hive/conf

Loading protobuf format file into pig script using loadfunc pig UDF

I have very little knowledge of pig. I have protobuf format data file. I need to load this file into a pig script. I need to write a LoadFunc UDF to load it. say function is Protobufloader().
my PIG script would be
A = LOAD 'abc_protobuf.dat' USING Protobufloader() as (name, phonenumber, email);
All i wish to know is How do i get the file input stream. Once i get hold of file input stream, i can parse the data from protobuf format to PIG tuple format.
PS: thanks in advance
Twitter's open source library elephant bird has many such loaders:
https://github.com/kevinweil/elephant-bird
You can use LzoProtobufB64LinePigLoader and LzoProtobufBlockPigLoader.
https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/load
To use it, you just need to do:
define ProtoLoader com.twitter.elephantbird.pig.load.LzoProtobufB64LineLoader('your.proto.class.name');
a = load '/your/file' using ProtoLoader;
b = foreach a generate
field1, field2;
After loading, it will be automatically translated to pig tuples with proper schema.
However, they assume you write your data in serialized protobuffer and compressed by lzo.
They have corresponding writers as well, in package com.twitter.elephantbird.pig.store.
If your data format is a bit different, you can adapt their code to your custom loader.

Resources