I am writing a spark streaming application I need to get the time of my current batch interval..how to do this?..I wanted to know also how the time is tuned..it is tuned based on the time of machine in the driver of the cluster?
If i understood correctly, you need to specify the batch interval of your Spark Streaming app, when you create StreamingContext
for example:
val ssc = new StreamingContext(sparkConf, Seconds(2))
In this above code, every 2 seconds, your spark streaming job pulls the data from input source ( kafka, file etc...)
Related
I am having two spring batch jobs.
1)Fetch data from DB and create CSV file of the fetched data.(Runs in every 15 mins)
2)Index the created CSV file to solr.(Runs in every 10 mins)
These 2 jobs are running in parallel as cron.I want to add partitioner to the csv to solr job for fast processing.
The problem is that some time the csv to solr job picks up the csv which is still in process of getting data from DB to CSV job.So want to pick up only that CSVs which is fully loaded with data and provide that file to partitioner.
Please tell me way forward to it.
I am writing a spark streaming app with online streaming data compared to basic data which i broadcast into each computing node. However, since the basic data is updated daily, i need to update the broadcasted variable daily too. The basic data resides on hdfs.
Is there a way to do this? The update is not related to any online streaming results, just say at 12:00 am everyday. Moreover, if there is such a way, will the updating process block spark streaming computing jobs?
Refer to the last answer in the thread you referred. Summary - instead of sending the data, send the caching code to update data at the needed interval
Create CacheLookup object that updates daily#12 am
Wrap that in Broadcast variable
Use CacheLookup as part of streaming logic
I'm collecting the data from a messaging app, I'm currently using Flume, it sends approx 50 Million records per day
I wish to use Kafka,
consume from Kafka using Spark Streaming
and persist it to hadoop and query with impala
I'm having issues with each approach I've tried..
Approach 1 - Save RDD as parquet, point an external hive parquet table to the parquet directory
// scala
val ssc = new StreamingContext(sparkConf, Seconds(bucketsize.toInt))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
lines.foreachRDD(rdd => {
// 1 - Create a SchemaRDD object from the rdd and specify the schema
val SchemaRDD1 = sqlContext.jsonRDD(rdd, schema)
// 2 - register it as a spark sql table
SchemaRDD1.registerTempTable("sparktable")
// 3 - qry sparktable to produce another SchemaRDD object of the data needed 'finalParquet'. and persist this as parquet files
val finalParquet = sqlContext.sql(sql)
finalParquet.saveAsParquetFile(dir)
The problem is that finalParquet.saveAsParquetFile outputs a huge number of files, the Dstream received from Kafka outputs over 200 files for a 1 minute batch size.
The reason that it outputs many files is because the computation is distributed as explained in another post- how to make saveAsTextFile NOT split output into multiple file?
However, the propsed solutions don't seem optimal for me , for e.g. as one user states - Having a single output file is only a good idea if you have very little data.
Approach 2 - Use HiveContext. insert RDD data directly to a hive table
# python
sqlContext = HiveContext(sc)
ssc = StreamingContext(sc, int(batch_interval))
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topics: 1})
lines = kvs.map(lambda x: x[1]).persist(StorageLevel.MEMORY_AND_DISK_SER)
lines.foreachRDD(sendRecord)
def sendRecord(rdd):
sql = "INSERT INTO TABLE table select * from beacon_sparktable"
# 1 - Apply the schema to the RDD creating a data frame 'beaconDF'
beaconDF = sqlContext.jsonRDD(rdd,schema)
# 2- Register the DataFrame as a spark sql table.
beaconDF.registerTempTable("beacon_sparktable")
# 3 - insert to hive directly from a qry on the spark sql table
sqlContext.sql(sql);
This works fine , it inserts directly to a parquet table but there are scheduling delays for the batches as processing time exceeds the batch interval time.
The consumer cant keep up with whats being produced and the batches to process begin to queue up.
it seems writing to hive is slow. I've tried adjusting batch interval size, running more consumer instances.
In summary
What is the best way to persist Big data from Spark Streaming given that there are issues with multiple files and potential latency with writing to hive?
What are other people doing?
A similar question has been asked here, but he has an issue with directories as apposed to too many files
How to make Spark Streaming write its output so that Impala can read it?
Many Thanks for any help
In solution #2, the number of files created can be controlled via the number of partitions of each RDD.
See this example:
// create a Hive table (assume it's already existing)
sqlContext.sql("CREATE TABLE test (id int, txt string) STORED AS PARQUET")
// create a RDD with 2 records and only 1 partition
val rdd = sc.parallelize(List( List(1, "hello"), List(2, "world") ), 1)
// create a DataFrame from the RDD
val schema = StructType(Seq(
StructField("id", IntegerType, nullable = false),
StructField("txt", StringType, nullable = false)
))
val df = sqlContext.createDataFrame(rdd.map( Row(_:_*) ), schema)
// this creates a single file, because the RDD has 1 partition
df.write.mode("append").saveAsTable("test")
Now, I guess you can play with the frequency at which you pull data from Kafka, and the number of partitions of each RDD (default, the partitions of your Kafka topic, that you can possibly reduce by repartitioning).
I'm using Spark 1.5 from CDH 5.5.1, and I get the same result using either df.write.mode("append").saveAsTable("test") or your SQL string.
I think the small file problem could be resolved somewhat. You may be getting large number of files based on kafka partitions. For me, I have 12 partition Kafka topic and I write using spark.write.mode("append").parquet("/location/on/hdfs").
Now depending on your requirements, you can either add coalesce(1) or more to reduce number of files. Also another option is to increase the micro batch duration. For example, if you can accept 5 minutes delay in writing day, you can have micro batch of 300 seconds.
For the second issues, the batches queue up only because you don't have back pressure enabled. First you should verify what is the max you can process in a single batch. Once you can get around that number, you can set spark.streaming.kafka.maxRatePerPartition value and spark.streaming.backpressure.enabled=true to enable limited number of records per micro batch. If you still cannot meet the demand, then the only options are to either increase partitions on topic or to increase resources on spark application.
I've been trying to create and maintain a Sequence File on HDFS using the Java API without running a MapReduce job as a setup for a future MapReduce job. I want to store all of my input data for the MapReduce job in a single Sequence File, but the data gets appended over time throughout the day. The problem is, if a SequenceFile exists, the following call will just overwrite the SequenceFile instead of appending to it.
// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));
seqWriter.close();
Another concern is that I cannot maintain a file of my own format and turn the data into a SequenceFile at the end of the day as a MapReduce job could be launched using that data at any point.
I cannot find any other API call to append to a SequenceFile and maintain its format. I also cannot simply concatenate two SequenceFiles because of their formatting needs.
I also wanted to avoid running a MapReduce job for this since it has high overhead for the little amount of data I'm adding to the SequenceFile.
Any thoughts or work-arounds? Thanks.
Support for appending to existing SequenceFiles has been added to Apache Hadoop 2.6.1 and 2.7.2 releases onwards, via enhancement JIRA: https://issues.apache.org/jira/browse/HADOOP-7139
For example usage, the test-case can be read: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileAppend.java#L63-L140
CDH5 users can find the same ability in version CDH 5.7.1 onwards.
Sorry, currently the Hadoop FileSystem does not support appends. But there are plans for it in a future release.
Customers able to upload urls in any time to database and application should processes urls as soon as possible. So i need periodic hadoop jobs running or run hadoop job automatically from other application(any script identifies new links were added, generates data for hadoop job and runs job). For PHP or Python script, i could set up cronjob, but what is best practice for periodic hadoop jobs running (prepare data for hadoop, upload data, run hadoop job and move data back to database?
Take a look at Oozie, the new workflow system from Y!, which can run jobs based on different triggers. A good overflow is presented by Alejandro here: http://www.slideshare.net/ydn/5-oozie-hadoopsummit2010
If you want urls to be processed as soon as possible, you'll have them processed each at a time. My recommendation is to wait for some number of links (or MB of links, or for example 10 min, every day).
And batch process them (I do my processing daily, but that jobs takes few hours)