Spark: Saving RDD in an already existing path in HDFS - hadoop

I am able to save the RDD output to HDFS with saveAsTextFile method. This method throws an exception if the file path already exists.
I have a use case where I need to save the RDDS in an already existing file path in HDFS. Is there a way to do just append the new RDD data to the data that is already existing in the same path?

One possible solution, available since Spark 1.6, is to use DataFrames with text format and append mode:
val outputPath: String = ???
rdd.map(_.toString).toDF.write.mode("append").text(outputPath)

Related

appending to ORC file

I'm new to Big data and related technologies, so I'm unsure if we can append data to the existing ORC file. I'm writing the ORC file using Java API and when I close the Writer, I'm unable to open the file again to write new content to it, basically to append new data.
Is there a way I can append data to the existing ORC file, either using Java Api or Hive or any other means?
One more clarification, when saving Java util.Date object into ORC file, ORC type is stored as:
struct<timestamp:struct<fasttime:bigint,cdate:struct<cachedyear:int,cachedfixeddatejan1:bigint,cachedfixeddatenextjan1:bigint>>,
and for java BigDecimal it's:
<margin:struct<intval:struct<signum:int,mag:struct<>,bitcount:int,bitlength:int,lowestsetbit:int,firstnonzerointnum:int>
Are these correct and is there any info on this?
Update 2017
Yes now you can! Hive provides a new support for ACID, but you can append data to your table using Append Mode mode("append") with Spark
Below an example
Seq((10, 20)).toDF("a", "b").write.mode("overwrite").saveAsTable("tab1")
Seq((20, 30)).toDF("a", "b").write.mode("append").saveAsTable("tab1")
sql("select * from tab1").show
Or a more complete exmple with ORC here; below an extract:
val command = spark.read.format("jdbc").option("url" .... ).load()
command.write.mode("append").format("orc").option("orc.compression","gzip").save("command.orc")
No, you cannot append directly to an ORC file. Nor to a Parquet file. Nor to any columnar format with a complex internal structure with metadata interleaved with data.
Quoting the official "Apache Parquet" site...
Metadata is written after the data to allow for single pass writing.
Then quoting the official "Apache ORC" site...
Since HDFS does not support changing the data in a file after it is
written, ORC stores the top level index at the end of the file (...)
The file’s tail consists of 3 parts; the file metadata, file footer
and postscript.
Well, technically, nowadays you can append to an HDFS file; you can even truncate it. But these tricks are only useful for some edge cases (e.g. Flume feeding messages into an HDFS "log file", micro-batch-wise, with fflush from time to time).
For Hive transaction support they use a different trick: creating a new ORC file on each transaction (i.e. micro-batch) with periodic compaction jobs running in the background, à la HBase.
Yes this is possible through Hive in which you can basically 'concatenate' newer data. From hive official documentation https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-WhatisACIDandwhyshouldyouuseit?

How do i get generated filename when calling the Spark SaveAsTextFile method

I'am new to Spark, Hadoop and all what comes with. My global need is to build a real-time application that get tweets and store them on HDFS in order to build a report based on HBase.
I'd like to get the generated filename when calling saveAsTextFile RRD method in order to import it to Hive.
Feel free to ask for further informations and thanks in advance.
saveAsTextFile will create a directory of sequence files. So if you give it path "hdfs://user/NAME/saveLocation", a folder called saveLocation will be created filled with sequence files. You should be able to load this into HBase simply by passing the directory name to HBase (sequenced files are a standard in Hadoop).
I do recommend you look into saving as a parquet though, they are much more useful than standard text files.
From what I understand, You saved your tweets to hdfs and now want the file names of those saved files. Correct me if I'm wrong
val filenames=sc.textfile("Your hdfs location where you saved your tweets").map(_._1)
This gives you an array of rdd's into filenames onto which you could do your operations. Im a newbie too to hadoop, but anyways...hope that helps

Log data using flume to required format at sink

I have a requirement in my project. I have to collect log data using flume and that data has to be fed into hive table.
Here my requirement to collect files placed in a folder into hdfs which I am doing using spooldir.
After this I need to process these files and place output in hive folder for data to be queried immediately.
Can I process the source files using sink in such a way that data placed in hdfs is already process into required format.?
Thanks,
Sathish
Yes, you need to use a serializer (implement this class - http://flume.apache.org/releases/content/1.2.0/apidocs/org/apache/flume/serialization/EventSerializer.html), drop it into plugin.d/ and then add it to the configuration for the HDFS Sink.
Using below configuration has served my purpose.
source.type = spooldir
source.spooldir = ${location}

How to add data in same file in Apache PIG?

I am new to PIG.
Actually I have a use case in which I have to store the data again and again in the same file after every regular interval. But as I gone through some tutorial and links, I didn't see the anything related to this.
How should I do store the data in same file?
It's impossible. Pig uses Hadoop and right now there is no "recommended" solution for appending files.
The other point is that pig would produce one file only if one mapper has been used or one reducer has been used and the end of the whole data flow.
You can:
Give more info about the problem you are trying to solve
Bad solution:
2.1. process data in your pig script
2.2. load data from exisitng file
2.3. union relations hwre first relation keeps new data, the second relation keeps data from exisitng file
2.4. store union result to new output
2.5. replace old file with new one.
Good solution:
Create folder /mydata
create partitions inside folder, they can be /yyyy/MM/dd/HH if you do process data each hour
Use globs to read data:
/mydata/*/*/*/*/*
All files from hour partitions would be read by PIG/HIVE/MR or whatever hadoop tool.
make a date folder like: /abc/hadoop/20130726/
within you generate output based on timestamp like: /abc/hadoop/20130726/201307265465.gz.
Then use getmerge command to merge all data into a single file
Usage: hadoop fs -getmerge <src> <localdst> [addnl]
Hope it will help you.

HDFS: Using HDFS API to append to a SequenceFile

I've been trying to create and maintain a Sequence File on HDFS using the Java API without running a MapReduce job as a setup for a future MapReduce job. I want to store all of my input data for the MapReduce job in a single Sequence File, but the data gets appended over time throughout the day. The problem is, if a SequenceFile exists, the following call will just overwrite the SequenceFile instead of appending to it.
// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));
seqWriter.close();
Another concern is that I cannot maintain a file of my own format and turn the data into a SequenceFile at the end of the day as a MapReduce job could be launched using that data at any point.
I cannot find any other API call to append to a SequenceFile and maintain its format. I also cannot simply concatenate two SequenceFiles because of their formatting needs.
I also wanted to avoid running a MapReduce job for this since it has high overhead for the little amount of data I'm adding to the SequenceFile.
Any thoughts or work-arounds? Thanks.
Support for appending to existing SequenceFiles has been added to Apache Hadoop 2.6.1 and 2.7.2 releases onwards, via enhancement JIRA: https://issues.apache.org/jira/browse/HADOOP-7139
For example usage, the test-case can be read: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileAppend.java#L63-L140
CDH5 users can find the same ability in version CDH 5.7.1 onwards.
Sorry, currently the Hadoop FileSystem does not support appends. But there are plans for it in a future release.

Resources