I want to configure a flume flow so that it takes in a CSV file as a source, checks the data, and dynamically separates each row of data into folders by year/month in HDFS. Is this possible?
I might suggest you look at using Nifi instead. I feel like it's the natural replacement for Flume.
Having said that it would seem that you might want to consider using a the spooling directory source and a hive sink (instead of hdfs). The hive partitions (Partitions on year/Month) would enable you to land the data in the Manner you are suggesting.
I'm new to Big data and related technologies, so I'm unsure if we can append data to the existing ORC file. I'm writing the ORC file using Java API and when I close the Writer, I'm unable to open the file again to write new content to it, basically to append new data.
Is there a way I can append data to the existing ORC file, either using Java Api or Hive or any other means?
One more clarification, when saving Java util.Date object into ORC file, ORC type is stored as:
struct<timestamp:struct<fasttime:bigint,cdate:struct<cachedyear:int,cachedfixeddatejan1:bigint,cachedfixeddatenextjan1:bigint>>,
and for java BigDecimal it's:
<margin:struct<intval:struct<signum:int,mag:struct<>,bitcount:int,bitlength:int,lowestsetbit:int,firstnonzerointnum:int>
Are these correct and is there any info on this?
Update 2017
Yes now you can! Hive provides a new support for ACID, but you can append data to your table using Append Mode mode("append") with Spark
Below an example
Seq((10, 20)).toDF("a", "b").write.mode("overwrite").saveAsTable("tab1")
Seq((20, 30)).toDF("a", "b").write.mode("append").saveAsTable("tab1")
sql("select * from tab1").show
Or a more complete exmple with ORC here; below an extract:
val command = spark.read.format("jdbc").option("url" .... ).load()
command.write.mode("append").format("orc").option("orc.compression","gzip").save("command.orc")
No, you cannot append directly to an ORC file. Nor to a Parquet file. Nor to any columnar format with a complex internal structure with metadata interleaved with data.
Quoting the official "Apache Parquet" site...
Metadata is written after the data to allow for single pass writing.
Then quoting the official "Apache ORC" site...
Since HDFS does not support changing the data in a file after it is
written, ORC stores the top level index at the end of the file (...)
The file’s tail consists of 3 parts; the file metadata, file footer
and postscript.
Well, technically, nowadays you can append to an HDFS file; you can even truncate it. But these tricks are only useful for some edge cases (e.g. Flume feeding messages into an HDFS "log file", micro-batch-wise, with fflush from time to time).
For Hive transaction support they use a different trick: creating a new ORC file on each transaction (i.e. micro-batch) with periodic compaction jobs running in the background, à la HBase.
Yes this is possible through Hive in which you can basically 'concatenate' newer data. From hive official documentation https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-WhatisACIDandwhyshouldyouuseit?
I have a directory tree with two directories and synchronized files in them:
home/dirMaster/file1.txt
home/dirMaster/file2.txt
home/dirSlave/file1-slave.txt
home/dirSlave/file2-slave.txt
Based on the file name file1-slave.txt have records corresponding to file1.txt
I want to move then to hdfs using flume but based on my reading so far I have the following problems:
flume will not preserve my file name - so I lose the
synchronization
flume does not guaranty that the files from source will match the destination - e.g. source file might get split to several dest files
Is this correct? Can flume support this scenario?
Flume Agent allows to move the data from source to sink. It uses channel to keep this data before rolling into Sink.
Flume's one of the sinks is HDFS Sink. HDFS sink allows rolling the data into HDFS based on the following criteria.
hdfs.rollSize
hdfs.rollInterval
hdfs.rollCount
It rolls the data based on the above parameters combination and file name are having predefined pattern. We can also control the file names using Sink parameters. But this pattern is same for all files which are rolled by this agent. We cannot expect different file path patterns from single flume agent.
agent.sinks.sink.hdfs.path=hdfs://:9000/pattern
Pattern could be static or dynamic path.
Flume also produces n number of files based on rolling criteria.
So Flume is not suitable for your requirement. Flume is best fit for streaming data ingestion.
DistCP: It is a distributed parallel data loading utility in HDFS.
It is a Map only MapReduce program and it will produce n number of part files (= no of maps) in the destination directory.
So DistCP is also not suitable for tour requirement.
So Better to use hadoop fs -put to load the data into HDFS.
hadoop fs -put /home/dirMaster/ /home/dirMaster/ /home/
I'am new to Spark, Hadoop and all what comes with. My global need is to build a real-time application that get tweets and store them on HDFS in order to build a report based on HBase.
I'd like to get the generated filename when calling saveAsTextFile RRD method in order to import it to Hive.
Feel free to ask for further informations and thanks in advance.
saveAsTextFile will create a directory of sequence files. So if you give it path "hdfs://user/NAME/saveLocation", a folder called saveLocation will be created filled with sequence files. You should be able to load this into HBase simply by passing the directory name to HBase (sequenced files are a standard in Hadoop).
I do recommend you look into saving as a parquet though, they are much more useful than standard text files.
From what I understand, You saved your tweets to hdfs and now want the file names of those saved files. Correct me if I'm wrong
val filenames=sc.textfile("Your hdfs location where you saved your tweets").map(_._1)
This gives you an array of rdd's into filenames onto which you could do your operations. Im a newbie too to hadoop, but anyways...hope that helps
I'd uploaded 50GB data on Hadoop cluster. But Now i want to delete first row of data file.
This is time consuming if i remove that data & change manually. Then upload it again on HDFS.
Please reply me.
HDFS files are immutable (for all practical purposes).
You need to upload the modified file(s). You can do the change programatically with a M/R job that does a near-identity transformation, eg. running a streaming shell script that does sed, but the gist of it that you need to create new files, HDFS files cannot be edited.