How to append to an existing file in a Hadoop user program? - hadoop

I have a Hadoop program in which when the mapping and reducing phases are done, I need to append to an existing file (which is already on HDFS). How can I do that?

it's already supported to append a file on hdfs after hadoop 0.20.2, more information is available here1 and here2
An append example i found may help you:
FSDataOutputStream stm = fs.create(path, true,
conf.getInt("io.file.buffer.size", 4096),
(short)3, blocksize);
String a = make(1000);
stm.write(a.getBytes());
stm.sync();

You can use append method of HDFS,
check the file is exists on not, if exists append the new content in the same file.
for example:-
FileSystem hdfs;
FSDataOutputStream writeInFile;
Path file;
if (hdfs.exists(file)) {
System.out.println("file exists");
writeInFile = hdfs.append(file);
writeInFile.writeBytes(data);
}
else {
System.out.println("new file");
writeInFile = hdfs.create(file, true);
writeInFile.writeBytes(data);
}

Related

Spark saveAsTextFile creating directory

I have implemented the following code in java using Apache Spark.
I am running this program on AWS EMR.
I have just implemented simple program from the examples for word count in a file.
I am reading file from HDFS.
public class FileOperations {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("HDFS");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> textFile = sparkContext.textFile("hdfs:/user/hadoop/test.txt");
System.out.println("Program is stared");
JavaPairRDD<String, Integer> counts = textFile
.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
counts.foreach(f -> System.out.println(f.toString()));
counts.saveAsTextFile("hdfs:/user/hadoop/output.txt");
System.out.println("Program finished");
}
}
The issue in the above program is counts.saveAsTextFile("hdfs:/user/hadoop/output.txt"); is not creating a text file , instead a directory output.txt is created.
What is wrong in the above code.
This is the first time I am working with Spark and EMR.
This is how it should work. You don't specify a file name, just a path. Spark will create files within that directory. If you look at the method definition for saveAsTextFile you can see that it expects a path:
public void saveAsTextFile(String path)
Within the path you specify it will create a part file for each partition in your data.
Either you .collect() all the data and write your own save method to a single file or you .repartition(1) the data which will still result in a directory, but with only one part file with the data (part-00000)

How to recursively read Hadoop files from directory using Spark?

Inside the given directory I have many different folders and inside each folder I have Hadoop files (part_001, etc.).
directory
-> folder1
-> part_001...
-> part_002...
-> folder2
-> part_001...
...
Given the directory, how can I recursively read the content of all folders inside this directory and load this content into a single RDD in Spark using Scala?
I found this, but it does not recursively enters into sub-folders (I am using import org.apache.hadoop.mapreduce.lib.input):
var job: Job = null
try {
job = Job.getInstance()
FileInputFormat.setInputPaths(job, new Path("s3n://" + bucketNameData + "/" + directoryS3))
FileInputFormat.setInputDirRecursive(job, true)
} catch {
case ioe: IOException => ioe.printStackTrace(); System.exit(1);
}
val sourceData = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).values
I also found this web-page that uses SequenceFile, but again I don't understand how to apply it to my case?
If you are using Spark, you can do this using wilcards as follow:
scala>sc.textFile("path/*/*")
sc is the SparkContext which if you are using spark-shell is initialized by default or if you are creating your own program should will have to instance a SparkContext by yourself.
Be careful with the following flag:
scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")
> res6: String = null
Yo should set this flag to true:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
I have found that the parameters must be set in this way:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
connector_output=${basepath}/output/connector/*/*/*/*/*
works for me when I've dir structure like -
${basepath}/output/connector/2019/01/23/23/output*.dat
I didn't have to set any other properties, just used following -
sparkSession.read().format("csv").schema(schema)
.option("delimiter", "|")
.load("/user/user1/output/connector/*/*/*/*/*");

hadoop-1.0.3 sequenceFile.Writer overwrites instead of appending images into a sequencefile

I am using hadoop 1.0.3 (I can't really upgrade right now,Thats for later. )
I have around 100 images in my HDFS and I am trying to combine them into a single sequencefile ( default no compression etc.. )
here's my code:
FSDataInputStream in = null;
BytesWritable value = new BytesWritable();
Text key = new Text();
Path inpath = new Path(fs.getHomeDirectory(),"/user/hduser/input");
Path seq_path = new Path(fs.getHomeDirectory(),"/user/hduser/output/file.seq");
FileStatus[] files = fs.listStatus(inpath);
SequenceFile.Writer writer = null;
for( FileStatus fileStatus : files){
inpath = fileStatus.getPath();
try {
in = fs.open(inpath);
byte bufffer[] = new byte[in.available()];
in.read(bufffer);
writer = SequenceFile.createWriter(fs,conf,seq_path,key.getClass(),value.getClass());
writer.append(new Text(inpath.getName()), new BytesWritable(bufffer));
}catch (Exception e) {
System.out.println("Exception MESSAGES = "+e.getMessage());
e.printStackTrace();
}}
This just goes through all the files in input/ and one by one appends them.
HOWEVER this just overwrites my sequence file instead of appending it , I see only the last image in sequencefile.
NOTE I am not closing the writer before the for loop ends , can anyone help me with this please.
I am not sure How can I append the images?
Your main issue is with the following line :
writer = SequenceFile.createWriter(fs, conf, seq_path, key.getClass(), value.getClass());
which is inside the for, creating a new writer in each pass. It replaces previous file at the path seq_path. Thus only last image is available.
Pull it out of the loop, and the problem should vanish.

Reading Distributed Files in Hadoop

I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Naming MapReduce job's part-0000 file to that of the input file in hadoop

I have developed a code that runs a map reduce job to read files from FTP server and write it into HDFS. Into HDFS it writes the file from FTP into the specified output directory naming it as part-0000. In case I have multiple files on the FTP server I get all of them written to that one part-0000 file in HDFS.
To avoid this I plan to pass the name of the file as key along with the data as value . Thus the reducer gets the data into an output file with the key as the name of the file.
I understand that I have to use an outputformat that extends MultipleTextOutputFormat. I have written it as follows
static class MultiFileOutput extends MultipleTextOutputFormat<Text, Text> {
protected String generateFileNameForKeyValue(Text key, Text value,String name) {
System.out.println("key is :"+ key.toString());
System.out.println("value is :"+ value.toString());
System.out.println("name is :"+ name.toString());
return key.toString();
}
But I fail to pass the name of the input file being processed . How do I get the name of the input file ?
map.input.file
and
FileSystem fs = file.getFileSystem(conf);
String fileName=fs.getName();
do not return the name of the input file.
Any pointers ?
You can get the input file path through context.
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String inputFilePath = fileSplit.getPath().toString();
This will give the full path. If you want just the filename you can do this :
String inputFileName = fileSplit.getPath().getName();
HTH
I used FileStatus object in the following code as my customised input format would not split the input file. It worked fine for me ..
FileSystem fs = file.getFileSystem(conf);
FileStatus status= fs.getFileStatus(file);
String fileName=status.getPath().toString();

Resources