Reading Distributed Files in Hadoop - hadoop

I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.

The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Related

How to recursively read Hadoop files from directory using Spark?

Inside the given directory I have many different folders and inside each folder I have Hadoop files (part_001, etc.).
directory
-> folder1
-> part_001...
-> part_002...
-> folder2
-> part_001...
...
Given the directory, how can I recursively read the content of all folders inside this directory and load this content into a single RDD in Spark using Scala?
I found this, but it does not recursively enters into sub-folders (I am using import org.apache.hadoop.mapreduce.lib.input):
var job: Job = null
try {
job = Job.getInstance()
FileInputFormat.setInputPaths(job, new Path("s3n://" + bucketNameData + "/" + directoryS3))
FileInputFormat.setInputDirRecursive(job, true)
} catch {
case ioe: IOException => ioe.printStackTrace(); System.exit(1);
}
val sourceData = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).values
I also found this web-page that uses SequenceFile, but again I don't understand how to apply it to my case?
If you are using Spark, you can do this using wilcards as follow:
scala>sc.textFile("path/*/*")
sc is the SparkContext which if you are using spark-shell is initialized by default or if you are creating your own program should will have to instance a SparkContext by yourself.
Be careful with the following flag:
scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")
> res6: String = null
Yo should set this flag to true:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
I have found that the parameters must be set in this way:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
connector_output=${basepath}/output/connector/*/*/*/*/*
works for me when I've dir structure like -
${basepath}/output/connector/2019/01/23/23/output*.dat
I didn't have to set any other properties, just used following -
sparkSession.read().format("csv").schema(schema)
.option("delimiter", "|")
.load("/user/user1/output/connector/*/*/*/*/*");

Spark Streaming: HDFS

I can't get my Spark job to stream "old" files from HDFS.
If my Spark job is down for some reason (e.g. demo, deployment) but the writing/moving to HDFS directory is continuous, I might skip those files once I up the Spark Streaming Job.
val hdfsDStream = ssc.textFileStream("hdfs://sandbox.hortonworks.com/user/root/logs")
hdfsDStream.foreachRDD(
rdd => logInfo("Number of records in this batch: " + rdd.count())
)
Output --> Number of records in this batch: 0
Is there a way for Spark Streaming to move the "read" files to a different folder? Or we have to program it manually? So it will avoid reading already "read" files.
Is Spark Streaming the same as running the spark job (sc.textFile) in CRON?
As Dean mentioned, textFileStream uses the default of only using new files.
def textFileStream(directory: String): DStream[String] = {
fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString)
}
So, all it is doing is calling this variant of fileStream
def fileStream[
K: ClassTag,
V: ClassTag,
F <: NewInputFormat[K, V]: ClassTag
] (directory: String): InputDStream[(K, V)] = {
new FileInputDStream[K, V, F](this, directory)
}
And, looking at the FileInputDStream class we will see that it indeed can look for existing files, but defaults to new only:
newFilesOnly: Boolean = true,
So, going back into the StreamingContext code, we can see that there is and overload we can use by directly calling the fileStream method:
def fileStream[
K: ClassTag,
V: ClassTag,
F <: NewInputFormat[K, V]: ClassTag]
(directory: String, filter: Path => Boolean, newFilesOnly: Boolean):InputDStream[(K, V)] = {
new FileInputDStream[K, V, F](this, directory, filter, newFilesOnly)
}
So, the TL;DR; is
ssc.fileStream[LongWritable, Text, TextInputFormat]
(directory, FileInputDStream.defaultFilter, false).map(_._2.toString)
Are you expecting Spark to read files already in the directory? If so, this is a common misconception, one that took me by surprise. textFileStream watches a directory for new files to appear, then it reads them. It ignores files already in the directory when you start or files it's already read.
The rationale is that you'll have some process writing files to HDFS, then you'll want Spark to read them. Note that these files much appear atomically, e.g., they were slowly written somewhere else, then moved to the watched directory. This is because HDFS doesn't properly handle reading and writing a file simultaneously.
val filterF = new Function[Path, Boolean] {
def apply(x: Path): Boolean = {
println("looking if "+x+" to be consider or not")
val flag = if(x.toString.split("/").last.split("_").last.toLong < System.currentTimeMillis){ println("considered "+x); list += x.toString; true}
else{ false }
return flag
}
}
this filter function is used to determine whether each path is actually the one preferred by you. so the function inside the apply should be customized as per your requirement.
val streamed_rdd = ssc.fileStream[LongWritable, Text, TextInputFormat]("/user/hdpprod/temp/spark_streaming_output",filterF,false).map{case (x, y) => (y.toString)}
now you have to set the third variable of filestream function to false, this is to make sure not only new files but also consider old existing files in the streaming directory.

How to read Hadoop Map intermediate file file.out

I have set property keep.task.files.pattern to ".*" in mapred-site.xml
restarted cluster and executed my test mapreduce program.
I see two file file.out and file.out.index in the folder
/opt/hadoopws/tmp/mapred/local/taskTracker/hduser/jobcache/job_201403260903_0001/attempt_201403260903_0001_m_000000_0/output/
When i attempt to read file.out using below code i get "not a SequenceFile error" message.
I know for sure its a binary file when i try to open file.out with less, it prompts that its a binary file.
I'm running Hadoop 1.2.1. What is the default map output format?
FileSystem fs = FileSystem.get(conf);
Path path = new Path("/opt/hadoopws/tmp/mapred/local/taskTracker/hduser/jobcache
/job_201403260903_0001/attempt_201403260903_0001_m_000000_0/output
/file.out");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
IntWritable key = new IntWritable();
IntWritable value = new IntWritable();
while (reader.next(key, value)) {
System.out.println(key.get() + " | " + value.get());
}
reader.close();
Error Message:
Exception in thread "main" java.io.IOException: /opt/hadoopws/tmp/mapred/local/taskTracker/hduser/jobcache/job_201403260903_0001/attempt_201403260903_0001_m_000000_0/output/file.out not a SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1517)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1490)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
at HDPConfigRun.run(HDPConfigRun.java:31)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at HDPConfigRun.main(HDPConfigRun.java:45)
The file is similar to a sequential file in that it contains key value pairs persisted in their serialized form, but it's designed to be scanned quickly for sort and partitioning purposes and has an internal-only format.
If you really need to see the set of key value pairs produced by the mapper then look into configuring your reducer to use the MultipleOutputs class - one output would contain every key/value pair read in by the reducer and another output to contain the "real" output of the reducer.

How to append to an existing file in a Hadoop user program?

I have a Hadoop program in which when the mapping and reducing phases are done, I need to append to an existing file (which is already on HDFS). How can I do that?
it's already supported to append a file on hdfs after hadoop 0.20.2, more information is available here1 and here2
An append example i found may help you:
FSDataOutputStream stm = fs.create(path, true,
conf.getInt("io.file.buffer.size", 4096),
(short)3, blocksize);
String a = make(1000);
stm.write(a.getBytes());
stm.sync();
You can use append method of HDFS,
check the file is exists on not, if exists append the new content in the same file.
for example:-
FileSystem hdfs;
FSDataOutputStream writeInFile;
Path file;
if (hdfs.exists(file)) {
System.out.println("file exists");
writeInFile = hdfs.append(file);
writeInFile.writeBytes(data);
}
else {
System.out.println("new file");
writeInFile = hdfs.create(file, true);
writeInFile.writeBytes(data);
}

Migrating Data from HBase to FileSystem. (Writing Reducer output to Local or Hadoop filesystem)

My Purpose is to migrate the data from Hbase Tables to Flat (say csv formatted) files.
I am used
TableMapReduceUtil.initTableMapperJob(tableName, scan,
GetCustomerAccountsMapper.class, Text.class, Result.class,
job);
for scanning through HBase table and TableMapper for Mapper.
My challange is in while forcing Reducer to dump the Row values (which is normalized in flattened format) to local(or Hdfs) file system.
My problem is neither I am able to see logs of Reducer nor I can see the any files at path that I have mentioned in Reducer.
It's my 2nd or 3rd MR job and first serious one. After trying hard for two days, I am still clueless how to achieve my goal.
Would be great if someone could show the right direction.
Here is my reducer code -
public void reduce(Text key, Iterable<Result> rows, Context context)
throws IOException, InterruptedException {
FileSystem fs = LocalFileSystem.getLocal(new Configuration());
Path dir = new Path("/data/HBaseDataMigration/" + tableName+"_Reducer" + "/" + key.toString());
FSDataOutputStream fsOut = fs.create(dir,true);
for (Result row : rows) {
try {
String normRow = NormalizeHBaserow(
Bytes.toString(key.getBytes()), row, tableName);
fsOut.writeBytes(normRow);
//context.write(new Text(key.toString()), new Text(normRow));
} catch (BadHTableResultException ex) {
throw new IOException(ex);
}
}
fsOut.flush();
fsOut.close();
My Configuration for Reducer Output
Path out = new Path(args[0] + "/" + tableName+"Global");
FileOutputFormat.setOutputPath(job, out);
Thanks in Advance - Panks
Why not reduce into HDFS and once finished use hdfs fs to export the file
hadoop fs -get /user/hadoop/file localfile
If you do want to handle it in the reduce phase take a look at this article on OutputFormat on InfoQ

Resources