Spark saveAsTextFile creating directory - hadoop

I have implemented the following code in java using Apache Spark.
I am running this program on AWS EMR.
I have just implemented simple program from the examples for word count in a file.
I am reading file from HDFS.
public class FileOperations {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("HDFS");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> textFile = sparkContext.textFile("hdfs:/user/hadoop/test.txt");
System.out.println("Program is stared");
JavaPairRDD<String, Integer> counts = textFile
.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
counts.foreach(f -> System.out.println(f.toString()));
counts.saveAsTextFile("hdfs:/user/hadoop/output.txt");
System.out.println("Program finished");
}
}
The issue in the above program is counts.saveAsTextFile("hdfs:/user/hadoop/output.txt"); is not creating a text file , instead a directory output.txt is created.
What is wrong in the above code.
This is the first time I am working with Spark and EMR.

This is how it should work. You don't specify a file name, just a path. Spark will create files within that directory. If you look at the method definition for saveAsTextFile you can see that it expects a path:
public void saveAsTextFile(String path)
Within the path you specify it will create a part file for each partition in your data.

Either you .collect() all the data and write your own save method to a single file or you .repartition(1) the data which will still result in a directory, but with only one part file with the data (part-00000)

Related

Bulk loading with LoadIncrementalHFiles and subdirectories

I wrote a Spark application that generates HFiles to be used for bulk loading with the LoadIncrementalHFiles command later. As the source data pool is very big, the input files are splitted into iterations that are processed one after the other. Each iteration creates its own HFile directory, so my HDFS structure looks like this:
/user/myuser/map_data/hfiles_0
... /hfiles_1
... /hfiles_2
... /hfiles_3
...
There are about 500 files in this map_data directory, therefore I'm searching for a way to automatically call the LoadIncrementalHFiles function, to process these subdirectories also in iterations later.
The corresponding command would be this:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable
I need to change this into an iterative command, as this command does not work with subdirectories (when I call it with the /user/myuser/map_data directory)!
I tried to use a Java Process instance to execute the command above automatically, but this doesn't seen to do anything (no output to console and also no more rows in my HBase table).
Using the org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles Java class out of my code also doesn't work, it's also not responsing!
Has anybody a working example for me? Or is there a parameter to be able to run the above hbase command on the parent directory? I'm working with HBase 1.1.2 in a Hortonworks Data Platform 2.5 cluster.
EDIT I tried to run the LoadIncrementalHFiles command from a Hadoop client Java application, but I'm getting an exception relating to snappy compression, see Run LoadIncrementalHFiles from Java client
The solution was to split the hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable command into many parts (one per command part), see this Java code snippet:
TreeSet<String> subDirs = getHFileDirectories(new Path(HDFS_PATH), hadoopConf);
for(String hFileDir : subDirs) {
try {
String pathToReadFrom = HDFS_OUTPUT_PATH + "/" + hFileDir;
==> String[] execCode = {"hbase", "org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles", "-Dcreate.table=no", pathToReadFrom, hbaseTableName};
ProcessBuilder pb = new ProcessBuilder(execCode);
pb.redirectErrorStream(true);
final Process p = pb.start();
// Write the output of the Process to the console
new Thread(new Runnable() {
public void run() {
BufferedReader input = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = null;
try {
while ((line = input.readLine()) != null)
System.out.println(line);
} catch (IOException e) {
e.printStackTrace();
}
}
}).start();
// Wait for the end of the execution
p.waitFor();
...
}

How to merge several zip files in one zip file in HDFS hadoop

I am running a mapreduce job that create several zip files for each row key .
I am not able to create one zip files for all records .So is there any way to merge several all zip files in a single zip files in hadoop mapreduce .
This is how i create zip files for each row-key
#Override
public void write(K key, V value) throws IOException {
final String valueStr = value.toString();
String strFullFileName[] = valueStr.split("\\|\\^\\|");
String strKey = strFullFileName[0] + strFullFileName[1];
ZipEntry ze = new ZipEntry(strKey);
zipOut.closeEntry();
zipOut.putNextEntry(ze);
if (value instanceof BytesWritable) {
zipOut.write(((BytesWritable) value).getBytes(), 0,
((BytesWritable) value).getLength());
} else {
zipOut.write(value.toString().getBytes());
}
}
Can we merge several zip files in mapreduce it self if not please suggest a command to merge .

How to recursively read Hadoop files from directory using Spark?

Inside the given directory I have many different folders and inside each folder I have Hadoop files (part_001, etc.).
directory
-> folder1
-> part_001...
-> part_002...
-> folder2
-> part_001...
...
Given the directory, how can I recursively read the content of all folders inside this directory and load this content into a single RDD in Spark using Scala?
I found this, but it does not recursively enters into sub-folders (I am using import org.apache.hadoop.mapreduce.lib.input):
var job: Job = null
try {
job = Job.getInstance()
FileInputFormat.setInputPaths(job, new Path("s3n://" + bucketNameData + "/" + directoryS3))
FileInputFormat.setInputDirRecursive(job, true)
} catch {
case ioe: IOException => ioe.printStackTrace(); System.exit(1);
}
val sourceData = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).values
I also found this web-page that uses SequenceFile, but again I don't understand how to apply it to my case?
If you are using Spark, you can do this using wilcards as follow:
scala>sc.textFile("path/*/*")
sc is the SparkContext which if you are using spark-shell is initialized by default or if you are creating your own program should will have to instance a SparkContext by yourself.
Be careful with the following flag:
scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")
> res6: String = null
Yo should set this flag to true:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
I have found that the parameters must be set in this way:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
connector_output=${basepath}/output/connector/*/*/*/*/*
works for me when I've dir structure like -
${basepath}/output/connector/2019/01/23/23/output*.dat
I didn't have to set any other properties, just used following -
sparkSession.read().format("csv").schema(schema)
.option("delimiter", "|")
.load("/user/user1/output/connector/*/*/*/*/*");

Reading Distributed Files in Hadoop

I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Migrating Data from HBase to FileSystem. (Writing Reducer output to Local or Hadoop filesystem)

My Purpose is to migrate the data from Hbase Tables to Flat (say csv formatted) files.
I am used
TableMapReduceUtil.initTableMapperJob(tableName, scan,
GetCustomerAccountsMapper.class, Text.class, Result.class,
job);
for scanning through HBase table and TableMapper for Mapper.
My challange is in while forcing Reducer to dump the Row values (which is normalized in flattened format) to local(or Hdfs) file system.
My problem is neither I am able to see logs of Reducer nor I can see the any files at path that I have mentioned in Reducer.
It's my 2nd or 3rd MR job and first serious one. After trying hard for two days, I am still clueless how to achieve my goal.
Would be great if someone could show the right direction.
Here is my reducer code -
public void reduce(Text key, Iterable<Result> rows, Context context)
throws IOException, InterruptedException {
FileSystem fs = LocalFileSystem.getLocal(new Configuration());
Path dir = new Path("/data/HBaseDataMigration/" + tableName+"_Reducer" + "/" + key.toString());
FSDataOutputStream fsOut = fs.create(dir,true);
for (Result row : rows) {
try {
String normRow = NormalizeHBaserow(
Bytes.toString(key.getBytes()), row, tableName);
fsOut.writeBytes(normRow);
//context.write(new Text(key.toString()), new Text(normRow));
} catch (BadHTableResultException ex) {
throw new IOException(ex);
}
}
fsOut.flush();
fsOut.close();
My Configuration for Reducer Output
Path out = new Path(args[0] + "/" + tableName+"Global");
FileOutputFormat.setOutputPath(job, out);
Thanks in Advance - Panks
Why not reduce into HDFS and once finished use hdfs fs to export the file
hadoop fs -get /user/hadoop/file localfile
If you do want to handle it in the reduce phase take a look at this article on OutputFormat on InfoQ

Resources