I wrote a Spark application that generates HFiles to be used for bulk loading with the LoadIncrementalHFiles command later. As the source data pool is very big, the input files are splitted into iterations that are processed one after the other. Each iteration creates its own HFile directory, so my HDFS structure looks like this:
/user/myuser/map_data/hfiles_0
... /hfiles_1
... /hfiles_2
... /hfiles_3
...
There are about 500 files in this map_data directory, therefore I'm searching for a way to automatically call the LoadIncrementalHFiles function, to process these subdirectories also in iterations later.
The corresponding command would be this:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable
I need to change this into an iterative command, as this command does not work with subdirectories (when I call it with the /user/myuser/map_data directory)!
I tried to use a Java Process instance to execute the command above automatically, but this doesn't seen to do anything (no output to console and also no more rows in my HBase table).
Using the org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles Java class out of my code also doesn't work, it's also not responsing!
Has anybody a working example for me? Or is there a parameter to be able to run the above hbase command on the parent directory? I'm working with HBase 1.1.2 in a Hortonworks Data Platform 2.5 cluster.
EDIT I tried to run the LoadIncrementalHFiles command from a Hadoop client Java application, but I'm getting an exception relating to snappy compression, see Run LoadIncrementalHFiles from Java client
The solution was to split the hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable command into many parts (one per command part), see this Java code snippet:
TreeSet<String> subDirs = getHFileDirectories(new Path(HDFS_PATH), hadoopConf);
for(String hFileDir : subDirs) {
try {
String pathToReadFrom = HDFS_OUTPUT_PATH + "/" + hFileDir;
==> String[] execCode = {"hbase", "org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles", "-Dcreate.table=no", pathToReadFrom, hbaseTableName};
ProcessBuilder pb = new ProcessBuilder(execCode);
pb.redirectErrorStream(true);
final Process p = pb.start();
// Write the output of the Process to the console
new Thread(new Runnable() {
public void run() {
BufferedReader input = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = null;
try {
while ((line = input.readLine()) != null)
System.out.println(line);
} catch (IOException e) {
e.printStackTrace();
}
}
}).start();
// Wait for the end of the execution
p.waitFor();
...
}
Related
What is the Best performance architecture to read XML in Spring Batch? Each XML is approximately 300 KB size and we are processing 1 Million.
Our Current Approach
30 partitions and 30 Grids and Each slave gets 166 XMLS
Commit Chunk 100
Application Start Memory is 8 GB
Using JAXB in Reader Default Bean Scope
#StepScope
#Qualifier("xmlItemReader")
public IteratorItemReader<BaseDTO> xmlItemReader(
#Value("#{stepExecutionContext['fileName']}") List<String> fileNameList) throws Exception {
String readingFile = "File Not Found";
logger.info("----StaxEventItemReader----fileName--->" + fileNameList.toString());
List<BaseDTO> fileList = new ArrayList<BaseDTO>();
for (String filePath : fileNameList) {
try {
readingFile = filePath.trim();
Invoice bill = (Invoice) getUnMarshaller().unmarshal(new File(filePath));
UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO(bill, environment);
unifiedDTO.setFileName(filePath);
BaseDTO baseDTO = new BaseDTO();
baseDTO.setUnifiedDTO(unifiedDTO);
fileList.add(baseDTO);
} catch (Exception e) {
UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO();
unifiedDTO.setFileName(readingFile);
unifiedDTO.setErrorMessage(e);
BaseDTO baseDTO = new BaseDTO();
baseDTO.setUnifiedDTO(unifiedDTO);
fileList.add(baseDTO);
}
}
return new IteratorItemReader<>(fileList);
}
Our questions:
Is this Archirecture correct
Is any performance or architecture advantage of using StaxEventItemReader and XStreamMarshaller over JAXB.
How to handle memory properly to avoid slow down
I would create a job per xml file by using the file name as a job parameter. This approach has many benefits:
Restartability: If a job fails, you only restart the failed file (from where it left off)
Scalability: This approach allows you to run multiple jobs in parallel. If a single machine is not enough, you can distribute the load on multiple machines
Logging: Logs are separate by design, you don't need to use an MDC or any other technique to separate logs
We are receiving XML filepath in a *.txt file
You can a create a script that iterates over these lines and launch a job per line (aka per file). Gnu Parallel (or a similar tool) is a good option to launch jobs in parallel.
Im running MR job on EMR master host.
My input file is in S3 and output set to a table in Hive via Hcatalog.
The job is running successful and i do see reducers output rows but looking at the S3 new partitions folder i can only see MR 0 byte SUCCESS file but no actual data files.
note- when reducer stage start i do see files writes to S3 into temp folder, but it seems the last operation throws the files somewhere.
I don't see any errors in MR logs.
Relevant MR driver code:"
Job job = Job.getInstance();
job.setJobName("Build Events");
job.setJarByClass(LoggersApp.class);
job.getConfiguration().set("fs.defaultFS", "s3://my-bucket");
// set input paths Path[] inputPaths = "file on s3";
FileInputFormat.setInputPaths(job, inputPaths); // set input output
format job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HCatOutputFormat.class);
_configureOutputTable(job);
private void _setReducer(Job job) {
job.setReducerClass(Reducer.class);
job.setOutputValueClass(DefaultHCatRecord.class); }
private void _configureOutputTable(Job job) throws IOException {
OutputJobInfo jobInfo =
OutputJobInfo.create(_cli.getOptionValue("hive-dbname"),
_cli.getOptionValue("output-table"), null); HCatOutputFormat.setOutput(job, jobInfo); HCatSchema schema =
HCatOutputFormat.getTableSchema(job.getConfiguration());
HCatFieldSchema partitionDate = new HCatFieldSchema("date",
TypeInfoFactory.stringTypeInfo, null); HCatFieldSchema
partitionBatchId = new HCatFieldSchema("batch_id",
TypeInfoFactory.stringTypeInfo, null);
schema.append(partitionDate); schema.append(partitionBatchId);
HCatOutputFormat.setSchema(job, schema);
}
Any help?
I have implemented the following code in java using Apache Spark.
I am running this program on AWS EMR.
I have just implemented simple program from the examples for word count in a file.
I am reading file from HDFS.
public class FileOperations {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("HDFS");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> textFile = sparkContext.textFile("hdfs:/user/hadoop/test.txt");
System.out.println("Program is stared");
JavaPairRDD<String, Integer> counts = textFile
.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
counts.foreach(f -> System.out.println(f.toString()));
counts.saveAsTextFile("hdfs:/user/hadoop/output.txt");
System.out.println("Program finished");
}
}
The issue in the above program is counts.saveAsTextFile("hdfs:/user/hadoop/output.txt"); is not creating a text file , instead a directory output.txt is created.
What is wrong in the above code.
This is the first time I am working with Spark and EMR.
This is how it should work. You don't specify a file name, just a path. Spark will create files within that directory. If you look at the method definition for saveAsTextFile you can see that it expects a path:
public void saveAsTextFile(String path)
Within the path you specify it will create a part file for each partition in your data.
Either you .collect() all the data and write your own save method to a single file or you .repartition(1) the data which will still result in a directory, but with only one part file with the data (part-00000)
I'm executing some code on docker in my java application using ProcessBuilder to run the code, however i'm having trouble retrieving the output from it. BufferedReader is not reading anything from the InputStream returned from the container. Is there a specific way to retrieve output from Docker??
I've never had trouble getting output from bash executions before, so I'm thinking maybe docker does things differently somehow. Any ideas would be appreciated
Here's a snippet of the code:
Process dockerCommand;
ProcessBuilder builder = new ProcessBuilder("bash","-c","sudo docker images");
builder.redirectErrorStream(true);
builder.redirectOutput(ProcessBuilder.Redirect.INHERIT);
builder.redirectError(ProcessBuilder.Redirect.INHERIT);
dockerCommand = builder.start();
dockerCommand.waitFor();
List<String> result = new ArrayList<>();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(dockerCommand.getInputStream()))
{
String line = reader.readLine();
while (line != null) {
result.add(line);
line = reader.readLine();
}
}
catch (IOException exc)
{}
The line
builder.redirectOutput(ProcessBuilder.Redirect.INHERIT);
causes bash to receive the same standard output as the parent process, which is presumably your terminal window. This produces misleading results because you actually see the Docker image list, but it's being printed by the shell.
If I comment that out and then iterate over the results list, I can see the output from Docker inside the JVM.
I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.