I am currently wrting a mapreduce program to find the difference between two hive tables.
My hive table are partitioned on one or more columns. So teh folder name contains the value of partitioned columns.
Is there any way to read the hive partitioned table.
Can it be read in mapper ?
Since the underlying HDFS data will be organised by default in a partitioned hive table as
table/root/folder/x=1/y=1
table/root/folder/x=1/y=2
table/root/folder/x=2/y=1
table/root/folder/x=2/y=2....,
You can build each of these input paths in the driver and add them through multiple calls to FileInputFormat.addInputPath(job, path).One call per folder path that you built.
Pasted sample code below.Note how paths are added to MyMapper.class.In this sample, I am using MultipleInputs API.Table is partitioned by 'part' and 'xdate'.
public class MyDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = getConf();
conf.set("mapred.compress.map.output", "true");
conf.set("mapred.output.compression.type", "BLOCK");
Job job = new Job(conf);
//set up various job parameters
job.setJarByClass(MyDriver.class);
job.setJobName(conf.get("job.name"));
MultipleInputs.addInputPath(job, new Path(conf.get("root.folder")+"/xdate="+conf.get("start.date")), TextInputFormat.class, OneMapper.class);
for (Path path : getPathList(job,conf)) {
System.out.println("path: "+path.toString());
MultipleInputs.addInputPath(job, path, Class.forName(conf.get("input.format")).asSubclass(FileInputFormat.class).asSubclass(InputFormat.class), MyMapper.class);
}
...
...
return job.waitForCompletion(true) ? 0 : -2;
}
private static ArrayList<Path> getPathList(Job job, Configuration conf) {
String rootdir = conf.get("input.path.rootfolder");
String partlist = conf.get("part.list");
String startdate_s = conf.get("start.date");
String enxdate_s = conf.get("end.date");
ArrayList<Path> pathlist = new ArrayList<Path>();
String[] partlist_split = partlist.split(",");
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
Date startdate_d = null;
Date enxdate_d = null;
Path path = null;
try {
startdate_d = sdf.parse(startdate_s);
enxdate_d = sdf.parse(enxdate_s);
GregorianCalendar gcal = new GregorianCalendar();
gcal.setTime(startdate_d);
Date d = null;
for (String part : partlist_split) {
gcal.setTime(startdate_d);
do {
d = gcal.getTime();
FileSystem fs = FileSystem.get(conf);
path = new Path(rootdir + "/part=" + part + "/xdate="
+ sdf.format(d));
if (fs.exists(path)) {
pathlist.add(path);
}
gcal.add(Calendar.DAY_OF_YEAR, 1);
} while (d.before(enxdate_d));
}
} catch (Exception e) {
e.printStackTrace();
}
return pathlist;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MyDriver(), args);
System.exit(res);
}
}
Yes, it can be read in Mapper pretty easily. This answer is based on the idea mentioned by #Daniel Koverman.
With the Context object passed to Mapper.map(), you can get the file split path this way
// this gives you the path plus offsets hdfs://.../tablename/partition1=20/partition2=ABC/000001_0:0+12345678
context.ctx.getInputSplit().toString();
// or this gets you the path only
((FileSplit)ctx.getInputSplit()).getPath();
Here's a more complete solution that parses out the actual partition value:
class MyMapper extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
// regex to parse out the /partitionName=partitionValue/ pairs
private static Pattern partitionRegex = Pattern.compile("(?<=/)(?<name>[_\\-\\w]+)=(?<value>[^/]*)(?=/)");
public static String parsePartitionValue(String path, String partitionName) throws IllegalArgumentException{
Matcher m = partitionRegex.matcher(path);
while(m.find()){
if(m.group("name").equals(partitionName)){
return m.group("value");
}
}
throw new IllegalArgumentException(String.format("Partition [%s] not found", partitionName));
}
#Override
public void map(KEYIN key, VALUEIN v, Context ctx) throws IOException, InterruptedException {
String partitionVal = parsePartitionValue(ctx.getInputSplit().toString(), "my_partition_col");
}
}
Related
I am attempting to save a file in the main class of a Hadoop application so it can be read later on by the mapper. The file is an encryption key that will be used to encrypt data. My question here is, where will the data end up if I am writing the file to the working directory?
public class HadoopIndexProject {
private static SecretKey generateKey(int size, String Algorithm) throws UnsupportedEncodingException, NoSuchAlgorithmException {
KeyGenerator keyGen = KeyGenerator.getInstance(Algorithm);
keyGen.init(size);
return keyGen.generateKey();
}
private static IvParameterSpec generateIV() {
byte[] b = new byte[16];
new Random().nextBytes(b);
return new IvParameterSpec(b);
}
public static void saveKey(SecretKey key, IvParameterSpec IV, String path) throws IOException {
FileOutputStream stream = new FileOutputStream(path);
//FSDataOutputStream stream = fs.create(new Path(path));
try {
stream.write(key.getEncoded());
stream.write(IV.getIV());
} finally {
stream.close();
}
}
/**
* #param args the command line arguments
* #throws java.lang.Exception
*/
public static void main(String[] args) throws Exception {
// TODO code application logic here
Configuration conf = new Configuration();
//FileSystem fs = FileSystem.getLocal(conf);
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
SecretKey KEY;
IvParameterSpec IV;
if (otherArgs.length != 2) {
System.err.println("Usage: Index <in> <out>");
System.exit(2);
}
try {
if(! new File("key.dat").exists()) {
KEY = generateKey(128, "AES");
IV = generateIV();
saveKey(KEY, IV, "key.dat");
}
} catch (NoSuchAlgorithmException ex) {
Logger.getLogger(HadoopIndexMapper.class.getName()).log(Level.SEVERE, null, ex);
}
conf.set("mapred.textoutputformat.separator", ":");
Job job = Job.getInstance(conf);
job.setJobName("Index creator");
job.setJarByClass(HadoopIndexProject.class);
job.setMapperClass(HadoopIndexMapper.class);
job.setReducerClass(HadoopIndexReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntArrayWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]) {});
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
There is not concept of working directory in HDFS. All relative paths are paths from /user/<username>, so your file will be located in /user/<username>/key.dat.
But in yarn you have concept of distributed cache, so additional files for your yarn application you can add there using job.addCacheFile
I am able to rename my reducer output file correctly but r-00000 is still persisting .
I have used MultipleOutputs in my reducer class .
Here is details of the that .Not sure what am i missing or what extra i have to do?
public class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(MyReducer.class);
private MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
public void setup(Context context) {
logger.info("Inside Reducer.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
sb.append(strArrvalueStr[0] + "|!|");
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()),strName);
}
}
public void cleanup(Context context) throws IOException,
InterruptedException {
multipleOutputs.close();
}
}
I was able to do it explicitly after my job finishes and thats ok for me.No delay in the job
if (b){
DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd-HHmm");
Calendar cal = Calendar.getInstance();
String strDate=dateFormat.format(cal.getTime());
FileSystem hdfs = FileSystem.get(getConf());
FileStatus fs[] = hdfs.listStatus(new Path(args[1]));
if (fs != null){
for (FileStatus aFile : fs) {
if (!aFile.isDir()) {
hdfs.rename(aFile.getPath(), new Path(aFile.getPath().toString()+".txt"));
}
}
}
}
A more suitable approach to the problem would be changing the OutputFormat.
For eg :- If you are using TextOutputFormatClass, just get the source code of the TextOutputFormat class and modify the below method to get the proper filename (without r-00000). We need to then set the modified output format in the driver.
public synchronized static String getUniqueFile(TaskAttemptContext context, String name, String extension) {
/*TaskID taskId = context.getTaskAttemptID().getTaskID();
int partition = taskId.getId();*/
StringBuilder result = new StringBuilder();
result.append(name);
/*
* result.append('-');
* result.append(TaskID.getRepresentingCharacter(taskId.getTaskType()));
* result.append('-'); result.append(NUMBER_FORMAT.format(partition));
* result.append(extension);
*/
return result.toString();
}
So whatever name is passed through the multiple outputs, filename will be created according to it.
I ran a recursive map/reduce program. Something went wrong and it nearly consumes all the disk space available in C drive. So i closed the resource manager, node manager, Name Node, data node consoles.
Now i have a C drive which is almost full and i don't know how to empty the disk space and make my C drive as it was before. What should i do now. Any help is appreciated.
Here is the code
public class apriori {
public static class CandidateGenMap extends Mapper<LongWritable, Text, Text, Text>
{
private Text word = new Text();
private Text count = new Text();
private int Support = 5;
public void CandidatesGenRecursion(Vector<String> in, Vector<String> out,
int length, int level, int start,
Context context) throws IOException {
int i,size;
for(i=start;i<length;i++) {
if(level==0){
out.add(in.get(i));
} else {
out.add(in.get(i));
int init=1;
StringBuffer current = new StringBuffer();
for(String s:out)
{
if(init==1){
current.append(s);
init=0;
} else {
current.append(" ");
current.append(s);
}
}
word.set(current.toString());
count.set(Integer.toString(1));
try {
context.write(word, count);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
if(i < length-1) {
CandidatesGenRecursion(in, out, length,level+1,i+1, context);
}
size = out.size();
if(size>0){
out.remove(size-1);
}
}
}
#Override
public void map(LongWritable key,Text value,Context context) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
String[] token=new String[2];
int i=0;
while(tokenizer.hasMoreTokens()){
token[i]= tokenizer.nextToken();
++i;
}
StringTokenizer urlToken = new StringTokenizer(token[1],",");
Vector<String> lst = new Vector<String>();
int loop=0;
while (urlToken.hasMoreTokens()) {
String str = urlToken.nextToken();
lst.add(str);
loop++;
}
Vector<String> combinations = new Vector<String>();
if(!lst.isEmpty()) {
CandidatesGenRecursion(lst, combinations, loop,0,0, context);
}
}
}
public static class CandidateGenReduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key,Iterator<IntWritable> values,Context context) throws IOException
{
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
try {
context.write(key, new IntWritable(sum));
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public static void main(String[] args) throws Exception
{
Date dt;
long start,end; // Start and end time
//Start Timer
dt = new Date();
start = dt.getTime();
Configuration conf1 = new Configuration();
System.out.println("Starting Job2");
Job job2 = new Job(conf1, "apriori candidate gen");
job2.setJarByClass(apriori.class);
job2.setMapperClass(CandidateGenMap.class);
job2.setCombinerClass(CandidateGenReduce.class); //
job2.setReducerClass(CandidateGenReduce.class);
job2.setMapOutputKeyClass(Text.class);
job2.setMapOutputValueClass(Text.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(IntWritable.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job2, new Path(args[0]));
FileOutputFormat.setOutputPath(job2, new Path(args[1]));
job2.waitForCompletion(true);
//End Timer
dt = new Date();
end = dt.getTime();
}
}
Hadoop needs sufficient disk space for its i/0 operations at each phase (map, reduce etc).
Check in your HDFS your job output path and delete the contents.
List contents:
$ sudo -u hdfs hadoop fs -ls [YourJobOutputPath]
Disk used:
$ sudo -u hdfs hadoop fs -du -h [YourJobOutputPath]
Delete contents (be careful!, it's recursive):
$ sudo -u hdfs hadoop fs -rm -R [YourJobOutputPath]
Deleting the output directory might help in freeing your disk from the files created by the MapReduce job.
I have chained two Map reduce jobs. The Job1 will have only one reducer and I am computing a float value. I want to use this value in my reducer of Job2. This is my main method setup.
public static String GlobalVriable;
public static void main(String[] args) throws Exception {
int runs = 0;
for (; runs < 10; runs++) {
String inputPath = "part-r-000" + nf.format(runs);
String outputPath = "part-r-000" + nf.format(runs + 1);
MyProgram.MR1(inputPath);
MyProgram.MR2(inputPath, outputPath);
}
}
public static void MR1(String inputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
conf.set("var1","");
Job job = new Job(conf, "This is job1");
job.setJarByClass(MyProgram.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReduce1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
job.waitForCompletion(true);
GlobalVriable = conf.get("var1"); // I am getting NULL here
}
public static void MR2(String inputPath, String outputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "This is job2");
...
}
public static class MyReduce1 extends
Reducer<Text, FloatWritable, Text, FloatWritable> {
public void reduce(Text key, Iterable<FloatWritable> values, Context context)
throws IOException, InterruptedException {
float s = 0;
for (FloatWritable val : values) {
s += val.get();
}
String sum = Float.toString(s);
context.getConfiguration().set("var1", sum);
}
}
As you can see I need to iterate the entire program multiple times. My Job1 is computing a single number from the input. Since it is just a single number and a lot of iterations I dont want to write it to HDFS and read from it. Is there a way to share the value computed in Myreducer1 and use it in Myreducer2.
UPDATE: I have tried passing the value using conf.set & conf.get. The value is not being passed.
Here's how to pass back a float value via a counter ...
First, in the first reducer, transform the float value into a long by multiplying by 1000 (to maintain 3 digits of precision, for example) and putting the result into a counter:
public void cleanup(Context context) {
long result = (long) (floatValue * 1000);
context.getCounter("Result","Result").increment(result);
}
In the driver class, retrieve the long value and transform it back to a float:
public static void MR1(String inputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "This is job1");
job.setJarByClass(MyProgram.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReduce1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
job.waitForCompletion(true);
long result = job.getCounters().findCounter("Result","Result").getValue();
float value = ((float)result) / 1000;
}
You could use ZooKeeper for this. It's great for any inter-job coordination or message passing like this.
Can't you just change the return type of MR1 to int (or whatever data type is appropriate) and return the number you computed:
int myNumber = MyProgram.MR1(inputPath);
Then add a parameter to MR2 and call it with your computed number:
MyProgram.MR2(inputPath, outputPath, myNumber);
I am getting some garbage like value instead of the data from the file I want to use as distributed cache.
The Job Configuration is as follows:
Configuration config5 = new Configuration();
JobConf conf5 = new JobConf(config5, Job5.class);
conf5.setJobName("Job5");
conf5.setOutputKeyClass(Text.class);
conf5.setOutputValueClass(Text.class);
conf5.setMapperClass(MapThree4c.class);
conf5.setReducerClass(ReduceThree5.class);
conf5.setInputFormat(TextInputFormat.class);
conf5.setOutputFormat(TextOutputFormat.class);
DistributedCache.addCacheFile(new URI("/home/users/mlakshm/ap1228"), conf5);
FileInputFormat.setInputPaths(conf5, new Path(other_args.get(5)));
FileOutputFormat.setOutputPath(conf5, new Path(other_args.get(6)));
JobClient.runJob(conf5);
In the Mapper, I have the following code:
public class MapThree4c extends MapReduceBase implements Mapper<LongWritable, Text,
Text, Text >{
private Set<String> prefixCandidates = new HashSet<String>();
Text a = new Text();
public void configure(JobConf conf5) {
Path[] dates = new Path[0];
try {
dates = DistributedCache.getLocalCacheFiles(conf5);
System.out.println("candidates: "+candidates);
String astr = dates.toString();
a = new Text(astr);
} catch (IOException ioe) {
System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
}
}
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer st = new StringTokenizer(line);
st.nextToken();
String t = st.nextToken();
String uidi = st.nextToken();
String uidj = st.nextToken();
String check = null;
output.collect(new Text(line), a);
}
}
The output value, I am getting from this mapper is:[Lorg.apache.hadoop.fs.Path;#786c1a82
instead of the value from the distributed cache file.
That looks like what you get when you call toString() on an array and if you look at the javadocs for DistributedCache.getLocalCacheFiles(), that is what it returns. If you need to actually read the contents of the files in the cache, you can open/read them with the standard java APIs.
From your code:
Path[] dates = DistributedCache.getLocalCacheFiles(conf5);
Implies that:
String astr = dates.toString(); // is a pointer to the above array (ie.dates) which is what you see in the output as [Lorg.apache.hadoop.fs.Path;#786c1a82.
You need to do the following to see the actual paths:
for(Path cacheFile: dates){
output.collect(new Text(line), new Text(cacheFile.getName()));
}