Accesing file in Mapper through Distributed Cache - hadoop

I want to access the contents of the distributed file in my Mapper. Below is the code I have written which generates the name of the file for Distributed Cache. Please help me accessing the contents of the file
public class DistCacheExampleMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text >
{
Text a = new Text();
Path[] dates = new Path[0];
public void configure(JobConf conf) {
try {
dates = DistributedCache.getLocalCacheFiles(conf);
String astr = dates.toString();
a = new Text(astr);
} catch (IOException ioe) {
System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
}
}
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
for(Path cacheFile: dates){
output.collect(new Text(line), new Text(cacheFile.getName()));
}
}
}

Try this instead in your configure() method:
List<String []> lines;
Path[] files = new Path[0];
public void configure(JobConf conf) {
lines = new ArrayList<>();
BufferedReader SW;
try {
files = DistributedCache.getLocalCacheFiles(conf);
SW = new BufferedReader(new FileReader(files[0].toString()));
String line;
while ((line = SW.readLine()) != null) {
lines.add(line.split(",")); //now, each lines entry is a String array, with each element being a column
}
SW.close();
} catch (IOException ioe) {
System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
}
}
This way, you will have the contents of the files (in this case the first file) in the Distributed Cache, in the variable lines. Each lines entry represent a String array, which is split by ','. So the first column of the first row is lines.get(0)[0], the third row of the second line is lines.get(1)[2], etc.

Related

How to remove r-00000 extention from reducer output in mapreduce

I am able to rename my reducer output file correctly but r-00000 is still persisting .
I have used MultipleOutputs in my reducer class .
Here is details of the that .Not sure what am i missing or what extra i have to do?
public class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(MyReducer.class);
private MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
public void setup(Context context) {
logger.info("Inside Reducer.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
sb.append(strArrvalueStr[0] + "|!|");
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()),strName);
}
}
public void cleanup(Context context) throws IOException,
InterruptedException {
multipleOutputs.close();
}
}
I was able to do it explicitly after my job finishes and thats ok for me.No delay in the job
if (b){
DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd-HHmm");
Calendar cal = Calendar.getInstance();
String strDate=dateFormat.format(cal.getTime());
FileSystem hdfs = FileSystem.get(getConf());
FileStatus fs[] = hdfs.listStatus(new Path(args[1]));
if (fs != null){
for (FileStatus aFile : fs) {
if (!aFile.isDir()) {
hdfs.rename(aFile.getPath(), new Path(aFile.getPath().toString()+".txt"));
}
}
}
}
A more suitable approach to the problem would be changing the OutputFormat.
For eg :- If you are using TextOutputFormatClass, just get the source code of the TextOutputFormat class and modify the below method to get the proper filename (without r-00000). We need to then set the modified output format in the driver.
public synchronized static String getUniqueFile(TaskAttemptContext context, String name, String extension) {
/*TaskID taskId = context.getTaskAttemptID().getTaskID();
int partition = taskId.getId();*/
StringBuilder result = new StringBuilder();
result.append(name);
/*
* result.append('-');
* result.append(TaskID.getRepresentingCharacter(taskId.getTaskType()));
* result.append('-'); result.append(NUMBER_FORMAT.format(partition));
* result.append(extension);
*/
return result.toString();
}
So whatever name is passed through the multiple outputs, filename will be created according to it.

Bloom Filter in MapReduce

I have to use bloom filter in the reduce side join algorithm to filter one of my input, but I have a problem with the function readFields that de-serialise the input stream of a distributed cache (bloom filter) into a bloom filter.
public class BloomJoin {
//function map : input transaction.txt
public static class TransactionJoin extends
Mapper<LongWritable, Text, Text, Text> {
private Text CID=new Text();
private Text outValue=new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String record[] = line.split(",", -1);
CID.set(record[1]);
outValue.set("A"+value);
context.write(CID, outValue);
}
}
//function map : input customer.txt
public static class CustomerJoinMapper extends
Mapper<LongWritable, Text, Text, Text> {
private Text outkey=new Text();
private Text outvalue = new Text();
private BloomFilter bfilter = new BloomFilter();
public void setup(Context context) throws IOException {
URI[] files = DistributedCache
.getCacheFiles(context.getConfiguration());
// if the files in the distributed cache are set
if (files != null) {
System.out.println("Reading Bloom filter from: "
+ files[0].getPath());
// Open local file for read.
DataInputStream strm = new DataInputStream(new FileInputStream(
files[0].toString()));
bfilter.readFields(strm);
strm.close();
// Read into our Bloom filter.
} else {
throw new IOException(
"Bloom filter file not set in the DistributedCache.");
}
};
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String record[] = line.split(",", -1);
outkey.set(record[0]);
if (bfilter.membershipTest(new Key(outkey.getBytes()))) {
outvalue.set("B"+value);
context.write(outkey, outvalue);
}
}
}
//function reducer: join customer with transaction
public static class JoinReducer extends
Reducer<Text, Text, Text, Text> {
private ArrayList<Text> listA = new ArrayList<Text>();
private ArrayList<Text> listB = new ArrayList<Text>();
#Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
listA.clear();
listB.clear();
for (Text t : values) {
if (t.charAt(0) == 'A') {
listA.add(new Text(t.toString().substring(1)));
System.out.println("liste A: "+listA);
} else /* if (t.charAt('0') == 'B') */{
listB.add(new Text(t.toString().substring(1)));
System.out.println("listeB :"+listB);
}
}
executeJoinLogic(context);
}
private void executeJoinLogic(Context context) throws IOException,
InterruptedException {
if (!listA.isEmpty() && !listB.isEmpty()) {
for (Text A : listB) {
for (Text B : listA) {
context.write(A, B);
System.out.println("A="+A+",B="+B);
}
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Path bloompath=new Path("/user/biadmin/ezzaki/bloomfilter/output/part-00000");
DistributedCache.addCacheFile(bloompath.toUri(),conf);
Job job = new Job(conf, "Bloom Join");
job.setJarByClass(BloomJoin.class);
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 3) {
System.err
.println("ReduceSideJoin <Transaction data> <Customer data> <out> ");
System.exit(1);
}
MultipleInputs.addInputPath(job, new Path(otherArgs[0]),
TextInputFormat.class,TransactionJoin.class);
MultipleInputs.addInputPath(job, new Path(otherArgs[1]),
TextInputFormat.class, CustomerJoinMapper.class);
job.setReducerClass(JoinReducer.class);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
//job.setMapOutputKeyClass(Text.class);
//job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 3);
}
}
How can I solve this problem?
Can you try changing
URI[] files = DistributedCache.getCacheFiles(context.getConfiguration());
to
Path[] cacheFilePaths = DistributedCache.getLocalCacheFiles(conf);
for (Path cacheFilePath : cacheFilePaths) {
DataInputStream fileInputStream = fs.open(cacheFilePath);
}
bloomFilter.readFields(fileInputStream);
fileInputStream.close();
Also, I think you are using Map side join and not Reduce side since you are using the Distributed cache in Mapper.
You can use a Bloom Filter from here:
https://github.com/odnoklassniki/apache-cassandra/blob/master/src/java/org/apache/cassandra/utils/BloomFilter.java
It goes with dedicated serializer:
https://github.com/odnoklassniki/apache-cassandra/blob/master/src/java/org/apache/cassandra/utils/BloomFilterSerializer.java
You can serialize like this:
Path file = new Path(bloomFilterPath);
FileSystem hdfs = file.getFileSystem(context.getConfiguration());
OutputStream os = hdfs.create(file);
BloomFilterSerializer serializer = new BloomFilterSerializer();
serializer.serialize(bloomFilter, new DataOutputStream(os));
And deserialize:
InputStream is = getInputStreamFromHdfs(context, bloomFilterPath);
Path path = new Path(bloomFilterPath);
InputStream is = path.getFileSystem(context.getConfiguration()).open(path);
BloomFilterSerializer serializer = new BloomFilterSerializer();
BloomFilter bloomFilter = serializer.deserialize(
new DataInputStream(new BufferedInputStream(is)));

Hadoop Mapreduce: Custom Input Format

I have a file with data having text and "^" in between:
SOME TEXT^GOES HERE^
AND A FEW^MORE
GOES HERE
I am writing a custom input format to delimit the rows using "^" character. i.e The output of the mapper should be like:
SOME TEXT
GOES HERE
AND A FEW
MORE GOES HERE
I have written a written a custom input format which extends FileInputFormat and also written a custom record reader that extends RecordReader. Code for my custom record reader is given below. I dont know how to proceed with this code. Having trouble with the nextKeyValue() method in the WHILE loop part. How should I read the data from a split and generate my custom key-value? I am using all new mapreduce package instead of the old mapred package.
public class MyRecordReader extends RecordReader<LongWritable, Text>
{
long start, current, end;
Text value;
LongWritable key;
LineReader reader;
FileSplit split;
Path path;
FileSystem fs;
FSDataInputStream in;
Configuration conf;
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext cont) throws IOException, InterruptedException
{
conf = cont.getConfiguration();
split = (FileSplit)inputSplit;
path = split.getPath();
fs = path.getFileSystem(conf);
in = fs.open(path);
reader = new LineReader(in, conf);
start = split.getStart();
current = start;
end = split.getLength() + start;
}
#Override
public boolean nextKeyValue() throws IOException
{
if(key==null)
key = new LongWritable();
key.set(current);
if(value==null)
value = new Text();
long readSize = 0;
while(current<end)
{
Text tmpText = new Text();
readSize = read //here how should i read data from the split, and generate key-value?
if(readSize==0)
break;
current+=readSize;
}
if(readSize==0)
{
key = null;
value = null;
return false;
}
return true;
}
#Override
public float getProgress() throws IOException
{
}
#Override
public LongWritable getCurrentKey() throws IOException
{
}
#Override
public Text getCurrentValue() throws IOException
{
}
#Override
public void close() throws IOException
{
}
}
There is no need to implement that yourself. You can simply set the configuration value textinputformat.record.delimiter to be the circumflex character.
conf.set("textinputformat.record.delimiter", "^");
This should work fine with the normal TextInputFormat.

Distributed cache not working

I am storing a small amount of data (few MBs) in a distributed cache and using that to perform anti join with two big files. For few lines of data in cache , the functionality is working fine, but when the cache is having more data in production it's not able to do the job but its not throwing any error as well. Just that only few records (around 20%) are getting joined and others are just getting ignored. So is there any upper limit of number of records that can be stored in the distributed cache? Why its working for some of the records and ignoring the rest? Any suggestion will be extremely helpful.
Bellow is my code
public class MyMapper extends Mapper<LongWritable, Text, Text, TextPair> {
Text albumKey = new Text();
Text photoKey = new Text();
private HashSet<String> photoDeleted = new HashSet<String>();
private HashSet<String> albDeleted = new HashSet<String>();
Text interKey = new Text();
private TextPair interValue = new TextPair();
private static final Logger LOGGER = Logger.getLogger(SharedStreamsSlMapper.class);
protected void setup(Context context) throws IOException, InterruptedException {
int count=0;
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
System.out.println(cacheFiles.length);
LOGGER.info(cacheFiles+"****");
try {
if (cacheFiles != null && cacheFiles.length > 0) {
for (Path path : cacheFiles) {
String line;
String[] tokens;
BufferedReader joinReader = new BufferedReader(new FileReader(path.toString()));
System.out.println(path.toString());
// BufferedReader joinReader = new BufferedReader(new FileReader("/Users/Kunal_Basak/Desktop/ss_test/dsitCache/part-m-00000"));
try {
while ((line = joinReader.readLine()) != null) {
count++;
tokens = line.split(SSConstants.TAB, 2);
if(tokens.length<2){
System.out.println("WL");
continue;
}
if (tokens[0].equals("P")) {
photoDeleted.add(tokens[1]);
}
else if (tokens[0].equals("A")) {
albDeleted.add(tokens[1]);
}
}
}
finally {
joinReader.close();
}
}
}
}
catch (IOException e) {
System.out.println("Exception reading DistributedCache: " + e);
}
System.out.println(count);
System.out.println("albdeleted *****"+albDeleted.size());
System.out.println("photo deleted *****"+photoDeleted.size());
LOGGER.info("albdeleted *****"+albDeleted.size());
LOGGER.info("albdeleted *****"+albDeleted.size());
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
try{
//my mapper code
}
}
}
According to this blog article:
The local.cache.size parameter controls the size of the
DistributedCache.
By default, it’s set to 10 GB.
So if you have more than 10GB in the cache, that may be your problem.

Distributed Cache Hadoop not retrieving the file content

I am getting some garbage like value instead of the data from the file I want to use as distributed cache.
The Job Configuration is as follows:
Configuration config5 = new Configuration();
JobConf conf5 = new JobConf(config5, Job5.class);
conf5.setJobName("Job5");
conf5.setOutputKeyClass(Text.class);
conf5.setOutputValueClass(Text.class);
conf5.setMapperClass(MapThree4c.class);
conf5.setReducerClass(ReduceThree5.class);
conf5.setInputFormat(TextInputFormat.class);
conf5.setOutputFormat(TextOutputFormat.class);
DistributedCache.addCacheFile(new URI("/home/users/mlakshm/ap1228"), conf5);
FileInputFormat.setInputPaths(conf5, new Path(other_args.get(5)));
FileOutputFormat.setOutputPath(conf5, new Path(other_args.get(6)));
JobClient.runJob(conf5);
In the Mapper, I have the following code:
public class MapThree4c extends MapReduceBase implements Mapper<LongWritable, Text,
Text, Text >{
private Set<String> prefixCandidates = new HashSet<String>();
Text a = new Text();
public void configure(JobConf conf5) {
Path[] dates = new Path[0];
try {
dates = DistributedCache.getLocalCacheFiles(conf5);
System.out.println("candidates: "+candidates);
String astr = dates.toString();
a = new Text(astr);
} catch (IOException ioe) {
System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
}
}
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer st = new StringTokenizer(line);
st.nextToken();
String t = st.nextToken();
String uidi = st.nextToken();
String uidj = st.nextToken();
String check = null;
output.collect(new Text(line), a);
}
}
The output value, I am getting from this mapper is:[Lorg.apache.hadoop.fs.Path;#786c1a82
instead of the value from the distributed cache file.
That looks like what you get when you call toString() on an array and if you look at the javadocs for DistributedCache.getLocalCacheFiles(), that is what it returns. If you need to actually read the contents of the files in the cache, you can open/read them with the standard java APIs.
From your code:
Path[] dates = DistributedCache.getLocalCacheFiles(conf5);
Implies that:
String astr = dates.toString(); // is a pointer to the above array (ie.dates) which is what you see in the output as [Lorg.apache.hadoop.fs.Path;#786c1a82.
You need to do the following to see the actual paths:
for(Path cacheFile: dates){
output.collect(new Text(line), new Text(cacheFile.getName()));
}

Resources