I want to store the data in hdfs which is emitted by Storm Spout. I have added hadoop FS API code in Bolt Class, but It is throwing compilation error with storm.
Following is the Storm bolt Class :
package bolts;
import java.io.*;
import java.util.*;
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
public class DataNormalizer extends BaseBasicBolt {
public void execute(Tuple input, BasicOutputCollector collector) {
String sentence = input.getString(0);
String[] process = sentence.split(" ");
int n = 1;
String rec = "";
try {
String filepath = "/root/data/top_output.csv";
String dest = "hdfs://localhost:9000/user/root/nishu/top_output/top_output_1.csv";
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.get(conf);
System.out.println(fileSystem);
Path srcPath = new Path(source);
Path dstPath = new Path(dest);
String filename = source.substring(source.lastIndexOf('/') + 1,
source.length());
try {
if (!(fileSystem.exists(dstPath))) {
FSDataOutputStream out = fileSystem.create(dstPath, true);
InputStream in = new BufferedInputStream(
new FileInputStream(new File(source)));
byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}
in.close();
out.close();
} else {
fileSystem.copyFromLocalFile(srcPath, dstPath);
}
} catch (Exception e) {
System.err.println("Exception caught! :" + e);
System.exit(1);
} finally {
fileSystem.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
I have added hadoop jars in CLASSPATH also..
Following is the value of classpath :
$STORM_HOME/storm-0.8.1.jar:$JAVA_HOME/lib/:$HADOOP_HOME/hadoop-core-1.0.4.jar:$HADOOP_HOME/lib/:$STORM_HOME/lib/
Also copied hadoop libraries : hadoop-cor-1.0.4.jar, commons-collection-3.2.1.jar and commons-cli-1.2.jar in Storm/lib directory.
When I am building this project, It is throwing following error :
3006 [Thread-16] ERROR backtype.storm.daemon.executor -
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:37)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:34)
at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:466)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1494)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1395)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123)
at bolts.DataNormalizer.execute(DataNormalizer.java:67)
at backtype.storm.topology.BasicBoltExecutor.execute(BasicBoltExecutor.java:32)
......................
The error message tells you that Apache commons configuration is missing. You have to add it to the classpath.
More generally, you should add all Hadoop dependencies to your classpath. You can find them using a dependency manager (Maven, Ivy, Gradle etc.) or look into /usr/lib/hadoop/lib on a machine on which Hadoop is installed.
Related
I have a code module and has supporting jars as content elements. Whenever I change the java code and want to update the new jar, I do check-out and check-in back. But in this process, when I check-out, all the supporting jars also need to be added manually. Is there a way to just Check-out the jar I want to update leaving behind supported jars?
I wrote a simple code to achieve this. Hope this helps others
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.util.LinkedHashMap;
import java.util.Map;
import javax.security.auth.Subject;
import com.filenet.api.admin.CodeModule;
import com.filenet.api.collection.ContentElementList;
import com.filenet.api.constants.AutoClassify;
import com.filenet.api.constants.CheckinType;
import com.filenet.api.constants.RefreshMode;
import com.filenet.api.constants.ReservationType;
import com.filenet.api.core.Connection;
import com.filenet.api.core.ContentTransfer;
import com.filenet.api.core.Domain;
import com.filenet.api.core.Factory;
import com.filenet.api.core.ObjectStore;
import com.filenet.api.util.UserContext;
public class UpdateCodeModule {
public static void main( String[] arg){
com.filenet.api.admin.CodeModule module=null;
try{
Connection conn = Factory.Connection.getConnection("http://server:port/wsi/FNCEWS40MTOM");
Subject subject = UserContext.createSubject(conn, "user_id", "password", "FileNetP8WSI");
UserContext uc = UserContext.get();
uc.pushSubject(subject);
Domain domain = Factory.Domain.fetchInstance(conn, "domain_name", null);
ObjectStore os = Factory.ObjectStore.fetchInstance(domain, "objectstore name", null);
module = Factory.CodeModule.getInstance(os, "/CodeModules/codemodulename");
module.clearPendingActions();
module.checkout(ReservationType.OBJECT_STORE_DEFAULT, null, "CodeModule", null);
module.save(RefreshMode.REFRESH);
ContentElementList element = Factory.ContentElement.createList();
CodeModule resveration = (CodeModule)module.get_Reservation();
File[] jars = new File("path to jars").listFiles();
LinkedHashMap<String, File> likedMap=new LinkedHashMap<String, File>();;
for (File file : jars) {
if(file.getName().equalsIgnoreCase("codemodule.jar")){
likedMap.put(file.getName(), file);
}
}
for (File file : jars) {
if(!file.getName().equalsIgnoreCase("codemodule.jar")){
likedMap.put(file.getName(), file);
}
}
for (Map.Entry<String, File> file : likedMap.entrySet()) {
ContentTransfer ct = Factory.ContentTransfer.createInstance();
try {
ct.setCaptureSource(new FileInputStream(file.getValue()));
ct.set_RetrievalName(file.getKey());
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
element.add(ct);
}
resveration.set_ContentElements(element);
resveration.checkin(AutoClassify.DO_NOT_AUTO_CLASSIFY, CheckinType.MAJOR_VERSION);
resveration.save(RefreshMode.REFRESH);
resveration.promoteVersion();
}catch(Exception e){
e.printStackTrace();
}finally{
module.cancelCheckout();
}
}
}
I am trying to load data from flat file to Hbase through API.But I am getting following error
========================================================
java.lang.NumberFormatException.forInputString
(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at org.apache.hadoop.hbase.HServerAddress.(HServerAddress.java:63)
at org.apache.hadoop.hbase.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:63)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:354)
at org.apache.hadoop.hbase.client.HBaseAdmin.(HBaseAdmin.java:94)
at Hbase.readFromFile.main(readFromFile.java:16)
Code :
package Hbase;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
public class readFromFile {
public static void main(String[] args) throws IOException{
if(args.length==1)
{
Configuration conf = HBaseConfiguration.create(new Configuration());
HBaseAdmin hba = new HBaseAdmin(conf);
if(!hba.tableExists(args[0])){
HTableDescriptor ht = new HTableDescriptor(args[0]);
ht.addFamily(new HColumnDescriptor("sample"));
ht.addFamily(new HColumnDescriptor("region"));
ht.addFamily(new HColumnDescriptor("time"));
ht.addFamily(new HColumnDescriptor("product"));
ht.addFamily(new HColumnDescriptor("sale"));
ht.addFamily(new HColumnDescriptor("profit"));
hba.createTable(ht);
System.out.println("New Table Created");
HTable table = new HTable(conf,args[0]);
File f = new File("/home/training/Desktop/data");
BufferedReader br = new BufferedReader(new FileReader(f));
String line = br.readLine();
int i =1;
String rowname="row";
while(line!=null && line.length()!=0){
System.out.println("Ok till here");
StringTokenizer tokens = new StringTokenizer(line,",");
rowname = "row"+i;
Put p = new Put(Bytes.toBytes(rowname));
p.add(Bytes.toBytes("sample"),Bytes.toBytes("sampleNo."),
Bytes.toBytes(Integer.parseInt(tokens.nextToken())));
p.add(Bytes.toBytes("region"),Bytes.toBytes("country"),
Bytes.toBytes(tokens.nextToken()));
p.add(Bytes.toBytes("region"),Bytes.toBytes("state"),
Bytes.toBytes(tokens.nextToken()));
p.add(Bytes.toBytes("region"),Bytes.toBytes("city"),
Bytes.toBytes(tokens.nextToken()));
p.add(Bytes.toBytes("time"),Bytes.toBytes("year"),
Bytes.toBytes(Integer.parseInt(tokens.nextToken())));
p.add(Bytes.toBytes("time"),Bytes.toBytes("month"),
Bytes.toBytes(tokens.nextToken()));
p.add(Bytes.toBytes("product"),Bytes.toBytes("productNo."),
Bytes.toBytes(tokens.nextToken()));
p.add(Bytes.toBytes("sale"),Bytes.toBytes("quantity"),
Bytes.toBytes(Integer.parseInt(tokens.nextToken())));
p.add(Bytes.toBytes("profit"),Bytes.toBytes("earnings"),
Bytes.toBytes(tokens.nextToken()));
i++;
table.put(p);
line = br.readLine();
}
br.close();
table.close();
}
else
System.out.println("Table Already exists.
Please enter another table name");
}
else
System.out.println("Please Enter the table
name through command line");
}
}
Please let me know whether we need to add any suitable jars ..I am using cloudera cloudera-quickstart-vm-5.4.2-0
Thanks,
VJ
If you read the error, it says that the Integer.parseInt method raised a NumberFormatException. This means that you attempted to convert a String of invalid format into an Integer. In your code, you call that method in this line:
Bytes.toBytes(Integer.parseInt(tokens.nextToken())));
You need look at the tokens you're passing into this method via tokens.nextToken() and ensure that each can be converted to an Integer.
I think the problem is with the cloudera jar versions used. Please check on it, that should work.
I have written a mapreduce program that reads the data from hive table using HCATLOG and writes into HBase. This is a map only job with no reducers. I have ran the program from command line and it works as expected(Created a fat jar to avoid Jar issues). I wanted to integrate it oozie (with Help of HUE) . I have two options to run it
Use Mapreduce Action
Use Java Action
Since my Mapreduce program has a driver method that holds the below code
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hive.hcatalog.data.schema.HCatSchema;
import org.apache.hive.hcatalog.mapreduce.HCatInputFormat;
import org.apache.hive.hcatalog.mapreduce.HCatOutputFormat;
public class HBaseValdiateInsertDriver {
public static void main(String[] args) throws Exception {
String dbName = "Test";
String tableName = "emp";
Configuration conf = new Configuration();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "HBase Get Put Demo");
job.setInputFormatClass(HCatInputFormat.class);
HCatInputFormat.setInput(job, dbName, tableName, null);
job.setJarByClass(HBaseValdiateInsertDriver.class);
job.setMapperClass(HBaseValdiateInsert.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path("maprfs:///user/input"));
FileOutputFormat.setOutputPath(job, new Path("maprfs:///user/output"));
job.waitForCompletion(true);
}
}
How do i specify the driver method in oozie, All that i can see is to specify mapper and reducer class.Can someone guide me how do i set the properties ?
Using java action i can specify my driver class as the main class and get this executed , but i face errors like table not found, HCATLOG jars not found etc. I have include hive-site.xml in the workflow(Using Hue) but i feel the system is not able to pick up the properties. Can someone advise me what all do i have to take care of, are there any other configuration properties that i need to include ?
Also the sample program i referred in cloudera website uses
HCatInputFormat.setInput(job, InputJobInfo.create(dbName,
inputTableName, null));
where as i use the below (I dont see a method that accept the above input
HCatInputFormat.setInput(job, dbName, tableName, null);
Below is my mapper code
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Durability;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTableInterface;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hive.hcatalog.data.HCatRecord;
public class HBaseValdiateInsert extends Mapper<WritableComparable, HCatRecord, Text, Text> {
static HTableInterface table;
static HTableInterface inserted;
private String hbaseDate = null;
String existigValue=null;
List<Put> putList = new ArrayList<Put>();
#Override
public void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
String tablename = "dev_arch186";
Utils.getHBConnection();
table = Utils.getTable(tablename);
table.setAutoFlushTo(false);
}
#Override
public void cleanup(Context context) {
try {
table.put(putList);
table.flushCommits();
table.close();
} catch (IOException e) {
e.printStackTrace();
}
Utils.closeConnection();
}
#Override
public void map(WritableComparable key, HCatRecord value, Context context) throws IOException, InterruptedException {
String name_hive = (String) value.get(0);
String id_hive = (String) value.get(1);
String rec[] = test.toString().split(",");
Get g = new Get(Bytes.toBytes(name_hive));
existigValue=getOneRecord(Bytes.toBytes("Info"),Bytes.toBytes("name"),name_hive);
if (existigValue.equalsIgnoreCase("NA") || !existigValue.equalsIgnoreCase(id_hive)) {
Put put = new Put(Bytes.toBytes(rec[0]));
put.add(Bytes.toBytes("Info"),
Bytes.toBytes("name"),
Bytes.toBytes(rec[1]));
put.setDurability(Durability.SKIP_WAL);
putList.add(put);
if(putList.size()>25000){
table.put(putList);
table.flushCommits();
}
}
}
public String getOneRecord(byte[] columnFamily, byte[] columnQualifier, String rowKey)
throws IOException {
Get get = new Get(rowKey.getBytes());
get.setMaxVersions(1);
Result rs = table.get(get);
rs.getColumn(columnFamily, columnQualifier);
System.out.println(rs.containsColumn(columnFamily, columnQualifier));
KeyValue result = rs.getColumnLatest(columnFamily,columnQualifier);
if (rs.containsColumn(columnFamily, columnQualifier))
return (Bytes.toString(result.getValue()));
else
return "NA";
}
public boolean columnQualifierExists(String tableName, String ColumnFamily,
String ColumnQualifier, String rowKey) throws IOException {
Get get = new Get(rowKey.getBytes());
Result rs = table.get(get);
return(rs.containsColumn(ColumnFamily.getBytes(),ColumnQualifier.getBytes()));
}
}
Note:
I use MapR (M3) Cluster with HUE as the interface for oozie.
Hive Version : 1-0
HCAT Version: 1-0
I couldn't find any way to initialize HCatInputFormat from Oozie mapreduce action.
But I have a workaround as below.
Created LazyHCatInputFormat by extending HCatInputFormat.
Override the getJobInfo method, to handle initalization. This will be called as part of getSplits(..) call.
private static void lazyInit(Configuration conf){
try{
if(conf==null){
conf = new Configuration(false);
}
conf.addResource(new Path(System.getProperty("oozie.action.conf.xml")));
conf.addResource(new org.apache.hadoop.fs.Path("hive-config.xml"));
String databaseName = conf.get("LazyHCatInputFormat.databaseName");
String tableName = conf.get("LazyHCatInputFormat.tableName");
String partitionFilter = conf.get("LazyHCatInputFormat.partitionFilter");
setInput(conf, databaseName, tableName);
//setFilter(partitionFilter);
//System.out.println("After lazyinit : "+conf.get("mapreduce.lib.hcat.job.info"));
}catch(Exception e){
System.out.println("*** LAZY INIT FAILED ***");
//e.printStackTrace();
}
}
public static InputJobInfo getJobInfo(Configuration conf)
throws IOException {
String jobString = conf.get("mapreduce.lib.hcat.job.info");
if (jobString == null) {
lazyInit(conf);
jobString = conf.get("mapreduce.lib.hcat.job.info");
if(jobString == null){
throw new IOException("job information not found in JobContext. HCatInputFormat.setInput() not called?");
}
}
return (InputJobInfo) HCatUtil.deserialize(jobString);
}
In the oozie map-redcue action, configured as below.
<property>
<name>mapreduce.job.inputformat.class</name>
<value>com.xyz.LazyHCatInputFormat</value>
</property>
<property>
<name>LazyHCatInputFormat.databaseName</name>
<value>HCAT DatabaseNameHere</value>
</property>
<property>
<name>LazyHCatInputFormat.tableName</name>
<value>HCAT TableNameHere</value>
</property>
This might not be the best implementation, but a quick hack to make it work.
While I was going through book Hadoop In Action there was an option which states that rather than adding the small files to distributed cache via program this can be done using the -files generic options.
When I tried this in the setup() of my code I get a FileNotFoundException at fs.open() and it shows me a path which am not sure with.
Question is :
If I use -files generic options by default where in HDFS the file gets copied to ?
The code am trying to execute is below..
public class JoinMapSide2 extends Configured implements Tool{
/* Program : JoinMapSide2.java
Description : Passing the small file via GenericOptionsParser
hadoop jar JoinMapSide2.jar -files orders.txt .........
Input : /data/patent/orders.txt(local file system), /data/patent/customers.txt
Output : /MROut/JoinMapSide2
Date : 23/03/2015
*/
protected static class MapClass extends Mapper <Text,Text,NullWritable,Text>{
// hash table to store the key+value from the distributed file or the background data
private Hashtable <String, String> joinData = new Hashtable <String, String>();
// setup function for filling up the joinData for each each map() call
protected void setup(Context context) throws IOException, InterruptedException {
String line;
String[] tokens;
FileSystem fs;
FSDataInputStream fdis;
LineReader joinReader;
Configuration conf;
Text buffer = new Text();
// get configuration
conf = context.getConfiguration();
// get file system related to the configuration
fs = FileSystem.get(conf);
// get all the local cache files distributed as part of the job
URI[] localFiles = context.getCacheFiles();
System.out.println("Cache File Path:"+localFiles[0].toString());
// check if there are any distributed files
// in our case we are sure we will always one so use that only
if (localFiles.length > 0){
// since the file is now on HDFS FSDataInputStream to read through the file
fdis = fs.open(new Path(localFiles[0].toString()));
joinReader = new LineReader(fdis);
// read local file until EOF
try {
while (joinReader.readLine(buffer) > 0) {
line = buffer.toString();
// apply the split pattern only once
tokens = line.split(",",2);
// add key+value into the Hashtable
joinData.put(tokens[0], tokens[1]);
}
} finally {
joinReader.close();
fdis.close();
}
}
else{
System.err.println("No Cache Files are distributed");
}
}
// map function
protected void map(Text key,Text value, Context context) throws IOException, InterruptedException{
NullWritable kNull = null;
String joinValue = joinData.get(key.toString());
if (joinValue != null){
context.write(kNull, new Text(key.toString() + "," + value.toString() + "," + joinValue));
}
}
}
#Override
public int run(String[] args) throws Exception {
if (args.length < 2){
System.err.println("Usage JoinMapSide -files <smallFile> <inputFile> <outputFile>");
}
Path inFile = new Path(args[0]); // input file(customers.txt)
Path outFile = new Path(args[1]); // output file file
Configuration conf = getConf();
// delimiter for the input file
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
Job job = Job.getInstance(conf, "Map Side Join2");
// this is not used as the small file is distributed to all the nodes in the cluster using
// generic options parser
// job.addCacheFile(disFile.toUri());
FileInputFormat.addInputPath(job, inFile);
FileOutputFormat.setOutputPath(job, outFile);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setJarByClass(JoinMapSide2.class);
job.setMapperClass(MapClass.class);
job.setNumReduceTasks(0);
job.waitForCompletion(true);
return 0;
}
public static void main(String args[]) throws Exception {
int ret = ToolRunner.run(new Configuration(), new JoinMapSide2(), args);
System.exit(ret);
}
This is the below exception I see in the trace
Error: java.io.FileNotFoundException: File does not exist: /tmp/hadoop-yarn/staging/shiva/.staging/job_1427126201553_0003/files/orders.txt#orders.txt
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:64)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
I start the job like
hadoop jar JoinMapSide2.jar -files orders.txt /data/patent/join/customers.txt /MROut/JoinMapSide2
Any directions would be really helpful. Thanks
First you need to move your orders.txt to hdfs and the you have to use -files
Okay after some searching around I did find out there are 2 errors in my code above.
I should not be using FileDataInputStream to read the distributed file as its local to the node running the mapper I should be using File.
I should not be using URI.toString() instead I should be using the symbolic link added to my file which is just orders.txt
I have corrected code listed below hope it helps.
package org.samples.hina.training;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.Hashtable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class JoinMapSide2 extends Configured implements Tool{
/* Program : JoinMapSide2.java
Description : To learn Replicated Join using Distributed Cache via Generic Options -files
Input : file:/patent/join/orders1.txt(distributed to all nodes), /data/patent/customers.txt
Output : /MROut/JoinMapSide2
Date : 24/03/2015
*/
protected static class MapClass extends Mapper <Text,Text,NullWritable,Text>{
// hash table to store the key+value from the distributed file or the background data
private Hashtable <String, String> joinData = new Hashtable <String, String>();
// setup function for filling up the joinData for each each map() call
protected void setup(Context context) throws IOException, InterruptedException {
String line;
String[] tokens;
// get all the cache files set in the configuration set in addCacheFile()
URI[] localFiles = context.getCacheFiles();
System.out.println("File1:"+localFiles[0].toString());
// check if there are any distributed files
// in our case we are sure we will always one so use that only
if (localFiles.length > 0){
// read from LOCAL copy
File localFile1 = new File("./orders1.txt");
// created reader to localFile1
BufferedReader joinReader = new BufferedReader(new FileReader(localFile1));
// read local file until EOF
try {
while ((line = joinReader.readLine()) != null){
// apply the split pattern only once
tokens = line.split(",",2);
// add key+value into the Hashtable
joinData.put(tokens[0], tokens[1]);
}
} finally {
joinReader.close();
}
} else{
System.err.println("Local Cache File does not exist");
}
}
// map function
protected void map(Text key,Text value, Context context) throws IOException, InterruptedException{
NullWritable kNull = null;
String joinValue = joinData.get(key.toString());
if (joinValue != null){
context.write(kNull, new Text(key.toString() + "," + value.toString() + "," + joinValue));
}
}
}
#Override
public int run(String[] args) throws Exception {
if (args.length < 2){
System.err.println("Usage JoinMapSide2 <inputFile> <outputFile>");
}
Path inFile = new Path(args[0]); // input file(customers.txt)
Path outFile = new Path(args[1]); // output file file
Configuration conf = getConf();
// delimiter for the input file
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
Job job = Job.getInstance(conf, "Map Side Join2");
// add the files orders1.txt, orders2.txt to distributed cache
// the files added by the Generic Options -files
//job.addCacheFile(disFile1);
FileInputFormat.addInputPath(job, inFile);
FileOutputFormat.setOutputPath(job, outFile);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setJarByClass(JoinMapSide2.class);
job.setMapperClass(MapClass.class);
job.setNumReduceTasks(0);
job.waitForCompletion(true);
return 0;
}
public static void main(String args[]) throws Exception {
int ret = ToolRunner.run(new Configuration(), new JoinMapSide2(), args);
System.exit(ret);
}
}
I'm running an EMR Activity inside a Data Pipeline analyzing log files and I get the following error when my Pipeline fails:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://10.211.146.177:9000/home/hadoop/temp-output-s3copy-2013-05-24-00 already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:905)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:879)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1316)
at com.valtira.datapipeline.stream.CloudFrontStreamLogProcessors.main(CloudFrontStreamLogProcessors.java:216)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
I've tried delete that folder by adding:
FileSystem fs = FileSystem.get(getConf());
fs.delete(new Path("path/to/file"), true); // delete file, true for recursive
but it does not work. Is there a way to override the FileOutputFormat method from Hadoop in java? Is there a way to ignore this error in java?
The path to the file to be deleted changes as the output directory uses date for naming.
There are 2 ways to delete it:
Over shell, try this:
hadoop dfs -rmr hdfs://127.0.0.1:9000/home/hadoop/temp-output-s3copy-*
To do it via Java code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.mortbay.log.Log;
public class FSDeletion {
public static void main(String[] args) {
try {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String fsName = conf.get("fs.default.name", "localhost:9000");
String baseDir = "/home/hadoop/";
String outputDirPattern = fsName + baseDir + "temp-output-s3copy-";
Path[] paths = new Path[1];
paths[0] = new Path(baseDir);
FileStatus[] status = fs.listStatus(paths);
Path[] listedPaths = FileUtil.stat2Paths(status);
for (Path p : listedPaths) {
if (p.toString().startsWith(outputDirPattern)) {
Log.info("Attempting to delete : " + p);
boolean result = fs.delete(p, true);
Log.info("Deleted ? : " + result);
}
}
fs.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}