Hadoop setInputPathFilter error - hadoop

I am using Hadoop 0.20.2 (that cannot be changed) and I want to add a filter to my input path. The data looks as follows:
/path1/test_a1
/path1/test_a2
/path1/train_a1
/path1/train_a2
and I only want to process all files with train in them.
A look at the FileInputFormat class suggests to use:
FileInputFormat.setInputPathFilter(Job job, Class<? extends PathFilter> filter)
and this is where my problem starts, since PathFilter is an interface - of course, I can extend the interface but then I still do not have an implementation. So instead, I implemented the interface:
class TrainFilter implements PathFilter
{
boolean accept(Path path)
{
return path.toString().contains("train");
}
}
When I use TrainFilter as PathFilter the code compiles, however when I run it, I get an exception as the input path is screwed up. Without setting the filter, my code runs through all files that are below /path1, however, when setting the filter, it throws the error:
InvalidInputException: Input path does not exist hdfs://localhost:9000/path1
Here is how I set it up in the driver code:
job.setMapperClass(....class);
job.setInputFormatClass(....class);
job.setMapOutputKeyClass(...class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPathFilter(job, TrainFilter.class);
FileInputFormat.addInputPath(job, new Path("/path1/"));
FileOutputFormat.setOutputPath(job, new Path("/path2/"));
job.waitForCompletion(true);
Any suggestions of what I am doing wrong here?
EDIT: I found the problem. The first call to the PathFilter is always the directory itself (/path1) and since it does not contain ("train"), the directory itself is invalid and thus the exception is thrown. Which brings me to another question: how can I test if an arbitrary path is a directory? For all I know, I need a reference to the FileSystem, which is not one of the default parameters of PathFilter.

Alternatively, you may try to loop through all of the files in the given directory and check if the file names begin with train. E.g:
Job job = new Job(conf, "myJob");
List<Path> inputhPaths = new ArrayList<Path>();
String basePath = "/user/hadoop/path";
FileSystem fs = FileSystem.get(conf);
FileStatus[] listStatus = fs.globStatus(new Path(basePath + "/train*"));
for (FileStatus fstat : listStatus) {
inputhPaths.add(fstat.getPath());
}
FileInputFormat.setInputPaths(job,
(Path[]) inputhPaths.toArray(new Path[inputhPaths.size()]));

A quick fix, You can blacklist paths instead of whitelisting like return false if path contains "test"

You can get a FileSystem instance by having your Filter implement the Configurable interface (or extend the Configured class), and create a fileSystem instance variable in the setConf method:
class TrainFilter extends Configured implements PathFilter
{
FileSystem fileSystem;
boolean accept(Path path)
{
// TODO: use fileSystem here to determine if path is a directory
return path.toString().contains("train");
}
public void setConf(Configuration conf) {
if (conf != null) {
fileSystem = FileSystem.get(conf);
}
}
}

I know this is a very old question, but it helped me find an answer to excluding paths when all the examples of it failed me, such as documented here.
I just want to warn about the answer that #ChrisWhite gave, because I am working on Hadoop 3.3.0 API and using setConf threw a NullPointerException when I was about to pull something out of the configuration.
I found that instead I'll keep the extends Configured in the PathFilter signature but just put getConf().get(<your configuration parameter name>) whenever you need what you need from the job configuration.
So, my excluding PathFilter looks like this:
public static class ExcludedPathsFilter extends Configured implements PathFilter {
public boolean accept(Path includePathGlob){
//debugging
System.out.println("excludedPath parameter is "+getConf().get("excludedPath")+", includePath parameter is "+includePathGlob.toString()+" and !includePathGlob.toString().contains(getConf().get(\"excludedPath\")) is "+!includePathGlob.toString().contains(getConf().get("excludedPath")));
return !includePathGlob.toString().contains(getConf().get("excludedPath"));
}
}

Related

Connecting to Accumulo inside a Mapper using Kerberos

I am moving some software from an older Hadoop Cluster (uses username/password authentication) to a newer one, 2.6.0-cdh5.12.0 which has Kerberos authentication enabled.
I have been able to get many of existing Map/Reduce jobs that use Accumulo for its input and/or output to work fine using a DelegationToken set in the AccumuloInput/OutputFormat classes.
However, I have 1 job, that uses AccumuloInput/OutputFormat for input and output, but also inside its Mapper.setup() method, it connects to Accumulo via Zookeeper so that in the Mapper.map() method, it can compare each key/value being processed my the Mapper.map() to and entry in another Accumulo table.
I included the relevant code below which shows the setup() method connecting to Zookeeper user a PasswordToken and then creating an Accumulo table Scanner which is then used in the mapper method.
So the question is how do I replace the use of the PasswordToken with a KerberosToken for setting up the Accumulo scanner in the Mapper.setup() method? I can find no way to "get" the DelegationToken used by the AccumuloInput/OutputFormat classes that I set.
I have tried context.getCredentials().getAllTokens() and looking for a token of type org.apache.accumulo.code.client.security.tokens.AuthenticationToken -- all of the tokens returned here are of type org.apache.hadoop.security.token.Token.
Please note that I typed the code fragments in versus cut/paste as the code runs on a network unconnected to the internet - aka there may be a typo. :)
//****************************
// code in the M/R driver
//****************************
ClientConfiguration accumuloCfg = ClientConfiguration.loadDefault().withInstance("Accumulo1").withZkHosts("zookeeper1");
ZooKeeperInstance inst = new ZooKeeperInstance(accumuloCfg);
AuthenticationToken dt = conn.securityOperations().getDelegationToken(new DelagationTokenConfig());
AccumuloInputFormat.setConnectorInfo(job, username, dt);
AccumuloOutputFormat.setConnectorInfo(job, username, dt);
// other job setup and then
job.waitForCompletion(true)
//****************************
// this is inside the Mapper class of the M/R job
//****************************
private Scanner index_scanner;
public void setup(Context context) {
Configuration cfg = context.getConfiguration();
// properties set and passed from M/R Driver program
String username = cfg.get("UserName");
String password = cfg.get("Password");
String accumuloInstName = cfg.get("InstanceName");
String zookeepers = cfg.get("Zookeepers");
String tableName = cfg.get("TableName");
Instance inst = new ZooKeeperInstance(accumuloInstName, zookeepers);
try {
AuthenticationToken passwordToken = new PasswordToken(password);
Connector conn = inst.getConnector(username, passwordToken);
index_scanner = conn.createScanner(tableName, conn.securityOperations().getUserAuthorizations(username));
} catch(Exception e) {
e.printStackTrace();
}
}
public void map(Key key, Value value, Context context) throws IOException, InterruptedException {
String uuid = key.getRow().toString();
index_scanner.clearColumns();
index_scanner.setRange(Range.exact(uuid));
for(Entry<Key, Value> entry : index_scanner) {
// do some processing in here
}
}
The provided AccumuloInputFormat and AccumuloOutputFormat have a method to set the token in the job configuration with the Accumulo*putFormat.setConnectorInfo(job, principle, token). You can also serialize the token in a file in HDFS, using the AuthenticationTokenSerializer and use the version of the setConnectorInfo method which accepts a file name.
If a KerberosToken is passed in, the job will create a DelegationToken to use, and if a DelegationToken is passed in, it will just use that.
The provided AccumuloInputFormat should handle its own scanner, so normally, you shouldn't have to do that in your Mapper if you've set the configuration properly. However, if you're doing secondary scanning (for something like a join) inside your Mapper, you can inspect the provided AccumuloInputFormat's RecordReader source code for an example of how to retrieve the configuration and construct a Scanner.

Understanding HBase Java Client

I started Hbase few days back and going through all the material of online.
I have installed and configured HBase and shell commands are working fine.
I got an example of Java client to get data from HBase Table and it executed successfully but I could not understand how it is working? In the code nowhere we have mentioned the port, host of Hbase server? How it able to fetch the data from table?
This is my code:
public class RetriveData {
public static void main(String[] args) throws IOException {
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
#SuppressWarnings({ "deprecation", "resource" })
HTable table = new HTable(config, "emp");
// Instantiating Get class
Get g = new Get(Bytes.toBytes("1"));
// Reading the data
Result result = table.get(g);
// Reading values from Result class object
byte [] value = result.getValue(Bytes.toBytes("personal data"),Bytes.toBytes("name"));
byte [] value1 = result.getValue(Bytes.toBytes("personal data"),Bytes.toBytes("city"));
// Printing the values
String name = Bytes.toString(value);
String city = Bytes.toString(value1);
System.out.println("name: " + name + " city: " + city);
}
}
The output looks like:
Output:
name: raju city: hyderabad
I agree with Binary Nerds answer
adding some more interesting information for better understanding.
Your Question :
I could not understand how it is working? In the code nowhere we have
mentioned the port, host of Hbase server? How it able to fetch the
data from table?
Since you are executing this program in cluster
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create()
all the cluster properties will be taken care from inside the cluster.. since you are in cluster and you are executing hbase java client program..
Now try like below (execute same program in different way from remote machine eclipse on windows to find out difference of what you have done earlier and now).
public static Configuration configuration; // this is class variable
static { //fill clusternode1,clusternode2,clusternode3 from your cluster
configuration = HBaseConfiguration.create();
configuration.set("hbase.zookeeper.property.clientPort", "2181");
configuration.set("hbase.zookeeper.quorum",
"clusternode1,clusternode2,clusternode3");
configuration.set("hbase.master", "clusternode1:600000");
}
Hope this heps you to understand.
If you look at the source code for HBaseConfiguration on github you can see what it does when it calls create().
public static Configuration create() {
Configuration conf = new Configuration();
// In case HBaseConfiguration is loaded from a different classloader than
// Configuration, conf needs to be set with appropriate class loader to resolve
// HBase resources.
conf.setClassLoader(HBaseConfiguration.class.getClassLoader());
return addHbaseResources(conf);
}
Followed by:
public static Configuration addHbaseResources(Configuration conf) {
conf.addResource("hbase-default.xml");
conf.addResource("hbase-site.xml");
checkDefaultsVersion(conf);
HeapMemorySizeUtil.checkForClusterFreeMemoryLimit(conf);
return conf;
}
So its loading the configuration from your HBase configuration files hbase-default.xml and hbase-site.xml.

Specifying relative path in application.properties in Spring

Is there a way we can lookup file resources using relative path in application.properties file in Spring boot application as specified below
spring.datasource.url=jdbc:hsqldb:file:${project.basedir}/db/init
I'm using spring boot to build a upload sample, and meet the same problem, I only want to get the project root path. (e.g. /sring-boot-upload)
I find out that below code works:
upload.dir.location=${user.dir}\\uploadFolder
#membersound answer is just breaking up the hardcoded path in 2 parts, not dynamically resolving the property. I can tell you how to achieve what you're looking for, but you need to understand is that there is NO project.basedir when you're running the application as a jar or war. Outside the local workspace, the source code structure doesn't exist.
If you still want to do this for testing, that's feasible and what you need is to manipulate the PropertySources. Your simplest option is as follows:
Define an ApplicationContextInitializer, and set the property there. Something like the following:
public class MyApplicationContextInitializer implements ApplicationContextInitializer<ConfigurableApplicationContext> {
#Override
public void initialize(ConfigurableApplicationContext appCtx) {
try {
// should be /<path-to-projectBasedir>/build/classes/main/
File pwd = new File(getClass().getResource("/").toURI());
String projectDir = pwd.getParentFile().getParentFile().getParent();
String conf = new File(projectDir, "db/init").getAbsolutePath();
Map<String, Object> props = new HashMap<>();
props.put("spring.datasource.url", conf);
MapPropertySource mapPropertySource = new MapPropertySource("db-props", props);
appCtx.getEnvironment().getPropertySources().addFirst(mapPropertySource);
} catch (URISyntaxException e) {
throw new RuntimeException(e);
}
}}
Looks like you're using Boot, so you can just declare context.initializer.classes=com.example.MyApplicationContextInitializer in your application.properties and Boot will run this class at startup.
Words of caution again:
This will not work outside the local workspace as it depends on the source code structure.
I've assumed a Gradle project structure here /build/classes/main. If necessary, adjust according to your build tool.
If MyApplicationContextInitializer is in the src/test/java, pwd will be <projectBasedir>/build/classes/test/, not <projectBasedir>/build/classes/main/.
your.basedir=${project.basedir}/db/init
spring.datasource.url=jdbc:hsqldb:file:${your.basedir}
#Value("${your.basedir}")
private String file;
new ClassPathResource(file).getURI().toString()

Specify job properties and override properties in hadoop jobs

I have a hadoop (2.2.0) map-reduce job which reads text from a specified path (say INPUT_PATH), and does some processing. I don't want to hardcode the input path (since it comes from some other source which changes each week).
I believe there should be a way in hadoop to specify an xml properties file while running though the command line. How should I do it?
One way I thought was to set an environment variable which points to the location of the properties file and then read this env variable in code and subsequently read the property file. This could work because the value of the env variable can be changed each week without changing the code. But I feel this is an ugly way of loading properties and overrides.
Please let me know the least hacky way of doing this.
There is no inbuilt way to read any configuration file for input/output.
One way I can suggest is to implement a Java M/R Driver program that does the following,
Read the configuration (XML/properties/anything) (Probably generated / updated by the other process)
Set the Job Properties
Submit the Job using your hadoop command (pass the configuration file as argument)
Something like this,
public class SampleMRDriver
extends Configured implements Tool {
#Override
public int run(
String[] args)
throws Exception {
// Read from args the configuration file
Properties prop = new Properties();
prop.loadFromXML(new FileInputStream(args[0]));
Job job = Job.getInstance(getConf(), "Test Job");
job.setJarByClass(SampleMRDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
FileInputFormat.setInputPaths(job, new Path(prop.get("input_path")));
FileOutputFormat.setOutputPath(job, new Path(prop.get("output_path")));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(
String[] args)
throws Exception {
ToolRunner.run(new BatteryAnomalyDetection(), args);
}
}

hadoop, how to include 3part jar while try to run mapred job

As we know, new need to pack all needed class into the job-jar and upload it to server. it's so slow, i will to know whether there is a way which to specify the thirdpart jar include executing map-red job, so that i could only pack my classes with out dependencies.
PS(i found there is a "-libjar" command, but i doesn't figure out how to use it. Here is the link http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/)
Those are called generic options.
So, to support those, your job should implement Tool.
Run your job like --
hadoop jar yourfile.jar [mainClass] args -libjars <comma seperated list of jars>
Edit:
To implement Tool and extend Configured, you do something like this in your MapReduce application --
public class YourClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new YourClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
//parse you normal arguments here.
Configuration conf = getConf();
Job job = new Job(conf, "Name of job");
//set the class names etc
//set the output data type classes etc
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}
For me I had to specify -libjar option before the arguments. Otherwise it was considered as an argument.

Resources