I'm running an EMR Activity inside a Data Pipeline analyzing log files and I get the following error when my Pipeline fails:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://10.211.146.177:9000/home/hadoop/temp-output-s3copy-2013-05-24-00 already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:905)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:879)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1316)
at com.valtira.datapipeline.stream.CloudFrontStreamLogProcessors.main(CloudFrontStreamLogProcessors.java:216)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
I've tried delete that folder by adding:
FileSystem fs = FileSystem.get(getConf());
fs.delete(new Path("path/to/file"), true); // delete file, true for recursive
but it does not work. Is there a way to override the FileOutputFormat method from Hadoop in java? Is there a way to ignore this error in java?
The path to the file to be deleted changes as the output directory uses date for naming.
There are 2 ways to delete it:
Over shell, try this:
hadoop dfs -rmr hdfs://127.0.0.1:9000/home/hadoop/temp-output-s3copy-*
To do it via Java code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.mortbay.log.Log;
public class FSDeletion {
public static void main(String[] args) {
try {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String fsName = conf.get("fs.default.name", "localhost:9000");
String baseDir = "/home/hadoop/";
String outputDirPattern = fsName + baseDir + "temp-output-s3copy-";
Path[] paths = new Path[1];
paths[0] = new Path(baseDir);
FileStatus[] status = fs.listStatus(paths);
Path[] listedPaths = FileUtil.stat2Paths(status);
for (Path p : listedPaths) {
if (p.toString().startsWith(outputDirPattern)) {
Log.info("Attempting to delete : " + p);
boolean result = fs.delete(p, true);
Log.info("Deleted ? : " + result);
}
}
fs.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Related
I have a spark job which runs without any issue in spark-shell. I am currently trying to submit this job to yarn using spark's api.
I am using the below class to run a spark job
import java.util.ResourceBundle;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.deploy.yarn.ClientArguments;
public class SubmitSparkJobToYARNFromJavaCode {
public static void main(String[] arguments) throws Exception {
ResourceBundle bundle = ResourceBundle.getBundle("device_compare");
String accessKey = bundle.getString("accessKey");
String secretKey = bundle.getString("secretKey");
String[] args = new String[] {
// path to my application's JAR file
// required in yarn-cluster mode
"--jar",
"my_s3_path_to_jar",
// name of my application's main class (required)
"--class", "com.abc.SampleIdCount",
// comma separated list of local jars that want
// SparkContext.addJar to work with
// "--addJars", arguments[1]
};
// create a Hadoop Configuration object
Configuration config = new Configuration();
// identify that I will be using Spark as YARN mode
System.setProperty("SPARK_YARN_MODE", "true");
System.setProperty("spark.local.dir", "/tmp");
// create an instance of SparkConf object
SparkConf sparkConf = new SparkConf();
sparkConf.set("fs.s3n.awsAccessKeyId", accessKey);
sparkConf.set("fs.s3n.awsSecretAccessKey", secretKey);
sparkConf.set("spark.local.dir", "/tmp");
// create ClientArguments, which will be passed to Client
ClientArguments cArgs = new ClientArguments(args);
// create an instance of yarn Client client
Client client = new Client(cArgs, config, sparkConf);
// submit Spark job to YARN
client.run();
}
}
This is the spark job I am trying to run
package com.abc;
import java.util.ResourceBundle;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class SampleIdCount {
private static String accessKey;
private static String secretKey;
public SampleIdCount() {
ResourceBundle bundle = ResourceBundle.getBundle("device_compare");
accessKey = bundle.getString("accessKey");
secretKey = bundle.getString("secretKey");
}
public static void main(String[] args) {
System.out.println("Started execution");
SampleIdCount sample = new SampleIdCount();
System.setProperty("SPARK_YARN_MODE", "true");
System.setProperty("spark.local.dir", "/tmp");
SparkConf conf = new SparkConf();
{
conf = new SparkConf().setAppName("SampleIdCount").setMaster("yarn-cluster");
}
JavaSparkContext sc = new JavaSparkContext(conf);
sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", accessKey);
sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secretKey);
JavaRDD<String> installedDeviceIdsRDD = sc.emptyRDD();
installedDeviceIdsRDD = sc.textFile("my_s3_input_path");
installedDeviceIdsRDD.saveAsTextFile("my_s3_output_path");
sc.close();
}
}
When I run my java code , the spark job is being submitted to yarn but the issue is I face the below error
Diagnostics: File file:/mnt/tmp/spark-1b86d806-5c8f-4ae6-a486-7b68d46c759a/__spark_libs__8257948728364304288.zip does not exist
java.io.FileNotFoundException: File file:/mnt/tmp/spark-1b86d806-5c8f-4ae6-a486-7b68d46c759a/__spark_libs__8257948728364304288.zip does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:616)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:829)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:606)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:431)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The problem I thought was that the folder /mnt was not available in slave nodes so I tried to change spark local directory to /tmp by doing the following things
I set System environment variable to "/tmp" in my Java code
I set a system env variable by doing export LOCAL_DIRS=/tmp
I set a system env variable by doing export SPARK_LOCAL_DIRS=/tmp
None of these had any effect and I still face the very same error. None of the suggestions in other links helped me either. I am really stuck in this. Any help would be much appreciated. Thanks in advance. Cheers!
So I have a huge Access Log file and I am trying to find the Path on the Server which is hit the Most. It is a Traditional Word Count Problem to find the No. of times a path is hit.
But, as the output values are not Sorted in a MR job(only the keys are sorted) I am executing another MR job where the mapper takes the Output of Previous job as input and I use InverseMapper.java to invert the keys and values and use Identity Reducer(Reducer.java) because no aggregation is need and I just need to sort the keys(i.e., values of the first Job). Here is my Code :
package edu.pitt.cloud.CloudProject;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable.DecreasingComparator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class AccessLogMostHitPath {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
String configPath = "/usr/local/hadoop-2.7.3/etc/hadoop/";
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Path finalOutputPath = new Path(args[2]);
Configuration config = new Configuration(true);
config.addResource(new Path(configPath+"hdfs-site.xml"));
config.addResource(new Path(configPath+"core-site.xml"));
config.addResource(new Path(configPath+"yarn-site.xml"));
config.addResource(new Path(configPath+"mapred-site.xml"));
Job job = Job.getInstance(config, "AccessLogMostHitPath");
job.setJarByClass(AccessLogMostHitPath.class);
job.setMapperClass(AccessLogMostHitPathMapper.class);
job.setReducerClass(AccessLogMostHitPathReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
config.set("mapreduce.job.running.map.limit", "2");
FileInputFormat.addInputPath(job, inputPath);
job.setInputFormatClass(TextInputFormat.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(1);
System.out.println("Starting Job Execution ::: AccessLogMostHitPath");
int code = job.waitForCompletion(true) ? 0 : 1;
System.out.println("Job Execution Finished ::: AccessLogMostHitPath");
if(code != 0){
System.out.println("First Job Failed");
System.exit(code);
}
FileSystem hdfs = FileSystem.get(config);
Path successPath = new Path(outputPath+"/_SUCCESS");
if (hdfs.exists(successPath))
hdfs.delete(successPath, true);
Job job2 = Job.getInstance(config, "AccessLogMostHitPathSort");
job2.setJarByClass(AccessLogMostHitPath.class);
job2.setMapperClass(InverseMapper.class);
job2.setReducerClass(Reducer.class);
//config.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "\\t");
KeyValueTextInputFormat.addInputPath(job2, outputPath);
job2.setInputFormatClass(KeyValueTextInputFormat.class);
FileOutputFormat.setOutputPath(job2, finalOutputPath);
job2.setOutputFormatClass(TextOutputFormat.class);
job2.setNumReduceTasks(1);
job2.setMapOutputKeyClass(IntWritable.class);
job2.setMapOutputValueClass(Text.class);
job2.setSortComparatorClass(DecreasingComparator.class);
job2.setOutputKeyClass(IntWritable.class);
job2.setOutputValueClass(Text.class);
config.set("mapreduce.job.running.map.limit", "2");
System.out.println("Starting Job Execution ::: AccessLogMostHitPathSort");
int code2 = job2.waitForCompletion(true) ? 0 : 1;
System.out.println("Job Execution Finished ::: AccessLogMostHitPathSort");
System.exit(code2);
}
}
I get the Below Exception When I execute this :
Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.IntWritable, received org.apache.hadoop.io.Text
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1072)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:715)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at org.apache.hadoop.mapreduce.lib.map.InverseMapper.map(InverseMapper.java:36)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Where is this going wrong?. I can see that somewhere there is a mismatchin the Key or Value Type but I have crosschecked Everything. Please Help.
The problem is KeyValueTextInputFormat. This is text input format, it reads key as Text and value as Text. But you declared that job2 output mapper types are IntWritable and Text:
job2.setMapOutputKeyClass(IntWritable.class);
job2.setMapOutputValueClass(Text.class);
So you have to provide your own input format that will read input correctly.
I am running MapReduce program on Hadoop.
The inputformat passes each file path to mapper.
I can check the file through cmd like this,
$ hadoop fs -ls hdfs://slave1.kdars.com:8020/user/hadoop/num_5/13.pdf
Found 1 items -rwxrwxrwx 3 hdfs hdfs 184269 2015-03-31 22:50 hdfs://slave1.kdars.com:8020/user/hadoop/num_5/13.pdf
However when I try to open that file from the mapper side, it is not working.
15/04/01 06:13:04 INFO mapreduce.Job: Task Id : attempt_1427882384950_0025_m_000002_2, Status : FAILED
Error: java.io.FileNotFoundException: hdfs:/slave1.kdars.com:8020/user/hadoop/num_5/13.pdf (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at java.io.FileInputStream.<init>(FileInputStream.java:101)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1111)
I checked that inputformat work fine and mapper have got right file path.
mapper code look like this,
#Override
public void map(Text title, Text file, Context context) throws IOException, InterruptedException {
long time = System.currentTimeMillis();
SimpleDateFormat dayTime = new SimpleDateFormat("yyyy-mm-dd hh:mm:ss");
String str = dayTime.format(new Date(time));
File temp = new File(file.toString());
if(temp.exists()){
DBManager.getInstance().insertSQL("insert into `plagiarismdb`.`workflow` (`type`) value ('"+temp+" is exists')");
}else{
DBManager.getInstance().insertSQL("insert into `plagiarismdb`.`workflow` (`type`) value ('"+temp+" is not exists')");
}
}
Help me please.
First, import these.
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
Then, use them in your mapper method.
FileSystem fs = FileSystem.get(new Configuration());
Path path= new Path(value.toString());
System.out.println(path);
if (fs.exists(path)) {
context.write(value, one);
} else {
context.write(value, zero);
}
I have a simple java program called putmerge that I am trying to execute. I have been at it for like 6hrs, researched many places on the web but could not find solution. Basically I try to build the jar with all class libraries with the following command:
javac -classpath *:lib/* -d playground/classes playground/src/PutMerge.java
And then I build the jar with the following command.
jar -cvf playground/putmerge.jar -C playground/classes/ .
And then I try to execute it with the following command:
bin/hadoop jar playground/putmerge.jar org.scd.putmerge "..inputPath.." "..outPath"
..
Exception in thread "main" java.lang.ClassNotFoundException: com.scd.putmerge
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
I tried every permutation/combination to run this simple jar, however I always get some kind of exception as shown above.
My source code:
package org.scd.putmerge;
import java.io.IOException;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
/**
*
* #author Anup V. Saumithri
*
*/
public class PutMerge
{
public static void main(String[] args) throws IOException
{
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
FileSystem local = FileSystem.getLocal(conf);
Path inputDir = new Path(args[0]);
Path hdfsFile = new Path(args[1]);
try
{
FileStatus[] inputFiles = local.listStatus(inputDir);
FSDataOutputStream out = hdfs.create(hdfsFile);
for(int i=0; i<inputFiles.length; i++)
{
System.out.println(inputFiles[i].getPath().getName());
FSDataInputStream in = local.open(inputFiles[i].getPath());
byte buffer[] = new byte[256];
int bytesRead = 0;
while((bytesRead = in.read(buffer)) > 0)
{
out.write(buffer, 0, bytesRead);
}
in.close();
}
out.close();
}
catch(IOException ex)
{
ex.printStackTrace();
}
}
}
The way you are putting your PutMerge class inside the jar may be a little incorrect.
If you do a jar tf putmerge.jar, you must see the PutMerge class inside the path mentioned in your package (org.scd.putmerge) in your code (i.e. org/scd/putmerge).
If not try doing the following to achieve that. Make sure you have copied PutMerge.class inside org/scd/putmerge/ directory.
jar -cvf playground/putmerge.jar org/scd/putmerge/PutMerge.class
Next, verify again with jar tf putmerge.jar to check if now see org/scd/putmerge/PutMerge.class in the output.
If everything's fine, you can try to run the hadoop jar again. But looking at the errors, I see that you haven't actually included the PutMerge class with the package. You should use org.scd.putmerge.PutMerge. So, the correct way should be something like --
bin/hadoop jar playground/putmerge.jar org.scd.putmerge.PutMerge "..inputPath.." "..outPath"
I want to store the data in hdfs which is emitted by Storm Spout. I have added hadoop FS API code in Bolt Class, but It is throwing compilation error with storm.
Following is the Storm bolt Class :
package bolts;
import java.io.*;
import java.util.*;
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
public class DataNormalizer extends BaseBasicBolt {
public void execute(Tuple input, BasicOutputCollector collector) {
String sentence = input.getString(0);
String[] process = sentence.split(" ");
int n = 1;
String rec = "";
try {
String filepath = "/root/data/top_output.csv";
String dest = "hdfs://localhost:9000/user/root/nishu/top_output/top_output_1.csv";
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.get(conf);
System.out.println(fileSystem);
Path srcPath = new Path(source);
Path dstPath = new Path(dest);
String filename = source.substring(source.lastIndexOf('/') + 1,
source.length());
try {
if (!(fileSystem.exists(dstPath))) {
FSDataOutputStream out = fileSystem.create(dstPath, true);
InputStream in = new BufferedInputStream(
new FileInputStream(new File(source)));
byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}
in.close();
out.close();
} else {
fileSystem.copyFromLocalFile(srcPath, dstPath);
}
} catch (Exception e) {
System.err.println("Exception caught! :" + e);
System.exit(1);
} finally {
fileSystem.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
I have added hadoop jars in CLASSPATH also..
Following is the value of classpath :
$STORM_HOME/storm-0.8.1.jar:$JAVA_HOME/lib/:$HADOOP_HOME/hadoop-core-1.0.4.jar:$HADOOP_HOME/lib/:$STORM_HOME/lib/
Also copied hadoop libraries : hadoop-cor-1.0.4.jar, commons-collection-3.2.1.jar and commons-cli-1.2.jar in Storm/lib directory.
When I am building this project, It is throwing following error :
3006 [Thread-16] ERROR backtype.storm.daemon.executor -
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:37)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:34)
at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:466)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1494)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1395)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123)
at bolts.DataNormalizer.execute(DataNormalizer.java:67)
at backtype.storm.topology.BasicBoltExecutor.execute(BasicBoltExecutor.java:32)
......................
The error message tells you that Apache commons configuration is missing. You have to add it to the classpath.
More generally, you should add all Hadoop dependencies to your classpath. You can find them using a dependency manager (Maven, Ivy, Gradle etc.) or look into /usr/lib/hadoop/lib on a machine on which Hadoop is installed.