unable to connect to hdfs - hadoop

I am trying to create a directory using hdfs, my Hadoop cluster is not running on localhost, rather it's running on a different network. Please tell me how to connect correctly. I am using the following code to connect to hdfs:
import java.io.BufferedInputStream;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class HadoopJava {
public static void main(String[] args) throws IOException {
HadoopJava hadoopJava = new HadoopJava();
hadoopJava.mkdir("/home/anoop/javamkdir");
}
public void mkdir(String directory) throws IOException {
Configuration obj = new Configuration();
obj.set("fs.default.name", "hdfs://10.111.214.124:50070");
FileSystem fs = FileSystem.get(obj);
Path path = new Path(directory);
fs.mkdirs(path);
System.out.println("created");
fs.close();
}
}
The error trace is as follows:
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:365)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:338)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2086)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2055)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2031)
at HadoopJava.copyfileFromHDFSToLOCAL(HadoopJava.java:74)
at HadoopJava.main(HadoopJava.java:29)

Related

docx4j cant load file with error java.io.FileNotFoundException

in my spring boot project iam using docx4j to load a file from the target folder although the file exists when i use system.out.print("exists) it appears in the console . any solution ? here is the code
public void testDocx4j() throws Docx4JException, FileNotFoundException {
File file = ResourceUtils.getFile("classpath:compare.docx");
if(file.exists()){
System.out.println("exists !!");
}
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(file);
MainDocumentPart mainDocumentPart = wordMLPackage.getMainDocumentPart();
}
i was trying to load the file with docx4j
The following works for me:
import java.io.IOException;
import java.io.InputStream;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.utils.ResourceUtils;
public class LoadAsResource {
public static void main(String[] args) throws Docx4JException, IOException {
InputStream is = ResourceUtils.getResource("sample-docxv2.docx");
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(is);
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
System.out.println(documentPart.getXML());
}
}

Hadoop WordCount Tutorial java.lang.ClassNotFoundException

I'm relatively new to hadoop and I'm struggling a little bit to understand the ClassNotFoundException I get when trying to run the job. I'm using the standard tutorial found here and here is my WordCount class (running on ubuntu 16.04 hadoop 2.7.3 distributed cluster mode):
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
To try and remain organized, I added a couple paths to my ~/.bashrc file:
hduser#mynode:~$ cd $HADOOP_CODE
hduser#mynode:/usr/local/hadoop/code$
This is one directory down from the $HADOOP_HOME directory. To compile the WordCount.JAVA file, I ran:
hduser#mynode:/usr/local/hadoop$ hadoop com.sun.tools.javac.Main $HADOOP_CODE/WordCount.java
hduser#mynode:/usr/local/hadoop$ jar cf wc.jar $HADOOP_CODE/WordCount*.class
I then tried:
hduser#mynode:/usr/local/hadoop$ hadoop jar $HADOOP_CODE/wc.jar $HADOOP_CODE/WordCount /home/hduser/input /home/hduser/output/wordcount
which bombed with the following error:
Exception in thread "main" java.lang.ClassNotFoundException: /usr/local/hadoop/code/WordCount
EDIT
This gave me the same error:
hduser#mynode:/usr/local/hadoop/code$ hadoop jar $HADOOP_CODE/wc.jar WordCount /home/hduser/input /home/hduser/output/wordcount
To get it to run without error, I moved the WordCount.Java file up one directory to the default hadoop ($HADOOP_HOME) folder. I also know from here and here that the solution is to add a package to the file.
What I'm trying to understand is why that is the solution. With no package name, where is hadoop looking for the specified package, and why can't I pass it a full path to get it to run correctly? This may be a basic java question (apologies - I'm from the python world), but what is the package name doing during the compile process that makes it so I could run without a path name, but leaving off the package name means it has to be in that default directory? I'd prefer not to have to add a package name to every job I run. An explanation would be greatly appreciated!

malformedURLException : no protocol

I am trying to execute a simple Hadoop program to read the contents of the file and print it on to the console.
I am following Hadoop The definitive guide : URLCat example
I am getting malformed URL Exception : no protocol
When i use -cat with hdfs://localhost/user/training/test.txt i am getting the contents printed out but when i use the same path while executing the jar i am getting the mentioned exception.
I have added static block where it sets the URLStreamHandlerFactory
EDITED :
My Program :
import java.io.InputStream;
import java.net.URL;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import org.apache.hadoop.io.IOUtils;
// vv URLCat
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}

Why does job submission from Java fail?

I submit a Spark job from Java as a RESTful service. I keep getting the following error:
Application application_1446816503326_0098 failed 2 times due to AM
Container for appattempt_1446816503326_0098_000002 exited with
exitCode: -1000 For more detailed output, check application tracking
page:http://ip-172-31-34-
108.us-west-2.compute.internal:8088/proxy/application_1446816503326_0098/Then,
click on links to logs of each attempt. Diagnostics:
java.io.FileNotFoundException: File file:/opt/apache-tomcat-
8.0.28/webapps/RESTfulExample/WEB-INF/lib/spark-yarn_2.10-1.3.0.jar does not exist Failing this attempt. Failing the application.
spark-yarn_2.10-1.3.0.jar file is there in the lib folder.
Here is my program.
package SparkSubmitJava;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.deploy.yarn.ClientArguments;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import java.io.IOException;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.PathParam;
import javax.ws.rs.core.Response;
#Path("/spark")
public class JavaRestService {
#GET
#Path("/{param}/{param2}/{param3}")
public Response getMsg(#PathParam("param") String bedroom,#PathParam("param2") String bathroom,#PathParam("param3")String area) throws IOException {
String[] args = new String[] {
"--name",
"JavaRestService",
"--driver-memory",
"1000M",
"--jar",
"/opt/apache-tomcat-8.0.28/webapps/scalatest-0.0.1-SNAPSHOT.jar",
"--class",
"ScalaTest.ScalaTest.ScalaTest",
"--arg",
bedroom,
"--arg",
bathroom,
"--arg",
area,
"--arg",
"yarn-cluster",
};
Configuration config = new Configuration();
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf sparkConf = new SparkConf();
ClientArguments cArgs = new ClientArguments(args, sparkConf);
Client client = new Client(cArgs, config, sparkConf);
client.run();
return Response.status(200).entity(client).build();
}
}
Any help will be appreciated.

How do I convert my Java Hadoop code to run on EC2?

I wrote a Driver, Mapper, and Reducer class in Java that runs the k-nearest neighbor algorithm on test data, and pulls in the training set using Distributed Cache. I used a Cloudera virtual machine to test the code, and it works in pseudo-distributed mode.
I'm trying to get through Amazon's EC2/EMR documentation ... it seems like there should be a way to easily convert working Java Hadoop code into something that will work in EC2, but I'm seeing a whole bunch of custom amazon import statements and methods that I've never seen before.
Here's my driver code for an example:
import java.net.URI;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class KNNDriverEC2 extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.setInt("rows",1000);
conf.setInt("columns",613);
DistributedCache.createSymlink(conf);
// might have to start next line with ./!!!
DistributedCache.addCacheFile(new URI("knn-jg/cache_data/train_sample.csv#train_sample.csv"),conf);
DistributedCache.addCacheFile(new URI("knn-jg/cache_data/train_labels.csv#train_labels.csv"),conf);
//DistributedCache.addCacheFile(new URI("cacheData/train_sample.csv"),conf);
//DistributedCache.addCacheFile(new URI("cacheData/train_labels.csv"),conf);
Job job = new Job(conf);
job.setJarByClass(KNNDriverEC2.class);
job.setJobName("KNN");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(KNNMapperEC2.class);
job.setReducerClass(KNNReducerEC2.class);
// job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new KNNDriverEC2(), args);
System.exit(exitCode);
}
}
I've gotten my instance running, but an exception is thrown at the line "FileInputFormat.setInputPaths(job, new Path(args[0]));". I'm going to try to work through the documentation on handling arguments, but I've run into so many errors so far I'm wondering if I'm far from making this work. Any help appreciated.

Resources