I am running MapReduce program on Hadoop.
The inputformat passes each file path to mapper.
I can check the file through cmd like this,
$ hadoop fs -ls hdfs://slave1.kdars.com:8020/user/hadoop/num_5/13.pdf
Found 1 items -rwxrwxrwx 3 hdfs hdfs 184269 2015-03-31 22:50 hdfs://slave1.kdars.com:8020/user/hadoop/num_5/13.pdf
However when I try to open that file from the mapper side, it is not working.
15/04/01 06:13:04 INFO mapreduce.Job: Task Id : attempt_1427882384950_0025_m_000002_2, Status : FAILED
Error: java.io.FileNotFoundException: hdfs:/slave1.kdars.com:8020/user/hadoop/num_5/13.pdf (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at java.io.FileInputStream.<init>(FileInputStream.java:101)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1111)
I checked that inputformat work fine and mapper have got right file path.
mapper code look like this,
#Override
public void map(Text title, Text file, Context context) throws IOException, InterruptedException {
long time = System.currentTimeMillis();
SimpleDateFormat dayTime = new SimpleDateFormat("yyyy-mm-dd hh:mm:ss");
String str = dayTime.format(new Date(time));
File temp = new File(file.toString());
if(temp.exists()){
DBManager.getInstance().insertSQL("insert into `plagiarismdb`.`workflow` (`type`) value ('"+temp+" is exists')");
}else{
DBManager.getInstance().insertSQL("insert into `plagiarismdb`.`workflow` (`type`) value ('"+temp+" is not exists')");
}
}
Help me please.
First, import these.
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
Then, use them in your mapper method.
FileSystem fs = FileSystem.get(new Configuration());
Path path= new Path(value.toString());
System.out.println(path);
if (fs.exists(path)) {
context.write(value, one);
} else {
context.write(value, zero);
}
Related
I'm running a simple mapreduce program wordcount agian Apache Hadoop 2.6.0. The hadoop is running distributedly (several nodes). However, I'm not able to see any stderr and stdout from yarn job history. (but I can see the syslog)
The wordcount program is really simple, just for demo purpose.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static final Log LOG = LogFactory.getLog(WordCount.class);
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
LOG.info("LOG - map function invoked");
System.out.println("stdout - map function invoded");
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("mapreduce.job.jar","/space/tmp/jar/wordCount.jar");
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/jsun/input"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/user/jsun/output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Note in the map function of Mapper class, I added two statements:
LOG.info("LOG - map function invoked");
System.out.println("stdout - map function invoded");
These two statements are to test whether I can see logging from hadoop server. I can successfully run the program. But if I go to localhost:8088 to see the application history and then "logs", I see nothing in "stdout", and in "stderr":
log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
I think there is some configuration needed to get those output, but not sure which piece of information is missing. I searched online as well as in stackoverflow. Some people mentioned container-log4j.properties but they are not specific about how to configure that file and where to put.
One thing to note is I also tried the job with Hortonworks Data Platform 2.2 and Cloudera 5.4. The result is the same. I remember when I dealt with some previous version of hadoop (hadoop 1.x), I can easily see the loggings from same place. So I guess this is something new in hadoop 2.x
=======
As a comparison, if I make the apache hadoop run in local mode (meaning LocalJobRunner), I can see some loggings in console like this:
[2015-09-08 15:57:25,992]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:998) INFO:kvstart = 26214396; length = 6553600
[2015-09-08 15:57:25,996]org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:402) INFO:Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
[2015-09-08 15:57:26,064]WordCount$TokenizerMapper.map(WordCount.java:28) INFO:LOG - map function invoked
stdout - map function invoded
[2015-09-08 15:57:26,075]org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:591) INFO:
[2015-09-08 15:57:26,077]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1457) INFO:Starting flush of map output
[2015-09-08 15:57:26,077]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1475) INFO:Spilling map output
These kind of loggings ("map function is invoked") is what I expected in hadoop server logging.
All the sysout written in Map-Reduce program can not be seen on console. It is because map-reduce run in multiple parallel copies across the cluster, so there is no concept of a single console with output.
However, The System.out.println() for map and reduce phases can be seen in the job logs. Easy way to access the logs is
open the jobtracker web console - http://localhost:50030/jobtracker.jsp
click on the completed job
click on map or reduce task
click on tasknumber
Go to task logs
Check stdout logs.
Please note that if you are not able to locate URL, just look into the console log for jobtracker URL.
I have problem with appending to file which was created by me. I don't have such problem with file which was upload manually to HDFS. What is a difference between file upload and created?
To append and create i use code below
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class Test {
public static final String hdfs = "hdfs://192.168.15.62:8020";
public static final String hpath = "/user/horton/wko/test.log";
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", hdfs);
conf.set("hadoop.job.ugi", "hdfs");
FileSystem fs = FileSystem.get(conf);
Path filenamePath = new Path(hpath);
//FSDataOutputStream out = fs.create(filenamePath);
FSDataOutputStream out = fs.append(filenamePath);
out.writeUTF("TEST\n");
out.close();
}
}
I got such exception in case append:
Exception in thread "main" java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.15.62:50010], original=[192.168.15.62:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
I had a similar problem that fixed adding conf.set("dfs.replication", "1").
In my case I had only one node in the cluster, and even though dfs.replication was set to 1 in hdfs-site.xml, it still was using the default value of 3.
Note that Hadoop will try to replicate the blocks of the file as soon as they get written in the first node, and since the default value for replication is 3, it will fail to access other nodes if you only have one node cluster.
Hadoop's DistributedCache documentation doesn't seem to sufficently describe how to use the distributed cache. Here is the example given:
// Setting up the cache for the application
1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
2. Setup the application's JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
3. Use the cached files in the Mapper
or Reducer:
public static class MapClass extends MapReduceBase
implements Mapper<K, V, K, V> {
private Path[] localArchives;
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
File f = new File("./map.zip/some/file/in/zip.txt");
}
public void map(K key, V value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here
// ...
// ...
output.collect(k, v);
}
}
I've been searching around for over an hour trying to figure out how to use this. After piecing together a few other SO questions, here's what I came up with:
public static void main(String[] args) throws Exception {
Job job = new Job(new JobConf(), "Job Name");
JobConf conf = job.getConfiguration();
DistributedCache.createSymlink(conf);
DistributedCache.addCacheArchive(new URI("/ProjectDir/LookupTable.zip", job);
// *Rest of configuration code*
}
public static class MyMapper extends Mapper<Object, Text, Text, IntWritable>
{
private Path[] localArchives;
public void configure(JobConf job)
{
// Get the cached archive
File file1 = new File("./LookupTable.zip/file1.dat");
BufferedReader br1index = new BufferedReader(new FileInputStream(file1));
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{ // *Map code* }
}
Where am I supposed to call the void configure(JobConf job) function?
Where do I use the private Path[] localArchives object?
Is my code in the configure() function the correct way to access files within an archive and to link a file with a BufferedReader?
I will answer your questions w.r.t new API and common practices in use for distributed cache
Where am I supposed to call the void configure(JobConf job) function?
Framework will call protected void setup(Context context) method once at beginning of every map task, the logic associated with using cache files is usually handled here. For example, reading file and storing data in variable to be used in map() function which is called after setup()
Where do I use the private Path[] localArchives object?
It will be typically used in setup() method to retrieve path of cache files . Something like this.
Path[] localArchive =DistributedCache.getLocalCacheFiles(context.getConfiguration());
Is my code in the configure() function the correct way to access
files within an archive and to link a file with a BufferedReader?
Its missing a call to method to retrive path where cache files are stored (shown above). Once the path is retrieved the file(s) can be read as below.
FSDataInputStream in = fs.open(localArchive);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
i'm using hadoop DistributedCache,but i got some troubles.
my hadoop is in pseudo-distributed mode.
from here we can see in pseudo-distributed mode we use
DistributedCache.getLocalCache(xx) to retrive cached file.
first i put my file into DistributedCache:
DistributedCache.addCacheFile(new Path(
"hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());
then retrieve in mapper setup(),but DistributedCache.getLocalCache returns null.i can see my cached file through
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
and it print out:
hdfs://localhost:8022/user/administrator/myfile
here is my Pseudocode:
public static class JoinMapper{
#Override
protected void setup(Context context){
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
System.out.println("Cache
:"+context.getConfiguration().get("mapred.cache.files"));
Path cacheFile;
if (cacheFiles != null) {}
}
}
xx....
public static void main(String[] args){
Job job = new Job(conf, "Join Test");
DistributedCache.addCacheFile(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());}
sorry about poor Typesetting.anyone help please....
btw,i can get uris using
URI[] uris = DistributedCache.getCacheFiles(context
.getConfiguration());
uris returns :
hdfs://localhost:8022/user/administrator/myfile
when i try to read from uri,error with file not found exception.
The Distributed Cache will copy your files from HDFS to the local file system of all TaskTracker.
How are u reading the file? If the file is in HDFS u will have to get HDFS FileSystem, otherwise it is going to use the default (probably the local one). So to read the file in HDFS try:
FileSystem fs = FileSystem.get(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(), new Configuration());
Path path = new Path (url);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
I have created jar that runs the mapReduce and generates the output at some directory.
I need to read data from output data from output dir from my java code which not runs in hadoop environment without copying it into local directory.
I am using ProcessBuilder to run Jar.can any one help me..??
You can write the following code to read the output of the job within your MR driver code.
job.waitForCompletion(true);
FileSystem fs = FileSystem.get(conf);
Path[] outputFiles = FileUtil.stat2Paths(fs.listStatus(output,new OutputFilesFilter()));
for (Path file : outputFiles ) {
InputStream is = fs.open(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
---
---
}
What's the problem in reading HDFS data using HDFS API??
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
FSDataInputStream inputStream = fs.open(new Path("/mapout/input.txt"));
System.out.println(inputStream.readLine());
}
Your program might be running out of your hadoop cluster but hadoop daemons must be running.