How to append to created file on HDFS - hadoop

I have problem with appending to file which was created by me. I don't have such problem with file which was upload manually to HDFS. What is a difference between file upload and created?
To append and create i use code below
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class Test {
public static final String hdfs = "hdfs://192.168.15.62:8020";
public static final String hpath = "/user/horton/wko/test.log";
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", hdfs);
conf.set("hadoop.job.ugi", "hdfs");
FileSystem fs = FileSystem.get(conf);
Path filenamePath = new Path(hpath);
//FSDataOutputStream out = fs.create(filenamePath);
FSDataOutputStream out = fs.append(filenamePath);
out.writeUTF("TEST\n");
out.close();
}
}
I got such exception in case append:
Exception in thread "main" java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.15.62:50010], original=[192.168.15.62:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

I had a similar problem that fixed adding conf.set("dfs.replication", "1").
In my case I had only one node in the cluster, and even though dfs.replication was set to 1 in hdfs-site.xml, it still was using the default value of 3.
Note that Hadoop will try to replicate the blocks of the file as soon as they get written in the first node, and since the default value for replication is 3, it will fail to access other nodes if you only have one node cluster.

Related

Access Data from REST API in HIVE

Is there a way to create a hive table where the location for that hive table will be a http JSON REST API? I don't want to import the data every time in HDFS.
I had encountered similar situation in a project couple of years ago. This is the sort of low-key way of ingesting data from Restful to HDFS and then you use Hive analytics to implement the business logic.I hope you are familiar with core Java, Map Reduce (if not you might look into Hortonworks Data Flow, HDF which is a product of Hortonworks).
Step 1: Your data ingestion workflow should not be tied to your Hive workflow that contains business logic. This should be executed independently in timely manner based on your requirement (volume & velocity of data flow) and monitored regularly. I am writing this code on a text editor. WARN: It's not compiled or tested!!
The code below is using a Mapper which would take in the url or tweak it to accept the list of urls from the FS. The payload or requested data is stored as text file in the specified job output directory (forget the structure of data this time).
Mapper Class:
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class HadoopHttpClientMap extends Mapper<LongWritable, Text, Text, Text> {
private int file = 0;
private String jobOutDir;
private String taskId;
#Override
protected void setup(Context context) throws IOException,InterruptedException {
super.setup(context);
jobOutDir = context.getOutputValueClass().getName();
taskId = context.getJobID().toString();
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
Path httpDest = new Path(jobOutDir, taskId + "_http_" + (file++));
InputStream is = null;
OutputStream os = null;
URLConnection connection;
try {
connection = new URL(value.toString()).openConnection();
//implement connection timeout logics
//authenticate.. etc
is = connection.getInputStream();
os = FileSystem.getLocal(context.getConfiguration()).create(httpDest,true);
IOUtils.copyBytes(is, os, context.getConfiguration(), true);
} catch(Throwable t){
t.printStackTrace();
}finally {
IOUtils.closeStream(is);
IOUtils.closeStream(os);
}
context.write(value, null);
//context.write(new Text (httpDest.getName()), new Text (os.toString()));
}
}
Mapper Only Job:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class HadoopHttpClientJob {
private static final String data_input_directory = “YOUR_INPUT_DIR”;
private static final String data_output_directory = “YOUR_OUTPUT_DIR”;
public HadoopHttpClientJob() {
}
public static void main(String... args) {
try {
Configuration conf = new Configuration();
Path test_data_in = new Path(data_input_directory, "urls.txt");
Path test_data_out = new Path(data_output_directory);
#SuppressWarnings("deprecation")
Job job = new Job(conf, "HadoopHttpClientMap" + System.currentTimeMillis());
job.setJarByClass(HadoopHttpClientJob.class);
FileSystem fs = FileSystem.get(conf);
fs.delete(test_data_out, true);
job.setMapperClass(HadoopHttpClientMap.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, test_data_in);
FileOutputFormat.setOutputPath(job, test_data_out);
job.waitForCompletion(true);
}catch (Throwable t){
t.printStackTrace();
}
}
}
Step 2: Create external table in Hive based on the HDFS directory. Remember to use Hive SerDe for the JSON data (in your case) then you can copy the data from external table into managed master tables. This is the step where you implement your incremental logics, compression..
Step 3: Point your hive queries (which you might have already created) to the master table to implement your business needs.
Note: If you are supposedly referring to realtime analysis or streaming api, you might have to change your application's architecture. Since you have asked architectural question, I am using my best educated guess to support you. Please go through this once. If you feel you can implement this in your application then you can ask the specific question, I will try my best to address them.

Meaning of context.getconfiguration in hadoop

I have this doubt in code of searching by argument.
what is the meaning of context.getConfiguration().get("Uid2Search");
package SearchTxnByArg;
// This is the Mapper Program for SearchTxnByArg
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMap extends Mapper<LongWritable, Text, NullWritable, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String Txn = value.toString();
String TxnParts[] = Txn.split(",");
String Uid = TxnParts[2];
String Uid2Search = context.getConfiguration().get("Uid2Search");
if(Uid.equals(Uid2Search))
{
context.write(null, value);
}
}
}
Driver Program
package SearchTxnByArg;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("Uid2Search", args[0]);
Job job = new Job(conf, "Map Reduce Search Txn by Arg");
job.setJarByClass(MyDriver.class);
job.setMapperClass(MyMap.class);
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I don't know how you have written your driver program. But in my experience,
If you are trying to get system property either by using -D option from the command line or by System.setproperty method by default these values will be set to context configuration.
As per documentation,
Configurations are specified by resources. A resource contains a set
of name/value pairs as XML data. Each resource is named by either a
String or by a Path. If named by a String, then the classpath is
examined for a file with that name. If named by a Path, then the local
filesystem is examined directly, without referring to the classpath.
Unless explicitly turned off, Hadoop by default specifies two
resources, loaded in-order from the classpath: core-default.xml :
Read-only defaults for hadoop. core-site.xml: Site-specific
configuration for a given hadoop installation. Applications may add
additional resources, which are loaded subsequent to these resources
in the order they are added.
Please see this answer as well
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.

Running a mapreduce job no output at all. It doesn't even run . very weird. no error thrown on the terminal

I compiled the mapreduce code (driver, mapper and reducer classes) and created Jar files. When I run it on the dataset, it doesn't seem to run. It just comes back to the prompt as shown in the image. Any suggestions folks?
thanks much
basam
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
//This driver program will bring all the information needed to submit this Map reduce job.
public class MultiLangDictionary {
public static void main(String[] args) throws Exception{
if (args.length !=2){
System.err.println("Usage: MultiLangDictionary <input path> <output path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job ajob = new Job(conf, "MultiLangDictionary");
//Assigning the driver class name
ajob.setJarByClass(MultiLangDictionary.class);
FileInputFormat.addInputPath(ajob, new Path(args[0]));
//first argument is the job itself
//second argument is the location of the output dataset
FileOutputFormat.setOutputPath(ajob, new Path(args[1]));
ajob.setInputFormatClass(TextInputFormat.class);
ajob.setOutputFormatClass(TextOutputFormat.class);
//Defining the mapper class name
ajob.setMapperClass(MultiLangDictionaryMapper.class);
//Defining the Reducer class name
ajob.setReducerClass(MultiLangDictionaryReducer.class);
//setting the second argument as a path in a path variable
Path outputPath = new Path(args[1]);
//deleting the output path automatically from hdfs so that we don't have delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
}
}
try with java packagename.classname in the command
hadoop jar MultiLangDictionary.jar [yourpackagename].MultiLangDictionary input output
You could try adding the Map and Reduce output key types to your driver. Something like (this is an example):
job2.setMapOutputKeyClass(Text.class);
job2.setMapOutputValueClass(Text.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
In the above both the Mapper and Reducer would be writing (Text,Text) in their context.write() methods.

yarn stderr no logger appender and no stdout

I'm running a simple mapreduce program wordcount agian Apache Hadoop 2.6.0. The hadoop is running distributedly (several nodes). However, I'm not able to see any stderr and stdout from yarn job history. (but I can see the syslog)
The wordcount program is really simple, just for demo purpose.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static final Log LOG = LogFactory.getLog(WordCount.class);
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
LOG.info("LOG - map function invoked");
System.out.println("stdout - map function invoded");
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("mapreduce.job.jar","/space/tmp/jar/wordCount.jar");
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/jsun/input"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/user/jsun/output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Note in the map function of Mapper class, I added two statements:
LOG.info("LOG - map function invoked");
System.out.println("stdout - map function invoded");
These two statements are to test whether I can see logging from hadoop server. I can successfully run the program. But if I go to localhost:8088 to see the application history and then "logs", I see nothing in "stdout", and in "stderr":
log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
I think there is some configuration needed to get those output, but not sure which piece of information is missing. I searched online as well as in stackoverflow. Some people mentioned container-log4j.properties but they are not specific about how to configure that file and where to put.
One thing to note is I also tried the job with Hortonworks Data Platform 2.2 and Cloudera 5.4. The result is the same. I remember when I dealt with some previous version of hadoop (hadoop 1.x), I can easily see the loggings from same place. So I guess this is something new in hadoop 2.x
=======
As a comparison, if I make the apache hadoop run in local mode (meaning LocalJobRunner), I can see some loggings in console like this:
[2015-09-08 15:57:25,992]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:998) INFO:kvstart = 26214396; length = 6553600
[2015-09-08 15:57:25,996]org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:402) INFO:Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
[2015-09-08 15:57:26,064]WordCount$TokenizerMapper.map(WordCount.java:28) INFO:LOG - map function invoked
stdout - map function invoded
[2015-09-08 15:57:26,075]org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:591) INFO:
[2015-09-08 15:57:26,077]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1457) INFO:Starting flush of map output
[2015-09-08 15:57:26,077]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1475) INFO:Spilling map output
These kind of loggings ("map function is invoked") is what I expected in hadoop server logging.
All the sysout written in Map-Reduce program can not be seen on console. It is because map-reduce run in multiple parallel copies across the cluster, so there is no concept of a single console with output.
However, The System.out.println() for map and reduce phases can be seen in the job logs. Easy way to access the logs is
open the jobtracker web console - http://localhost:50030/jobtracker.jsp
click on the completed job
click on map or reduce task
click on tasknumber
Go to task logs
Check stdout logs.
Please note that if you are not able to locate URL, just look into the console log for jobtracker URL.

WebHdfsFileSystem local ip vs network ip hadoop

have a requirement to read HDFS from a outside of the hdfs cluster. I stumbled upon WebHdfsFileSystem and even though I got the idea but I could not make it work with the network address. For example, the code below works fine as long as I use 127.0.0.1 or localhost. But the moment I use the network ip address 192.168.. , I get "Retrying connect to server" messages followed by ConnectException.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.web.WebHdfsFileSystem;
public class ReadHDFSFile {
public static void main(String[] args) {
Path p = new Path("hdfs://127.0.0.1:9000/user/hduser");
WebHdfsFileSystem web = new WebHdfsFileSystem();
try {
Configuration conf = new Configuration();
conf.set("fs.default.name","hdfs://127.0.0.1:9000/");
web.setConf(conf);
Configuration conf1 = web.getConf();
FileSystem fs = FileSystem.get(web.getConf());
System.out.println(fs.exists(p));
} catch (IOException e) {
e.printStackTrace();
}
}
}
I am not sure what am I missing here.
I have a version of this working on Hadoop 2.4. I had to change two things relative to using the regular Hadoop FileSystem API:
the protocol changes from hdfs:// to webhdfs://
the port changed to the http-port (which on our Hortonworks cluster is 50070), not the default hdfs port (which might also be called the RPC port?), which is 8020 on our system
Example code that works for me:
Configuration conf = new Configuration();
String conxUrl = String.format("webhdfs://%s:%s", NAMENODE_IP_ADDR, WEBHDFS_PORT);
conf.set("fs.defaultFS", conxUrl);
FileSystem fs = WebHdfsFileSystem.get(conf);
Path path = new Path("/path/to/my/file");
System.out.println(fs.exists(path));

Resources