Not able to append to existing file to HDFS - hadoop

I have single node Hadoop 1.2.1 cluster running on VM.
My hdfs-site.xml looks like this:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
</description>
</property>
<property>
<name>dfs.support.append</name>
<value>true</value>
<description>Does HDFS allow appends to files?
</description>
</property>
</configuration>
Now when I'm trying to run the following code from Eclipse it returns me always false:
Configuration config = new Configuration();
config.set("mapred.job.tracker","10.0.0.6:54311");
config.set("fs.default.name","hdfs://10.0.0.6:54310");
FileSystem fs = FileSystem.get(config);
boolean flag = Boolean.getBoolean(fs.getConf().get("dfs.support.append"));
System.out.println("dfs.support.append is set to be " + flag);
Now If I'm trying to append to existing file I'll get the following error:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Append is not supported. Please see the dfs.support.append configuration parameter
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1781)
at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:725)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)
at org.apache.hadoop.ipc.Client.call(Client.java:1113)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy1.append(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy1.append(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:933)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:922)
at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:196)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:659)
at com.vanilla.hadoop.AppendToHdfsFile.main(AppendToHdfsFile.java:29)
What is wrong? Am I missing something?

You should try with a 2.X.X version or 0.2X version because appending a file on hdfs after hadoop 0.20.2. See more information at here and here

Append is not supported since 1.0.3. Anyway, if you really need the previous functionality, to turn on the append functionality set the flag "dfs.support.broken.append" to true.
hadoop.apache.org/docs/r1.2.1/releasenotes.html

Let's now start with configuring the file system:
public FileSystem configureFileSystem(String coreSitePath, String hdfsSitePath) {
FileSystem fileSystem = null;
try {
Configuration conf = new Configuration();
conf.setBoolean("dfs.support.append", true);
Path coreSite = new Path(coreSitePath);
Path hdfsSite = new Path(hdfsSitePath);
conf.addResource(coreSite);
conf.addResource(hdfsSite);
fileSystem = FileSystem.get(conf);
} catch (IOException ex) {
System.out.println("Error occurred while configuring FileSystem");
}
return fileSystem;
}
Make sure that the property dfs.support.append in hdfs-site.xml is set to true.
You can either set it manually by editing the hdfs-site.xml file or programmatically using:
conf.setBoolean("dfs.support.append", true);
Let's start with appending to a file in HDFS.
public String appendToFile(FileSystem fileSystem, String content, String dest) throws IOException {
Path destPath = new Path(dest);
if (!fileSystem.exists(destPath)) {
System.err.println("File doesn't exist");
return "Failure";
}
Boolean isAppendable = Boolean.valueOf(fileSystem.getConf().get("dfs.support.append"));
if(isAppendable) {
FSDataOutputStream fs_append = fileSystem.append(destPath);
PrintWriter writer = new PrintWriter(fs_append);
writer.append(content);
writer.flush();
fs_append.hflush();
writer.close();
fs_append.close();
return "Success";
}
else {
System.err.println("Please set the dfs.support.append property to true");
return "Failure";
}
}
To see whether the data has been correctly written to HDFS, let's write a method to read from HDFS and return the content as a String.
public String readFromHdfs(FileSystem fileSystem, String hdfsFilePath) {
Path hdfsPath = new Path(hdfsFilePath);
StringBuilder fileContent = new StringBuilder("");
try{
BufferedReader bfr=new BufferedReader(new InputStreamReader(fileSystem.open(hdfsPath)));
String str;
while ((str = bfr.readLine()) != null) {
fileContent.append(str+"\n");
}
}
catch (IOException ex){
System.out.println("----------Could not read from HDFS---------\n");
}
return fileContent.toString();
}
After that, we have successfully written and read the file in HDFS. It's time to close the file system.
public void closeFileSystem(FileSystem fileSystem){
try {
fileSystem.close();
}
catch (IOException ex){
System.out.println("----------Could not close the FileSystem----------");
}
}
Before executing the code, you should have Hadoop running on your system.
You just need to go to HADOOP_HOME and run following command:
./sbin/start-all.sh
For Complete Reference use https://github.com/ksimar/HDFS_AppendAPI

Related

How to fetch data from Hbase table which is running on linux system and java progamme which is run on window Could not locate executable null\bin\

How to fetch data from Hbase table which is running on linux system and java progamme which is run on window Could not locate executable null\bin\
//
this is my code to connect
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "192.168.20.129");
conf.set("hbase.zookeeper.property.clientPort", "2181");
conf.set("hbase.master", "192.168.20.129:60010");
Just add this method , call before connecting.
private static void workaround() {
//workaround for = java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
File workaround = new File(".");
System.getProperties().put("hadoop.home.dir", workaround.getAbsolutePath());
new File("./bin").mkdirs();
try {
new File("./bin/winutils.exe").createNewFile();
} catch (IOException e) {
logger.error(e);
}
}

Append to file in HDFS (CDH 5.4.5)

Brand new to HDFS here.
I've got this small section of code to test out appending to a file:
val path: Path = new Path("/tmp", "myFile")
val config = new Configuration()
val fileSystem: FileSystem = FileSystem.get(config)
val outputStream = fileSystem.append(path)
outputStream.writeChars("what's up")
outputStream.close()
It is failing with this message:
Not supported
java.io.IOException: Not supported
at org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:352)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1163)
I looked at the source for ChecksumFileSystem.java, and it seems to be hardcoded to not support appending:
#Override
public FSDataOutputStream append(Path f, int bufferSize,
Progressable progress) throws IOException {
throw new IOException("Not supported");
}
How to make this work? Is there some way to change the default file system to some other implementation that does support append?
It turned out that I needed to actually run a real hadoop namenode and datanode. I am new to hadoop and did not realize this. Without this, it will use your local filesystem which is a ChecksumFileSystem, which does not support append. So I followed the blog post here to get it up and running on my system, and now I am able to append.
The append method has to be called on outputstream not on filesystem. filesystem.get() is just used to connect to your HDFS. First set dfs.support.append as true in hdfs-site.xml
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
stop all your demon services using stop-all.sh and restart it again using start-all.sh. Put this in your main method.
String fileuri = "hdfs/file/path"
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(fileuri),conf);
FSDataOutputStream out = fs.append(new Path(fileuri));
PrintWriter writer = new PrintWriter(out);
writer.append("I am appending this to my file");
writer.close();
fs.close();

FileNotFoundException when using DistributedCache to access MapFile

I am using hadoop cdf4.7 run in yarn mode. There is a MapFile in hdfs://test1:9100/user/tagdict_builder_output/part-00000
and it has two file index and data
I used the following code to add it to distributedCache:
Configuration conf = new Configuration();
Path tagDictFilePath = new Path("hdfs://test1:9100/user/tagdict_builder_output/part-00000");
DistributedCache.addCacheFile(tagDictFilePath.toUri(), conf);
Job job = new Job(conf);
And initialize a MapFile.Reader at setup of Mapper:
#Override
protected void setup(Context context) throws IOException, InterruptedException {
Path[] localFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
if (localFiles != null && localFiles.length > 0 && localFiles[0] != null) {
String mapFileDir = localFiles[0].toString();
LOG.info("mapFileDir " + mapFileDir);
FileSystem fs = FileSystem.get(context.getConfiguration());
reader = new MapFile.Reader(fs, mapFileDir, context.getConfiguration());
}
else {
throw new IOException("Could not read lexicon file in DistributedCache");
}
}
But it throws FileNotFoundException:
Error: java.io.FileNotFoundException: File does not exist: /home/mps/cdh/local/usercache/mps/appcache/application_1405497023620_0045/container_1405497023620_0045_01_000012/part-00000/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1704)
at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:452)
at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:426)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:396)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:405)
at aps.Cdh4MD5TaglistPreprocessor$Vectorizer.setup(Cdh4MD5TaglistPreprocessor.java:61)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:160)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:155)
I've also tried /user/tagdict_builder_output/part-00000 as path,or use a symbol link. But these do not work either.How to solve this?Many thanks.
As it says here:
Distributed Cache associates the cache files to the current working directory of the mapper and reducer using symlinks.
So you should try to access your files through the File object:
File f = new File("./part-00000");
EDIT1
My last suggestion:
DistributedCache.addCacheFile(new URI(tagDictFilePath.toString() + "#cache-file"), conf);
DistributedCache.createSymlink(conf);
...
// in mapper
File f = new File("cache-file");

FileNotFoundException on hadoop

Inside my map function, I am trying to read a file from the distributedcache, load its contents into a hash map.
The sys output log of the MapReduce job prints the content of the hashmap. This shows that it has found the file, has loaded into the data structure and performed the needed operation. It iterates through the list and prints its contents. Thus proving that the operation was successful.
However, I still get the below error after a few minutes of running the MR job:
13/01/27 18:44:21 INFO mapred.JobClient: Task Id : attempt_201301271841_0001_m_000001_2, Status : FAILED
java.io.FileNotFoundException: File does not exist: /app/hadoop/jobs/nw_single_pred_in/predict
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here's the portion which initializes Path with the location of the file to be placed in the distributed cache
// inside main, surrounded by try catch block, yet no exception thrown here
Configuration conf = new Configuration();
// rest of the stuff that relates to conf
Path knowledgefilepath = new Path(args[3]); // args[3] = /app/hadoop/jobs/nw_single_pred_in/predict/knowledge.txt
DistributedCache.addCacheFile(knowledgefilepath.toUri(), conf);
job.setJarByClass(NBprediction.class);
// rest of job settings
job.waitForCompletion(true); // kick off load
This one is inside the map function:
try {
System.out.println("Inside try !!");
Path files[]= DistributedCache.getLocalCacheFiles(context.getConfiguration());
Path cfile = new Path(files[0].toString()); // only one file
System.out.println("File path : "+cfile.toString());
CSVReader reader = new CSVReader(new FileReader(cfile.toString()),'\t');
while ((nline=reader.readNext())!=null)
data.put(nline[0],Double.parseDouble(nline[1])); // load into a hashmap
}
catch (Exception e)
{// handle exception }
Help appreciated.
Cheers !
Did a fresh installation of hadoop and ran the job with the same jar, the problem disappeared. Seems to be a bug rather than programming errors.

Copying files from HDFS to local file system with JAVA

I am trying to copy files from HDFS to local filesystem for preprocessing. The below code should work according to the documentation. Although it doesn't give any error messages and the mapreduce job runs smoothly I can not see any output on my local hard drive. What do you think the problem is? Thanks.
try {
Path phdfs_input = new Path("hdfs://master:54310/user/hduser/conninput/"+value.toString());
Path plocal_input = new Path("/home/hduser/Desktop/"+avlue.toString());
FileSystem fs = FileSystem.get(context.getConfiguration());
fs.copyToLocalFile(phdfs_input, plocal_input);
/* String localoutput_file = "/home/hduser/Destop/output/"+value.toString();
String cmd1[] = {"mafia", "-mfi", ".5", "-ascii", "~/Desktop/"+value.toString(), localoutput_file };
File mafia_dir = new File("/home/hduser/");
ShellCommandExecutor s = new ShellCommandExecutor(cmd1, mafia_dir);*/
} catch (Exception e) {
e.printStackTrace();
}
Try using /user/hduser/conninput/"+value.toString() in the Path constructor instead of providing the master:54310 part. It should figure out master:54310 from the Configuration.

Resources