FSDataOutputStream.writeUTF() adds extra characters at the start of the data on hdfs. How to avoid this extra data? - hadoop

What I am trying to is to convert a sequence file on hdfs which has xml data into .xml files on hdfs.
Searched on Google and found the below code. I made modifications according to my need and the following is the code..
public class SeqFileWriterCls {
public static void main(String args[]) throws Exception {
System.out.println("Reading Sequence File");
Path path = new Path("seq_file_path/seq_file.seq");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
SequenceFile.Writer writer = null;
SequenceFile.Reader reader = null;
FSDataOutputStream fwriter = null;
OutputStream fowriter = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
//writer = new SequenceFile.Writer(fs, conf,out_path,Text.class,Text.class);
Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
//i am just editing the path in such a way that key will be my filename and data in it will be the value
Path out_path = new Path(""+key);
String string_path = out_path.toString();
String clear_path=string_path.substring(string_path.lastIndexOf("/")+1);
Path finalout_path = new Path("path"+clear_path);
System.out.println("the final path is "+finalout_path);
fwriter = fs.create(finalout_path);
fwriter.writeUTF(value.toString());
fwriter.close();
FSDataInputStream in = fs.open(finalout_path);
String s = in.readUTF();
System.out.println("file has: -" + s);
//fowriter = fs.create(finalout_path);
//fowriter.write(value.toString());
System.out.println(key + " <===> :" + value.toString());
System.exit(0);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeStream(reader);
fs.close();
}
}
I am using "FSDataOutputStream" to write the data to HDFS and the method is used is "writeUTF" The issue is that when i write to the hdfs file some additional characters are getting in the starting of data. But when i print the data i couldnt see the extra characters.
i tried using writeChars() but even taht wont work.
is there any way to avoid this?? or is there any other way to write the data to HDFS???
please help...

The JavaDoc of the writeUTF(String str) method says the followings:
Writes a string to the underlying output stream using modified UTF-8 encoding in a machine-independent manner.
First, two bytes are written to the output stream as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string. Following the length, each character of the string is output, in sequence, using the modified UTF-8 encoding for the character. (...)
Both the writeBytes(String str) and writeChars(String str) methods should work fine.

Related

Text file not getting compressed correct in HDFS

I have a .txt file in my local and I want to compress this file into .gz and upload it in a location in HDFS.
Below is the code I tried:
String codecClassName = args[1];
String source = args[2];
String dest = args[3];
InputStream in = new BufferedInputStream(new FileInputStream(source));
Class<?> codecClass = Class.forName(codecClassName);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf);
FileSystem fs = FileSystem.get(URI.create(dest),conf);
OutputStream out = fs.create(new Path(dest),new Progressable() {
#Override
public void progress() {
System.out.println(".");
}
});
CompressionOutputStream outStream = codec.createOutputStream(out);
IOUtils.copyBytes(in, outStream, 4096,false);
Below are the values of the argument passed in this code:
arg1 (Name of the Compresser): org.apache.hadoop.io.compress.GzipCodec
arg2 (A location in my local drive): /home/user/Demo.txt
arg3 (A location in HDFS): hdfs://localhost:8020/user/input/Demo.gz
When I run this code, the Demo.gz file is getting created in the above mentioned HDFS location but the size for the .gz file is 0MB.
Please let me know why is the file not getting compressed and uploaded in the HDFS correctly.
You did not seem to close the streams.
You have two options:
Close them automatically by passing true as the forth parameter to copyBytes
Close them manually e.g. outStream.close()

Memory issues when running Spark job on relatively large input

I am running a spark cluster with 50 machines. Each machine is a VM with 8-core, and 50GB memory (41 seems to be available to Spark).
I am running on several input folders, I estimate the size of input to be ~250GB gz compressed.
Although it seems to me that the amount and configuration of machines I am using seems to be sufficient, after about 40 minutes of run the job fail, I can see following errors in the logs:
2558733 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 345.0 in stage 1.0 (TID 345, hadoop-w-3.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: Java heap space
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
and also:
2653545 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 122.1 in stage 1.0 (TID 392, hadoop-w-22.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
How do I go about debugging such an issue?
EDIT: I Found the root cause of the problem. It is this piece of code:
private static final int MAX_FILE_SIZE = 40194304;
....
....
JavaPairRDD<String, List<String>> typedData = filePaths.mapPartitionsToPair(new PairFlatMapFunction<Iterator<String>, String, List<String>>() {
#Override
public Iterable<Tuple2<String, List<String>>> call(Iterator<String> filesIterator) throws Exception {
List<Tuple2<String, List<String>>> res = new ArrayList<>();
String fileType = null;
List<String> linesList = null;
if (filesIterator != null) {
while (filesIterator.hasNext()) {
try {
Path file = new Path(filesIterator.next());
// filter non-trc files
if (!file.getName().startsWith("1")) {
continue;
}
fileType = getType(file.getName());
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
CompressionCodec codec = compressionCodecs.getCodec(file);
FileSystem fs = file.getFileSystem(conf);
ContentSummary contentSummary = fs.getContentSummary(file);
long fileSize = contentSummary.getLength();
InputStream in = fs.open(file);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
byte[] buffer = new byte[MAX_FILE_SIZE];
BufferedInputStream bis = new BufferedInputStream(in, BUFFER_SIZE);
int count = 0;
int bytesRead = 0;
try {
while ((bytesRead = bis.read(buffer, count, BUFFER_SIZE)) != -1) {
count += bytesRead;
}
} catch (Exception e) {
log.error("Error reading file: " + file.getName() + ", trying to read " + BUFFER_SIZE + " bytes at offset: " + count);
throw e;
}
Iterable<String> lines = Splitter.on("\n").split(new String(buffer, "UTF-8").trim());
linesList = Lists.newArrayList(lines);
// get rid of first line in file
Iterator<String> it = linesList.iterator();
if (it.hasNext()) {
it.next();
it.remove();
}
//res.add(new Tuple2<>(fileType,linesList));
} finally {
res.add(new Tuple2<>(fileType, linesList));
}
}
}
return res;
}
Particularly allocating a buffer of size 40M for each file in order to read the content of the file using BufferedInputStream. This causes the stack memory to end at some point.
The thing is:
If I read line by line (which does not require a buffer), it will be
very non-efficient read
If I allocate one buffer and reuse it for
each file read - is it possible in parallelism sense? Or will it get
overwritten by several threads?
Any suggestions are welcome...
EDIT 2: Fixed first memory issue by moving the byte array allocation outside the iterator, so it gets reused by all partition elements. But there is still the new String(buffer, "UTF-8").trim()) which gets created for the split purpose - that's an object that gets also created every time. I could use a stringbuffer/builder but then how would I set the charset encoding without a String object ?
Eventually I changed the code as follows:
// Transform list of files to list of all files' content in lines grouped by type
JavaPairRDD<String,List<String>> typedData = filePaths.mapToPair(new PairFunction<String, String, List<String>>() {
#Override
public Tuple2<String, List<String>> call(String filePath) throws Exception {
Tuple2<String, List<String>> tuple = null;
try {
String fileType = null;
List<String> linesList = new ArrayList<String>();
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
Path path = new Path(filePath);
fileType = getType(path.getName());
tuple = new Tuple2<String, List<String>>(fileType, linesList);
// filter non-trc files
if (!path.getName().startsWith("1")) {
return tuple;
}
CompressionCodec codec = compressionCodecs.getCodec(path);
FileSystem fs = path.getFileSystem(conf);
InputStream in = fs.open(path);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF-8"), BUFFER_SIZE);
// Get rid of the first line in the file
r.readLine();
// Read all lines
String line;
while ((line = r.readLine()) != null) {
linesList.add(line);
}
} catch (IOException e) { // Filtering of files whose reading went wrong
log.error("Reading of the file " + filePath + " went wrong: " + e.getMessage());
} finally {
return tuple;
}
}
});
So now I do not use a buffer in size of 40M but rather build the lines list dynamically using an array list. This solved my current memory issue, but now I got other strange errors failing the job. Will report those in a different question...

Jmeter value to variable in string

How do i replace a variable defined in a file (a.xml) after the file is read into Jmeter ?
eg. a.xml has a content.
<Shipment Action="MODIFY" OrderNo="${vOrderNo}" >
The entire file is read into a string using
str_Input=${__FileToString(/a.xml)}
In the Jmx file, a http Request is made to get output from a webservice as
Using Xpath Extractor the value of OrderNo is read into a Variable vOrderNo.
Now, wanted to use the value of variable vOrderNo in the str_Input.. ? How do i ?
You can easily achieve this using beanshell (~java) code from any jmeter's sampler which allows beanshell code execution - BeanShell Sampler e.g..
The following works:
import java.io.*;
try
{
// reading file into buffer
StringBuilder data = new StringBuilder();
BufferedReader in = new BufferedReader(new FileReader("d:\\test.xml"));
char[] buf = new char[1024];
int numRead = 0;
while ((numRead = in.read(buf)) != -1) {
data.append(buf, 0, numRead);
}
in.close();
// replacing stub with actual value
String vOrderNo = vars.get("vOrderNo");
String temp = data.toString().replaceAll("\\$\\{vOrderNo\\}", vOrderNo);
// writing back into file
Writer out = new BufferedWriter(new FileWriter("d:\\test.xml"));
out.write(temp);
out.close();
}
catch (Exception ex) {
IsSuccess = false;
log.error(ex.getMessage());
System.err.println(ex.getMessage());
}
catch (Throwable thex) {
System.err.println(thex.getMessage());
}
This code doesn't require read file into string via ${__FileToString(...)}.
As well, you can combine both methods.

windows phone 7 updating isolated storage

In windows phone 7, what is the protocol for updated an isolated storage text file? Say I have 10 words in a text file arranged at 1 per line. Now suppose, the user uses the application and a new word needs to be stored on the fifth line. How do I write to the file, which already contains 10 words with 1 word per line?
Thanks in advance you guys are awesome.
The way I have been doing it is:
Read in a file from IsolatedStorage to menmory
Update the String
Write the file back to storage
Read in File
public static string ReadFromStorage(string filename)
{
string fileText = "";
try
{
using (IsolatedStorageFile storage = IsolatedStorageFile.GetUserStoreForApplication())
{
using (StreamReader sr = new StreamReader(new IsolatedStorageFileStream(filename, FileMode.Open, storage)))
{
fileText = sr.ReadToEnd();
}
}
}
catch
{
}
return fileText;
}
Write to File
public static void WriteToStorage(string filename, string text)
{
try
{
using (IsolatedStorageFile storage = IsolatedStorageFile.GetUserStoreForApplication())
{
string directory = Path.GetDirectoryName(filename);
if (!storage.DirectoryExists(directory))
storage.CreateDirectory(directory);
if (storage.FileExists(filename))
{
MessageBoxResult result = MessageBox.Show(filename + " Exists\nOverwrite Existing File?", "Question", MessageBoxButton.OKCancel);
if (result == MessageBoxResult.Cancel)
return;
}
using (StreamWriter sw = new StreamWriter(storage.CreateFile(filename)))
{
sw.Write(text);
}
}
}
catch
{
}
}
So I would do:
string fileName = "Test.txt";
string testFile = IsolatedStorage_Utility.ReadFromStorage(fileName);
testFile = testFile.Replace("a", "b");
IsolatedStorage_Utility.WriteToStorage(fileName, testFile);
Writing to a file in Isolated Storage is basically a file write operation. It is similar as how you will access a normal file and read write to it in normal operating system. in your scenario if you are sure that you need to update 5th line out of 10 lines, you will read line by line using stream reader and will use stream writer to update the specific line that you want to update. You do not need to re-write all content again and again.
On the other hand if you just want to add new content you can just append it to end of file. You may find this link useful http://goo.gl/IKii5

Programmatically reading the output of Hadoop Mapreduce Program

This may be a basic question, but I could not find an answer for it on Google.
I have a map-reduce job that creates multiple output files in its output directory.
My Java application executes this job on a remote hadoop cluster and after the job is finished, it needs to read the output programatically using org.apache.hadoop.fs.FileSystem API. Is it possible?
The application knows the output directory, but not the names of the output files generated by the map-reduce job. It seems there is no way to programatically list the contents of a directory in the hadoop file system API. How will the output files be read?
It seems such a commonplace scenario, that I am sure it has a solution. But I am missing something very obvious.
The method you are looking for is called listStatus(Path).
It simply returns all files inside of a Path as a FileStatus array. Then you can simply loop over them create a path object and read it.
FileStatus[] fss = fs.listStatus(new Path("/"));
for (FileStatus status : fss) {
Path path = status.getPath();
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
IntWritable key = new IntWritable();
IntWritable value = new IntWritable();
while (reader.next(key, value)) {
System.out.println(key.get() + " | " + value.get());
}
reader.close();
}
For Hadoop 2.x you can setup the reader like this:
SequenceFile.Reader reader =
new SequenceFile.Reader(conf, SequenceFile.Reader.file(path))
You have a few options: here are two that I sometimes use.
Method #1: Depending on your data size, is to make use of the following HDFS commands (found here, Item 6)
hadoop fs -getmerge hdfs-output-dir local-file
// example
hadoop fs -getmerge /user/kenny/mrjob/ /tmp/mrjob_output
// another way
hadoop fs -cat /user/kenny/mrjob/part-r-* > /tmp/mrjob_output
"This concatenates the HDFS files hdfs-output-dir/part-* into a single local file."
Then you can just read in the one single file. (note that it is in local storage and not HDFS)
Method #2: Create a helper method: (I have a class called HDFS which contains the Configuration, FileSystem instances as well as other helper methods)
public List<Path> matchFiles(String path, final String filter) {
List<Path> matches = new LinkedList<Path>();
try {
FileStatus[] statuses = fileSystem.listStatus(new Path(path), new PathFilter() {
public boolean accept(Path path) {
return path.toString().contains(filter);
}
});
for(FileStatus status : statuses) {
matches.add(status.getPath());
}
} catch(IOException e) {
LOGGER.error(e.getMessage(), e);
}
return matches;
}
You can then call via a command like this: hdfs.matchFiles("/user/kenny/mrjob/", "part-")
FSDataInputStream inputStream = fs.open(path);
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
String record;
while((record = reader.readLine()) != null) {
int blankPos = record.indexOf(" ");
System.out.println(record+"blankPos"+blankPos);
String keyString = record.substring(0, blankPos);
String valueString = record.substring(blankPos + 1);
System.out.println(keyString + " | " + valueString);
}

Resources