Memory issues when running Spark job on relatively large input - hadoop

I am running a spark cluster with 50 machines. Each machine is a VM with 8-core, and 50GB memory (41 seems to be available to Spark).
I am running on several input folders, I estimate the size of input to be ~250GB gz compressed.
Although it seems to me that the amount and configuration of machines I am using seems to be sufficient, after about 40 minutes of run the job fail, I can see following errors in the logs:
2558733 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 345.0 in stage 1.0 (TID 345, hadoop-w-3.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: Java heap space
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
and also:
2653545 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 122.1 in stage 1.0 (TID 392, hadoop-w-22.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
How do I go about debugging such an issue?
EDIT: I Found the root cause of the problem. It is this piece of code:
private static final int MAX_FILE_SIZE = 40194304;
....
....
JavaPairRDD<String, List<String>> typedData = filePaths.mapPartitionsToPair(new PairFlatMapFunction<Iterator<String>, String, List<String>>() {
#Override
public Iterable<Tuple2<String, List<String>>> call(Iterator<String> filesIterator) throws Exception {
List<Tuple2<String, List<String>>> res = new ArrayList<>();
String fileType = null;
List<String> linesList = null;
if (filesIterator != null) {
while (filesIterator.hasNext()) {
try {
Path file = new Path(filesIterator.next());
// filter non-trc files
if (!file.getName().startsWith("1")) {
continue;
}
fileType = getType(file.getName());
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
CompressionCodec codec = compressionCodecs.getCodec(file);
FileSystem fs = file.getFileSystem(conf);
ContentSummary contentSummary = fs.getContentSummary(file);
long fileSize = contentSummary.getLength();
InputStream in = fs.open(file);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
byte[] buffer = new byte[MAX_FILE_SIZE];
BufferedInputStream bis = new BufferedInputStream(in, BUFFER_SIZE);
int count = 0;
int bytesRead = 0;
try {
while ((bytesRead = bis.read(buffer, count, BUFFER_SIZE)) != -1) {
count += bytesRead;
}
} catch (Exception e) {
log.error("Error reading file: " + file.getName() + ", trying to read " + BUFFER_SIZE + " bytes at offset: " + count);
throw e;
}
Iterable<String> lines = Splitter.on("\n").split(new String(buffer, "UTF-8").trim());
linesList = Lists.newArrayList(lines);
// get rid of first line in file
Iterator<String> it = linesList.iterator();
if (it.hasNext()) {
it.next();
it.remove();
}
//res.add(new Tuple2<>(fileType,linesList));
} finally {
res.add(new Tuple2<>(fileType, linesList));
}
}
}
return res;
}
Particularly allocating a buffer of size 40M for each file in order to read the content of the file using BufferedInputStream. This causes the stack memory to end at some point.
The thing is:
If I read line by line (which does not require a buffer), it will be
very non-efficient read
If I allocate one buffer and reuse it for
each file read - is it possible in parallelism sense? Or will it get
overwritten by several threads?
Any suggestions are welcome...
EDIT 2: Fixed first memory issue by moving the byte array allocation outside the iterator, so it gets reused by all partition elements. But there is still the new String(buffer, "UTF-8").trim()) which gets created for the split purpose - that's an object that gets also created every time. I could use a stringbuffer/builder but then how would I set the charset encoding without a String object ?

Eventually I changed the code as follows:
// Transform list of files to list of all files' content in lines grouped by type
JavaPairRDD<String,List<String>> typedData = filePaths.mapToPair(new PairFunction<String, String, List<String>>() {
#Override
public Tuple2<String, List<String>> call(String filePath) throws Exception {
Tuple2<String, List<String>> tuple = null;
try {
String fileType = null;
List<String> linesList = new ArrayList<String>();
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
Path path = new Path(filePath);
fileType = getType(path.getName());
tuple = new Tuple2<String, List<String>>(fileType, linesList);
// filter non-trc files
if (!path.getName().startsWith("1")) {
return tuple;
}
CompressionCodec codec = compressionCodecs.getCodec(path);
FileSystem fs = path.getFileSystem(conf);
InputStream in = fs.open(path);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF-8"), BUFFER_SIZE);
// Get rid of the first line in the file
r.readLine();
// Read all lines
String line;
while ((line = r.readLine()) != null) {
linesList.add(line);
}
} catch (IOException e) { // Filtering of files whose reading went wrong
log.error("Reading of the file " + filePath + " went wrong: " + e.getMessage());
} finally {
return tuple;
}
}
});
So now I do not use a buffer in size of 40M but rather build the lines list dynamically using an array list. This solved my current memory issue, but now I got other strange errors failing the job. Will report those in a different question...

Related

Read packed decimal and convert to numeric in spring boot

All,
I am using a Spring boot application to store data in DB. I am getting this data from IBM MQ through Kafka topic.
I am getting messages in EBCDIC format, so used cobol copybook, JRecord, cb2xml jars to convert to readable format and store in DB.
Now i am getting another file also in the same manner, but after conversion the data looks like this:
10020REFUNDONE
10021REFUNDTWO ·" ÷/
10022REFUNDTHREE oú^ "
10023REFUNDFOUR ¨jÄ ò≈
Here is how i am converting to readable format from ebcdic:
AbstractLineReader reader = null;
StringBuffer finalBuffer = new StringBuffer();
try {
String copybook = "/ds_header.cbl";
reader = CustomCobolProvider.getInstance().getLineReader(copybook, Convert.FMT_MAINFRAME, new BufferedInputStream(new ByteArrayInputStream(salesData)));
AbstractLine line;
while ((line = reader.read()) != null) {
if (null != line.getFieldValue(REC_TYPE)){
finalBuffer.append(line.getFullLine());
}
}
}
and this is my getLineReader method:
public AbstractLineReader getLineReader(String copybook, int numericType, InputStream fileStream) throws Exception {
String font = "";
if (numericType == 1) {
font = "cp037";
}
InputStream stream = CustomCobolProvider.class.getResourceAsStream(copybook);
if(stream == null ) throw new RuntimeException("Can't Load the Copybook Metadata file from Resource....");
LayoutDetail copyBook = ((ExternalRecord)this.copybookInt.loadCopyBook(stream, copybook, CopybookLoader.SPLIT_REDEFINE, 0, font, CommonBits.getDefaultCobolTextFormat(), Convert.FMT_MAINFRAME, 0, (AbsSSLogger)null).setFileStructure(Constants.IO_FIXED_LENGTH)).asLayoutDetail();
AbstractLineReader ret = LineIOProvider.getInstance().getLineReader(copyBook, (LineProvider)null);
ret.open(fileStream, copyBook);
return ret;
}
I am stuck here with the numeric conversion, i got to know it is coming in packed decimal.
I have nil knowledge on cobol and mainframe, referred few sites and got to know how to convert from ebcdic to readable format. Please help!
The problem is getFullLine() method does not do any field translation; you need to access individual fields. You can use the line.getFieldIterator(0) to get a field iterator for the line.
Also unless you are using an ancient version of JRecord, you are better off using the JRecordInterface1 class.
Some thing like the following should work:
StringBuffer finalBuffer = new StringBuffer();
try {
ICobolIOBuilder iob = JRecordInterface1.COBOL .newIOBuilder(copybookName)
.setFont("cp037")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
;
AbstractLineReader reader = iob.newReader(dataFile);
while ((line = reader.read()) != null) {
String sep = "";
for (AbstractFieldValue fv : line.getFieldIterator(0)) {
finalBuffer.append(sep).append(fv);
sep = "\t";
}
finalBuffer.append("\n");
}
reader.close();
} catch (Exception e) {
// what ever ....
}
Other points
With MQ data source you do not need to create line-readers. You can create lines directly from a byte array:
ICobolIOBuilder iob = JRecordInterface1.COBOL .newIOBuilder(copybookName)
.setFont("cp037")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
;
AbstractLine line = iob.newLine(byteArrayFromMq);
for (AbstractFieldValue fv : line.getFieldIterator(0)) {
// what ever
}

Minimize the impact on memory on generating and downloading a file

For testing purposes only, I created this controller, which generates a 100MB file and returns it to the client. The method is very fast. The content of the file does not matter. The file is generated on the fly, it is not saved to disk.
Is it possible to reduce the impact on memory, particularly on java heap? Thank you
#GetMapping("/testDownload100MB")
public ResponseEntity<Resource> download100MB() throws IOException {
int sizeInBytes = 100 * 1024 * 1024;
ByteArrayOutputStream outStream = new ByteArrayOutputStream(sizeInBytes);
for (int i = 0; i < sizeInBytes; i++) {
outStream.write(0);
}
Resource resource = new ByteArrayResource(outStream.toByteArray());
return ResponseEntity.ok()
.headers(utilities.getGenericHttpHeadersToDownloadFile())
.contentLength(sizeInBytes)
.contentType(MediaType.APPLICATION_OCTET_STREAM)
.body(resource);
}
The code of utilities.getGenericHttpHeadersToDownloadFile() is not relevant for the question, however I report it for completeness:
public HttpHeaders getGenericHttpHeadersToDownloadFile() {
HttpHeaders headers = new HttpHeaders();
headers.add(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=myfile");
headers.add("Cache-Control", "no-cache, no-store, must-revalidate");
headers.add("Pragma", "no-cache");
headers.add("Expires", "0");
return headers;
}
Solved.
The code in my question throws a java.lang.OutOfMemoryError: Java heap space on my Spring Boot server. This new code solves the issue:
#GetMapping("/testDownload100MB")
public ResponseEntity<Resource> download100MB() throws IOException {
int size1MB = 1 * 1024 * 1024; // size in bytes
File tempFile = new File(System.getProperty("user.home") + File.separator + "tempFile" + System.currentTimeMillis());
for (int i = 0; i < 100; i++) {
ByteArrayOutputStream outStream = new ByteArrayOutputStream(size1MB);
for (int j = 0; j < size1MB; j++) {
outStream.write(0);
}
FileUtils.writeByteArrayToFile(tempFile, outStream.toByteArray(), true); // append to the temp file
}
Resource resource = new FileSystemResource(tempFile);
Timer timer = new Timer();
timer.schedule(new TimerTask() {
#Override
public void run() {
// we can assume that the file can be safely deleted after five minutes
tempFile.delete();
}
}, 1000 * 60 * 5);
return ResponseEntity.ok()
.headers(utilities.getGenericHttpHeadersToDownloadFile())
.contentLength(tempFile.length())
.contentType(MediaType.APPLICATION_OCTET_STREAM)
.body(resource);
}
Note that FileUtils must be imported with: import org.apache.commons.io.FileUtils;.
In this case, no objects exceeding 1 MB are placed in the java heap. In addition, FileSystemResource does not need to load the entire file into memory. On the other hand, I had to add a timer to delete the file: since this is a method for testing purposes, I can safely assume that after a few minutes the file is no longer needed.
If there are better solutions that do not require a temporary file, add your answer :)

Create AIScene instance from the file's content

I'm writing a Java web service where it is possible to upload a 3D object, operate on it and store it.
What I'm trying to do is creating an AIScene instance using a byte[] as an input parameter which is the file itself (it's content).
I have found no way to do this in the docs, all import methods require a path.
Right now I'm taking a look at both the lwjgl java version of Assimp as well as the C++ version. It doesn't matter which one is used to solve the issue.
Edit: the code I'm trying to get done:
#Override
public String uploadFile(MultipartFile file) {
AIFileIO fileIo = AIFileIO.create();
AIFileOpenProcI fileOpenProc = new AIFileOpenProc() {
public long invoke(long pFileIO, long fileName, long openMode) {
AIFile aiFile = AIFile.create();
final ByteBuffer data;
try {
data = ByteBuffer.wrap(file.getBytes());
} catch (IOException e) {
throw new RuntimeException();
}
AIFileReadProcI fileReadProc = new AIFileReadProc() {
public long invoke(long pFile, long pBuffer, long size, long count) {
long max = Math.min(data.remaining(), size * count);
memCopy(memAddress(data) + data.position(), pBuffer, max);
return max;
}
};
AIFileSeekI fileSeekProc = new AIFileSeek() {
public int invoke(long pFile, long offset, int origin) {
if (origin == Assimp.aiOrigin_CUR) {
data.position(data.position() + (int) offset);
} else if (origin == Assimp.aiOrigin_SET) {
data.position((int) offset);
} else if (origin == Assimp.aiOrigin_END) {
data.position(data.limit() + (int) offset);
}
return 0;
}
};
AIFileTellProcI fileTellProc = new AIFileTellProc() {
public long invoke(long pFile) {
return data.limit();
}
};
aiFile.ReadProc(fileReadProc);
aiFile.SeekProc(fileSeekProc);
aiFile.FileSizeProc(fileTellProc);
return aiFile.address();
}
};
AIFileCloseProcI fileCloseProc = new AIFileCloseProc() {
public void invoke(long pFileIO, long pFile) {
/* Nothing to do */
}
};
fileIo.set(fileOpenProc, fileCloseProc, NULL);
AIScene scene = aiImportFileEx(file.getName(),
aiProcess_JoinIdenticalVertices | aiProcess_Triangulate, fileIo); // ISSUE HERE. file.getName() is not a path, just a name. so is getOriginalName() in my case.
try{
Long id = scene.mMeshes().get(0);
AIMesh mesh = AIMesh.create(id);
AIVector3D vertex = mesh.mVertices().get(0);
return mesh.mName().toString() + ": " + (vertex.x() + " " + vertex.y() + " " + vertex.z());
}catch(Exception e){
e.printStackTrace();
}
return "fail";
}
When debugging the method I get an access violation in the method that binds to the native:
public static long naiImportFileEx(long pFile, int pFlags, long pFS)
this is the message:
#
A fatal error has been detected by the Java Runtime Environment:
#
EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000007400125d, pid=6400, tid=0x0000000000003058
#
JRE version: Java(TM) SE Runtime Environment (8.0_201-b09) (build 1.8.0_201-b09)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.201-b09 mixed mode windows-amd64 compressed oops)
Problematic frame:
V [jvm.dll+0x1e125d]
#
Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
An error report file with more information is saved as:
C:\Users\ragos\IdeaProjects\objectstore3d\hs_err_pid6400.log
#
If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp
#
It is possible if we use the aiImportFileFromMemory method.
The approach I wanted to follow was copied from a github demo and actually copies the buffer around unnecessarily.
The reason for the access violation was the use of indirect buffers (for more info why that is a problem, check this out).
The solution is not nearly as complicated as the code I initially pasted:
#Override
public String uploadFile(MultipartFile file) throws IOException {
ByteBuffer buffer = BufferUtils.createByteBuffer((int) file.getSize());
buffer.put(file.getBytes());
buffer.flip();
AIScene scene = Assimp.aiImportFileFromMemory(buffer,aiProcess_Triangulate, (ByteBuffer) null);
Long id = scene.mMeshes().get(0);
AIMesh mesh = AIMesh.create(id);
AIVector3D vertex = mesh.mVertices().get(0);
return mesh.mName().dataString() + ": " + (vertex.x() + " " + vertex.y() + " " + vertex.z());
}
Here I create a direct buffer with the appropriate size, load the data and flip it (this part is a must.) After that let Assimp do its magic so you get pointers to the structure. With the return statement I just check if I got the valid data.
edit
As in the comments it was pointed out, this implementation is limited to a single file upload and assumes it gets everything that is necessary from that one MultipartFile, it won't work well with referenced formats. See docs for more detail.
The demo that was linked in the question's comments which was used in the question as a base has a different use case to my original one.

FSDataOutputStream.writeUTF() adds extra characters at the start of the data on hdfs. How to avoid this extra data?

What I am trying to is to convert a sequence file on hdfs which has xml data into .xml files on hdfs.
Searched on Google and found the below code. I made modifications according to my need and the following is the code..
public class SeqFileWriterCls {
public static void main(String args[]) throws Exception {
System.out.println("Reading Sequence File");
Path path = new Path("seq_file_path/seq_file.seq");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
SequenceFile.Writer writer = null;
SequenceFile.Reader reader = null;
FSDataOutputStream fwriter = null;
OutputStream fowriter = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
//writer = new SequenceFile.Writer(fs, conf,out_path,Text.class,Text.class);
Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
//i am just editing the path in such a way that key will be my filename and data in it will be the value
Path out_path = new Path(""+key);
String string_path = out_path.toString();
String clear_path=string_path.substring(string_path.lastIndexOf("/")+1);
Path finalout_path = new Path("path"+clear_path);
System.out.println("the final path is "+finalout_path);
fwriter = fs.create(finalout_path);
fwriter.writeUTF(value.toString());
fwriter.close();
FSDataInputStream in = fs.open(finalout_path);
String s = in.readUTF();
System.out.println("file has: -" + s);
//fowriter = fs.create(finalout_path);
//fowriter.write(value.toString());
System.out.println(key + " <===> :" + value.toString());
System.exit(0);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeStream(reader);
fs.close();
}
}
I am using "FSDataOutputStream" to write the data to HDFS and the method is used is "writeUTF" The issue is that when i write to the hdfs file some additional characters are getting in the starting of data. But when i print the data i couldnt see the extra characters.
i tried using writeChars() but even taht wont work.
is there any way to avoid this?? or is there any other way to write the data to HDFS???
please help...
The JavaDoc of the writeUTF(String str) method says the followings:
Writes a string to the underlying output stream using modified UTF-8 encoding in a machine-independent manner.
First, two bytes are written to the output stream as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string. Following the length, each character of the string is output, in sequence, using the modified UTF-8 encoding for the character. (...)
Both the writeBytes(String str) and writeChars(String str) methods should work fine.

How to write the data from Mysql into a file using jdbc code and file writer?

String selectTableSQL = "select JobID, MetadataJson from raasjobs join metadata using (JobID) where JobCreatedDate > '2014-07-01';";
File file = new File("/users/t_shetd/file.txt");
try {
dbConnection = getDBConnection();
statement = dbConnection.createStatement();
System.out.println(selectTableSQL);
// execute select SQL stetement
ResultSet rs = statement.executeQuery(selectTableSQL);
if (!file.exists()) {
file.createNewFile();
}
FileWriter fw = new FileWriter(file.getAbsoluteFile());
BufferedWriter bw = new BufferedWriter(fw);
while (rs.next()) {
String JobID = rs.getString("JobID");
String Metadata = rs.getString("MetadataJson");
bw.write(selectTableSQL);
bw.close();
System.out.println("Done");
// Now i am only getting the output done
If I understand your question, then this
while (rs.next()) {
String JobID = rs.getString("JobID");
String Metadata = rs.getString("MetadataJson");
bw.write(selectTableSQL);
bw.close();
System.out.println("Done");
}
Should be something like (following Java capitalization conventions),
while (rs.next()) {
String jobId = rs.getString("JobID");
String metaData = rs.getString("MetadataJson");
bw.write(String.format("Job ID: %s, MetaData: %s", jobId, metaData));
}
bw.close(); // <-- finish writing first!
System.out.println("Done");
In your version, you close the output after printing the first line from the ResultSet. After that, nothing else will write (because the File is closed).

Resources