Read packed decimal and convert to numeric in spring boot - spring-boot

All,
I am using a Spring boot application to store data in DB. I am getting this data from IBM MQ through Kafka topic.
I am getting messages in EBCDIC format, so used cobol copybook, JRecord, cb2xml jars to convert to readable format and store in DB.
Now i am getting another file also in the same manner, but after conversion the data looks like this:
10020REFUNDONE
10021REFUNDTWO ·" ÷/
10022REFUNDTHREE oú^ "
10023REFUNDFOUR ¨jÄ ò≈
Here is how i am converting to readable format from ebcdic:
AbstractLineReader reader = null;
StringBuffer finalBuffer = new StringBuffer();
try {
String copybook = "/ds_header.cbl";
reader = CustomCobolProvider.getInstance().getLineReader(copybook, Convert.FMT_MAINFRAME, new BufferedInputStream(new ByteArrayInputStream(salesData)));
AbstractLine line;
while ((line = reader.read()) != null) {
if (null != line.getFieldValue(REC_TYPE)){
finalBuffer.append(line.getFullLine());
}
}
}
and this is my getLineReader method:
public AbstractLineReader getLineReader(String copybook, int numericType, InputStream fileStream) throws Exception {
String font = "";
if (numericType == 1) {
font = "cp037";
}
InputStream stream = CustomCobolProvider.class.getResourceAsStream(copybook);
if(stream == null ) throw new RuntimeException("Can't Load the Copybook Metadata file from Resource....");
LayoutDetail copyBook = ((ExternalRecord)this.copybookInt.loadCopyBook(stream, copybook, CopybookLoader.SPLIT_REDEFINE, 0, font, CommonBits.getDefaultCobolTextFormat(), Convert.FMT_MAINFRAME, 0, (AbsSSLogger)null).setFileStructure(Constants.IO_FIXED_LENGTH)).asLayoutDetail();
AbstractLineReader ret = LineIOProvider.getInstance().getLineReader(copyBook, (LineProvider)null);
ret.open(fileStream, copyBook);
return ret;
}
I am stuck here with the numeric conversion, i got to know it is coming in packed decimal.
I have nil knowledge on cobol and mainframe, referred few sites and got to know how to convert from ebcdic to readable format. Please help!

The problem is getFullLine() method does not do any field translation; you need to access individual fields. You can use the line.getFieldIterator(0) to get a field iterator for the line.
Also unless you are using an ancient version of JRecord, you are better off using the JRecordInterface1 class.
Some thing like the following should work:
StringBuffer finalBuffer = new StringBuffer();
try {
ICobolIOBuilder iob = JRecordInterface1.COBOL .newIOBuilder(copybookName)
.setFont("cp037")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
;
AbstractLineReader reader = iob.newReader(dataFile);
while ((line = reader.read()) != null) {
String sep = "";
for (AbstractFieldValue fv : line.getFieldIterator(0)) {
finalBuffer.append(sep).append(fv);
sep = "\t";
}
finalBuffer.append("\n");
}
reader.close();
} catch (Exception e) {
// what ever ....
}
Other points
With MQ data source you do not need to create line-readers. You can create lines directly from a byte array:
ICobolIOBuilder iob = JRecordInterface1.COBOL .newIOBuilder(copybookName)
.setFont("cp037")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
;
AbstractLine line = iob.newLine(byteArrayFromMq);
for (AbstractFieldValue fv : line.getFieldIterator(0)) {
// what ever
}

Related

FSDataOutputStream.writeUTF() adds extra characters at the start of the data on hdfs. How to avoid this extra data?

What I am trying to is to convert a sequence file on hdfs which has xml data into .xml files on hdfs.
Searched on Google and found the below code. I made modifications according to my need and the following is the code..
public class SeqFileWriterCls {
public static void main(String args[]) throws Exception {
System.out.println("Reading Sequence File");
Path path = new Path("seq_file_path/seq_file.seq");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
SequenceFile.Writer writer = null;
SequenceFile.Reader reader = null;
FSDataOutputStream fwriter = null;
OutputStream fowriter = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
//writer = new SequenceFile.Writer(fs, conf,out_path,Text.class,Text.class);
Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
//i am just editing the path in such a way that key will be my filename and data in it will be the value
Path out_path = new Path(""+key);
String string_path = out_path.toString();
String clear_path=string_path.substring(string_path.lastIndexOf("/")+1);
Path finalout_path = new Path("path"+clear_path);
System.out.println("the final path is "+finalout_path);
fwriter = fs.create(finalout_path);
fwriter.writeUTF(value.toString());
fwriter.close();
FSDataInputStream in = fs.open(finalout_path);
String s = in.readUTF();
System.out.println("file has: -" + s);
//fowriter = fs.create(finalout_path);
//fowriter.write(value.toString());
System.out.println(key + " <===> :" + value.toString());
System.exit(0);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeStream(reader);
fs.close();
}
}
I am using "FSDataOutputStream" to write the data to HDFS and the method is used is "writeUTF" The issue is that when i write to the hdfs file some additional characters are getting in the starting of data. But when i print the data i couldnt see the extra characters.
i tried using writeChars() but even taht wont work.
is there any way to avoid this?? or is there any other way to write the data to HDFS???
please help...
The JavaDoc of the writeUTF(String str) method says the followings:
Writes a string to the underlying output stream using modified UTF-8 encoding in a machine-independent manner.
First, two bytes are written to the output stream as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string. Following the length, each character of the string is output, in sequence, using the modified UTF-8 encoding for the character. (...)
Both the writeBytes(String str) and writeChars(String str) methods should work fine.

Memory issues when running Spark job on relatively large input

I am running a spark cluster with 50 machines. Each machine is a VM with 8-core, and 50GB memory (41 seems to be available to Spark).
I am running on several input folders, I estimate the size of input to be ~250GB gz compressed.
Although it seems to me that the amount and configuration of machines I am using seems to be sufficient, after about 40 minutes of run the job fail, I can see following errors in the logs:
2558733 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 345.0 in stage 1.0 (TID 345, hadoop-w-3.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: Java heap space
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
and also:
2653545 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 122.1 in stage 1.0 (TID 392, hadoop-w-22.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
How do I go about debugging such an issue?
EDIT: I Found the root cause of the problem. It is this piece of code:
private static final int MAX_FILE_SIZE = 40194304;
....
....
JavaPairRDD<String, List<String>> typedData = filePaths.mapPartitionsToPair(new PairFlatMapFunction<Iterator<String>, String, List<String>>() {
#Override
public Iterable<Tuple2<String, List<String>>> call(Iterator<String> filesIterator) throws Exception {
List<Tuple2<String, List<String>>> res = new ArrayList<>();
String fileType = null;
List<String> linesList = null;
if (filesIterator != null) {
while (filesIterator.hasNext()) {
try {
Path file = new Path(filesIterator.next());
// filter non-trc files
if (!file.getName().startsWith("1")) {
continue;
}
fileType = getType(file.getName());
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
CompressionCodec codec = compressionCodecs.getCodec(file);
FileSystem fs = file.getFileSystem(conf);
ContentSummary contentSummary = fs.getContentSummary(file);
long fileSize = contentSummary.getLength();
InputStream in = fs.open(file);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
byte[] buffer = new byte[MAX_FILE_SIZE];
BufferedInputStream bis = new BufferedInputStream(in, BUFFER_SIZE);
int count = 0;
int bytesRead = 0;
try {
while ((bytesRead = bis.read(buffer, count, BUFFER_SIZE)) != -1) {
count += bytesRead;
}
} catch (Exception e) {
log.error("Error reading file: " + file.getName() + ", trying to read " + BUFFER_SIZE + " bytes at offset: " + count);
throw e;
}
Iterable<String> lines = Splitter.on("\n").split(new String(buffer, "UTF-8").trim());
linesList = Lists.newArrayList(lines);
// get rid of first line in file
Iterator<String> it = linesList.iterator();
if (it.hasNext()) {
it.next();
it.remove();
}
//res.add(new Tuple2<>(fileType,linesList));
} finally {
res.add(new Tuple2<>(fileType, linesList));
}
}
}
return res;
}
Particularly allocating a buffer of size 40M for each file in order to read the content of the file using BufferedInputStream. This causes the stack memory to end at some point.
The thing is:
If I read line by line (which does not require a buffer), it will be
very non-efficient read
If I allocate one buffer and reuse it for
each file read - is it possible in parallelism sense? Or will it get
overwritten by several threads?
Any suggestions are welcome...
EDIT 2: Fixed first memory issue by moving the byte array allocation outside the iterator, so it gets reused by all partition elements. But there is still the new String(buffer, "UTF-8").trim()) which gets created for the split purpose - that's an object that gets also created every time. I could use a stringbuffer/builder but then how would I set the charset encoding without a String object ?
Eventually I changed the code as follows:
// Transform list of files to list of all files' content in lines grouped by type
JavaPairRDD<String,List<String>> typedData = filePaths.mapToPair(new PairFunction<String, String, List<String>>() {
#Override
public Tuple2<String, List<String>> call(String filePath) throws Exception {
Tuple2<String, List<String>> tuple = null;
try {
String fileType = null;
List<String> linesList = new ArrayList<String>();
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
Path path = new Path(filePath);
fileType = getType(path.getName());
tuple = new Tuple2<String, List<String>>(fileType, linesList);
// filter non-trc files
if (!path.getName().startsWith("1")) {
return tuple;
}
CompressionCodec codec = compressionCodecs.getCodec(path);
FileSystem fs = path.getFileSystem(conf);
InputStream in = fs.open(path);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF-8"), BUFFER_SIZE);
// Get rid of the first line in the file
r.readLine();
// Read all lines
String line;
while ((line = r.readLine()) != null) {
linesList.add(line);
}
} catch (IOException e) { // Filtering of files whose reading went wrong
log.error("Reading of the file " + filePath + " went wrong: " + e.getMessage());
} finally {
return tuple;
}
}
});
So now I do not use a buffer in size of 40M but rather build the lines list dynamically using an array list. This solved my current memory issue, but now I got other strange errors failing the job. Will report those in a different question...

Windows 8.1 store xaml save InkManager in a string

I'm trying to save what i have drawn with the pencil as a string , and i do this by SaveAsync() method to put it in an IOutputStream then convert this IOutputStream to a stream using AsStreamForWrite() method from this point things should go fine, however i get a lot of problems after this part , if i use for example this code block:
using (var stream = new MemoryStream())
{
byte[] buffer = new byte[2048]; // read in chunks of 2KB
int bytesRead = (int)size;
while (bytesRead < 0)
{
stream.Write(buffer, 0, bytesRead);
}
byte[] result = stream.ToArray();
// TODO: do something with the result
}
i get this exception
"Offset and length were out of bounds for the array or count is greater than the number of elements from index to the end of the source collection."
or if i try to convert the stream into an image using InMemoryRandomAccessStream like this:
InMemoryRandomAccessStream ras = new InMemoryRandomAccessStream();
await s.CopyToAsync(ras.AsStreamForWrite());
my InMemoryRandomAccessStream variable is always zero in size.
also tried
StreamReader.ReadToEnd();
but it returns an empty string.
found the answer here :
http://social.msdn.microsoft.com/Forums/windowsapps/en-US/2359f360-832e-4ce5-8315-7f351f2edf6e/stream-inkmanager-strokes-to-string
private async void ReadInk(string base64)
{
if (!string.IsNullOrEmpty(base64))
{
var bytes = Convert.FromBase64String(base64);
using (var inMemoryRAS = new InMemoryRandomAccessStream())
{
await inMemoryRAS.WriteAsync(bytes.AsBuffer());
await inMemoryRAS.FlushAsync();
inMemoryRAS.Seek(0);
await m_InkManager.LoadAsync(inMemoryRAS);
if (m_InkManager.GetStrokes().Count > 0)
{
// You would do whatever you want with the strokes
// RenderStrokes();
}
}
}
}

ResultSet.getBlob() Exception

The Code:
ResultSet rs = null;
try {
conn = getConnection();
stmt = conn.prepareStatement(sql);
rs = stmt.executeQuery();
while (rs.next()) {
Blob blob = rs.getBlob("text");
byte[] blobbytes = blob.getBytes(1, (int) blob.length());
String text = new String(blobbytes);
The result:
java.sql.SQLException: Invalid column type: getBLOB not implemented for class oracle.jdbc.driver.T4CClobAccessor
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:111)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:145)
at oracle.jdbc.driver.Accessor.unimpl(Accessor.java:357)
at oracle.jdbc.driver.Accessor.getBLOB(Accessor.java:1299)
at oracle.jdbc.driver.OracleResultSetImpl.getBLOB(OracleResultSetImpl.java:1280)
at oracle.jdbc.driver.OracleResultSetImpl.getBlob(OracleResultSetImpl.java:1466)
at oracle.jdbc.driver.OracleResultSet.getBlob(OracleResultSet.java:1978)
I have class12_10g.zip in my class path. I've googled and have found essentially only one site on this particular problem, and it wasn't helpful at.
Does anyone have any ideas on this?
A little background:
We were converting one of our databases from MySQL to Oracle. Within the MySQL DB, one of the fields is a longtext which is treated as a BLOB in the code. The SQL developer workbench by default converts longtext to CLOB (make sense to me) but the code was expecting Blob. I guess the error wasn't that nice: oracle.jdbc.driver.T4CClobAccessor (though it does mention Clob).
When I tried the following:
rs = stmt.executeQuery();
while (rs.next()) {
byte[] blobbytes = rs.getBytes("text");
String text = new String(blobbytes);
}
it threw an unsupported exception - all I had to do in the first place was compare the types in the newly created Oracle DB with what the code was expecting (unfortunately I just assumed they would match).
Sorry guys! Not that I've put much thought into it, now I have to figure out why the original developers used BLOB types for longtext
Not sure about making the Blob object work -- I typically skip the Blob step:
rs = stmt.executeQuery();
while (rs.next()) {
byte[] blobbytes = rs.getBytes("text");
String text = new String(blobbytes);
}
try to use the latest version of the drivers (10.2.0.4). Try also the drivers for JDK 1.4/1.5 since classes12 are for JDK 1.2/1.3.
Try...
PreparedStatement stmt = connection.prepareStatement(query);
ResultSet rs = stmt.executeQuery();
rs.next();
InputStream is = rs.getBlob(columnIndex).getBinaryStream();
...instead?
I have a utility method in a DAO superclass of all my DAOs:
protected byte[] readBlob(oracle.sql.BLOB blob) throws SQLException {
if (blob != null) {
byte[] buffer = new byte[(int) blob.length()];
int bufsz = blob.getBufferSize();
InputStream is = blob.getBinaryStream();
int len = -1, off = 0;
try {
while ((len = is.read(buffer, off, bufsz)) != -1) {
off += len;
}
} catch (IOException ioe) {
logger.debug("IOException when reading blob", ioe);
}
return buffer;
} else {
return null;
}
}
// to get the oracle BLOB object from the result set:
oracle.sql.BLOB blob= (oracle.sql.BLOB) ((OracleResultSet) rs).getBlob("blobd");
Someone will now say "why didn't you just do XYZ", but there was some issue at the time that made the above more reliable.
When JDBC returns a ResultSet from an Oracle database it always returns an OracleResultSet. If you are typing it as a ResultSet, java upcasts it to the standard SQL ResultSet.
OracleResultSet overrides most of the data type methods, because Oracle datatypes are not standard SQL types.
In other words, that worked because you cast the rs as an OracleResultSet, and used it's getBlob method.

Downloading files from remote server

I am using C# and a console app and I am using this script to download files from a remote server. There area a couple of things I want to add. First, when it writes to a file, it doesn't take into consideration a newline. This seems to run a certain amount of bytes and then goes to a newline. I would like it to keep the same format as the file it is reading from. Second, there are multiple .jpg files on the server that I need to download. How can I use this script to download multiple, .jpg files
public static int DownLoadFiles(String remoteUrl, String localFile)
{
int bytesProcessed = 0;
// Assign values to these objects here so that they can
// be referenced in the finally block
StreamReader remoteStream = null;
StreamWriter localStream = null;
WebResponse response = null;
// Use a try/catch/finally block as both the WebRequest and Stream
// classes throw exceptions upon error
try
{
// Create a request for the specified remote file name
WebRequest request = WebRequest.Create(remoteUrl);
request.PreAuthenticate = true;
NetworkCredential credentials = new NetworkCredential("id", "pass");
request.Credentials = credentials;
if (request != null)
{
// Send the request to the server and retrieve the
// WebResponse object
response = request.GetResponse();
if (response != null)
{
// Once the WebResponse object has been retrieved,
// get the stream object associated with the response's data
remoteStream = new StreamReader(response.GetResponseStream());
// Create the local file
localStream = new StreamWriter(File.Create(localFile));
// Allocate a 1k buffer
char[] buffer = new char[1024];
int bytesRead;
// Simple do/while loop to read from stream until
// no bytes are returned
do
{
// Read data (up to 1k) from the stream
bytesRead = remoteStream.Read(buffer, 0, buffer.Length);
// Write the data to the local file
localStream.WriteLine(buffer, 0, bytesRead);
// Increment total bytes processed
bytesProcessed += bytesRead;
} while (bytesRead > 0);
}
}
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
finally
{
// Close the response and streams objects here
// to make sure they're closed even if an exception
// is thrown at some point
if (response != null) response.Close();
if (remoteStream != null) remoteStream.Close();
if (localStream != null) localStream.Close();
}
// Return total bytes processed to caller.
return bytesProcessed;
Why don't you use WebClient.DownloadData or WebClient.DownloadFile instead?
WebClient client = new WebClient();
client.Credentials = new NetworkCredentials("id", "pass");
client.DownloadFile(remoteUrl, localFile);
By the way the correct way to copy a stream to another is not what you did. You shouldn't read into char[] at all, as you might run into encoding and end of line issues as you are downloading a binary file. The WriteLine method call is problematic too. The right way to copy contents of a stream to another is:
void CopyStream(Stream destination, Stream source) {
int count;
byte[] buffer = new byte[BUFFER_SIZE];
while( (count = source.Read(buffer, 0, buffer.Length)) > 0)
destination.Write(buffer, 0, count);
}
The WebClient class is much easier to use and I suggest using that instead.
The reason you're getting spurious newlines in the result file is because StreamWriter.WriteLine() puts them there. Try using StreamWriter.Write() instead.
Regarding downloading multiple files, can't you just run the function several times, passing it the URLs of the different files you need?

Resources