I am trying to open and parse an x937 file - which I BELIEVE is usually encoded in EBCDIC 0037.
I am using the following library to decode the main bytes of the file :
"github.com/gdumoulin/goebcdic"
and the code I am using is as follows, for now.
// Bytes in file.
b, _ := ioutil.ReadFile("testingFile.x937")
fmt.Println(string(goebcdic.ASCIItoEBCDICofBytes(b)))
But if I dump the output of my file, I still don't seem to get anything that matches what I would have thought I would be looking for.
Any ideas on how I can work with this?
Related
I have a task to parse both eml and msg formatted email files using Go. There's a wonderful package for parsing EML files, however, with MSG, no matter what package I research and attempt to implement, I encounter the same error every single time.
malformed MIME header: missing colon:
It isn't the msg file itself. I have the same service in .NET which reads the msg file perfectly (MsgReader library).
Could someone suggest a package I could use in Go to read msg files? I wonder if it's an encoding issue (this wasn't a problem with eml files).
I've tried using these packages:
github.com/veqryn/go-email
net/mail
https://github.com/go-gomail/gomail
github.com/jordan-wright/email
github.com/emersion/go-message
github.com/jpoehls/gophermail
As an example, here is one function I've tried to read an msg file.
func parse_msg_file() {
var filePath string = "c://messages//kraken.msg"
var reader io.Reader
f, err := os.Open(filePath)
checkerr(err, "file "+filePath+" not found or can not be readed")
defer f.Close()
reader = bufio.NewReader(f)
msg, err := email.ParseMessage(reader)
checkerr(err, "failed to parse raw msg file")
if msg == nil {
checkerr(err, "failed to parse raw msg file")
}
}
and the output when I call the function is:
malformed MIME header: missing colon: "\xd0\xcf\x11\u0871\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00>\x00\x03\x00\xfe\xff\t\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\t\x00\x00\x00\x02\x00\x00\x00\xfe\xff\xff\xff\x00\x00\x00\x00\x03\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xffR\x00o\x00o\x00t\x00 \x00E\x00n\x00t\x00r\x00y\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16\x00\x05\x00\xff\xff\xff\xff\xff\xff\xff\xff\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0\t-0r$\xd9\x01"
exit status 255
Just to add to my comment, I have searched for "msg parsers in go" in Google and it has brought up this repository - https://github.com/oucema001/OutlookMessageParser-Go . I don't know if it actually works - it's pretty old, and no documentation, so unlikely it'll be easy to use, but you can start from there.
Here's the specification for Microsoft's Outlook Item File Format (*.msg).
And here's the specification for Microsoft's Compound File Binary File Format, the basis for the Outlook Item File Format (*.msg).
The Compound File Binary File Format is
a general-purpose file format that provides a file-system-like structure within a file for the storage of arbitrary, application-specific streams of data.
I believe that this stuff all came from Microsoft's old OLE/COM stuff (Object Linking and Embedding/Component Object Model).
FWIW, here's a parser for the Compound File Binary File Format. No idea if it works, or anything else about it, but it might be, at least, a jumping-off point for you.
https://github.com/richardlehane/mscfb
[Edited to note]
Seems that the above package is a dependency of https://github.com/oucema001/OutlookMessageParser-Go, referenced in this answer by #astax.
Golang Code Here: (include 4 files here)
https://gist.github.com/kmahyyg/02a2da2970001de455f847f4e7525aff
When defined as above, compress a big file (512M here, a bin created by dd from /dev/urandom).
If you use SetWriter(out), try to pass out as a bufio.Writer but keep the struct field definition as io.Writer, and the same as Reader part.
Then try decompress, you will get an unexpected EOF error.
But if you pass out as a io.Writer, everything will be fine.
Compress function have no errors.
Why use bufio.Writer will cause unexpected EOF?
Note:
after some observation, it seems that file smaller than a specific size (here, is 337MB on my machine) will not get unexpected EOF.
The official gunzip extract the same gzip file which caused unexpected EOF will only get about the first 337M part of data, then get the "corrupted file" message.
Edit: 1. Full code attached.
2. Screen shot here: (Use zstd as an example, same result when use gzip)
#leafbebop has the correct answer. The io.Writer will not automatically flush the buffer when close. So you must manually flush it before close when use bufio.Writer as io.Writer
I am trying to convert EBCDIC file to ASCII using CobolIoProvider class from JRecord in Apache Beam.
Code that I'm using:
CobolIoProvider ioProvider = CobolIoProvider.getInstance();
AbstractLineReader reader = ioProvider.getLineReader(Constants.IO_FIXED_LENGTH, Convert.FMT_MAINFRAME,CopybookLoader.SPLIT_NONE, copybookname, cobolfilename);
The code reads and converts the file as required. I am able to read the cobolfilename and copybookname only from the local system which are basically paths of the EBCDIC file and the copybook respectively. However, when I try to read the files from GCS, it fails with FileNotFoundException – “The filename, directory name, or volume label syntax is incorrect” .
Is there a way to read Cobol file(EBCDIC) from GCS using CobolIoProvider class ?
If not, is there any other class available to convert Cobol file(EBCDIC) to ASCII and allowing the files to be read from GCS.
Using ICobolIOBuilder:-
Code that I’m using:
ICobolIOBuilder iob = JRecordInterface1.COBOL.newIOBuilder("copybook.cbl")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
.setSplitCopybook(CopybookLoader.SPLIT_NONE);
AbstractLineReader reader = iob.newReader(bs); //bs is an InputStream object of my Cobol file
However, here are a few concerns:-
1) I have to keep my copybook.cbl locally. Is there any way to read copybook file from GCS. I tried the below code, trying to read my copybook from GCS to Stream and pass the stream to LoadCopyBook(). But the code didn’t work.
Sample code below:
InputStream bs2 = new ByteArrayInputStream(copybookfile.toString().getBytes());
LayoutDetail schema = new CobolCopybookLoader()
.loadCopyBook( bs, " copybook.cbl",
CopybookLoader.SPLIT_NONE, 0, "",
Constants.USE_STANDARD_COLUMNS,
Convert.FMT_INTEL, 0, new TextLog())
.asLayoutDetail();
AbstractLineReader reader = LineIOProvider.getInstance().getLineReader(schema);
reader.open(inputStream, schema);
2) Reading the EBCDIC file from stream using newReader didn’t convert my file to ascii.
Thanks.
I do not have a full answer. If you are using a recent version of suggest changing the JRecord code to use the JRecordInterface1. The IO-Builder is a lot more flexible than the older CobolIoProvider interface.
String encoding = "cp037"; // cp037/IBM037 US ebcdic; cp273 - German ebcdic
ICobolIOBuilder iob = JRecordInterface1.COBOL
.newIOBuilder("CopybookFile.cbl")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
.setFont(encoding); // should set encoding if you can
AbstractLineReader reader = iob.newReader(datastream);
With the IO-Builder interface you can use streams. This question Stream file from Google Cloud Storage is about creating a stream from GCS, may be useful. Hopefully some one with more knowledge of GCS can help.
Alternatively you could read from GCS directly and create data-lines(data-records) using the newLine method of a JRecord-IO-Builder:
AbstractLine l = iob.newLine(byteArray);
I will look at creating a basic Read/Write interface to JRecord so JRecord user's can write there own interface to GCS or IBM's Mainframe Access (ZFile) etc. But this will take time.
The easiest way to use Beam/Dataflow with new kinds of file-based sources is to first use FileIO to get a PCollection<ReadableFile> and then use a DoFn to read that file. This will require implementing the code to read from a given channel. Something like the following:
Pipeline p = ...
p.apply(FileIO.match().filepattern("..."))
.apply(FileIO.readMatches(...))
.apply(new DoFn<ReadableFile, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
try (ReadableByteChannel channel = c.element().open()) {
// Use CobolIO to read from the byte channel
}
});
I'm trying to save and load the states of Matrices (using Matrix) during the execution of my program with the functions dump and load from Marshal. I can serialize the matrix and get a ~275 KB file, but when I try to load it back as a string to deserialize it into an object, Ruby gives me only the beginning of it.
# when I want to save
mat_dump = Marshal.dump(#mat) # serialize object - OK
File.open('mat_save', 'w') {|f| f.write(mat_dump)} # write String to file - OK
# somewhere else in the code
mat_dump = File.read('mat_save') # read String from file - only reads like 5%
#mat = Marshal.load(mat_dump) # deserialize object - "ArgumentError: marshal data too short"
I tried to change the arguments for load but didn't find anything yet that doesn't cause an error.
How can I load the entire file into memory? If I could read the file chunk by chunk, then loop to store it in the String and then deserialize, it would work too. The file has basically one big line so I can't even say I'll read it line by line, the problem stays the same.
I saw some questions about the topic:
"Ruby serialize array and deserialize back"
"What's a reasonable way to read an entire text file as a single string?"
"How to read whole file in Ruby?"
but none of them seem to have the answers I'm looking for.
Marshal is a binary format, so you need to read and write in binary mode. The easiest way is to use IO.binread/write.
...
IO.binwrite('mat_save', mat_dump)
...
mat_dump = IO.binread('mat_save')
#mat = Marshal.load(mat_dump)
Remember that Marshaling is Ruby version dependent. It's only compatible under specific circumstances with other Ruby versions. So keep that in mind:
In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number.
Is it possible to decompile a string containing Protocol Buffers descriptor back to .proto file?
Say I have a long string like
\n\file.proto\u001a\u000ccommon.proto\"\u00a3\u0001\n\nMsg1Request\u0012\u0017\n\u0006common\u0018\u0001 ... etc.
I need to restore .proto, not necessary exactly as it was but compilable.
In C++, the FileDescriptor interface has a method DebugString() which formats the descriptor contents in .proto syntax -- i.e. exactly what you want. In order to use it, you first need to write code to convert the raw FileDescriptorProto to a FileDescriptor, using the DescriptorPool interface.
Something like this should do it:
#include <google/protobuf/descriptor.h>
#include <google/protobuf/descriptor.pb.h>
#include <iostream>
int main() {
google::protobuf::FileDescriptorProto fileProto;
fileProto.ParseFromFileDescriptor(0);
google::protobuf::DescriptorPool pool;
const google::protobuf::FileDescriptor* desc =
pool.BuildFile(fileProto);
std::cout << desc->DebugString() << std::endl;
return 0;
}
You need to feed this program the raw bytes of the FileDescriptorProto, which you can get by using Java to encode your string to bytes using the ISO-8859-1 charset.
Also note that the above doesn't work if the file imports any other files -- you would have to load those imports into the DescriptorPool first.
Yes it should be possible to get some thing close get original definition. I do not know of any existing code to do it (hopefully some one else will).
Hava a look at how protocol buffers itself handles the String.
Basically
convert the string to bytes (using charset="ISO-8859-1" in java), it will then be a Protocol-Buffer message(format=FileDescriptorProto in java). The FileDescriptorProto is built as part of the Protocol-Buffers install.
Extract the data in the Protocol-Buffer message
Here is a File-Descriptor protocol displayed in the Protocol-Buffer editor