Convert EBCDIC to ASCII in Apache Beam - ascii

I am trying to convert EBCDIC file to ASCII using CobolIoProvider class from JRecord in Apache Beam.
Code that I'm using:
CobolIoProvider ioProvider = CobolIoProvider.getInstance();
AbstractLineReader reader = ioProvider.getLineReader(Constants.IO_FIXED_LENGTH, Convert.FMT_MAINFRAME,CopybookLoader.SPLIT_NONE, copybookname, cobolfilename);
The code reads and converts the file as required. I am able to read the cobolfilename and copybookname only from the local system which are basically paths of the EBCDIC file and the copybook respectively. However, when I try to read the files from GCS, it fails with FileNotFoundException – “The filename, directory name, or volume label syntax is incorrect” .
Is there a way to read Cobol file(EBCDIC) from GCS using CobolIoProvider class ?
If not, is there any other class available to convert Cobol file(EBCDIC) to ASCII and allowing the files to be read from GCS.
Using ICobolIOBuilder:-
Code that I’m using:
ICobolIOBuilder iob = JRecordInterface1.COBOL.newIOBuilder("copybook.cbl")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
.setSplitCopybook(CopybookLoader.SPLIT_NONE);
AbstractLineReader reader = iob.newReader(bs); //bs is an InputStream object of my Cobol file
However, here are a few concerns:-
1) I have to keep my copybook.cbl locally. Is there any way to read copybook file from GCS. I tried the below code, trying to read my copybook from GCS to Stream and pass the stream to LoadCopyBook(). But the code didn’t work.
Sample code below:
InputStream bs2 = new ByteArrayInputStream(copybookfile.toString().getBytes());
LayoutDetail schema = new CobolCopybookLoader()
.loadCopyBook( bs, " copybook.cbl",
CopybookLoader.SPLIT_NONE, 0, "",
Constants.USE_STANDARD_COLUMNS,
Convert.FMT_INTEL, 0, new TextLog())
.asLayoutDetail();
AbstractLineReader reader = LineIOProvider.getInstance().getLineReader(schema);
reader.open(inputStream, schema);
2) Reading the EBCDIC file from stream using newReader didn’t convert my file to ascii.
Thanks.

I do not have a full answer. If you are using a recent version of suggest changing the JRecord code to use the JRecordInterface1. The IO-Builder is a lot more flexible than the older CobolIoProvider interface.
String encoding = "cp037"; // cp037/IBM037 US ebcdic; cp273 - German ebcdic
ICobolIOBuilder iob = JRecordInterface1.COBOL
.newIOBuilder("CopybookFile.cbl")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
.setFont(encoding); // should set encoding if you can
AbstractLineReader reader = iob.newReader(datastream);
With the IO-Builder interface you can use streams. This question Stream file from Google Cloud Storage is about creating a stream from GCS, may be useful. Hopefully some one with more knowledge of GCS can help.
Alternatively you could read from GCS directly and create data-lines(data-records) using the newLine method of a JRecord-IO-Builder:
AbstractLine l = iob.newLine(byteArray);
I will look at creating a basic Read/Write interface to JRecord so JRecord user's can write there own interface to GCS or IBM's Mainframe Access (ZFile) etc. But this will take time.

The easiest way to use Beam/Dataflow with new kinds of file-based sources is to first use FileIO to get a PCollection<ReadableFile> and then use a DoFn to read that file. This will require implementing the code to read from a given channel. Something like the following:
Pipeline p = ...
p.apply(FileIO.match().filepattern("..."))
.apply(FileIO.readMatches(...))
.apply(new DoFn<ReadableFile, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
try (ReadableByteChannel channel = c.element().open()) {
// Use CobolIO to read from the byte channel
}
});

Related

X937 file decoding in golang?

I am trying to open and parse an x937 file - which I BELIEVE is usually encoded in EBCDIC 0037.
I am using the following library to decode the main bytes of the file :
"github.com/gdumoulin/goebcdic"
and the code I am using is as follows, for now.
// Bytes in file.
b, _ := ioutil.ReadFile("testingFile.x937")
fmt.Println(string(goebcdic.ASCIItoEBCDICofBytes(b)))
But if I dump the output of my file, I still don't seem to get anything that matches what I would have thought I would be looking for.
Any ideas on how I can work with this?

Stanford CoreNLP: output in CONLL format from Java

I want to parse some German text with Stanford CoreNLP and obtain a CONLL output, so that I can pass the latter to CorZu for coreference resolution.
How can I do that programmatically?
Here is my code so far (which only outputs dependency trees):
Annotation germanAnnotation = new Annotation("Gestern habe ich eine blonde Frau getroffen");
Properties germanProperties = StringUtils.argsToProperties("-props", "StanfordCoreNLP-german.properties");
StanfordCoreNLP pipeline = new StanfordCoreNLP(germanProperties);
pipeline.annotate(germanAnnotation);
StringBuilder trees = new StringBuilder("");
for (CoreMap sentence : germanAnnotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree sentenceTree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
trees.append(sentenceTree).append("\n");
}
With the following code I managed to save the parsing output in CONLL format.
OutputStream outputStream = new FileOutputStream(new File("./target/", OUTPUT_FILE_NAME));
CoNLLOutputter.conllPrint(germanAnnotation, outputStream, pipeline);
However, the HEAD field was 0 for all words. I am not sure whether there is a problem in parsing or only in the CONLLOutputter. Honestly, I was too annoyed by CoreNLP to investigate further.
I decided and I suggest using ParZu instead. ParZu and CorZu are made to work together seamlessly - and they do.
In my case, I had an already tokenized and POS tagged text. This makes things easier, since you will not need:
A POS-Tagger using the STTS tagset
A tool for morphological analysis
Once you have ParZu and CorZu installed, you will only need to run corzu.sh (included in the CorZu download folder). If your text is tokenized and POS tagged, you can edit the script accordingly:
parzu_cmd="/YourPath/ParZu/parzu -i tagged"
Last note: make sure to convert your tagged text to the following format, with empty lines signifying sentence boundaries: word [tab] tag [newline]

Write data to sdcard zedboard

I want to write data to zedboard's sdcard. I am able to write data to DRAM. Now I want to read DRAM's data and write it Sdcard. I have followed this (http://elm-chan.org/fsw/ff/00index_e.html) but it does not fulfill my requirement. I am not able to find any tutorial any example etc for this.
Please any tutorial link or any example. Thanks.
If you're using Vivado SDK, which I assume you are, it is really straightforward to use the SD Card.
To include the Fat File System, inside Xilinx SDK, open your Board Support Package (system.mss file) an select Modify this BSP's Settings. Under Overview, you can select xilffs.
Next, you must write the software to access the SD Card. This library offers a wide variety of functions. You can either look at here_1, in here_2 or in here_3. In this second reference is provided a wide variety of complex functions.
Aside from this, in order to use the SD Card, what you should basically do is written below. Note that this is just an overview, and you should refer to the references I gave you.
# Flush the cache
Xil_DCacheFlush();
Xil_DCacheDisable();
# Initialize the SD Card
f_mount(0, 0);
f_mount(0, <FATFS *>)
# Open a file using f_open
# Read and Write Operations
# Either use f_write or f_read
# Close your file with f_close
# Unmount the SD with f_mount(0,0)
Note that experience teaches me that you need to write to the file in blocks that are multiples of the block size of the file system, which for the FAT files syste, is typically 512 bytes. Writing less that 512 bytes and closing the file will make it zero bytes in length.
In new version of Xilffs (Fatfs) lib syntax is little changed.
New syntax is:
static FATFS FS_instance; // File System instance
const char *path = "0:/"; // string pointer to the logical drive number
static FIL file1; // File instance
FRESULT result; // FRESULT variable
static char fileName[24] = "FIL"; // name of the log
result = f_mount(&FS_instance, path, 0); //f_mount
result = f_open(&file1, (char *)fileName, FA_OPEN_ALWAYS | FA_WRITE); //f_open
May be this can help you.

Silverlight: Encoding a webClient stream

I've been trying to get this to work, but I'm very frustrated at this point. I am a beginner in this field, so maybe I'm just making mistakes.
What I need to do is to take in a website .html and store it into a txt file. Now the problem is that this website is in Russian (encoding windows-1251) and Silverlight only supports 3 encodings. So in order to bypass that limitation, I got my hands on an encoding class that transfers the stream into a byte array and then tries to pull the correctly encoded string from the text. The problem with this is that
1) I try to ensure that webClient recieves a Unicode encoded stream, because the other ones do not seem to create a retrievable string, but it still doesn't seem to work.
WebClient wc = new WebClient();
wc.Encoding = System.Text.Encoding.Unicode;
wc.DownloadStringCompleted += new DownloadStringCompletedEventHandler(wc_LoadCompleted);
wc.DownloadStringAsync(new Uri(site));
2) I fear that when I store the html into a txt file using streamWriter, the encoding is, yet again, somehow screwed up.
3) The encoding class is not doing its job.
Encoding rus = Encoding.GetEncoding(1251);
Encoding eng = Encoding.Unicode;
byte[] bytes = rus.GetBytes(string);
textBlock1.Text = eng.GetString(bytes);
Can anyone offer any help on this matter? This huge detriment to my project. Thanks in advance,
Since you want to handle an encoding alien to Silverlight you should start with downloading using OpenReadAsync and OpenReadCompleted.
Now you should be able to take the Stream provided by the event args Result property and supply it directly to the encoding component you have acquired to generate the correct string result.

SSIS - Flat file always ANSI never UTF-8 encoded

Have a pretty straight forward SSIS package:
OLE DB Source to get data via a view, (all string columns in db table nvarchar or nchar).
Derived Column to format existing date and add it on to the dataset, (data type DT_WSTR).
Multicast task to split the dataset between:
OLE DB Command to update rows as "processed".
Flat file destination - the connection manager of which is set to Code Page 65001 UTF-8 and Unicode is unchecked. All string columns map to DT_WSTR.
Everytime I run this package an open the flat file in Notepad++ its ANSI, never UTF-8. If I check the Unicode option, the file is UCS-2 Little Endian.
Am I doing something wrong - how can I get the flat file to be UTF-8 encoded?
Thanks
In Source -> Advance Editor -> Component Properties ->
Set Default Code Page to 65001
AlwaysUseDefaultCodePage to True
Then Source->Advance Editor -> Input And OutPut Properties
Check Each Column in External Columns and OutPut Columns and set CodePage to 65001 wherever possible.
That's it.
By the way Excel can not define data inside the file to be UTF - 8. Excel is just a file handler. You can create csv files using notepad also. as long as you fill the csv file with UTF-8 you should be fine.
Adding explanation to the answers ...
setting the CodePage to 65001 (but do NOT check the Unicode checkbox on the file source), should generate a UTF-8 file. (yes, the data types internally also should be nvarchar, etc).
But the file that is produced from SSIS does not have a BOM header (Byte Order Marker), so some programs will assume it is still ASCII, not UTF-8. I've seen this confirmed by MS employees on MSDN, as well as confirmed by testing.
The file append solution is a way around this - by creating a blank file WITH the proper BOM, and then appending data from SSIS, the BOM header remains in place. If you tell SSIS to overwrite the file, it also loses the BOM.
Thanks for the hints here, it helped me figure out the above detail.
I have recently worked on a problem where we come across a situation such as the following:
You are working on a solution using SQL Server Integration Services(Visual Studio 2005).
You are pulling data from your database and trying to place the results into a flat file (.CSV) in UTF-8 format. The solution exports the data perfectly and keeps the special characters in the file because you have used 65001 as the code page.
However, the text file when you open it or try to load it to another process, it says the file is ANSI instead of UTF-8. If you open the file in notepad and do a SAVE AS and change the encode to UTF-8 and then your external process works but this is a tedious manual work.
What I have found that when you specify the Code Page property of the Flat file connection manager, it do generates a UTF-8 file. However, it generates a version of the UTF-8 file which misses something we call as Byte Order Mark.
So if you have a CSV file containing the character AA, the BOM for UTF8 will be 0xef, 0xbb and 0xbf. Even though the file has no BOM, it’s still UTF8.
Unfortunately, in some old legacy systems, the applications search for the BOM to determine the type of the file. It appears that your process is also doing the same.
To workaround the problem you can use the following piece of code in your script task which can be ran after the export process.
using System.IO;
using System.Text;
using System.Threading;
using System.Globalization;
enter code here
static void Main(string[] args)
{
string pattern = "*.csv";
string[] files = Directory.GetFiles(#".\", pattern, SearchOption.AllDirectories);
FileCodePageConverter converter = new FileCodePageConverter();
converter.SetCulture("en-US");
foreach (string file in files)
{
converter.Convert(file, file, "Windows-1252"); // Convert from code page Windows-1250 to UTF-8
}
}
class FileCodePageConverter
{
public void Convert(string path, string path2, string codepage)
{
byte[] buffer = File.ReadAllBytes(path);
if (buffer[0] != 0xef && buffer[0] != 0xbb)
{
byte[] buffer2 = Encoding.Convert(Encoding.GetEncoding(codepage), Encoding.UTF8, buffer);
byte[] utf8 = new byte[] { 0xef, 0xbb, 0xbf };
FileStream fs = File.Create(path2);
fs.Write(utf8, 0, utf8.Length);
fs.Write(buffer2, 0, buffer2.Length);
fs.Close();
}
}
public void SetCulture(string name)
{
Thread.CurrentThread.CurrentCulture = new CultureInfo(name);
Thread.CurrentThread.CurrentUICulture = new CultureInfo(name);
}
}
when you will run the package you will find that all the CSVs in the designated folder will be converted into a UTF8 format which contains the byte order mark.
This way your external process will be able to work with the exported CSV files.
if you are looking only for particular folder...send that variable to script task and use below one..
string sPath;
sPath=Dts.Variables["User::v_ExtractPath"].Value.ToString();
string pattern = "*.txt";
string[] files = Directory.GetFiles(sPath);
I hope this helps!!
OK - seemed to have found an acceptable work-around on SQL Server Forums. Essentially I had to create two UTF-8 template files, use a File Task to copy them to my destination then make sure I was appending data rather than overwriting.
For very large files #Prashanthi's in-memory solution will cause out of memory exceptions. Here is my implementation, a variation of the code from here.
public static void ConvertFileEncoding(String path,
Encoding sourceEncoding, Encoding destEncoding)
{
// If the source and destination encodings are the same, do nothting.
if (sourceEncoding == destEncoding)
{
return;
}
// otherwise, move file to a temporary path before processing
String tempPath = Path.GetDirectoryName(path) + "\\" + Guid.NewGuid().ToString() + ".csv";
File.Move(path, tempPath);
// Convert the file.
try
{
FileStream fileStream = new FileStream(tempPath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using (StreamReader sr = new StreamReader(fileStream, sourceEncoding, false))
{
using (StreamWriter sw = new StreamWriter(path, false, destEncoding))
{
//this seems to not work here
//byte[] utf8 = new byte[] { 0xef, 0xbb, 0xbf };
//sw.BaseStream.Write(utf8, 0, utf8.Length);
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
}
finally
{
File.Delete(tempPath);
}
}
I know this is a very old topic, but here goes another answer that may be easier to implement than the other ones already posted (take your pick).
I found this; which you can download the .exe file from this location. (It's free).
Make sure to follow the instructions in the first link and copy the .exe into your C:\Windows\System32 and C:\Windows\SysWOW64 for easy usage without having to type/remember complicated paths.
In SSIS, add an Execute process task.
Configure the object with convertcp.exe in the Process -> Executable field.
Configure the object with the arguments in the Process -> Arguments field with the following: 0 65001 /b /i "\<OriginalFilePath<OriginalFile>.csv" /o "\<TargetFilePath<TargetFile>_UTF-8.csv"
I suggest Window style to be set to hidden.
Done! If you run the package the Execute process task will convert the original ANSI file to UTF-8. You can convert from other codepages to other codepages as well. Just find the codepage numbers and you are good to go!
Basically this command line utility gives SSIS the ability to convert from codepage to codepage using the Execute process task. Worked like a charm for me. (If you deploy to a SQL Server you will have to copy the executable into the server in the system folders as well, of course.)
Best, Raphael

Resources