I am reading a table in an object and I need to generate a passthrough ebcidic file from it. This is a spring batch step. There was some suggestions to use jrecord to write an aggregator and a FlatFileItemWriter.
Any clues ?
JRecord is possible solution, I can not say whether there is a better solution for you or not as I do not
know anything about Spring-Batch. This is perhaps more of an extended Comment than a pure answer
JRecord reads / writes files using a File-Schema (or File Description).
Normally this file-schema is a Cobol-Copybook although it also can be a Xml~Schema. The file schema can also be defined in the Program if need be. Given you want to write Ebcdic files, I would think a Cobol-Copybook
will be needed at some stage.
JRecord also support for mainframe/Cobol sequential File structures (FB - Fixed-Width files)
which is what you want
JRecord allows access to fields either by Field-Name or Field-Index (or field id). Note Record_Type_index is to handle files with multiple record types (e.g. header-record, detail-record, footer-record files).
outLine.getFieldValue(record_Type_Index, field_Index).set(...)
or
outLine.getFieldValue("Field-Name").set(...)
Bruce Martin (author of JRecord)
Discussions continued at JRecord forum
https://sourceforge.net/p/jrecord/discussion/678634/thread/2709ab72/?limit=25#c009/8287
Related
I have a legacy VB application that still has some life in it, and I am wanting to translate it to another language.
I plan to write a Ruby script, possibly utilising a parser, to extract all strings from the three million lines of source, replace them with constants, and move them to a string resource file that can be used to provide translations.
Is anyone aware of a script/library that could be used to intelligently extract the strings?
I'm not aware of any existing off-the-shelf tool that you could use. We created a tool like this at my work and it worked well. The FRM file format is quite simple (although only briefly documented). We wrote a tool that (1) extracted all strings from control definitions and (2) generated the code to reload them at runtime during Form_Load.
Are there any rules for file extensions? For example, I wrote some code which reads and writes a byte pattern that is only understood by that specific programm. I'm assuming my anti virus programm won't be too happy if I give it the name "pleasetrustme.exe"... Is it gerally allowed to use those extensions? And what about the lesser known ones, like ".arw"?
You can use any file extension you want (or none at all). Using standard extensions that reflect the actual type of the file just makes things more convenient. On Windows, file extensions control stuff like how the files are displayed in Windows Explorer and what happens when you double click on it.
I wrote some code which reads and writes a byte pattern that is only
understood by that specific programm.
A file extension is only an indication of what type of data will be inside, never a guarantee that certain data formatted in a specific way will be inside the file.
For your own specific data structure it is of course always best to choose an extension that is not already in use for other file formats (or use a general extension like .dat or .bin maybe). This also has the advantage of being able to use an own icon without it being overwritten by other software using the same extension - or the other way around.
But maybe even more important when creating a custom (binary?) file format, is to provide a magic number as the first bytes of that file, maybe followed by a file header structure containing a version number etc. That way your own software can first check the header data to make sure it's the right type and version (for example: anyone could rename any file type to your extension, so your program needs to have a way to do some checks inside the file before reading the remaining data).
Is the ReplaceFile Windows API a convenience function only, or does it achieve anything beyond what could be coded using multiple calls to MoveFileEx?
I'm currently in the situation where I need to
write a temporary file and then
rename this temporary file to the original filename, possibly replacing the original file.
I thought about using MoveFileEx with MOVEFILE_REPLACE_EXISTING (since I don't need a backup or anything) but there is also the ReplaceFile API and since it is mentioned under Alternatives to TxF.
This got me thinking: Does ReplaceFile actually do anything special, or is it just a convenience wrapper for MoveFile(Ex)?
I think the key to this can be found in this line from the documentation (my emphasis):
The replacement file assumes the name of the replaced file and its identity.
When you use MoveFileEx, the replacement file has a different identity. Its creation date is not preserved, the creator is not preserved, any ACLs are not preserved and so on. Using ReplaceFile allows you to make it look as though you opened the file, and modified its contents.
The documentation says it like this:
Another advantage is that ReplaceFile not only copies the new file data, but also preserves the following attributes of the original file:
Creation time
Short file name
Object identifier
DACLs
Security resource attributes
Encryption
Compression
Named streams not already in the replacement file
For example, if the replacement file is encrypted, but the
replaced file is not encrypted, the resulting file is not
encrypted.
Any app that wants to update a file by writing to a temp and doing the rename/rename/delete dance (handling all the various failure scenarios correctly), would have to change each time a new non-data attribute was added to the system. Rather than forcing all apps to change, they put in an API that is supposed to do this for you.
So you could "just do it yourself", but why? Do you correctly cover all the failure scenarios? Yes, MS may have a bug, but why try to invent the wheel?
NB, I have a number of issues with the programming model (better to do a "CreateUsingTemplate") but it's better than nothing.
Is there a standard or open format which can be used to describe the formating of a flat file. My company integrates many different customer file formats. With an XML file it's easy to get or create an XSD to describe the XML file format. I'm looking for something similar to describe a flat file format (fixed width, delimited etc). Stylus Studio uses a proprietary .conv format to do this. That .conv format can be used at runtime to transform an arbitrary flat file to an XML file. I was just wondering if there was any more open or standards based method for doing the same thing.
I'm looking for one method of describing a variety of flat file formats whether they are fixed width or delimited, so CSV is not an answer to this question.
XFlat:
http://www.infoloom.com/gcaconfs/WEB/philadelphia99/lyons.HTM#N29
http://www.unidex.com/overview.htm
For complex cases (e.g. log files) you may consider a lexical parser.
About selecting existing flat file formats: There is the Comma-separated values (CSV) format. Or, more generally, DSV. But these are not "fixed-width", since there's a delimiter character (such as a comma) that separates individual cells. Note that though CSV is standardized, not everybody adheres to the standard. Also, CSV may be to simple for your purposes, since it doesn't allow a rich document structure.
In that respect, the standardized and only slightly more complex (but thus more useful) formats JSON and YAML are a better choice. Both are supported out of the box by plenty of languages.
Your best bet is to have a look at all languages listed as non-binary in this overview and then determine which works best for you.
About describing flat file formats: This could be very easy or difficult, depending on the format. Though in most cases easier solutions exist, one way that will work in general is to view the file format as a formal grammar, and write a lexer/parser for it. But I admit, that's quite† heavy machinery.
If you're lucky, a couple of advanced regular expressions may do the trick. Most formats will not lend themselves for that however.‡ If you plan on writing a lexer/parser yourself, I can advise PLY (Python Lex-Yacc). But many other solutions exists, in many different languages, a lot of them more convenient than the old-school Lex & Yacc. For more, see What parser generator do you recommend?
†: Yes, that may be an understatement.
‡: Even properly describing the email address format is not trivial.
COBOL (whether you like it or not) has a standard format for describing fixed-width record formats in files.
Other file formats, however, are somewhat simpler to describe. A CSV file, for example, is just a list of strings. Often the first row of a CSV file is the column names -- that's the description.
There are examples of using JSON to formulate metadata for text files. This can be applied to JSON files, CSV files and fixed-format files.
Look at http://www.projectzero.org/sMash/1.1.x/docs/zero.devguide.doc/zero.resource/declaration.html
This is IBM's sMash (Project Zero) using JSON to encode metadata. You can easily apply this to flat files.
At the end of the day, you will probably have to define your own file standard that caters specifically to your storage needs. What I suggest is using xml, YAML or JSON as your internal container for all of the file types you receive. On top of this, you will have to implement some extra validation logic to maintain meta-data such as the column sizes of the fixed width files (for importing from and exporting to fixed width). Alternatively, you can store or link a set of metadata to each file you convert to the internal format.
There may be a standard out there, but it's too hard to create 'one size fits all' solutions for these problems. There are entity relationship management tools out there (Talend, others) that make creating these mappings easier, but you will still need to spend a lot of time maintaining file format definitions and rules.
As for enforcing column width, xml might be the best solution as you can describe the formats using xml schemas (with the length restriction). For YAML or JSON, you may have to write your own logic for this, although I'm sure someone else has come up with a solution.
See XML vs comma delimited text files for further reference.
I don't know if there is any standard or open format to describe a flat file format. But one industry has done this: the banking industry. Financial institutions are indeed communicating using standardized message over a dedicated network called SWIFT. SWIFT messages were originally positional (before SWIFTML, the XMLified version). I don't know if it's a good suggestion as it's kinda obscure but maybe you could look at the SWIFT Formatting Guide, it may gives you some ideas.
Having that said, check out Flatworm, an humble flat file parser. I've used it to parse positional and/or CSV file and liked its XML descriptor format. It may be a better suggestion than SWIFT :)
CSV
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
The CSV entry on wikipedia allowed me to find a comparison of data serialization formats that is pretty much what you asked for.
The only similar thing I know of is Hachoir, which can currently parse 70 file formats:
http://bitbucket.org/haypo/hachoir/wiki/Home
I'm not sure if it really counts as a declarative language, since it's plugin parser based, but it seems to work, and is extensible, which may meet your needs just fine.
As an aside, there are interesting standardised, extensible flat-file FORMATS, such as IFF (Interchange File Format).
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
EDIT: I would really like to see some general discussion about the formats, their pros and cons!
EDIT2: The 'bounty didn't really help to create the needed discussion, there are a few interesting answers but the comprehensive coverage of the topic is still missing. Six persons marked the question as favourites, which shows me that there is an interest in this discussion.
When deciding about internationalization the toughest part IMO is the choice of storage format.
For example the Zend PHP Framework offers the following adapters which cover pretty much all my options:
Array : no, hard to maintain
CSV : don't know, possible problems with encoding
Gettext : frequently used, poEdit for all platforms available BUT complicated
INI : don't know, possible problems with encoding
TBX : no clue
TMX : too much of a big thing? no editors freely available.
QT : not very widespread, no free tools
XLIFF : the comming standard? BUT no free tools available.
XMLTM : no, not what I need
basically I'm stuck with the 4 'bold' choices. I would like to use INI files but I'm reading about the encoding problems... is it really a problem, if I use strict UTF-8 (files, connections, db, etc.)?
I'm on Windows and I tried to figure out how poEdit functions but just didn't manage. No tutorials on the web either, is gettext still a choice or an endangered species anyways?
What about XLIFF, has anybody worked with it? Any tips on what tools to use?
Any ideas for Eclipse integration of any of these technologies?
POEdit isn't really hard to get a hang of. Just create a new .po file, then tell it to import strings from source files. The program scans your PHP files for any function calls matching _("Text"), gettext("Text"), etc. You can even specify your own functions to look for.
You then enter a translation in the appropriate box. When you save your .po file, a .mo file is automatically generated. That's just a binary version of the translations that gettext can easily parse.
In your PHP script make a call to bindtextdomain() telling it where your .mo file is located. Now any strings passed to gettext (or the underscore function) will be translated.
It makes it really easy to keep your translation files up to date. POEdit also has some neat features like allowing comments, showing changed and dropped strings and allowing fuzzy matches, which means you don't have to re-translate strings that have been slightly modified.
There is always Translate Toolkit which allow translating between I think all mentioned formats, and preferred gettext (po) and XLIFF.
you can use INI if you want, it's just that INI doesn't have a way to tell anyone that it is in UTF8, so if someone opens your INI with an editor, it might corrupt yout file.
So the idea is that, if you can trust the user to edit it with a UTF8 encoding.
You can add a BOM at the start of the file, some editors knows about it.
What do you want it to store ? user generated content or your application ressources ?
I worked with two of these formats on the l18n side: TMX and XLIFF. They are pretty similar. TMX is more popular nowdays, but XLIFF is gaining support quickly. There was at least one free XLIFF editor when I last looked into it: Transolution but it is not being developed now.
I do the data storage myself using a custom design - All displayed text is stored in the DB.
I have two tables.
The first table has an identity value, a 32 character varchar field (indexed on this field)
and a 200 character english description of the phrase.
My second table has the identity value from the first table, a language code (EN_UK,EN_US,etc) and an NVARCHAR column for the text.
I use an nvarchar for the text because it supports other character sets which I don't yet use.
The 32 character varchar in the first table stores something like 'pleaselogin' while the second table actually stores the full "Please enter your login and password below".
I have created a huge list of dynamic values which I replace at runtime. An example would be "You have {[dynamic:passworddaysremain]} days to change your password." - this allows me to work around the word ordering in different languages.
I have only had to deal with Arabic numerals so far but will have to work something out for the first user who requires non arabic numbers.
I actually pull this information out of the database on a 2 hourly interval and cache it to the disk in a file for each language in XML. Extensive use of the CDATA is used.
There are many options available, for performance you could use html templates for each language - My method works well but does use the XML DOM a lot at runtime to create the pages.
One rather simple approach is to just use a resource file and resource script. Programs like MSVC have no problem editing them. They're also reasonably friendly to other systems (and to text editors) as well. You can just create separate string tables (and bitmap tables) for each language, and mark each such table with what language it is in.
None of those choices looks very appetizing to me.
If you're sending files out for translation in multiple languages, then you want to be able to trust that the encodings are correct, especially if you no one in your team speaks those languages. Sometimes it's difficult to spot an encoding problem in a foreign language, and it is just too easy to inadvertantly corrupt file encodings if you let your OS 'guess'.
You really want a format that declares its encoding. Otherwise, translators or their translation tools might select something other than UTF-8. For my money, any kind of simple XML format is best, but it looks like you'd need to roll your own in Zend. XLIFF and TMX are certainly overkill.
A format like Java's XML resources would be ideal.
This might be a little different from what's been posted so far and may not be exactly what you're looking for, but I thought I would add it, if for nothing else but a different approach. I went with an object-oriented approach. What I did was create a system that encapsulates language files into a class by storing them in an array of string=>translation pairs. Access to the translation is through a method called translate with the key string as a parameter. Extending classes inherit the parent's language array and can add to it or overwrite it. Because the classes are extensible, you can change a base class and have the changes propagate through the children, making more maintainable than an array by itself. Plus, you only call the classes you need.
We just store the strings in the DB and have a translator mode built into the application to handle actually adding strings for different languages.
In the application we use various tricks to create text ids, like
£("btn_save")
£(Order.class,"amt")
The translations is loaded from the db when the system boots, or when a reload is manually triggered. The £ method takes care of looking up the translated string according the the language specified in the user session.
You can check my l10n tool called iL10Nz on http://www.myl10n.net
You can upload po/pot files, xliff, ini files , translate, download.
you can also check out this video on youtube
http://www.youtube.com/watch?v=LJLmxMFxaxA
Thanks
Olivier