Modify ASN.1 BER encoded CDR file - sms

I am processing some CDRs (call detailed record). I dont know which exactly the file it is? But i supposed this to be 'ASN.1' format BER encoded files. Now my problem is that I want to modify some data in this files but I dont know which Editor or decorder I can use to modify this files. I searched a lot and found many ASN.1 Decorder as well as ASN.1 BSR viewer/editor but no one allows what i want to perform.
This CDR is supposed to contain Customer detail, phone number, telecom services(telephony, SMS, MMS) etc.
One of CDR name is - GGSN01_20120105000102_56641-09-12-01-09%3A30
and file type is - File
No other information is available. When I am opening this file in some text editor it show some rectangles and some text data.
Any telecom guy can definite help me. I am new to telecom domain.
Please ask if you need more information. Thanks

You would need to know something about ASN.1 and BER to be able to correctly edit your file. BER is a binary format, not ASCII text, thus what you see in your text editor. Even modifying any embedded plain text is only safe if you are not changing the length of the string; BER uses nested structures that encode lengths and so a change in the length of a string value requires adjustments to the encoded lengths of the enclosing structures. Additionally, in order to really know what your data is, you would need to know the ASN.1 that describes it (defines the types that describe your encoded data).
You could use a tool such as ASN.1 editor, but without the requisite background knowledge, I think it will not be very helpful to you. You can follow various links on this resources page to get more information about ASN.1. (full disclosure: I am currently an Obj-Sys employee).

Look for tools like enber and unber, they come as debugging tools with the fee asn.1-compiler of Lev Walkin. At least you get text-format from them.
The systemic solution is, of course to write a program that reads the BER-file, applies the schnages and then writes out the altered BER-file. To do so you need the ASN.1-Specification file of your CDR-Format (usually to be found in the specifications of the standard e.g. IMS, you are using) an asn1-compiler such as Lev's and some programming skills.

Related

Searching through the protocol buffer file

I'm new to protocol buffers and I was wondering whether it was possible to search a protocol buffers binary file and read the data in a structured format. For example if a message in my .proto file has 4 fields I would like to serialize the message and write multiple messages into a file and then search for a particular field in the file. If I find the field I would like to read back the message in the same structured format as it was written. Is this possible with protocol buffers ? If possible any sample code or examples would be very helpful. Thank you
You should treat protobuf library as one serialization protocol, not an all-in-one library which supports complex operations (such as querying, indexing, picking up particular data). Google has various libraries on top of open-sourced portion of protobuf to do so, but they are not released as open source, as they are tied with their unique infrastructure. That being said, what you want is certainly possible, yet you need to write some code.
Anyhow, some of your requirements are:
one file contains various serialized binaries.
search a particular field in each serialized binary and extract that chunk.
There are several ways to achieve them.
The most popular way for serial read/write is that the file contains a series of [size, type, serialization output]. That is, one serialized output is always prefixed by size and type (either 4/8 byte or variable-length) to help reading and parsing. So you just repeat this procedure: 1) read size and type, 2) read binary with given size, 3) parse with given type 4) goto 1). If you use union type or one file shares same type, you may skip type. You cannot drop size, as there is no way know the end of output by itself. If you want random read/write, other type of data structure is necessary.
'search field' in binary file is more tricky. One way is to read/parse output one by one and to check the existance of field by HasField(). It is most obvious and slow yet straightforward way to do so. If you want to search field by number (say, you want to search 'optional string email = 3;'), thus search by binary blob (like 0x1A, field number 3, wire type 2), it is not possible. In a serialized binary stream, field information is saved merely a number. Without an exact context (.proto scheme or binary file's structure), the number alone doesn't mean anything. There is no guarantee that 0x1A is from field information, or field information from other message type, or actually number 26, or part of other number, etc. That is, you need to maintain the information by yourself. You may create another file or database with necessary information to fetch particular message (like the location of serialization output with given field).
Long story short, what you ask is beyond what open-sourced protobuf library itself does, yet you can write them with your requirements.
I hope, this is what you are looking for:
http://temk.github.io/protobuf-utils/
This is a command line utility for searching within protobuf file.

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
宜しくお願いします。
For Data, dig Google's Japanese IME (Mozc) data files here.
https://github.com/google/mozc/tree/master/src/data
There is lots of interesting data there, including IPA dictionaries.
Edit:
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
https://taku910.github.io/mecab/
and there is ruby bindings for that too.
https://taku910.github.io/mecab/bindings.html
and here is somebody tested, ruby with mecab with tagger -Oyomi
http://hirai2.blog129.fc2.com/blog-entry-4.html
just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!
Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
EDIT
If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).
I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :
http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

Is there a standard format for describing a flat file?

Is there a standard or open format which can be used to describe the formating of a flat file. My company integrates many different customer file formats. With an XML file it's easy to get or create an XSD to describe the XML file format. I'm looking for something similar to describe a flat file format (fixed width, delimited etc). Stylus Studio uses a proprietary .conv format to do this. That .conv format can be used at runtime to transform an arbitrary flat file to an XML file. I was just wondering if there was any more open or standards based method for doing the same thing.
I'm looking for one method of describing a variety of flat file formats whether they are fixed width or delimited, so CSV is not an answer to this question.
XFlat:
http://www.infoloom.com/gcaconfs/WEB/philadelphia99/lyons.HTM#N29
http://www.unidex.com/overview.htm
For complex cases (e.g. log files) you may consider a lexical parser.
About selecting existing flat file formats: There is the Comma-separated values (CSV) format. Or, more generally, DSV. But these are not "fixed-width", since there's a delimiter character (such as a comma) that separates individual cells. Note that though CSV is standardized, not everybody adheres to the standard. Also, CSV may be to simple for your purposes, since it doesn't allow a rich document structure.
In that respect, the standardized and only slightly more complex (but thus more useful) formats JSON and YAML are a better choice. Both are supported out of the box by plenty of languages.
Your best bet is to have a look at all languages listed as non-binary in this overview and then determine which works best for you.
About describing flat file formats: This could be very easy or difficult, depending on the format. Though in most cases easier solutions exist, one way that will work in general is to view the file format as a formal grammar, and write a lexer/parser for it. But I admit, that's quite† heavy machinery.
If you're lucky, a couple of advanced regular expressions may do the trick. Most formats will not lend themselves for that however.‡ If you plan on writing a lexer/parser yourself, I can advise PLY (Python Lex-Yacc). But many other solutions exists, in many different languages, a lot of them more convenient than the old-school Lex & Yacc. For more, see What parser generator do you recommend?
  †: Yes, that may be an understatement.
  ‡: Even properly describing the email address format is not trivial.
COBOL (whether you like it or not) has a standard format for describing fixed-width record formats in files.
Other file formats, however, are somewhat simpler to describe. A CSV file, for example, is just a list of strings. Often the first row of a CSV file is the column names -- that's the description.
There are examples of using JSON to formulate metadata for text files. This can be applied to JSON files, CSV files and fixed-format files.
Look at http://www.projectzero.org/sMash/1.1.x/docs/zero.devguide.doc/zero.resource/declaration.html
This is IBM's sMash (Project Zero) using JSON to encode metadata. You can easily apply this to flat files.
At the end of the day, you will probably have to define your own file standard that caters specifically to your storage needs. What I suggest is using xml, YAML or JSON as your internal container for all of the file types you receive. On top of this, you will have to implement some extra validation logic to maintain meta-data such as the column sizes of the fixed width files (for importing from and exporting to fixed width). Alternatively, you can store or link a set of metadata to each file you convert to the internal format.
There may be a standard out there, but it's too hard to create 'one size fits all' solutions for these problems. There are entity relationship management tools out there (Talend, others) that make creating these mappings easier, but you will still need to spend a lot of time maintaining file format definitions and rules.
As for enforcing column width, xml might be the best solution as you can describe the formats using xml schemas (with the length restriction). For YAML or JSON, you may have to write your own logic for this, although I'm sure someone else has come up with a solution.
See XML vs comma delimited text files for further reference.
I don't know if there is any standard or open format to describe a flat file format. But one industry has done this: the banking industry. Financial institutions are indeed communicating using standardized message over a dedicated network called SWIFT. SWIFT messages were originally positional (before SWIFTML, the XMLified version). I don't know if it's a good suggestion as it's kinda obscure but maybe you could look at the SWIFT Formatting Guide, it may gives you some ideas.
Having that said, check out Flatworm, an humble flat file parser. I've used it to parse positional and/or CSV file and liked its XML descriptor format. It may be a better suggestion than SWIFT :)
CSV
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
The CSV entry on wikipedia allowed me to find a comparison of data serialization formats that is pretty much what you asked for.
The only similar thing I know of is Hachoir, which can currently parse 70 file formats:
http://bitbucket.org/haypo/hachoir/wiki/Home
I'm not sure if it really counts as a declarative language, since it's plugin parser based, but it seems to work, and is extensible, which may meet your needs just fine.
As an aside, there are interesting standardised, extensible flat-file FORMATS, such as IFF (Interchange File Format).

i18n - best practices for internationalization - XLIFF, gettext, INI, ...? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
EDIT: I would really like to see some general discussion about the formats, their pros and cons!
EDIT2: The 'bounty didn't really help to create the needed discussion, there are a few interesting answers but the comprehensive coverage of the topic is still missing. Six persons marked the question as favourites, which shows me that there is an interest in this discussion.
When deciding about internationalization the toughest part IMO is the choice of storage format.
For example the Zend PHP Framework offers the following adapters which cover pretty much all my options:
Array : no, hard to maintain
CSV : don't know, possible problems with encoding
Gettext : frequently used, poEdit for all platforms available BUT complicated
INI : don't know, possible problems with encoding
TBX : no clue
TMX : too much of a big thing? no editors freely available.
QT : not very widespread, no free tools
XLIFF : the comming standard? BUT no free tools available.
XMLTM : no, not what I need
basically I'm stuck with the 4 'bold' choices. I would like to use INI files but I'm reading about the encoding problems... is it really a problem, if I use strict UTF-8 (files, connections, db, etc.)?
I'm on Windows and I tried to figure out how poEdit functions but just didn't manage. No tutorials on the web either, is gettext still a choice or an endangered species anyways?
What about XLIFF, has anybody worked with it? Any tips on what tools to use?
Any ideas for Eclipse integration of any of these technologies?
POEdit isn't really hard to get a hang of. Just create a new .po file, then tell it to import strings from source files. The program scans your PHP files for any function calls matching _("Text"), gettext("Text"), etc. You can even specify your own functions to look for.
You then enter a translation in the appropriate box. When you save your .po file, a .mo file is automatically generated. That's just a binary version of the translations that gettext can easily parse.
In your PHP script make a call to bindtextdomain() telling it where your .mo file is located. Now any strings passed to gettext (or the underscore function) will be translated.
It makes it really easy to keep your translation files up to date. POEdit also has some neat features like allowing comments, showing changed and dropped strings and allowing fuzzy matches, which means you don't have to re-translate strings that have been slightly modified.
There is always Translate Toolkit which allow translating between I think all mentioned formats, and preferred gettext (po) and XLIFF.
you can use INI if you want, it's just that INI doesn't have a way to tell anyone that it is in UTF8, so if someone opens your INI with an editor, it might corrupt yout file.
So the idea is that, if you can trust the user to edit it with a UTF8 encoding.
You can add a BOM at the start of the file, some editors knows about it.
What do you want it to store ? user generated content or your application ressources ?
I worked with two of these formats on the l18n side: TMX and XLIFF. They are pretty similar. TMX is more popular nowdays, but XLIFF is gaining support quickly. There was at least one free XLIFF editor when I last looked into it: Transolution but it is not being developed now.
I do the data storage myself using a custom design - All displayed text is stored in the DB.
I have two tables.
The first table has an identity value, a 32 character varchar field (indexed on this field)
and a 200 character english description of the phrase.
My second table has the identity value from the first table, a language code (EN_UK,EN_US,etc) and an NVARCHAR column for the text.
I use an nvarchar for the text because it supports other character sets which I don't yet use.
The 32 character varchar in the first table stores something like 'pleaselogin' while the second table actually stores the full "Please enter your login and password below".
I have created a huge list of dynamic values which I replace at runtime. An example would be "You have {[dynamic:passworddaysremain]} days to change your password." - this allows me to work around the word ordering in different languages.
I have only had to deal with Arabic numerals so far but will have to work something out for the first user who requires non arabic numbers.
I actually pull this information out of the database on a 2 hourly interval and cache it to the disk in a file for each language in XML. Extensive use of the CDATA is used.
There are many options available, for performance you could use html templates for each language - My method works well but does use the XML DOM a lot at runtime to create the pages.
One rather simple approach is to just use a resource file and resource script. Programs like MSVC have no problem editing them. They're also reasonably friendly to other systems (and to text editors) as well. You can just create separate string tables (and bitmap tables) for each language, and mark each such table with what language it is in.
None of those choices looks very appetizing to me.
If you're sending files out for translation in multiple languages, then you want to be able to trust that the encodings are correct, especially if you no one in your team speaks those languages. Sometimes it's difficult to spot an encoding problem in a foreign language, and it is just too easy to inadvertantly corrupt file encodings if you let your OS 'guess'.
You really want a format that declares its encoding. Otherwise, translators or their translation tools might select something other than UTF-8. For my money, any kind of simple XML format is best, but it looks like you'd need to roll your own in Zend. XLIFF and TMX are certainly overkill.
A format like Java's XML resources would be ideal.
This might be a little different from what's been posted so far and may not be exactly what you're looking for, but I thought I would add it, if for nothing else but a different approach. I went with an object-oriented approach. What I did was create a system that encapsulates language files into a class by storing them in an array of string=>translation pairs. Access to the translation is through a method called translate with the key string as a parameter. Extending classes inherit the parent's language array and can add to it or overwrite it. Because the classes are extensible, you can change a base class and have the changes propagate through the children, making more maintainable than an array by itself. Plus, you only call the classes you need.
We just store the strings in the DB and have a translator mode built into the application to handle actually adding strings for different languages.
In the application we use various tricks to create text ids, like
£("btn_save")
£(Order.class,"amt")
The translations is loaded from the db when the system boots, or when a reload is manually triggered. The £ method takes care of looking up the translated string according the the language specified in the user session.
You can check my l10n tool called iL10Nz on http://www.myl10n.net
You can upload po/pot files, xliff, ini files , translate, download.
you can also check out this video on youtube
http://www.youtube.com/watch?v=LJLmxMFxaxA
Thanks
Olivier

manually finding the size of a block of text (ASCII format)

Is there an easy way to manually (ie. not through code) find the size (in bytes, KB, etc) of a block of selected text? Currently I am taking the text, cutting/pasting into a new text document, saving it, then clicking "properties" to get an estimate of the size.
I am developing mainly in visual studio 2008 but I need any sort of simple way to manually do this.
Note: I understand this is not specifically a programming question but it's related to programming. I need this to compare a few functions and see which one is returning the smallest amount of text. I only need to do it a few times so figured writing a method for it would be overkill.
This question isn't meaningful as asked. Text can be encoded in different formats; ASCII, UTF-8, UTF-16, etc. The memory consumed by a block of text depends on which encoding you decide to use for it.
EDIT: To answer the question you've stated now (how do I determine which function is returning a "smaller" block of text) -- given a single encoding, the shorter text will almost always be smaller as well. Why can't you just compare the lengths?
In your comment you mention it's ASCII. In that case, it'll be one byte per character.
I don't see the difference between using the code written by the app you're pasting into, and using some other code. Being a python person myself, whenever I want to check length of some text I just do it in the interactive interpreter. Surely some equivalent solution more suited to your tastes would be appropriate?
ended up just cutting/pasting the text into MS Word and using the char count feature in there

Resources