Returnn Switchboard data processing - returnn

Could anybody give me pointers on how to process Switchboard dataset for training with RETURNN? I did see BlissDataset class that seems to be designed for switchboard, but it's not clear to me what I should include in the paths given in the example:
Example:
./tools/dump-dataset.py "
{'class':'BlissDataset',
'path': '/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz',
'bpe_file': '/u/zeyer/setups/switchboard/subwords/swb-bpe-codes',
'vocab_file': '/u/zeyer/setups/switchboard/subwords/swb-vocab'}"
The switchboard dataset has several folders with audios, i.e. swb1_d2/data/*.sph and transcripts swb1_LDC97S62/swb_ms98_transcriptions/**/*
I'm not quite sure how to proceed with this to get a dataset that can be used to train RETURNN.

At our group (RWTH Aachen University), we use the config as it was published on GitHub. As you see, this one uses ExternSprintDataset. That dataset uses
The implementation uses Sprint (publicly called RWTH ASR (RASR), see here) as an external tool (ran in a subprocess) to handle the data (feature extraction, etc). Sprint gets a Bliss XML file which describes all the segments with path to audio and audio offsets and transcriptions, and also it gets further configs for the feature extraction and maybe other things. There is an open source version of RASR which should work but it might be a bit involved to get this to work.
The BlissDataset was planned to be a simpler replacement for that. However, the implementation is incomplete. Also, you still would need to generate the Bliss XML by yourself in some way (we have used some own internal scripts to prepare that based on the official LDC data).
So, unfortunately, there is no simple way yet. Actually, I think the easiest way would be to come up with yet another custom format, which might be similar to the LibriSpeechDataset implementation, or maybe just the same, and then you could just reuse LibriSpeechDataset, or at least parts of that. That dataset implementation takes the data in some zip format which contains the transcripts in txt files and the audio in ogg or wav files. It uses librosa to do MFCC feature extraction (or also other feature types). I planned to implement that for Switchboard, and then reproduce the results, however I did not have time yet and not sure when I will get to that. But if you want to try that on your own, I will be happy to help you however I can. The starting point would be to look at LibriSpeechDataset and understand how the format of that looks like.

Related

How do I reverse-engineer the "import file" feature of an abandoned pascal application?

first question I've asked and I'm not sure how to ask it clearly, or if there will be an answer that I want to hear ;)
tl;dr: "I want to import a file into my application at work but I don't know the input format. How can I discover it?"
Forgive any pending wordiness and/or redaction.
In my work I depend on an unsupported (and proprietary) application written in Pascal. I have no experience with pascal (yet...) and naturally have no source code access. It is an excellent (and very secret/NDA sort of deal I think) application that allows us to deal with inventory and financial issues in my employer's organization. It is quite feature-comprehensive, reasonably stable and robust, and kind of foistered (word?) on us by a higher power.
One excellent feature that it has is the ability to load up "schedules" into our corporate system. This feature should be saving us hundreds of hours in data entry.
But it isn't.
The problem is, the schedules we receive are written in a legacy format intended for human eyes. The "new" system can't interpret them.
Our current information (which I have to read and then re-enter into the database by hand) is send in a sort of rich-text flat-file format, which would be easy to parse with the string library of probably any mainstream language.
So I want to write a converter to convert our data into a format that the new software can interpret.
By feeding certain assorted files into the system, I have learned a little bit about what kind of file it expects:
I "import" a zero-byte file. Nothing happens (same as printing a report with no data)
I "import" an XML file that I guess might look like the system expects. It responds with an exception dialog and a stacktrace. Apparently the string <?xml contains illegal characters or something
I "import" a jpeg image -- similar result to #2.
So I think that my target wants a flat-file itself. The file would need to contain a "document number" along with {entries with "incident IDs" and descriptions and numeric values}.
But I don't know this for certain.
Nobody is able to tell me exactly what these files should look like. Someone in the know said that they have seen the feature demonstrated -- somewhere out there is a utility that creates my importable schedules. But for now, the utility is lost and I am on my own.
What methods can I use to figure out the input file format? I know nothing about debugging pascal, but I assume that that is probably my best bet. Or do I have to keep on with brute force until I can afford a million monkey-operated typewriters? Do I have to decompile the target application? I don't know if I can get away with that, let alone read the decompiled source.
My google-fu has failed me.
Has anyone done something like this before or could they point me in the right direction? Are there any guides on this subject?
Thanks in advance.
PS: I am certain that I am not breaking any laws at this point, although I will have to check to find out if decompilation would get me into trouble or not, and that might be outside of my technical competence anyway.
If you have an example file you can try to take a hexdump utility and try to see if there things you can identify. Any additional info that you have (what should in the file) helps with that. Better even, if you know a program that can edit the file, you can use the editor to make minimal changes and then compare the file before and after.
IOW standard tricks of binary file format reverse engineering.
...If you have no existing files whatsoever, then reverse engineering the binary is your only option, and that is not pretty. Decompilation of native binaries is a black art that requires considerable time and skill. Read the various decompilation FAQs on the net.
First and for all, I would try to contact the authors of the program. Source code are options 1,2,3 and you only go with other options if there is really, really, really no hope whatsoever of obtaining source or getting normal support.

MFC CDocument: How to read contents of database files created by defunct app?

I have almost zero experience coding in Visual Studio, MFC, etc. But I've got several data files that were created in a now-defunct MFC application, which I need to migrate to another format.
Unfortunately there's really no good way, within the application itself, to extract the data (short of copy-pasting hundreds or even thousands of records individually). And viewing the files themselves, i.e. in a Hex Editor, has proven fruitless; even though the raw data stored by the app is text-based, the database files are encoded in some cryptic binary format.
So far I've been able to determine that the app was written using MFC and that it uses the CDocument class (or a simple derivative thereof) to store the files. I understand that CDocument-based data files have something to do with serializing the data, but I'm not sure how to make sense of the encoding.
Does anyone know enough about MFC to explain to me how CDocument actually works?
Does anyone have any ideas on how I might be able to decode these files to extract the text?
I once faced an almost identical scenario. I eventually worked out the code to deserialize the data, but it wasn't easy.
Write a small MFC application to do the work, that way you can leverage the same serialization code that the original app used. The topic of reverse engineering a data format is way too complex to answer here. It's probably not encrypted; more likely compressed.
If you're an experienced programmer you should be able to read the MFC source code, then apply that knowledge to the raw data. Not everything can be heuristically determined just by observing the raw data, but if you have an independent way of determining the actual content, it's certainly possible with sufficient work.

Extracting the serialized data from unknown files

My dearest stackoverflowers,
I want to access the serialized data contained in files with strange, to me, extensions. The bulk of the data seems to be in a .st and an .idt file.
The program is meant to be run on Windows, and the unix file command gives me only false positives. Any ideas on either what these extensions mean or on how to investigate and extract their contents?
Below I provide the entirety of the extensions in a long list in hope somebody recognizes them. Googling also gives me false positives. For example: .st is commonly used for ATARI emulation files.
Thanks in advance!
.cix
.cmp
.cnt
.dam
.das
.drf
.idt
.irc
.lxp
.mp
.mbr
.str
.vlf
.rpf
.st
.st
Some general advice on how to approach this:
One way to approach this is to use a site like http://filext.com/ to try to figure out where the files came from. This can be tough, because it's not like there's a file extension standard anywhere - anyone can use any extension, so you're going to have a lot of conflicts/disambiguation issues to solve.
Sometimes you can get lucky, and if you open up the files in a plain text editor you can occasionally see plain string data that is readable, which can help identify the general sort of data contained in a file, and therefore help cut down on the possible number of sources for a file. For example, I have often helped people who received a file as an email attachment with no extension, figure out what file type it was using this technique, adding the file extension, and then opening it in the appropriate program.
There are also sites like http://www.oldversion.com/ that keep old versions of programs that you (typcially) can download for free. This is especially helpful if the data you're working with was created 5+ years ago, in a proprietary program, and that program is no longer available/purchasable from the vendor who created it.
Once you have a good idea of what files belong to what programs, then you're probably going to spend a lot of time trying to find online resources for what the structure of the files are. If that isn't available, you can get a copy of the original program, but either the program won't open the files you're interested in or you still want raw access to the data, then try generating some sample output files with data that you input, and go Rosetta Stone on it, comparing your known file to the original file.
From there, the additional knowledge you'll probably want, is to try to find out what language/compiler the software was written in, which can give you a lead on what code libraries were used to serialize the data in the first place. Once you know all that, then it's matter of reading through any available documentation on the serialization process, and then writing a deserializer.
The one thing this technique won't solve is, if you're dealing with corrupt/truncated data files, it may be very difficult to tell the difference between that and whether or not you have the file structure correct. The "Rosetta Stone" technique might be helpful in that case.
Depending on how many different pieces of source software you're talking about, sounds like a pretty big project. Good luck!

Is avoiding the T in ETL possible?

ETL is pretty common-place. Data is out there somewhere so you go get it. After you get it, it's probably in a weird format so you transform it into something and then load it somewhere. The only problem I see with this method is you have to write the transform rules. Of course, I can't think of anything better. I supposed you could load whatever you get into a blob (sql) or into a object/document (non-sql) but then I think you're just delaying the parsing. Eventually you'll have to parse it into something structured (assuming you want to). So is there anything better? Does it have a name? Does this problem have a name?
Example
Ok, let me give you an example. I've got a printer, an ATM and a voicemail system. They're all network enabled or I can give you connectivity. How would you collect the state from all these devices? For example, the printer dumps a text file when you type status over port 9000:
> status
===============
has_paper:true
jobs:0
ink:low
The ATM has a CLI after you connect on port whatever and you can type individual commands to get different values:
maint-mode> GET BILLS_1
[$1 bills]: 7
maint-mode> GET BILLS_5
[$5 bills]: 2
etc ...
The voicemail system requires certain key sequences to get any kind of information over a network port:
telnet> 7,9*
0 new messages
telnet> 7,0*
2 total messages
My thoughts
Printer - So this is pretty straight-forward. You can just capture everything after sending "status", split on lines and then split on colons or something. Pretty easy. It's almost like getting a crap-formatted result from a web service or something. I could avoid parsing and just dump the whole conversation from port 9000. But eventually I'll want to get rid of that equal signs line. It doesn't really mean anything.
ATM - So this is a bit more of a pain because it's interactive. Now I'm approaching expect or a protocol territory. It'd be better if they had a service that I could query these values but that's out of scope for this post. So I write a client that gets all the values. But now if I want to collect all the data, I have to define what all the questions are. For example, I know that the ATM has more bills than $1 and $5 so I'd have a complete list like "BILLS_1 BILLS_5 BILLS_10 BILLS_20". If I ask all the questions then I have an inventory of the ATM machine. Of course, I still have to parse out the results and clean up the text if I wanted to figure out how much money is left in the ATM machine. So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Voicemail - This is similar to the ATM machine where it's interactive. It's just a bit weirder because the key sequences/commands aren't "get key". But essentially it's the same problem and solution.
Future Proof
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster. Or anything? You'd have to write "connectors" ahead of time or write a parser afterwards against some raw field you stored earlier. Maybe in the case of these very limited examples there's no alternative. There's no way to future-proof. You just have to understand the new device and parse it at collection or parse it after the fact (your stored blob/object/document).
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer that simply requires the device to split out lines. Then you could have a text processing piece that parses based on rules. For the ATM device, you'd have to write something that "speaks ATM" and turns it into lines which the iterator would then take care of. At this point, hopefully you'd be able to say "I can handle anything that has lines of text".
But then what will you call these rules for parsing the text? "Printer rules" might as well be called "printer parser" which is the same to me as "printer transform". Is there a better term for all of this?
I apologize for this question being so open ended. :)
When your sources of information are as disparate as what you illustrate then you have no choice but to implement the Transform in order to bring the items into a common data repository. Usually your data sources won't be this extreme, the data will all be related in some way but you may be retrieving it from different sources (some might come from a nicely structured database, some more might come from an Excel or XML or text file, some more might come from a web service call, etc).
When coding up a custom ETL application, a common pattern that is used is the Provider model, this enables you to write a whole bunch of custom providers to load/query and then transform the data. All the providers will implement a common interface with some relatively common function definitions (for example QueryData(), TransformData()), but the implementation of those methods will be wildly different depending on the data source being dealt with - the interface just gives a common way to deal with all the different providers. You can then use an XML configuration file to dictate which providers to run and any other initial settings they may require. Tools like SSIS abstract this stuff away for you by giving you a nice visual designer, but you can still get down and dirty and write your own code which it calls.
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster.
No problem, i would just write a new provider, which can sit in its very own assembly (dll), so it can be shipped (or modified, upgraded, etc) in isolation to any other providers i already have. Or if i was using SSIS then i would write a new DTS package.
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer ... Then you could have a text processing piece that parses based on rules.
Absolutely - you can have a base class containing common functionality which several different providers can implement, and each provider can use its own set of rules which could be coded into it or they can be contained in an external configuration file.
So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Use whichever approach makes sense for the data you are grabbing. It is also quite common for an ETL process to dump its data into a staging area (like some staging tables in a database) while the data is all being aggregated and accumulated, and then further process it to link related data and perform calculations. In the case of your ATM it may not be necessary to calculate a cash balance at ETL time because you can easily calculate it at any time in the future.

easy, programmable data plotting

I spend most of my time plotting data, but unfortunately I haven't found a decent solution for my plotting needs. At the moment, the most powerful and pleasant library I found that performs plotting is matplotlib. The results are stunning, but I mostly spend my time fighting with the library when trying to do simple things like having an arrow as I want. SImilar programs like R and gnuplot produce visually less appealing results, and they are not GUI based.
On the other hand, programs like xmgrace (or better) allow direct manipulation of the plotted objects and direct feedback, but they fail on two important points:
if my dataset (normally stored in csv files) changes for some reason, I have to reimport it and perform the manipulations again, by hand
once I obtain a nice plot setup, the only way I have to recreate the plot is to use a graphical, interactive program. I would like to have the possibility to run a command line utility on my csv files and get the .pdf as a result, with no human intervention.
I still have to find something that provides me both worlds, and it has an affordable price. Ideally, I would need an interactive GUI program (a la Origin) to generate matplotlib-based python scripts.
Does anyone have any hints on software that could address my needs on OSX (preferably) or Linux ?
You may want to check out Igor Pro. It's quite old, and quirky but it provides the most advanced plotting system I've found yet on the Mac. You can modify anything graphically, at a command line or in script files. The most powerful feature (IMO) is the ability to automatically generate a script to recreate a figure or to use a figure to create a script that generates figures like (in style etc.) a particular figure. I use Igor for all publication figures I produce.
Data is stored in "waves" (translation: vectors) which encapsulate data and information about the delta between data points (e.g. time step). Figures reference waves as their data source. When you update a wave (e.g. by re-importing a CSV file and specifying that the data overwrite specific waves), all figures that reference that wave are automatically updated.
You can create "layouts" which are page-layouts containing multiple graphs. These layouts are also automatically updated whenever any of the figures in the layout are updated (see above). You can add drawing/text/annotations to either graphs or layout.s
Be warned: Igor Pro's scripting language is something like the bastard child of VB and Matlab. It makes my eyes bleed. It makes me pray to whatever God that the pain just end. But the entire system is so powerful that it's worth it.
I have always used Matlab or R for this sort of thing. While you may not like how the generic plots look, I find that once I familiarize myself with the libraries I can make them as fancy as I want them to be.
R being free, I would try to stick it out with that. It is extremely powerful and perfectly suited to what you need (generate charts on the fly directly from datafiles). I bet that the more you get comfortable with it, you'll find yourself using R for a wide range of tasks outside of plotting data.
MathGL is cross-platform GPL library which meet all yours criteria. It can produce nice graphics, it can read csv files, it have window for displaying graphics (you don't need to know widget libraries), and it can plot in console (don't need a window or X at all). At this you can use C/C++/Fortran/Python/... for yours own code or MGL scripts for simplicity (see UDAV front-end in the last case).
Finally it can produce bitmaps (PNG/JPEG/GIF/...) or vector (EPS/SVG) output. Later it can be converted to PDF easily. Or you can create a PDF with U3D directly -- you'll need HPDF and U3D libraries in this case.

Resources