Extracting the serialized data from unknown files - windows

My dearest stackoverflowers,
I want to access the serialized data contained in files with strange, to me, extensions. The bulk of the data seems to be in a .st and an .idt file.
The program is meant to be run on Windows, and the unix file command gives me only false positives. Any ideas on either what these extensions mean or on how to investigate and extract their contents?
Below I provide the entirety of the extensions in a long list in hope somebody recognizes them. Googling also gives me false positives. For example: .st is commonly used for ATARI emulation files.
Thanks in advance!
.cix
.cmp
.cnt
.dam
.das
.drf
.idt
.irc
.lxp
.mp
.mbr
.str
.vlf
.rpf
.st
.st

Some general advice on how to approach this:
One way to approach this is to use a site like http://filext.com/ to try to figure out where the files came from. This can be tough, because it's not like there's a file extension standard anywhere - anyone can use any extension, so you're going to have a lot of conflicts/disambiguation issues to solve.
Sometimes you can get lucky, and if you open up the files in a plain text editor you can occasionally see plain string data that is readable, which can help identify the general sort of data contained in a file, and therefore help cut down on the possible number of sources for a file. For example, I have often helped people who received a file as an email attachment with no extension, figure out what file type it was using this technique, adding the file extension, and then opening it in the appropriate program.
There are also sites like http://www.oldversion.com/ that keep old versions of programs that you (typcially) can download for free. This is especially helpful if the data you're working with was created 5+ years ago, in a proprietary program, and that program is no longer available/purchasable from the vendor who created it.
Once you have a good idea of what files belong to what programs, then you're probably going to spend a lot of time trying to find online resources for what the structure of the files are. If that isn't available, you can get a copy of the original program, but either the program won't open the files you're interested in or you still want raw access to the data, then try generating some sample output files with data that you input, and go Rosetta Stone on it, comparing your known file to the original file.
From there, the additional knowledge you'll probably want, is to try to find out what language/compiler the software was written in, which can give you a lead on what code libraries were used to serialize the data in the first place. Once you know all that, then it's matter of reading through any available documentation on the serialization process, and then writing a deserializer.
The one thing this technique won't solve is, if you're dealing with corrupt/truncated data files, it may be very difficult to tell the difference between that and whether or not you have the file structure correct. The "Rosetta Stone" technique might be helpful in that case.
Depending on how many different pieces of source software you're talking about, sounds like a pretty big project. Good luck!

Related

Returnn Switchboard data processing

Could anybody give me pointers on how to process Switchboard dataset for training with RETURNN? I did see BlissDataset class that seems to be designed for switchboard, but it's not clear to me what I should include in the paths given in the example:
Example:
./tools/dump-dataset.py "
{'class':'BlissDataset',
'path': '/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz',
'bpe_file': '/u/zeyer/setups/switchboard/subwords/swb-bpe-codes',
'vocab_file': '/u/zeyer/setups/switchboard/subwords/swb-vocab'}"
The switchboard dataset has several folders with audios, i.e. swb1_d2/data/*.sph and transcripts swb1_LDC97S62/swb_ms98_transcriptions/**/*
I'm not quite sure how to proceed with this to get a dataset that can be used to train RETURNN.
At our group (RWTH Aachen University), we use the config as it was published on GitHub. As you see, this one uses ExternSprintDataset. That dataset uses
The implementation uses Sprint (publicly called RWTH ASR (RASR), see here) as an external tool (ran in a subprocess) to handle the data (feature extraction, etc). Sprint gets a Bliss XML file which describes all the segments with path to audio and audio offsets and transcriptions, and also it gets further configs for the feature extraction and maybe other things. There is an open source version of RASR which should work but it might be a bit involved to get this to work.
The BlissDataset was planned to be a simpler replacement for that. However, the implementation is incomplete. Also, you still would need to generate the Bliss XML by yourself in some way (we have used some own internal scripts to prepare that based on the official LDC data).
So, unfortunately, there is no simple way yet. Actually, I think the easiest way would be to come up with yet another custom format, which might be similar to the LibriSpeechDataset implementation, or maybe just the same, and then you could just reuse LibriSpeechDataset, or at least parts of that. That dataset implementation takes the data in some zip format which contains the transcripts in txt files and the audio in ogg or wav files. It uses librosa to do MFCC feature extraction (or also other feature types). I planned to implement that for Switchboard, and then reproduce the results, however I did not have time yet and not sure when I will get to that. But if you want to try that on your own, I will be happy to help you however I can. The starting point would be to look at LibriSpeechDataset and understand how the format of that looks like.

How do I reverse-engineer the "import file" feature of an abandoned pascal application?

first question I've asked and I'm not sure how to ask it clearly, or if there will be an answer that I want to hear ;)
tl;dr: "I want to import a file into my application at work but I don't know the input format. How can I discover it?"
Forgive any pending wordiness and/or redaction.
In my work I depend on an unsupported (and proprietary) application written in Pascal. I have no experience with pascal (yet...) and naturally have no source code access. It is an excellent (and very secret/NDA sort of deal I think) application that allows us to deal with inventory and financial issues in my employer's organization. It is quite feature-comprehensive, reasonably stable and robust, and kind of foistered (word?) on us by a higher power.
One excellent feature that it has is the ability to load up "schedules" into our corporate system. This feature should be saving us hundreds of hours in data entry.
But it isn't.
The problem is, the schedules we receive are written in a legacy format intended for human eyes. The "new" system can't interpret them.
Our current information (which I have to read and then re-enter into the database by hand) is send in a sort of rich-text flat-file format, which would be easy to parse with the string library of probably any mainstream language.
So I want to write a converter to convert our data into a format that the new software can interpret.
By feeding certain assorted files into the system, I have learned a little bit about what kind of file it expects:
I "import" a zero-byte file. Nothing happens (same as printing a report with no data)
I "import" an XML file that I guess might look like the system expects. It responds with an exception dialog and a stacktrace. Apparently the string <?xml contains illegal characters or something
I "import" a jpeg image -- similar result to #2.
So I think that my target wants a flat-file itself. The file would need to contain a "document number" along with {entries with "incident IDs" and descriptions and numeric values}.
But I don't know this for certain.
Nobody is able to tell me exactly what these files should look like. Someone in the know said that they have seen the feature demonstrated -- somewhere out there is a utility that creates my importable schedules. But for now, the utility is lost and I am on my own.
What methods can I use to figure out the input file format? I know nothing about debugging pascal, but I assume that that is probably my best bet. Or do I have to keep on with brute force until I can afford a million monkey-operated typewriters? Do I have to decompile the target application? I don't know if I can get away with that, let alone read the decompiled source.
My google-fu has failed me.
Has anyone done something like this before or could they point me in the right direction? Are there any guides on this subject?
Thanks in advance.
PS: I am certain that I am not breaking any laws at this point, although I will have to check to find out if decompilation would get me into trouble or not, and that might be outside of my technical competence anyway.
If you have an example file you can try to take a hexdump utility and try to see if there things you can identify. Any additional info that you have (what should in the file) helps with that. Better even, if you know a program that can edit the file, you can use the editor to make minimal changes and then compare the file before and after.
IOW standard tricks of binary file format reverse engineering.
...If you have no existing files whatsoever, then reverse engineering the binary is your only option, and that is not pretty. Decompilation of native binaries is a black art that requires considerable time and skill. Read the various decompilation FAQs on the net.
First and for all, I would try to contact the authors of the program. Source code are options 1,2,3 and you only go with other options if there is really, really, really no hope whatsoever of obtaining source or getting normal support.

MFC CDocument: How to read contents of database files created by defunct app?

I have almost zero experience coding in Visual Studio, MFC, etc. But I've got several data files that were created in a now-defunct MFC application, which I need to migrate to another format.
Unfortunately there's really no good way, within the application itself, to extract the data (short of copy-pasting hundreds or even thousands of records individually). And viewing the files themselves, i.e. in a Hex Editor, has proven fruitless; even though the raw data stored by the app is text-based, the database files are encoded in some cryptic binary format.
So far I've been able to determine that the app was written using MFC and that it uses the CDocument class (or a simple derivative thereof) to store the files. I understand that CDocument-based data files have something to do with serializing the data, but I'm not sure how to make sense of the encoding.
Does anyone know enough about MFC to explain to me how CDocument actually works?
Does anyone have any ideas on how I might be able to decode these files to extract the text?
I once faced an almost identical scenario. I eventually worked out the code to deserialize the data, but it wasn't easy.
Write a small MFC application to do the work, that way you can leverage the same serialization code that the original app used. The topic of reverse engineering a data format is way too complex to answer here. It's probably not encrypted; more likely compressed.
If you're an experienced programmer you should be able to read the MFC source code, then apply that knowledge to the raw data. Not everything can be heuristically determined just by observing the raw data, but if you have an independent way of determining the actual content, it's certainly possible with sufficient work.

What's the most effective way to insert content into binary file, or join two files

We want insert something into binary file, but want do I/O as less as possible. We don't want read the second part and write it back. Is there any way to do it. We know in theory, file are saved as blocks in file system. Could we just break the chain of blocks and insert a new one between it?
The same idea could also be used in join two files. Attach one after another. Is there any fast way to do this?
The problem comes from write a large file. We want write it to many small files and join them together.
I'd say don't go there...
It would involve low-level, file-system specific coding that could possibly mess up the file system.
If this is really, really important for you(r) business, perhaps you could 'emulate' blocks inside a file. This would however result in files of which the intended structure can not be resolved by other 'parties' (other software using the file).
You most certainly don't want to go messing with the file system directly. Let the OS worry about that. If you have to ask this question, you probably don't have the requisite knowledge to manipulate things at that level with greater efficiency that the OS does.
Rather than adding data in the middle of a huge file, consider a different solution that allows you to append data to the end of the file. (Or is some more efficient way). Also, if you're dealing with a huge amount of data, maybe a database would be more efficient than a gargantuan file.
I would be interested in hearing more about what you're trying to do. Perhaps there's an easier way.

Javascript source code analysis ( specifically duplication checking )

Partial duplicate of this
Notes:
I already use JSLint extensively via a tool I wrote that scans in intervals my current project directory for recently updated/created .js files. It's drastically improved productivity for me and I doubt there is anything as good as JSLint for the price (it's free).
That said, is there any analysis tool out there that can find repetitive or near-duplicate code blocks, the goal being to make it easier to find opportunities to consolidate large files or small/medium sized projects?
May not be exactly what your after, but Google's Javascript optimizer is worth a look.
Our CloneDR is a tool for finding exact and near-miss cloned code blocks for a variety of languages. It will find duplicates in the same file or across literally thousands of files if you have them. You don't have to provide it with any guidance; it can find the cloned code by itself. And it won't be fooled by different indentation or line breaks, or even consistent renaming of identifiers
It does support JavaScript, even if it isn't clear from the website.
You can see sample clone reports for a variety of langauges at the website.
You may want to have a look at jsinspect.
jsinspect ./src
It will print a list of code blocks that are either identical or structurally very similar.
And there's also jscpd.

Resources