Decoding a file compressed with an obsolete language - algorithm

I'm trying to decompress a data file that was originally compressed with an extension for AMOS Pro, the old Amiga BASIC language, that shipped with the AMOS Pro compiler. I've still got the programming language and have access to the compressor and decompressor, but I'm trying to decompress the files using C. I ultimately want to be able to view these files on modern hardware without having to resort to using an Amiga emulator first.
However, there's no documentation as to how the compressor worked, so I'm trying to reverse-engineer it solely from watching its behaviour. Here's what I've got so far.
This is a raw file (ASCII):
AABCDEFGHIJKLMNOPQRSTUVWXYZAABCDEFGHIJKLMNOPQRSTUVWXYZAABCDEFGHIJKLMNOPQRSTUVWXYZ
Here's the compressed version (hex):
D802C6B5
05048584
4544C5C4
2524A5A4
6564E5E4
15149594
5554D5D4
3534B591
00000007
AD763363
00000051
Testing with various files has given me to a few insights:
The last 4 bytes are the size of the original file.
The file seems to function as a bit stream, so byte boundaries aren't important (I say this because I've seen ASCII codes appear in a few files and they aren't aligned to byte boundaries).
All of the bits in the file are stored in reverse.
The first 4 byte seems to represent a sequence length. In the above example, the value 0xD8 is 11011000 in binary; mirror it (bits are in reverse) and you'll get 00011011, which is 0x1B in hex or 27 in decimal. That matches the sequence length.
However, I'm not making any more progress. Does this look like a standard compression algorithm? What do I try next?

As you've posted here, the compression function is called "squash", a function part of AMOS Pro.
As such, my advice would be to try one of the following lines of attack:
Reverse engineer the algorithm by analyzing its output: This is definitely not a viable option. You will only waste time.
Read, annotate, understand the source code of the unsquash function in AMOS Pro
Contact the author of AMOS Pro
Read the source code
The source code for AMOS Pro is apparently in the public domain now and can be found here:
http://www.pianetaamiga.it/downloads/AMOSPro_Sources.zip
It consists of 68000 assembly code and quite a few compiled object files.
The unsquash function can be found in the file +header.s on line 1061 and onwards. It is not documented, except for its entry register values, which is good at least. It doesn't appear to be a very large function so this might be worth a shot.
You will need to have, or obtain/learn, rudimentary 68000 machine code. It does not appear to call out to system libraries or anything and only seem to operate directly on memory, which would suggest this is actually doable (ie. understanding the code). Still, I've never written or read 68000 code in my life so what do I know.
Contact the author of AMOS Pro
The author of AMOS Pro is François Lionet, as is evident by the User Guide, he founded Clickteam in the mid-90s to make game- and multimedia-making software. He still seems to be situated in that company and according to forum posts from others looking into AMOS Pro he seems to be willing to answer email. Sadly I don't know his email but the Clickteam website above should give you a starting point.

Related

Is there a publicly available list of known good GS1-128 codes for testing?

I just finished writing a function to interpret GS1-128 codes (the data, not the barcode/datamatrix image) in (hopefully) any constellation they can come in.
I am now trying to thoroughly test the function.
All manually generated codes I have tried are working fine (first try, obviously), but generating error free GS1-128 codes by hand is a rather slow process and also flawed methodology since my understanding of creating a norm-conform code and my functions logic are obviously the same, but not nessccessarily correct.
I have already inquired at the local GS1 organization whether they have a list of known-good codes for testing they are allowed to hand out. The answer was no.
I have also searched on the internet for either a list or an automated means of generating codes (preferably with varried contents and orders od content) in bulk, neither yielded a helpful result.
I don't really know of any better place to ask this quiestion, so I'm asking this community, since I have the hope that someone might have one either lying around or has means of generating a reasonable (or unreasonable) amount of codes (the data, not the barcode/datamatrix image (although that would work just as well)) without too much effort.
If it's available I'd gladly take a list with decoded contents per code to further automate and scale up the testing, but I'm really not going to turn anything down :P
Hope whoever read this far is having a good day.
I would suggest that you take a look at the unit tests in the GS1 Barcode Syntax Engine, which comprehensively processes GS1 AI syntax messages:
https://github.com/gs1/gs1-syntax-engine/blob/main/src/c-lib/ai.c#L795
It's a shame that your GS1 Member Organisation were unable to refer you to their own tool.

How do I reverse-engineer the "import file" feature of an abandoned pascal application?

first question I've asked and I'm not sure how to ask it clearly, or if there will be an answer that I want to hear ;)
tl;dr: "I want to import a file into my application at work but I don't know the input format. How can I discover it?"
Forgive any pending wordiness and/or redaction.
In my work I depend on an unsupported (and proprietary) application written in Pascal. I have no experience with pascal (yet...) and naturally have no source code access. It is an excellent (and very secret/NDA sort of deal I think) application that allows us to deal with inventory and financial issues in my employer's organization. It is quite feature-comprehensive, reasonably stable and robust, and kind of foistered (word?) on us by a higher power.
One excellent feature that it has is the ability to load up "schedules" into our corporate system. This feature should be saving us hundreds of hours in data entry.
But it isn't.
The problem is, the schedules we receive are written in a legacy format intended for human eyes. The "new" system can't interpret them.
Our current information (which I have to read and then re-enter into the database by hand) is send in a sort of rich-text flat-file format, which would be easy to parse with the string library of probably any mainstream language.
So I want to write a converter to convert our data into a format that the new software can interpret.
By feeding certain assorted files into the system, I have learned a little bit about what kind of file it expects:
I "import" a zero-byte file. Nothing happens (same as printing a report with no data)
I "import" an XML file that I guess might look like the system expects. It responds with an exception dialog and a stacktrace. Apparently the string <?xml contains illegal characters or something
I "import" a jpeg image -- similar result to #2.
So I think that my target wants a flat-file itself. The file would need to contain a "document number" along with {entries with "incident IDs" and descriptions and numeric values}.
But I don't know this for certain.
Nobody is able to tell me exactly what these files should look like. Someone in the know said that they have seen the feature demonstrated -- somewhere out there is a utility that creates my importable schedules. But for now, the utility is lost and I am on my own.
What methods can I use to figure out the input file format? I know nothing about debugging pascal, but I assume that that is probably my best bet. Or do I have to keep on with brute force until I can afford a million monkey-operated typewriters? Do I have to decompile the target application? I don't know if I can get away with that, let alone read the decompiled source.
My google-fu has failed me.
Has anyone done something like this before or could they point me in the right direction? Are there any guides on this subject?
Thanks in advance.
PS: I am certain that I am not breaking any laws at this point, although I will have to check to find out if decompilation would get me into trouble or not, and that might be outside of my technical competence anyway.
If you have an example file you can try to take a hexdump utility and try to see if there things you can identify. Any additional info that you have (what should in the file) helps with that. Better even, if you know a program that can edit the file, you can use the editor to make minimal changes and then compare the file before and after.
IOW standard tricks of binary file format reverse engineering.
...If you have no existing files whatsoever, then reverse engineering the binary is your only option, and that is not pretty. Decompilation of native binaries is a black art that requires considerable time and skill. Read the various decompilation FAQs on the net.
First and for all, I would try to contact the authors of the program. Source code are options 1,2,3 and you only go with other options if there is really, really, really no hope whatsoever of obtaining source or getting normal support.

Exact Code segment size for a windows process

The linux file proc/{pid}/status as we know gives us some fine grain memory footprint for a particular process. One of the parameters thrown by it is the 'VmExe' or the size of the text segment of the process. I'm particularly interested in this field but am stuck in a windows environment with no proc file system to help me. cygwin mimics most of the procfs but the {pid}/* files seem to be one of those parts which cygwin ignores. I tried playing around with the VmMap tool on windows sysinternals, but the closest field I got was was 'Private Data Size' on a Private working set. I'm not really sure if this is what I'm looking for.
I would take a look at the vmmap.exe from sysinternals, and see if it displays the information you are looking for, for a given process. If the information you are seeking is displayed there, you could take a look at the api calls the application uses, or ask on the sysinternals forums on msdn. I know this isnt exactly what you were looking for in an answer, but it hopefully points you in the right direction.
If you are talking about the :text segment in the PE itself, you can get that information from the debughlp library, and a number of other ways (there are a few libraries floating around for binary analysis).

Html entities in file names: Possible mine traps?

When I thought about resizing images and saving the new sizes parallel on the server, I came to the following question:
// Original size
DSC_18342.jpg
// New size: Use an "x" for "times"
DSC_18342_640x480px.jpg
// New size: Use the real "×" for "times"
DSC_18342_640×480px.jpg
The point is, that it's slightly easier if you got a real × instead of an x in the file name, as the unit px already contains the x, which makes it a little bit harder to read.
Question: What problems could I get in, when using the Html entity in the filename?
Sidenotes: I'm writing an open source, publicly available script, so the targeted server can be anything - therefore I'm also interested (and will vote up) edge cases, that I'm not aware off.
Thank you all!
You may have noticed, that I'm aware, that I could simply avoid it (which I'll do anyway), but I'm interested in this issue and learning about it, so please just take above example as possible case.
There are file systems that simply don't support unicode. This may be less of a problem if you make unicode support a requirement of your application.
Some consideration about different unicode file system are given in File Systems, Unicode, and Normalization.
A concluding remark (from a viewpoint of solaris file systems) is:
Complete compatibility and seamless interoperability with
all other existing Unicode file systems appears not 100%
possible due to inherent differences.
I can imagine that there will be problems especially when migrating the application. Just storing files is probably no problem but if their names are stored in a database there might be a mismatch after migration.

How did Turbo Pascal overlays work?

I'm implementing an assemblinker for the 16-bit DCPU from the game 0x10c.
One technique that somebody suggested to me was using "overlays, like in Turbo Pascal back in the day" in order to swap code around at run time.
I get the basic idea (link overlayed symbols to same memory, swap before ref), but what was their implementation?
Was it a function that the compiler inserted before references? Was it a trap? Was the data for the overlay stored at the location of the overlay, or in a big table somewhere? Did it work well, or did it break often? Was there an interface for assembly to link with overlayed Pascal (and vice versa), or was it incompatible?
Google is giving me basically no information (other than it being a no-on on modern Pascal compilers). And, I'm just, like, five years too young to have ever needed them when they were current.
A jump table per unit whose elements point to a trap (int 3F) when not loaded. But that is for older Turbo Pascal/Borland Pascal versions (5/6), newer ones also support (286) protected mode, and they might employ yet another scheme.
This scheme means that when an overlay is loaded, no trap overhead happens anymore.
I found this link in my references: The Slithy Tove. There are other nice details there, like how call chains are handled that span multiple overlays.

Resources