I have an image that was given to me in ASCII character format (from an XML document, i,e. sdlf9sdf09e93jr), and I was told that I should be able to convert that into binary and use that data to view it as a .gif image. I've converted all the ASCII characters into binary, but now I'm stuck and not sure what to do from there.
EDIT: Discovered the real problem I'm having: Even though I've translated the ASCII to binary, I've only written it to a file as what the binary should be for those ASCII characters. The "binary" characters are just a bunch of 0's and 1's in ASCII format. I'm using the C language to write this program that will create gif, but don't know what to do to create a .gif from my original XML parse.
First, you need to know the encoding scheme used to encode the image as a sequence of characters in the XML document.
Are you sure you have ASCII data? It may be if it is base-64 encoded. If it is not base-64 encoded, perhaps you are seeing the Latin-1 characters corresponding to each byte.
But you say you have the byte sequence, the "binary" as you call it. If so, then do as MRAB says and save the bytes to a file, slap a .gif extension on that and open it with any viewer.
But again, make sure you know the encoding. Base-64 is common in XML.
If the binary is a GIF image, then just write it to a file with the extension ".gif".
Related
I have been trying to examine a JPEG file, known to contain IPTC data, but could notice no strings whatsoever. I tried the well known UNIX strings command, ASCII, 8-bit, 16-bit Unicode --- to no avail: I could not see any strings that I expect to find in IPTC fields.
My question is: How is IPTC data encoded? Is it encrypted? Compressed? Other? Why can't it be viewed using the strings command?
The most probable reason why you cannot view IPTC data using a hex viewer is because it has no IPTC data.
An image that contains IPTC data like this one:
http://regex.info/exif.cgi?dummy=on&imgurl=https%3A%2F%2Fwww.iptc.org%2Fstd%2Fphotometadata%2Fexamples%2FIPTC-PhotometadataRef-Std2014_large.jpg
has an XML structure and text fields that are view-able through a text editor like Emacs (8-bit, not even Unicode).
For instance, we need a third party lib to parse and get a file meta data. But the method will decode all the meta data via utf-8, even if the meta data is encoded in another encoding, it will return us a utf-8 encoded string. And the lib doesn't support any method to return a raw string data for us to encode it correctly. Now we know the file's original encoding of the meta data is, for example, GBK. Is there a way to correct the utf-8 encoded string to GBK?
No there isn't, decoding something as UTF-8 that isn't in UTF-8 is lossy. That means, by the time you get the string from the lib, you have lost information and can't represent the original data as GBK. Change how the lib works, or change the file meta data to UTF-8.
Yes. You should learn about ruby 1.9's force_encoding and encode methods on the string class. I recommend converting everything to actually be UTF-8 as soon as possible before manipulating it in ruby.
I was wondering how Windows interprets characters.
I made a file with a hex editor with the 3 bytes E3 81 81.
Those bytes are the ぁ character in UTF-8.
I opened the notepad and it displayed ぁ. I didn't specify the encoding of the file, I just created the bytes and the notepad interpreted it correctly.
Is notepad somehow guessing the encoding?
Or is the hex editor saving those bytes with a specific encoding?
If the file only contains these three bytes, then there is no information at all about which encoding to use.
A byte is just a byte, and there is no way to include any encoding information in it. Besides, the hex editor doesn't even know that you intended to decode the data as text.
Notepad normally uses ANSI encoding, so if it reads the file as UTF-8 then it has to guess the encoding based on the data in the file.
If you save a file as UTF-8, Notepad will put the BOM (byte order mark) EF BB BF at the beginning of the file.
Notepad makes an educated guess. I don't know the details, but loading the first few kilobytes and trying to convert them from UTF-8 is very simple, so it probably does something similar to that.
...and sometimes it gets it wrong...
https://ychittaranjan.wordpress.com/2006/06/20/buggy-notepad/
There is an easy and efficient way to check whether a file is in UTF-8. See Wikipedia: http://en.wikipedia.org/w/index.php?title=UTF-8&oldid=581360767#Advantages, fourth bullet point. Notepad probably uses this.
Wikipedia claims that Notepad used the IsTextUnicode function, which checks whether a patricular text is written in UTF-16 (it may have stopped using it in Windows Vista, which fixed the "Bush hid the facts" bug): http://en.wikipedia.org/wiki/Bush_hid_the_facts.
how to identify the file is in which encoding ....?
Go to the file and try to Save As... and you can see the default (current) encoding of the file (in which encoding it is saved).
I'm downloading via FTP some files with chinese names (BIG5 encoded), and Filezilla displays those filenames as garbage (as FTP cannot handle any encoding other than ASCII and UTF-8, as least the standard compliant ones).
Given a filename with garbled characters, is it possible for me to repair the encoding and get a proper filename String given that I already know the source encoding? Will the FTP client misinterpreting BIG5 as UTF-8 insert bytes that make conversion back to BIG5 difficult?
My proposed steps (in Java):
1. get the garbled filename using File object.
2. getbytes using UTF-8.
3. create a new string using those bytes in BIG5.
4. Write the decoded filename back to the file.
Will the above method work?
Not every sequence of bytes is a valid ASCII or UTF-8 string so it's quite likely that some of the bytes will have been discarded, converted to the replacement character, or otherwise irreversibly mangled. So it looks like you won't be able to retrieve the original filenames if they have been modified by FileZilla to become correctly formed UTF-8 or ASCII.
You might be lucky to be able to get a certain percentage of the original characters back, where they just happened to be both valid BIG5 and valid UTF-8, but I doubt you will be able to recover the entire filename.
You could post a few examples of your garbled filenames (as raw bytes encoded in hex) to get a more definite answer. That way we can see exactly what the damage is.
Is there a good way to see what format an image is, without having to read the entire file into memory?
Obviously this would vary from format to format (I'm particularly interested in TIFF files) but what sort of procedure would be useful to determine what kind of image format a file is without having to read through the entire file?
BONUS: What if the image is a Base64-encoded string? Any reliable way to infer it before decoding it?
Most image file formats have unique bytes at the start. The unix file command looks at the start of the file to see what type of data it contains. See the Wikipedia article on Magic numbers in files and magicdb.org.
Sure there is. Like the others have mentioned, most images start with some sort of 'Magic', which will always translate to some sort of Base64 data. The following are a couple examples:
A Bitmap will start with Qk3
A Jpeg will start with /9j/
A GIF will start with R0l (That's a zero as the second char).
And so on. It's not hard to take the different image types and figure out what they encode to. Just be careful, as some have more than one piece of magic, so you need to account for them in your B64 'translation code'.
Either file on the *nix command-line or reading the initial bytes of the file. Most files come with a unique header in the first few bytes. For example, TIFF's header looks something like this: 0x00000000: 4949 2a00 0800 0000
For more information on the TIFF file format specifically if you'd like to know what those bytes stand for, go here.
TIFFs will begin with either II or MM (Intel byte ordering or Motorolla).
The TIFF 6 specification can be downloaded here and isn't too hard to follow