how to check a ruby string is an actural string or a blob data such as image - ruby

In ruby how to check a string is an actural string or a blob data such as image, from the data type of view they are ruby string, but really their contents are very different since one is literal string, the other is blob data such as image.
Could anyone provide some clue for me? Thank you in advance.

Bytes are bytes. There is no way to declare that something isn't file data. It'd be fairly easy to construct a valid file in many formats consisting only of printable ASCII. Especially when dealing with Unicode, you're in very murky territory. If possible, I'd suggest modifying the method so that it takes two parameters... use one for passing text and the other for binary data.
One thing you might do is look at the length of the string. Most image formats are at least 500-600 bytes even for a tiny image, and while this is by no means an accurate test, if you get passed, say, a 20k string, it's probably an image. If it were text, it would be quite a bit (Like a quarter of a typical novel, or thereabouts)

Files like images or sound files have defined blocks that can be "sniffed". Wotsit.org has a lot of info about the key bytes and ways to determine what the files are. By looking at those byte offsets in your data you could figure it out.
Another way way is to use some "magic", which is code to sniff key-bytes or byte-types in a file to try to figure out what its type is. *nix systems have it built in via the file command. Do a man file or man magic for more info or check Wikipedia's article on Magic numbers in files.
Ruby Filemagic uses the same technique but is based on GNU's libmagic.

What would constitute a string? Are you expecting simple ASCII? UTF-8? Or text encoded some other way?
If you know you're going to get ASCII text or a blob then you can just spin through the first n bytes and see if anything has the eight bit set, that would tell you that you have binary. OTOH, not finding anything wouldn't guarantee that you had text.
If you're going to get UTF-8 Unicode then you'd do the same thing but look for invalid UTF-8 sequences. Of course, the same caveats apply.
You could scan the first n bytes for anything between 0x00 and 0x20. If you find any bytes that low then you probably have a binary blob of some sort. But maybe not.
As Tyler Eaves said: bytes are bytes. You're starting with a bunch of bytes and trying to find an interpretation of them that makes sense.
Your best bet is to make the caller supply the expected interpretation or take Greg's advice and use a magic number library.

Related

Using sound to encrypt any file?

First Question: Say I have a random Base64 encoded string. Is it possible to read each character on the string and convert each character to a frequency/sound and then same the string as a sound?
Second Question: Is it possible to do the opposite? How would I take a sound that was created above and convert back to a base64 string?
If someone clicked no the audio encrypted file it would just be noise.
Yes, it's possible and actually being used, for example here: http://arstechnica.com/tech-policy/2015/11/beware-of-ads-that-use-inaudible-sound-to-link-your-phone-tv-tablet-and-pc/
A Perl script for this is beyond a single answer, but there are many Sound-related modules on CPAN: http://search.cpan.org/search?query=sound&mode=all You'll probably need some time for research, but it should be more or less easy to build.

tcl utf-8 characters not displaying properly in ui

Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)

How do I extract ASCII data from binary file with unknown format, in Windows?

On Windows, what is the best way to convert a binary file where the internal structure is unknown less that its contents are ASCII in nature back to plain text?
Ideally the conversion would produce a "human"-readable version. I think the file should contain something like the following:
Date: 10 FEB 2010
House: 345 Dogwood Drive
Exterior: Brick
In Linux/Unix:
$ strings < unknown.dat > ascii-from-unknown.txt
This is of course not so much a "conversion" as a straight up extraction, by just filtering out the non-ASCII bytes. It's useful quite often, though.
In general, without more knowledge of the file's internal format, I don't think you can do much better.
Depending on what exactly you want to achieve, a hex dump might fit the bill: It's a pure ASCII format that represents the entire file without any loss of data (but being quite wasteful with space).
It is not really human readable, but since you don't explain why you want to do that, it's the best I can offer.
There are several simple tools that produce a hex dump on Windows.

manually finding the size of a block of text (ASCII format)

Is there an easy way to manually (ie. not through code) find the size (in bytes, KB, etc) of a block of selected text? Currently I am taking the text, cutting/pasting into a new text document, saving it, then clicking "properties" to get an estimate of the size.
I am developing mainly in visual studio 2008 but I need any sort of simple way to manually do this.
Note: I understand this is not specifically a programming question but it's related to programming. I need this to compare a few functions and see which one is returning the smallest amount of text. I only need to do it a few times so figured writing a method for it would be overkill.
This question isn't meaningful as asked. Text can be encoded in different formats; ASCII, UTF-8, UTF-16, etc. The memory consumed by a block of text depends on which encoding you decide to use for it.
EDIT: To answer the question you've stated now (how do I determine which function is returning a "smaller" block of text) -- given a single encoding, the shorter text will almost always be smaller as well. Why can't you just compare the lengths?
In your comment you mention it's ASCII. In that case, it'll be one byte per character.
I don't see the difference between using the code written by the app you're pasting into, and using some other code. Being a python person myself, whenever I want to check length of some text I just do it in the interactive interpreter. Surely some equivalent solution more suited to your tastes would be appropriate?
ended up just cutting/pasting the text into MS Word and using the char count feature in there

Is there a way to infer what image format a file is, without reading the entire file?

Is there a good way to see what format an image is, without having to read the entire file into memory?
Obviously this would vary from format to format (I'm particularly interested in TIFF files) but what sort of procedure would be useful to determine what kind of image format a file is without having to read through the entire file?
BONUS: What if the image is a Base64-encoded string? Any reliable way to infer it before decoding it?
Most image file formats have unique bytes at the start. The unix file command looks at the start of the file to see what type of data it contains. See the Wikipedia article on Magic numbers in files and magicdb.org.
Sure there is. Like the others have mentioned, most images start with some sort of 'Magic', which will always translate to some sort of Base64 data. The following are a couple examples:
A Bitmap will start with Qk3
A Jpeg will start with /9j/
A GIF will start with R0l (That's a zero as the second char).
And so on. It's not hard to take the different image types and figure out what they encode to. Just be careful, as some have more than one piece of magic, so you need to account for them in your B64 'translation code'.
Either file on the *nix command-line or reading the initial bytes of the file. Most files come with a unique header in the first few bytes. For example, TIFF's header looks something like this: 0x00000000: 4949 2a00 0800 0000
For more information on the TIFF file format specifically if you'd like to know what those bytes stand for, go here.
TIFFs will begin with either II or MM (Intel byte ordering or Motorolla).
The TIFF 6 specification can be downloaded here and isn't too hard to follow

Resources