Given a HL7 file which I know that in its TXA segment there's a byte code of an image, how can I extract that image?
I know my question might be blurry, but that's the details I have
EDIT: The TXA segment is as follows:
TXA|1|25^PathologyResultsReport|8^HTML|||||||||||||||||||908^מעבדת^פתולוגיה^^^^^^^^^^^^20110710084900|||PCFET0NUWVBFIGh0bWwgUFVCTElDICItLy9XM0MvL0RURCBYSFRNTCAxLjAgU3RyaWN0Ly9FTiIgImh0dHA6Ly93d3cudzMub3JnL1RSL3hodG1sMS9EVEQveGh0bWwxLXN0cmljdC5kdGQiPg0KPGh0bWw+PGhlYWQ+PG1ld...
+PGJyLz48L3RkPjwvdHI+DQo8dHI+PHRkPg0KPC90ZD48L3RyPg0KPC90Ym9keT4NCjwvdGFibGU+DQo8L3RkPjxTb2ZUb3ZOZXdDb2x1bW4gLz48L3RyPjxTb2ZUb3ZOZXdMaW5lIC8+DQo8L3Rib2R5Pg0KPC90YWJsZT4NCjwvYm9keT4NCjwvaHRtbD4NCg==|
Thanks in advance
From reading the documentation it appears that images are stored in this form:
OBX||TX|11490-0^^LN||^IM^TIFF^Base64^
SUkqANQAAABXQU5HIFRJRkYgAQC8AAAAVGl0bGU6AEF1dGhvcjoAU3ViamVjdDoAS2V5d29yZHM6~
AENvbW1lbnRzOgAAAFQAaQB0AGwAZQA6AAAAAABBAHUAdABoAG8AcgA6AAAAAABTAHUAYgBqAGUA~
YwB0ADoAAAAAAEsAZQB5AHcAbwByAGQAcwA6AAAAAABDAG8AbQBtAGUAbgB0AHMAOgAAAAAAAAAA~
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASAP4ABAABAAAAAAAAAAAB~
(681 lines omitted)
1qqQS/cFpaSVeD1QP1/SX1VJfpPSfXr+tIOKrN2aSrB8OHoH1kfz2tnPLpB/6WkksJ0w5G6WKVNe~
vSisJQdhLdQjODpbznVXXDMPdBNhVtBNpOqqtkY60qYoJxQK17cUoS0v4ijYztCapqqYUKmIUJhJ~
sKqoIO2opiqr7lupIMFBBhNQmtOIzG4naS7XsQuDBLFOP/gAgAgAAKMHAACcBgAACRcAALcYAAC4~
EwAA5RoAALQXAADyBAAAnAMAAD8LAADbEQAA5CgAAJtBAABTVQAAOHAAAOyHAAA=|||||||F
This looks like a simple structure, where the image data is base64 encoded and stored as a long stream, you know its an image because it has ^IM and the image type because of ^TIFF
More specifically:
When an image is sent, OBX-2 must contain the value ED which stands for encapsulated data. The components of OBX-5 must be as described below.
The first component, source application, must be null.
Component 2, type of data, must contain IM, indicating image data.
Component 3, data subtype, must contain TIFF
Component 4, encoding, must contain Base64
Base64 encoding of non-structured (standard HL7) data, normally in an OBX (but could be anywhere) is the norm. Older systems may have a 32K or 64K byte limit, and when that happens the data will be spread over multiple segments.
The target system will first have to potentially concatenate multiple segments and then decode the Base64 encoding.
The target system must know what the expected data type is so that it can be properly displayed or further decoded/interpreted.
This would be a great question on our new StackExchange site for IT Healthcare: http://area51.stackexchange.com/proposals/51758/healthcare-it
Related
What all info can be derived out of an image?
-gps data
-tags
-time taken
I am trying to do some data engineering and just wondering what all information I can extract out of an image? Anyone have any ideas on this?
That would depend on the file format.
Some image file formats specify means for storing meta data.
There is no guarantee that this meta data fields are actually used so even if the file format provides meta data the file may contain no useful meta information.
For most image formats, width, height and pixel depth, full stop. Sometimes physical pixel size, though you can doubt the "updatedness".
Tiff and Jpeg have provisions to include innumerable information tags, most of which are optional and some of which can be user-defined (and of course will be unsupported by existing readers).
Image tags are a jungle and are just ignored by most users/implementors. There is no standardization, because there is no need, and any attempt to be "complete" ends-up in ridiculous enumerations.
https://en.wikipedia.org/wiki/Exif#Example
I have been trying to examine a JPEG file, known to contain IPTC data, but could notice no strings whatsoever. I tried the well known UNIX strings command, ASCII, 8-bit, 16-bit Unicode --- to no avail: I could not see any strings that I expect to find in IPTC fields.
My question is: How is IPTC data encoded? Is it encrypted? Compressed? Other? Why can't it be viewed using the strings command?
The most probable reason why you cannot view IPTC data using a hex viewer is because it has no IPTC data.
An image that contains IPTC data like this one:
http://regex.info/exif.cgi?dummy=on&imgurl=https%3A%2F%2Fwww.iptc.org%2Fstd%2Fphotometadata%2Fexamples%2FIPTC-PhotometadataRef-Std2014_large.jpg
has an XML structure and text fields that are view-able through a text editor like Emacs (8-bit, not even Unicode).
I'm new to protocol buffers and I was wondering whether it was possible to search a protocol buffers binary file and read the data in a structured format. For example if a message in my .proto file has 4 fields I would like to serialize the message and write multiple messages into a file and then search for a particular field in the file. If I find the field I would like to read back the message in the same structured format as it was written. Is this possible with protocol buffers ? If possible any sample code or examples would be very helpful. Thank you
You should treat protobuf library as one serialization protocol, not an all-in-one library which supports complex operations (such as querying, indexing, picking up particular data). Google has various libraries on top of open-sourced portion of protobuf to do so, but they are not released as open source, as they are tied with their unique infrastructure. That being said, what you want is certainly possible, yet you need to write some code.
Anyhow, some of your requirements are:
one file contains various serialized binaries.
search a particular field in each serialized binary and extract that chunk.
There are several ways to achieve them.
The most popular way for serial read/write is that the file contains a series of [size, type, serialization output]. That is, one serialized output is always prefixed by size and type (either 4/8 byte or variable-length) to help reading and parsing. So you just repeat this procedure: 1) read size and type, 2) read binary with given size, 3) parse with given type 4) goto 1). If you use union type or one file shares same type, you may skip type. You cannot drop size, as there is no way know the end of output by itself. If you want random read/write, other type of data structure is necessary.
'search field' in binary file is more tricky. One way is to read/parse output one by one and to check the existance of field by HasField(). It is most obvious and slow yet straightforward way to do so. If you want to search field by number (say, you want to search 'optional string email = 3;'), thus search by binary blob (like 0x1A, field number 3, wire type 2), it is not possible. In a serialized binary stream, field information is saved merely a number. Without an exact context (.proto scheme or binary file's structure), the number alone doesn't mean anything. There is no guarantee that 0x1A is from field information, or field information from other message type, or actually number 26, or part of other number, etc. That is, you need to maintain the information by yourself. You may create another file or database with necessary information to fetch particular message (like the location of serialization output with given field).
Long story short, what you ask is beyond what open-sourced protobuf library itself does, yet you can write them with your requirements.
I hope, this is what you are looking for:
http://temk.github.io/protobuf-utils/
This is a command line utility for searching within protobuf file.
In ruby how to check a string is an actural string or a blob data such as image, from the data type of view they are ruby string, but really their contents are very different since one is literal string, the other is blob data such as image.
Could anyone provide some clue for me? Thank you in advance.
Bytes are bytes. There is no way to declare that something isn't file data. It'd be fairly easy to construct a valid file in many formats consisting only of printable ASCII. Especially when dealing with Unicode, you're in very murky territory. If possible, I'd suggest modifying the method so that it takes two parameters... use one for passing text and the other for binary data.
One thing you might do is look at the length of the string. Most image formats are at least 500-600 bytes even for a tiny image, and while this is by no means an accurate test, if you get passed, say, a 20k string, it's probably an image. If it were text, it would be quite a bit (Like a quarter of a typical novel, or thereabouts)
Files like images or sound files have defined blocks that can be "sniffed". Wotsit.org has a lot of info about the key bytes and ways to determine what the files are. By looking at those byte offsets in your data you could figure it out.
Another way way is to use some "magic", which is code to sniff key-bytes or byte-types in a file to try to figure out what its type is. *nix systems have it built in via the file command. Do a man file or man magic for more info or check Wikipedia's article on Magic numbers in files.
Ruby Filemagic uses the same technique but is based on GNU's libmagic.
What would constitute a string? Are you expecting simple ASCII? UTF-8? Or text encoded some other way?
If you know you're going to get ASCII text or a blob then you can just spin through the first n bytes and see if anything has the eight bit set, that would tell you that you have binary. OTOH, not finding anything wouldn't guarantee that you had text.
If you're going to get UTF-8 Unicode then you'd do the same thing but look for invalid UTF-8 sequences. Of course, the same caveats apply.
You could scan the first n bytes for anything between 0x00 and 0x20. If you find any bytes that low then you probably have a binary blob of some sort. But maybe not.
As Tyler Eaves said: bytes are bytes. You're starting with a bunch of bytes and trying to find an interpretation of them that makes sense.
Your best bet is to make the caller supply the expected interpretation or take Greg's advice and use a magic number library.
Is there a good way to see what format an image is, without having to read the entire file into memory?
Obviously this would vary from format to format (I'm particularly interested in TIFF files) but what sort of procedure would be useful to determine what kind of image format a file is without having to read through the entire file?
BONUS: What if the image is a Base64-encoded string? Any reliable way to infer it before decoding it?
Most image file formats have unique bytes at the start. The unix file command looks at the start of the file to see what type of data it contains. See the Wikipedia article on Magic numbers in files and magicdb.org.
Sure there is. Like the others have mentioned, most images start with some sort of 'Magic', which will always translate to some sort of Base64 data. The following are a couple examples:
A Bitmap will start with Qk3
A Jpeg will start with /9j/
A GIF will start with R0l (That's a zero as the second char).
And so on. It's not hard to take the different image types and figure out what they encode to. Just be careful, as some have more than one piece of magic, so you need to account for them in your B64 'translation code'.
Either file on the *nix command-line or reading the initial bytes of the file. Most files come with a unique header in the first few bytes. For example, TIFF's header looks something like this: 0x00000000: 4949 2a00 0800 0000
For more information on the TIFF file format specifically if you'd like to know what those bytes stand for, go here.
TIFFs will begin with either II or MM (Intel byte ordering or Motorolla).
The TIFF 6 specification can be downloaded here and isn't too hard to follow