I am confuse that what argument should i pass in CGPDFDictionaryGetString function for "key"?I want to extract text and image from PDF file.
The method you have specified is normally used for extracting a String COS object, and will probably be of little direct use in getting the text off the PDF page. COS objects are stored within the PDF's document catalog tree. You normally acquire a COS object in the tree by using its key value. COS objects can be of several different types (Dictionary, Array, Number, String, Stream etc.) each type is identified with a key that allows it to be identified and retrieved via methods like:
CGPDFDictionaryGetString(key)
CGPDFDictionaryGetNumber(key)
CGPDFDictionaryGetDictionary(key)
I've never had the need to extract the on-page text myself, but looking over a simple PDF file, the on-page text seems to be in the page's "Contents" stream.
So in your case you probably want to do something like
1) Get the Document Catalog
2) Get the 'Pages' Dictionary
3) Get Page(n) that you are concerned with
4) Get that page's "Contents" stream and parse it for the text.
Images are normally stored under the page's "Resource" dictionary (which resides at the same level as the "Contents" stream.
If you want to get a better understanding of the COS object tree and its structure, you can view it for the currently viewed PDF using Acrobat's "Preflight" utility. Under the Advanced menu: Preflight... | options | Browse Internal PDF structure...
And of course, flipping through the official spec is a good Idea:
Hope that helps!
Related
I am creating a report with SAS STP and I want to display a image(a logo) on the report. Okay here is what is happening:
data _null_;
file _webout;
put '<html>';
put '</html>';
run;
I am PUTing HTML because I have complex table formats which I need to display and I am not using %STPBEGIN & %STPEND because that opens up an ODS Stream which frankly I do not know how to handle and I am having problems. Not using %STPBEGIN means the above code. This has been a very successful mechanism for me. I can show beautiful reports with CSS and everything. The only problem is images. A client has recently requested to put logo on every report. i though this was going to be easy but it has not been. Ok here is the deal, I tried to use <img src=" "/ > tag and I thought I would use some relative path and my image will show. This technique succeeded and failed.
I added an image to a folder location using SAS Management Console
and use its relative path '/Products/SAS Enterprise GRC/****' (didn't work)
I copied an image to default theme's images folder under Web/Staging/*** and tried to used the relative path (didn't work). So i tried to use other images from the the default theme. It worked.
I am stuck, how can I use a custom images here?
If your image is static, you can embed it into your results using a datastep without having to copy files to the server.
The trick to doing this is to encode the image into Base64 encoding, then you can embed the image into an <img src="" /> statement by using this magical notation:
<img src="data:image/png;base64,...." />
You can see that the src= attribute contains metadata to tell the browser that the value contains image data, that represents a png file (I used a png file when testing this post, you may have a JPG/BMP etc...) and that the value is encoded using base64. The 4 periods at the end would be replaced by your image data represented in base64 notation. This would look something like this:
<img src="data:image/png;base64,iVBORw0KGgoAAAAN
... much much more base64 content here ...
HSLyz+h9xy+7HbHRL83L1tv9h8+4d/+Ic/Gf8DiYav3mpqHAMAAAAASUVORK5CYII=" />
Converting your image to base64 is simple. You can simply google for an "online base64 image converter" such as this one. Drag and drop your image and it will produce your base64 code for you.
To get this into a datastep in sas, it's simply a case of:
data _null_;
file _webout;
put '<html>';
put '<img src="data:image/png;base64,iVBORw0KGgoAAAAN......etc..." />';
put '</html>';
run;
If you image is particularly big (say greater than ~32k) you may run into issues trying to output it from a datastep. I probably need to test this to clarify. You can work around this by reading the base64 image from a file in SAS and streaming it directly to _webout, using code similar to below:
data _null_;
file _webout;
infile '\path\to\base64\file.ext';
input;
put _infile_;
run;
If you want to get really tricky, you can take any image you like (such as a chart generated in SAS) and convert it to base64 on the fly, then stream it out. Here is some SAS code that will take an image file and convert it to Base64:
data _null_;
length base64_format $20 base64_string $32767;
infile "\your_sasdir\hi.png" recfm=n;
file "\your_sasdir\hi.base64";
input byte $16000. ;
* FORMAT LENGTH NEEDS TO BE 4n/3 ROUNDED UP TO NEAREST MULTIPLE OF 4;
format_length = 4*(lengthn(byte)/3);
mod = mod(format_length,4);
if mod ne 0 then do;
format_length = format_length - mod + 4;
end;
base64_format = cats("$base64x",format_length,".");
base64_string = putc(cats(byte), base64_format);
put base64_string;
run;
Here is the image I used to test this with:
Once converted, the Base64 representation should look like:
iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAIAAAAC64paAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAABaSURBVDhP5YxbCsAgDAS9/6XTvJTWNUSIX3ZAYXcdGxW4QW6Khw42Axne81LG0shlRvVVLyeTI2aZ2fcPyXwPdBI8B999NK/gKTaGyxaMX8gTJRkpyREFmegBTt8lFJjOey0AAAAASUVORK5CYII=
I'm going to see if I can find a way to streamline this as this is something we do frequently at work.
EDIT : Interestingly, SAS9.4 seems to support doing this directly using ODS HTML5 in conjunction with the inline option. See the doc here.
See also this post, by Don Henderson, that provides a similar way to approach this problem. Thanks to Vasilij for the link.
When you define pictures in SAS metadata, it can be accessed via SAS Content server.
To get picture URL log into: 'https://severhost/SASContentServer/repository/default/sasfolders' and search for your picture.
If you defined your picture in catalog /Products/SAS Enterprise GRC/PictureName.gif, it should be accessible from adres 'https://severhost/SASContentServer/repository/default/sasfolders/Products/SAS Enterprise GRC/PictureName.gif(Report)'
Of course you have to remember, that customer user need to have access permission in SAS Metadata to read picture object.
If this won't solve your problem, please type which version of SAS software you are using.
I had a similar problem to you once. I have added the image to our intranet which happens to be SharePoint at the time. I defined that image to have public access level and then references in all my reports.
The idea that since the report is only for internal audience, they all will have access to intranet, but not necessarily to the Content Server so it circumvents the problem that Bagin mentioned.
If you don't have a suitable intranet, you could always reference a logo from your public website which is probably available to all of your audience even if they are external, but then you don't have control over that logo file and one day it might change in some undesirable way.
Regards,
Vasilij
Using SASjs you can compile ANY binary content into a SAS web service (Stored Process or Viya Job).
Here's an example using an MP3 file: https://github.com/allanbowe/sasrap
I'm working with iTunes via AppleScript. The artwork element of a track contains image data (or raw data, which appears in practice to return the same thing), which can be retrieved and, say, directly written to a file. (It's an e.g. PNG bytestream.)
But I don't know how to do anything with this thing besides write it to a file. I'd like to ask it how many bytes it contains, or even rummage through it (though the latter may well be out of scope for AppleScript). In Script Debugger, it looks like «data tdtaXXXXXX.....» (hex values where I wrote the XXXs), and the iTunes scripting dictionary doesn't link through to any useful type/class for it.
I'm not really sure what the guillemets mean in AppleScript, or what the nature of this object is, or whether this thing can be interrogated natively. Any references on this would be helpful. Thanks!
See https://books.google.com/books?id=rW5k0w_wC3MC&pg=PA57&lpg=PA57&dq=guillemets+applescript+events+data&source=bl&ots=ogzi9W4jxW&sig=7ct-n0wpzdhBhtHDJtTrZDKgEEk&hl=en&sa=X&ei=-qSYVICZAsjooASo0oKwCg&ved=0CB4Q6AEwAA#v=onepage&q=guillemets%20applescript%20events%20data&f=false for explanation of raw codes and data and use of guillemets in AppleScript; See this answer:
Getting artwork from current track in Applescript
for an example of writing image data from iTunes artwork to file.
Dealing with a few DICOM and DICONDE images(.dcm) and wanted to add new tags to those images.
I am using DICOM Browser to check the meta information of the image. It allows me to edit the value but I want to modify the tag name also, for example, from Patient ID to Component ID.
Just wondering if i have to change that in the dictionary. Where can i find the dictionary and modify it to add/edit tags.
Regards
Vish
The dictionary that link tags' (group, element) pairs to Tag name is not saved in the DICOM data, but defined privately by the editor/viewer, with respect to the DICOM standard.
There's no way to change it.
As CharlesB points out, the tag keys carry no semantic value descriptions - they're just numbers as specified in the Dicom standard. Manufacturers often add custom fields, but since the meaning of these values is not explicit in the dicoms, they have to tell you what the values mean. It's usually frustrating to deal with this custom data. Please don't do this unless you have a good reason.
We do have a website which should be translate into different languages. Some of the wording is in message properties files ready for translation. I want now add the rest of the text into these files.
What is a good way to name the text blocks?
<view>.<type>.<name>
We mostly have webpages and some of the elements/modules are repeating on some sites.
As far as I know, no "standard" exists. Therefore it is pretty hard to tell what is proper and what is improper way of naming resource keys. However, based on my experience, I could recommend this way:
property file name: <module>.properties
resource keys: <view or dialog>[.<sub-context>].<control-type>.<name>
We may discuss if it is proper way to put every strings from one module into one property files - probably it could be right if updates doesn't happen often and there are not so many messages. Otherwise you might think about one file per view.
As for key naming strategy: it is important for the Translator (sounds like film with honorable governor Arnold S. isn't it?) to have a Context. Translation may actually depend on it, i.e. in Polish you would translate a message in a different way if it is page/dialog/whatever title and in totally different way if it is text on a button.
One example of such resource key could be:
preferences.password_area.label.username=User name
It gives enough hints to the Translator about what it actually is, which could result in correct translation...
We have come up with the following key naming convention (Java, btw) using dot notation and camel case:
Label Keys (form labels, page/form/app titles, etc...i.e., not full sentences; used in multiple UI locations):
If the label represents a Java field (i.e., a form field) and matches the form label: label.nameOfField
Else: label.sameAsValue
Examples:
label.firstName = First Name
label.lastName = Last Name
label.applicationTitle = Application Title
label.editADocument = Edit a Document
Content Keys:
projectName.uiPath.messageOrContentType.n.*
Where:
projectName is the short name of the project (or a derived name from the Java package)
uiPath is the UI navigation path to the content key
messageOrContentType (e.g., added, deleted, updated, info, warning, error, title, content, etc.) should be added based on the type of content. Example messages: (1) The page has been updated. (2) There was an error processing your request.
n.* handles the following cases: When there are multiple content areas on a single page (e.g., when the content is separated by, an image, etc), when content is in multiple paragraphs or when content is in an (un)ordered list - a numeric identifier should be appended. Example: ...content.1, ...content.2
When there are multiple content areas on a page and one or more need to be further broken up (based on the HTML example above), a secondary numeric identifier may be appended to the key. Example: ...content.1.1, ...content.1.2
Examples:
training.mySetup.myInfo.content.1 = This is the first sentence of content 1. This is the second sentence of content 1. This content will be surrounded by paragraph tags.
training.mySetup.myInfo.content.2 = This is the first sentence of content 2. This is the second sentence of content 2. This content will also be surrounded by paragraph tags.
training.mySetup.myInfo.title = My Information
training.mySetup.myInfo.updated = Your personal information has been updated.
Advantages / Disadvantages:
+ Label keys can easily be reused; location is irrelevant.
+ For content keys that are not reused, locating the page on the UI will be simple and logical.
- It may not be clear to translators where label keys reside on the UI. This may be a non-issue for translators who do not navigate the pages, but may still be an issue for developers.
- If content keys must be used in more than one location on the UI (which is highly likely), the key name choice will not make sense in the other location(s). In our case, management is not concerned with a duplication of values for content areas, so we will be using different keys (to demonstrate the location on the UI) in this case.
Feedback on this convention - especially feedback that will improve it - would be much appreciated since we are currently revamping our resource bundles! :)
I'd propose the below convention
functionalcontext.subcontext.key
logicalcontext.subcontext.key
This way you can logically group all the common messages in a super context (id in the below example). There are few things that aren't specific to any functional context (like lastName etc) which you can group into logical-context.
order.id=Order Id
order.submission.submit=Submit Order
name.last=Last Name
the method that I have personally used and that I've liked more so far is using sentence to localisee as the key. For example: (pls replace T with the right syntax dependably on the language)
for example:
print(T("Hello world"))
in this case T will search for a key "Hello world". If it is not found then the key is returned, otherwise the value of the key.
In this way, you do not need to edit the message (in your default language) at least that you need to use parameters.... It saved me a LOT of dev time
i want to know which approach is better to saving Webpage content to database for caching?
Using ntext data-type and save content as flat string
Using ntext, but compress content and then save
Using varbinary(MAX) to save content (how i can convert flat string to binary? ;-))
An other approach which you are suggest to me
UPDATE
in more depth i have many table (URLs, Caches, ParsedContents, Words, Hits and etc) which for each url in URLs table i'm sending request and save response into caches table. this is Downloader (URLResolver of Google) section of my engine. then indexer section act was to perform parsing and etc tasks which associated with this. and Compress/Decompress performs only when new content goes to be caching or parsing
The better approach would be to use the built-in caching features in ASP.NET. Searching StackOverflow for [asp.net] [caching] is a good start, and after (or before) that, similar searches on both www.asp.net and Google will get you quite far.
In response to your comment, I would probably save the data as a flat string. It might not be the best option performance-wise when it comes to storage, but if you're going to perform searches on the text content, you don't want to have to compress/decompress or convert to/from binary every time, since there is probably no (easy) way to do this inside SQL Server. Just make sure you have all intexes and full-text features you need set up correctly.