Is there a way to find the PDF version of file in Xamarin? [duplicate] - xamarin

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a windows .NET application that manages many PDF Files. Some of the files are corrupt.
2 issues: I'll try to explain in my imperfect English...sorry
1.)
How can I detect if any pdf file is correct ?
I want to read header of PDF and detect if it is correct.
var okPDF = PDFCorrect(#"C:\temp\pdfile1.pdf");
2.)
How to know if byte[] (bytearray) of file is PDF file or not.
For example, for ZIP files, you could examine the first four bytes and see if they match the local header signature, i.e. in hex
50 4b 03 04
if (buffer[0] == 0x50 && buffer[1] == 0x4b && buffer[2] == 0x03 &&
buffer[3] == 0x04)
If you are loading it into a long, this is (0x04034b50). by David Pierson
I want the same for PDF files.
byte[] dataPDF = ...
var okPDF = PDFCorrect(dataPDF);
Any sample source code in .NET?

I check Header PDF like this:
public bool IsPDFHeader(string fileName)
{
byte[] buffer = null;
FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
BinaryReader br = new BinaryReader(fs);
long numBytes = new FileInfo(fileName).Length;
//buffer = br.ReadBytes((int)numBytes);
buffer = br.ReadBytes(5);
var enc = new ASCIIEncoding();
var header = enc.GetString(buffer);
//%PDF−1.0
// If you are loading it into a long, this is (0x04034b50).
if (buffer[0] == 0x25 && buffer[1] == 0x50
&& buffer[2] == 0x44 && buffer[3] == 0x46)
{
return header.StartsWith("%PDF-");
}
return false;
}

a. Unfortunately, there is no easy way to determine is pdf file corrupt. Usually, the problem files have a correct header so the real reasons of corruption are different. PDF file is effectively a dump of PDF objects. The file contains a reference table giving the exact byte offset locations of each object from the start of the file. So, most probably corrupted files have a broken offsets or may be some object is missed.
The best way to detect the corrupted file is to use specialized PDF libraries.
There are lots of both free and commercial PDF libraries for .NET. You may simply try to load PDF file with one of such libraries. iTextSharp will be a good choice.
b. According to the PDF reference the header of a PDF file usually looks like %PDF−1.X (where X is a number, for the present from 0 to 7). And 99% of PDF files have such header. However, there are some other kinds of headers which Acrobat Viewer accepts and even absence of a header isn't a real problem for PDF viewers. So, you shouldn't treat file as corrupted if it does not contain a header.
E.g., the header may be appeared somewhere within the first 1024 bytes of the file or be in the form %!PS−Adobe−N.n PDF−M.m
Just for your information I am a developer of the Docotic PDF library.

Well-behaving PDFs start with the first 9 Bytes as %PDF-1.x plus a newline (where x in 0..8). 1.x is supposed to give you the version of the PDF file format. The 2nd line are some binary bytes in order to help applications (editors) to identify the PDF as a non-ASCIItext file type.
However, you cannot trust this tag at all. There are lots of applications out there which use features from PDF-1.7 but claim to be PDF-1.4 and are thusly misleading some viewers into spitting out invalid error messages. (Most likey these PDFs are a result of a mis-managed conversion of the file from a higher to a lower PDF version.)
There is no such section as a "header" in PDF (maybe the initial 9 Bytes of %PDF-1.x are what you meant with "header"?). There may be embedded a structure for holding metadata inside the PDF, giving you info about Author, CreationDate, ModDate, Title and some other stuff.
My way to reliably check for PDF corruption
There is no other way to check for validity and un-corrupted-ness of a PDF than to render it.
A "cheap" and rather reliable way to check for such validity for me personally is to use Ghostscript.
However: you want this to happen fast and automatically. And you want to use the method programatically or via a scripted approach to check many PDFs.
Here is the trick:
Don't let Ghostscript render the file to a display or to a real (image) file.
Use Ghostscript's nullpage device instead.
Here's an example commandline:
gswin32c.exe ^
-o nul ^
-sDEVICE=nullpage ^
-r36x36 ^
"c:/path to /input.pdf"
This example is for Windows; on Unix use gs instead of gswin32c.exe and -o /dev/null.
Using -o nul -sDEVICE=nullpage will not output any rendering result. But all the stderr and stdout output of Ghostscript's processing the input.pdf will still appear in your console. -r36x36 sets resolution to 36 dpi to speed up the check.
%errorlevel% (or $? on Linux) will be 0 for an uncorrupted file. It will be non-0 for corrupted files. And any warning or error messages appearing on stdout may help you to identify problems with the input.pdf.
There is no other way to check for a PDF file's corruption than to somehow render it...
Update: Meanwhile not only %PDF-1.0, %PDF-1.1, %PDF-1.2, %PDF-1.3, %PDF-1.4, %PDF-1.5, %PDF-1.6, %PDF-1.7 and %PDF-1.8 are valid version indicators, but also %PDF-2.0.

The first line of a PDF file is a header identifying the version of the PDF specification
to which the file conforms %PDF-1.0, %PDF-1.1, %PDF-1.2, %PDF-1.3, %PDF-1.4 etc.
You could check this by reading some bytes from the start of the file and see if you have the header at the beginning for a match as PDF file. See the PDF reference from Adobe for more details.
Don't have a .NET example for you (haven't touched the thing in some years now) but even if I had, I'm not sure you can check for a complete valid content of the file. The header might be OK but the rest of the file might be messed up (as you said yourself, some files are corrupt).

You could use iTextSharp to open and attempt to parse the file (e.g. try and extract text from it) but that's probably overkill. You should also be aware that it's GNU Affero GPL unless you purchase a commercial licence.

Checking the header is tricky. Some of the code above simply won't work since not all PDF's start with %PDF. Some pdf's that open correctly in a viewer start with a BOM marker, others start like this
------------e56a47d13b73819f84d36ee6a94183
Content-Disposition: form-data; name="par"
...etc
So checking for "%PDF" will not work.

What I do is:
1.Validate extension
2.Open PDF file, read the header (first line) and check if it contains this string: "%PDF-"
3.Check if the file contains a string that specifies the number of pages by searching for multiple "/Page" (PDF file should always have at least 1 page)
As suggested earlier you can also use a library to read the file:
Reading PDF File Using iTextSharp

Related

Changing document Signature (magic numbers)

For my project I'm experimenting with disguising the content of a file and thought a good way to do this would be to change the document signature (magic numbers). I think in order to do this I need to change the starting x bytes of the hex but am not sure if this is possible? I've tried looking at the file I want to change in various hex viewers such as autopsy but it strips back all the metadata and only shows the content of that file and the corresponding hex. My question is it possible to change the signature and if so what is the best way to go about it? Any program recommendations?

Protobuf message - parsing difference between binary and text files

During my implementation on a protocol buffer application, I tried to work with the text pbtxt files to ease up my programming. The idea was to switch to the pb binary format afterward, once I had a clearer understanding of the API. (I am working in C++)
I made my application working by importing the file with TextFormat::Parse. (The content of the file came from TextFormat::Print). I then generated the corresponding binary file, that I tried to import with myMessageVariable.ParsefromCodedStream (file not compressed). But I notice that only a very small part of the message is imported. The myMessageVariable.IsInitialized returns true, thus I guess that the library "thinks" that it has completely imported the file.
So my question: is there something different in the way the file are imported that could make the import "half-fail"? (Besides the obvious reason that one is binary and the other one is in text?) And what can we do against it?
There are a few differences in reading text data and reading binary data:
Text files sometimes use automatic linefeed conversion (\r\n vs. \n), especially on Windows platforms. This has to be disabled by opening the file in binary mode.
Binary files can contain null bytes at any point. Some text processing functions stop reading at the first null byte.
It could help if you can determine more about how much of the message gets parsed. Then you can look at what kind of bytes are near the problem point, using e.g. hex editor.

Include pictures while converting ps1 to exe with PowerGUI

I use the PowerGUI editor to convert a ps1 file to an exe file. Reason is that I dont want people to see my source code. The script includes a own little GUI with a picture on it. My problem is that after converting the script to an exe file the picture will only be shown when it exists on a specific path. If I delete or move the picture from that path it wont be shown when starting the exe.
How can I include the picture to the exe? I want to have only one file in the end ...
One way you could do this is by converting your image into a Base 64 String, using the following:
[convert]::ToBase64String((get-content C:\YourPicture.jpg -encoding byte)) > C:\YourString.txt
With the string that is produced in the text file "C:\YourString.txt" you can copy and paste it into your code and load it into a picture object on the form like so:
$logo.Image = [System.Convert]::FromBase64String('
iVBORw0KGgoAAAANSUhEUgAAAfQAAACMCAYAAACK0FuSAAAABGdBTUEAALGPC/xhBQAAAAlwSFlz
AAAXEQAAFxEByibzPwAAAAd0SU1FB98DFA8VLc5RQx4AANRpSURBVHhe7J0FmBTX0oZ7F1jc3d3d
nU+dOtXw4MGDZfft25fmyJEjPu+//76Xqzke8pCHPOQhD3lIQGxxcGSapvHdd9+FF3hHEdjGFHgn
....... Many more lines of string .......
OqelFQzDMAzD/CZoztADbUwhUm0JERoXCNfEQYhyPAQryiBIUQSBiiTwl1sb3skwDMMwzG+KWLVe
jTEBH6U7JvJ0CJQXoaGPQMWBm5W+Va27k14MwzDMfwCA/wfUstOLO+nBIAAAAABJRU5Jggg==')
Do this will mean that your image is stored within the code already and doesn't need to be loaded from somewhere.
Note: Make sure that the pictures size on disk is the smallest you can get it as producing the string can take sometime and could turn out thousands of lines long. So I would recommend that you only use a pic that is less that 75 Kilobytes in size. You could do it with larger one but this will take a long time to process.

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text

Is there a way to infer what image format a file is, without reading the entire file?

Is there a good way to see what format an image is, without having to read the entire file into memory?
Obviously this would vary from format to format (I'm particularly interested in TIFF files) but what sort of procedure would be useful to determine what kind of image format a file is without having to read through the entire file?
BONUS: What if the image is a Base64-encoded string? Any reliable way to infer it before decoding it?
Most image file formats have unique bytes at the start. The unix file command looks at the start of the file to see what type of data it contains. See the Wikipedia article on Magic numbers in files and magicdb.org.
Sure there is. Like the others have mentioned, most images start with some sort of 'Magic', which will always translate to some sort of Base64 data. The following are a couple examples:
A Bitmap will start with Qk3
A Jpeg will start with /9j/
A GIF will start with R0l (That's a zero as the second char).
And so on. It's not hard to take the different image types and figure out what they encode to. Just be careful, as some have more than one piece of magic, so you need to account for them in your B64 'translation code'.
Either file on the *nix command-line or reading the initial bytes of the file. Most files come with a unique header in the first few bytes. For example, TIFF's header looks something like this: 0x00000000: 4949 2a00 0800 0000
For more information on the TIFF file format specifically if you'd like to know what those bytes stand for, go here.
TIFFs will begin with either II or MM (Intel byte ordering or Motorolla).
The TIFF 6 specification can be downloaded here and isn't too hard to follow

Resources