I compress the uploaded .pdf files and save them to server's file system. Everything works well, but the file gets bigger, from 30kb to 48kb. What could I be doing wrong? Here's the code part I compress the uploaded file:
FileStream sourceFile = System.IO.File.OpenRead(filePath);
FileStream destFile = System.IO.File.Create(zipPath);
GZipStream compStream = new GZipStream(destFile, CompressionMode.Compress);
try
{
int theByte = sourceFile.ReadByte();
while (theByte != -1)
{
compStream.WriteByte((byte)theByte);
theByte = sourceFile.ReadByte();
}
}
I guess the problem is with GZipLib here. I used DotNetZip instead and file gets smaller as expected now. It can be downloaded here.
The compression algorithms for the System.IO.Compression..::.DeflateStream and System.IO.Compression..::.GZipStream classes have improved so that data that is already compressed is no longer inflated. This results in much better compression ratios. Also, the 4-gigabyte size restriction for compressing streams has been removed.
It has been fixed in .NET 4.
What could I be doing wrong?
The PDF file standard already provides compression of different parts. So what you get is probably an already compressed file. And you know what happens when you try to compress a compressed file? It's the same as if you were trying to compress a ZIP file: kinda useless effort.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a windows .NET application that manages many PDF Files. Some of the files are corrupt.
2 issues: I'll try to explain in my imperfect English...sorry
1.)
How can I detect if any pdf file is correct ?
I want to read header of PDF and detect if it is correct.
var okPDF = PDFCorrect(#"C:\temp\pdfile1.pdf");
2.)
How to know if byte[] (bytearray) of file is PDF file or not.
For example, for ZIP files, you could examine the first four bytes and see if they match the local header signature, i.e. in hex
50 4b 03 04
if (buffer[0] == 0x50 && buffer[1] == 0x4b && buffer[2] == 0x03 &&
buffer[3] == 0x04)
If you are loading it into a long, this is (0x04034b50). by David Pierson
I want the same for PDF files.
byte[] dataPDF = ...
var okPDF = PDFCorrect(dataPDF);
Any sample source code in .NET?
I check Header PDF like this:
public bool IsPDFHeader(string fileName)
{
byte[] buffer = null;
FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
BinaryReader br = new BinaryReader(fs);
long numBytes = new FileInfo(fileName).Length;
//buffer = br.ReadBytes((int)numBytes);
buffer = br.ReadBytes(5);
var enc = new ASCIIEncoding();
var header = enc.GetString(buffer);
//%PDF−1.0
// If you are loading it into a long, this is (0x04034b50).
if (buffer[0] == 0x25 && buffer[1] == 0x50
&& buffer[2] == 0x44 && buffer[3] == 0x46)
{
return header.StartsWith("%PDF-");
}
return false;
}
a. Unfortunately, there is no easy way to determine is pdf file corrupt. Usually, the problem files have a correct header so the real reasons of corruption are different. PDF file is effectively a dump of PDF objects. The file contains a reference table giving the exact byte offset locations of each object from the start of the file. So, most probably corrupted files have a broken offsets or may be some object is missed.
The best way to detect the corrupted file is to use specialized PDF libraries.
There are lots of both free and commercial PDF libraries for .NET. You may simply try to load PDF file with one of such libraries. iTextSharp will be a good choice.
b. According to the PDF reference the header of a PDF file usually looks like %PDF−1.X (where X is a number, for the present from 0 to 7). And 99% of PDF files have such header. However, there are some other kinds of headers which Acrobat Viewer accepts and even absence of a header isn't a real problem for PDF viewers. So, you shouldn't treat file as corrupted if it does not contain a header.
E.g., the header may be appeared somewhere within the first 1024 bytes of the file or be in the form %!PS−Adobe−N.n PDF−M.m
Just for your information I am a developer of the Docotic PDF library.
Well-behaving PDFs start with the first 9 Bytes as %PDF-1.x plus a newline (where x in 0..8). 1.x is supposed to give you the version of the PDF file format. The 2nd line are some binary bytes in order to help applications (editors) to identify the PDF as a non-ASCIItext file type.
However, you cannot trust this tag at all. There are lots of applications out there which use features from PDF-1.7 but claim to be PDF-1.4 and are thusly misleading some viewers into spitting out invalid error messages. (Most likey these PDFs are a result of a mis-managed conversion of the file from a higher to a lower PDF version.)
There is no such section as a "header" in PDF (maybe the initial 9 Bytes of %PDF-1.x are what you meant with "header"?). There may be embedded a structure for holding metadata inside the PDF, giving you info about Author, CreationDate, ModDate, Title and some other stuff.
My way to reliably check for PDF corruption
There is no other way to check for validity and un-corrupted-ness of a PDF than to render it.
A "cheap" and rather reliable way to check for such validity for me personally is to use Ghostscript.
However: you want this to happen fast and automatically. And you want to use the method programatically or via a scripted approach to check many PDFs.
Here is the trick:
Don't let Ghostscript render the file to a display or to a real (image) file.
Use Ghostscript's nullpage device instead.
Here's an example commandline:
gswin32c.exe ^
-o nul ^
-sDEVICE=nullpage ^
-r36x36 ^
"c:/path to /input.pdf"
This example is for Windows; on Unix use gs instead of gswin32c.exe and -o /dev/null.
Using -o nul -sDEVICE=nullpage will not output any rendering result. But all the stderr and stdout output of Ghostscript's processing the input.pdf will still appear in your console. -r36x36 sets resolution to 36 dpi to speed up the check.
%errorlevel% (or $? on Linux) will be 0 for an uncorrupted file. It will be non-0 for corrupted files. And any warning or error messages appearing on stdout may help you to identify problems with the input.pdf.
There is no other way to check for a PDF file's corruption than to somehow render it...
Update: Meanwhile not only %PDF-1.0, %PDF-1.1, %PDF-1.2, %PDF-1.3, %PDF-1.4, %PDF-1.5, %PDF-1.6, %PDF-1.7 and %PDF-1.8 are valid version indicators, but also %PDF-2.0.
The first line of a PDF file is a header identifying the version of the PDF specification
to which the file conforms %PDF-1.0, %PDF-1.1, %PDF-1.2, %PDF-1.3, %PDF-1.4 etc.
You could check this by reading some bytes from the start of the file and see if you have the header at the beginning for a match as PDF file. See the PDF reference from Adobe for more details.
Don't have a .NET example for you (haven't touched the thing in some years now) but even if I had, I'm not sure you can check for a complete valid content of the file. The header might be OK but the rest of the file might be messed up (as you said yourself, some files are corrupt).
You could use iTextSharp to open and attempt to parse the file (e.g. try and extract text from it) but that's probably overkill. You should also be aware that it's GNU Affero GPL unless you purchase a commercial licence.
Checking the header is tricky. Some of the code above simply won't work since not all PDF's start with %PDF. Some pdf's that open correctly in a viewer start with a BOM marker, others start like this
------------e56a47d13b73819f84d36ee6a94183
Content-Disposition: form-data; name="par"
...etc
So checking for "%PDF" will not work.
What I do is:
1.Validate extension
2.Open PDF file, read the header (first line) and check if it contains this string: "%PDF-"
3.Check if the file contains a string that specifies the number of pages by searching for multiple "/Page" (PDF file should always have at least 1 page)
As suggested earlier you can also use a library to read the file:
Reading PDF File Using iTextSharp
I have a .NEF file on my Desktop titled "my_image.nef". If I look at the details of the image, I see a resolution of 4256x2832:
When I try to open this with Julia, I get a two-dimensional array of size 120x160.
How do I get a full-resolution array to load? Why is it loading a much smaller version of the original image?
I'm not an expert on various RAW file formats, but it's probably loading the thumbnail preview. There's good reason to hope this may be fairly easily resolvable: like many RAW formats, it appears to be a variation on TIFF, and Julia's TiffImages package is an amazingly good TIFF library. It's possible you'd have to create a "wrapper package" specifically for RAW or NEF, but it's might end up being a fairly short exercise in piecing together the correct series of calls to TiffImages internals. I encourage you to file an issue at TiffImages to discuss it.
I ended up using the Julia command prompt to iteratively call ImageMagick, locally converting all the .NEF files to .PNG files and reading in the PNG files as arrays.
using Glob
using Shell
filenames = glob(string("*",".NEF"), <IMAGE DIR>)
for file in filenames
fname = split(split(file,"\\")[end],".")[1]
fname1 = string(img_folder, fname, ".NEF")
fname2 = string(img_folder, fname, ".png")
cmd = string("convert ", fname1, " ", fname2)
Shell.run(cmd)
end
Sloppy, but there didn't seem to be a tidy Julia-based package that worked well.
I am looking for a way to retrieve the artwork from a large amount of mp3 files and store the artwork as jpg files in a folder. The only gem I am aware of able to read mp3 data is mp3info and the documentation only says that it can write images to the mp3 not retrieve the data from the mp3, are there any gems that give this capability?
So I ended up getting it. As I said above, I ended up using taglib, and used Carrierwave to store the files in my fs.
If you follow the documentation on https://robinst.github.io/taglib-ruby/, you would use:
file = TagLib::MPEG::File.new(file)
id3v2_tag = file.id3v2_tag
cover = id3v2_tag.frame_list('APIC').first
file.close
to grab the picture. Then create a new file to write the data:
newfile = File.new('temp.jpg', 'w+')
newfile.write(cover.picture.force_encoding('UTF-8'))
newfile.close
Then use Carrierwave, Paperclip, or whatever to store it somewhere. Hope this helps.
I'm trying to determine the algorithm used to compress a series of bytes... I have no idea what algorithm it is or how it works. What I do know is what the contents of the data are, both before and after its compressed.
Is there a program I can use to determine this, is the answer obvious from these small samples, or can you redirect me to some pretty good resources to figure this out?
Input = "\x00\x00"
Output = "\x78\xda\x63\x60\x00\x00\x00\x02\x00\x01"
Input = "\x00\x01\x00\x00\x00\x3C\xEA\x00\x05\x68\x65\x6c\x6c\x6f\x03"
Output = \x78\xda\x63\x60\x64\x60\x60\xb0\x79\xc5\xc0\x9a\x91\x9a\x93\x93\xcf\x0c\x00\x13\x10\x03\x44"
Input = "\x00\x0A\x00\x02\x1a\xec\xEA\x00\x0A\x62\x61\x73\x69\x6c\x61\x64\x65\x31\x32\x02\x00\x02\xe6\x0f\xEA\x00\x0B\x31\x31\x68\x6f\x74\x70\x69\x6e\x6b\x31\x31\x02\x00\x02\xee\x84\xEA\x00\x08\x73\x78\x79\x63\x61\x69\x74\x79\x02\x00\x02\xf3\x6b\xEA\x00\x09\x52\x6f\x62\x6c\x6f\x78\x31\x30\x31\x02\x00\x03\x13\xd3\xEA\x00\x0D\x62\x6c\x75\x65\x5f\x6d\x61\x66\x69\x61\x31\x32\x33\x02\x00\x03\x4c\x94\xEA\x00\x0D\x45\x76\x65\x72\x74\x6f\x6e\x20\x42\x72\x69\x74\x6f\x02\x00\x03\xb3\x96\xEA\x00\x0D\x69\x48\x65\x61\x72\x74\x43\x6f\x6f\x6b\x69\x65\x73\x02\x00\x04\xbf\x25\xEA\x00\x0B\x6a\x61\x6b\x65\x2e\x2e\x2e\x77\x68\x61\x74\x02\x00\x05\x94\x09\xEA\x00\x07\x7e\x5a\x61\x70\x70\x79\x7e\x02\x00\x06\xa9\x97\xEA\x00\x08\x4c\x75\x63\x79\x4c\x75\x63\x79\x02"
Output = "\x78\xda\x63\xe0\x62\x60\x92\x7a\xf3\x8a\x81\x2b\x29\xb1\x38\x33\x27\x31\x25\xd5\xd0\x88\x89\x81\xe9\x19\xff\x2b\x06\x6e\x43\xc3\x8c\xfc\x92\x82\xcc\xbc\x6c\x43\x43\xa0\xd0\xbb\x96\x57\x0c\x1c\xc5\x15\x95\xc9\x89\x99\x25\x95\x40\xfe\xe7\xec\x57\x0c\x9c\x41\xf9\x49\x39\xf9\x15\x86\x06\x40\x05\xcc\xc2\x97\x5f\x31\xf0\x26\xe5\x94\xa6\xc6\xe7\x26\xa6\x65\x26\x1a\x1a\x19\x03\x05\x7d\xa6\x00\x05\x5d\xcb\x52\x8b\x4a\xf2\xf3\x14\x9c\x8a\x32\x4b\xf2\x81\x82\x9b\xa7\x01\x05\x33\x3d\x52\x13\x8b\x4a\x9c\xf3\xf3\xb3\x33\x53\x8b\x99\x18\x58\xf6\xab\x02\xad\xcc\x4a\xcc\x4e\xd5\xd3\xd3\x2b\xcf\x48\x2c\x61\x62\x60\x9d\xc2\xf9\x8a\x81\xbd\x2e\x2a\xb1\xa0\xa0\xb2\x8e\x89\x81\x6d\xe5\x74\xa0\x0b\x7c\x4a\x93\x2b\x41\x98\x09\x00\x28\x9c\x3b\x2f"
That is zlib format using the Compress method it is very common.
https://www.rfc-editor.org/rfc/rfc1950
Related Answer What does a zlib header look like?
Edit
It is possible to be something else of course but this is the best place to start to decompress it.
http://www.zlib.net
I have a folder of image file which have been compressed into .dat file. Since the .dat files are extremly huge(They are the microscopic image of the organ.), I don't really know what kind of tools that I can use to convert it into jpeg file. So the best case would that the whole image is split up into pieces, and I can get all the pieces of the image.
The ".dat" file suffix is used broadly, so you'll need to specify more details on what format/source software created the original data. As a guess, from a quick search of ".dat" format microscopy, these tools looks like they might be applicable to your domain:
http://gwyddion.net/
or
http://www.openmicroscopy.org/site/products/bio-formats
If you can't find a library for the format/languages you are using, then you'll need to find documentation of the file format, and write a converter (at least, the reading portion of the converter - you can use something like libjpeg to handle the writing portion.)