Snapshot testing PDFs [duplicate] - pdf-generation

I am generating and storing PDFs in a database.
The pdf data is stored in a text field using Convert.ToBase64String(pdf.ByteArray)
If I generate the same exact PDF that already exists in the database, and compare the 2 base64strings, they are not the same. A big portion is the same, but it appears about 5-10% of the text is different each time.
What would make 2 pdfs different if both were generated using the same method?
This is a problem because I can't tell if the PDF was modified since it was last saved to the db.
Edit: The 2 pdfs visually appear exactly the same when viewing the actual pdf, but the base64string of the bytes are different

Two PDFs that look 100% the same visually can be completely different under the covers. PDF producing programs are free to write the word "hello" as a single word or as five individual letters written in any order. They are also free to draw the lines of a table first followed by the cell contents, or the cell contents first, or any combination of these such as one cell at a time.
If you are actually programmatically creating the PDFs and you create two PDFs using completely identical code you still won't get files that are 100% identical. There's a couple of reasons for this, the most obvious is that PDFs support creation and modification dates. These will obviously change depending on when they are created. You can override these (and confuse everyone else so I don't recommend this) using something like this:
var info = writer.Info;
info.Put(PdfName.CREATIONDATE, new PdfDate(new DateTime(2001,01,01)));
info.Put(PdfName.MODDATE, new PdfDate(new DateTime(2001,01,01)));
However, PDFs also support a unique identifier in the trailer's /ID entry. To the best of my knowledge iText has no support for overriding this parameter. You could duplicate your PDF, change this manually and then calculate your differences and you might get closer to a comparison.
Then there's fonts. When subsetting fonts, producers create a unique internal name based on the original name and an arbitrary selection of six uppercase ASCII letters. So for the font Calibri the font's name could be JLXWHD+Calibri one time and SDGDJT+Calibri another time. iText doesn't support overriding of this because you'd probably do more harm than good. These internal names are used to avoid font subset collisions.
So the short answer is that unless you are comparing two files that are physical duplicates of each other you can't perform a direct comparison on their binary contents. The long answer is that you can tweak some of the PDF entries to remove unique parts for comparison only but you'd probably be doing more work than it would take to just re-store the file in the database.

Related

what is the best output format / platform to display different sorts of extracted data?

I am writing a script, that extracts different types of data from different kind of custom log files.
But before I continue to write, I want to determine in what output format / platform I want it to be, so it is displayed properly or it can be read properly.
examples:
sometimes it is certain lines of text with an important word in it
sometimes it is a block of text between a start and end phrase
sometimes it are data points, which i then want to visualize better in a line chart
....
OR it is a combination of those
At first i thought i write it so that it is in a markdown file format, so i can for instance create fold able blocks, so that i just unfold the part that i want to read.
But markdown is not versatile. Meaning I cant create line charts or other kinds of stuff (thinking about the future)
So know I put the different types of data in different type of output formats and visualize them in an HTML file.
meaning, the blocks of text in a markdown file, which I then import though a java-script markdown viewer
the data points, I create a line chart through a java-script chart
.....
HOWEVER, I am not sure that this is the best/correct way to go .....
What is your advice ?

PDF - Edit raw text without special paid tool

Is there a way to edit the raw text from a PDF without any special paid software?
So there are PDFs with highlightable text. I assume that the text is stored somewhere in the file.
I tried to just drag & drop a PDF into vscode but it just showed me unknown characters; even a little of meta text but if I edit the meta-infos, the file gets mostly corrupted.
Apart from that, I could not find any of the text contents of my desired PDF in vscode-editor.
Does someone know if there is a solution like inspecting and changing the source code somehow without a special software? I want to edit the contents; not the meta-infos.
(I use macOS)
The text you see on a pdf page can be constructed in dozens of different ways, actually there are millions of users, using potentially hundreds if not thousands of different methods.
Update
The question is MacOS but for native cross platform you need to work in mime text/pdf to be universally useful. But by way of example how thats possible specifically in windows its possible to write line by line using say cmd here is a snippet of what was a few dozen lines :-)
echo %%PDF-1.0>demo.pdf
echo %%µ¶µ¶>>demo.pdf
echo/>>demo.pdf
for %%Z in (demo.pdf) do set "FZ1=%%~zZ"
echo 1 0 obj>>demo.pdf
echo ^<^</Type/Catalog/Pages 2 0 R^>^>>>demo.pdf
echo endobj>>demo.pdf
echo/>>demo.pdf
For the fuller "Feature Creep"ing of now over more than a 100 lines and counting see
https://github.com/GitHubRulesOK/MyNotes/raw/master/MAKE-PDF.cmd
However although plain text could be the simplest it is rarely used except to prove a conceptual point that it is possible. The rest of the time "Special Software" as you call it (a pdf generator/editor) will be used to compress the file objects, most frequently as different optimal binary streams.
So some text may be scanned pixels whilst other text may be line shapes that look like letters, or at other times plain letters without fonts but a named style, or even letters with the font included (embedded) in the file (the preferred option).
In many ways each page may be built different to the others and thus no two pdfs generally will use the same structure unless like a bank statement using a format that does not change much from month to month, even if the balance wobbles about.
So in summary the tool that will work best is the one that covers every single permutation that Adobe dreamed of, and still keep the result a valid Adobe PDF.
Thus Acrobat PRO 3D is on my shelf (even if not used from one year to the next)
There are many cheaper editors and ones I will use more often for small mods are Tracker Xchange and FreePDF PRO and both have different limitations.
Your choices for MacOS will be more limited thus search for the best you are willing to pay for.

What does the "Interoperability IFD" EXIF tag mean, and how can it be different for the same image?

I am hunting for duplicates in my photo albums, and I stumbled upon some image pairs for which (seemingly) all visual data is the same, all EXIF tags are the same, except this "Interoperability IFD" field. For one image pair I checked it is 4896 and 4908 for the two files.
I'm pretty sure it is the same photo taken, it was just imported twice in different time/ways. What does this tag mean?
This is a byte offset into the Interoperability IFD "table". EXIF can contain several "tables" (list of "IDs" and "values"). Some table items are not actual data, but references to further "tables" (e.g. 0x8769 ExifTag, 0x8825 GPSTag or 0xA005 IopTag). This offset can of course vary depending in which order an implementer chooses to save the exif information and its different "tables".
Since your software seems to simply output this value, it might be safe to assume that it does not follow this offset to read the additional tags inside the Interoperability IFD "table" (tags 0x0001-0x0002 and 0x1000-0x1002).
Source https://www.exiv2.org/tags.html

Converting All Blocks to Lines and Text

When I receive a drawing, I wish to remove all definitions from previous drafters, such as blocks, styles, layers, groups, xrefs, etc. in order to retain only primitives: texts, lines and arcs, in summary, a single flat drawing.
This is a very routinary activity, and I've found many dissimilar answers through internet, often involving non-standard, non-canonical, combinations of the following commands:
LAYMRG, PURGE
AUDIT
SELECTSIMILAR
WBLOCK
EXPLODE, XPLODE
DIMSTYLE, BATTMAN
DXFOUT, WMFOUT, DXFIN, WMFIN
BURST
Unfortunately, after applying most them, the result sometimes retain many non-purgable objects, including:
Non-explodable blocks,
Dimensions with their own styles,
Blocks losing their text attributes (by XPLODE),
Changed fonts (by WMFOUT),
Do AutoCAD have some canonical way to do this?
I think it's not so easy. If there is such command, I don't know that, but...
In situation You described, You should attach drawing You get as External reference XRef . In that case, You can make such drawing displayed as darker or lighter, but without so many changes in drawing. Also if You get new version of such file, for example because Architect make some changes, You don't need to do anything, maybe only reload such file and new version is displayed.
You will have two separate files:
base, for example architecture
branch , for example electircal, HVAC, and so on. Your work.
Of corse You can think about some script (scr file of LISP) which will run all commands You want just by run one command. Create such script is not very complicated, but In my opinion it's easy and flexible enought to use XRef.

manually finding the size of a block of text (ASCII format)

Is there an easy way to manually (ie. not through code) find the size (in bytes, KB, etc) of a block of selected text? Currently I am taking the text, cutting/pasting into a new text document, saving it, then clicking "properties" to get an estimate of the size.
I am developing mainly in visual studio 2008 but I need any sort of simple way to manually do this.
Note: I understand this is not specifically a programming question but it's related to programming. I need this to compare a few functions and see which one is returning the smallest amount of text. I only need to do it a few times so figured writing a method for it would be overkill.
This question isn't meaningful as asked. Text can be encoded in different formats; ASCII, UTF-8, UTF-16, etc. The memory consumed by a block of text depends on which encoding you decide to use for it.
EDIT: To answer the question you've stated now (how do I determine which function is returning a "smaller" block of text) -- given a single encoding, the shorter text will almost always be smaller as well. Why can't you just compare the lengths?
In your comment you mention it's ASCII. In that case, it'll be one byte per character.
I don't see the difference between using the code written by the app you're pasting into, and using some other code. Being a python person myself, whenever I want to check length of some text I just do it in the interactive interpreter. Surely some equivalent solution more suited to your tastes would be appropriate?
ended up just cutting/pasting the text into MS Word and using the char count feature in there

Resources