Is there a way to edit the raw text from a PDF without any special paid software?
So there are PDFs with highlightable text. I assume that the text is stored somewhere in the file.
I tried to just drag & drop a PDF into vscode but it just showed me unknown characters; even a little of meta text but if I edit the meta-infos, the file gets mostly corrupted.
Apart from that, I could not find any of the text contents of my desired PDF in vscode-editor.
Does someone know if there is a solution like inspecting and changing the source code somehow without a special software? I want to edit the contents; not the meta-infos.
(I use macOS)
The text you see on a pdf page can be constructed in dozens of different ways, actually there are millions of users, using potentially hundreds if not thousands of different methods.
Update
The question is MacOS but for native cross platform you need to work in mime text/pdf to be universally useful. But by way of example how thats possible specifically in windows its possible to write line by line using say cmd here is a snippet of what was a few dozen lines :-)
echo %%PDF-1.0>demo.pdf
echo %%µ¶µ¶>>demo.pdf
echo/>>demo.pdf
for %%Z in (demo.pdf) do set "FZ1=%%~zZ"
echo 1 0 obj>>demo.pdf
echo ^<^</Type/Catalog/Pages 2 0 R^>^>>>demo.pdf
echo endobj>>demo.pdf
echo/>>demo.pdf
For the fuller "Feature Creep"ing of now over more than a 100 lines and counting see
https://github.com/GitHubRulesOK/MyNotes/raw/master/MAKE-PDF.cmd
However although plain text could be the simplest it is rarely used except to prove a conceptual point that it is possible. The rest of the time "Special Software" as you call it (a pdf generator/editor) will be used to compress the file objects, most frequently as different optimal binary streams.
So some text may be scanned pixels whilst other text may be line shapes that look like letters, or at other times plain letters without fonts but a named style, or even letters with the font included (embedded) in the file (the preferred option).
In many ways each page may be built different to the others and thus no two pdfs generally will use the same structure unless like a bank statement using a format that does not change much from month to month, even if the balance wobbles about.
So in summary the tool that will work best is the one that covers every single permutation that Adobe dreamed of, and still keep the result a valid Adobe PDF.
Thus Acrobat PRO 3D is on my shelf (even if not used from one year to the next)
There are many cheaper editors and ones I will use more often for small mods are Tracker Xchange and FreePDF PRO and both have different limitations.
Your choices for MacOS will be more limited thus search for the best you are willing to pay for.
Full disclosure: I'm working on my libui GUI framework's text API. This wraps DirectWrite on Windows, Core Text on OS X, and Pango (which uses HarfBuzz for OpenType shaping) on other Unixes. One of the text formatting attributes I want to specify is a collection of OpenType features to use, which all three provide; DirectWrite's is IDWriteTypography.
Now, when you draw some text with these libraries, by default you'll get a few useful OpenType features enabled, such as the standard ligatures (liga) like the f+i ligature. I thought this was font-specific, but it turns out this is specific to the script of the text being shaped. Microsoft provides guidelines for all the scripts supported by OpenType (under "Script-specific Development"), and I can see rather complex logic for doing it all in HarfBuzz itself to confirm it.
On Core Text and Pango, if I enable other attributes, they'll be added on top of these defaults. But with DirectWrite, in particular IDWriteTextLayout::SetTypography(), doing so removes the defaults:
The program that produces this output is can be found here.
Obviously my first option would be to ask how to get the default features on DirectWrite. Someone did so already on this site, though, and the answer seems to be "no".
I am guessing that DirectWrite is allowing me to be in complete control of the list of features to apply to some text. This is nice, except that I can't do this with the other APIs unless I explicitly disable the default features somehow! Of course, I don't know if this list will ever change, so hardcoding it might not be the best idea.
Even if hardcoding is an option, I could just grab HarfBuzz's list for each script, but a) it's rather complicated b) there are multiple possible shapers for a script, depending on (I think) version compatibility (for instance, Myanmar).
So why not use HarfBuzz's lists to recreate the default list of features for DirectWrite anyway? It seems to want to be accurate to other shapers anyway, so this should work, right? Well I would need to do two things: figure out what script to use, and figure out which attributes to use on which characters for script where the position of a character in the word matters.
DirectWrite provides an interface IDWriteTextAnalyzer that provides facilities to perform shaping. I could use this, but it seems the script data is returned in a DWRITE_SCRIPT_ANALYSIS structure, and the description for the script ID says "The zero-based index representation of writing system script.".
This doesn't help, so I wrote a program to just dump the script numbers for text I type in. Running it on the input string
لللللللللللللاااااااااالا abcd محمد ابن بطوطة Отложения датского яруса
yields the output
0 - 26 script 3 shapes 0
26 - 5 script 49 shapes 0
31 - 14 script 3 shapes 0
45 - 2 script 1 shapes 1
47 - 25 script 22 shapes 0
I cannot match these script numbers to anything in any of the Windows headers: if there is a defined number for Arabic, Latin, or Cyrillic in any API, they don't match these. And even if I did get a mapping between script and script number, that still doesn't give me the data to apply intra-word features.
What about Uniscribe? Well, the documentation for the equivalent SCRIPT_ANALYSIS type says that its script ID is an "[opaque] value" whose "value for this member is undefined and applications should not rely on its value being the same from one release to the next". And while I can get a language code to identify the script by, there's still no defined value other than LANG_ENGLISH for "Western" (Latin?) scripts. Are the DirectWrite values the same as the Uniscribe ones? And it seems like I can at least figure the initial and final states of words by looking at the fLinkBefore and fLinkAfter fields, but is this enough to properly apply attributes per-script?
HarfBuzz does have an experimental DirectWrite backend that isn't intended to be used by real programs; I'm not yet sure whether it has the same feature-clobbering I specified above. If I find out, I'll update this part here.
Finally, if I enter the following equivalent test case to the first one above in something like kaxaml:
<Page
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml">
<Grid>
<FlowDocumentPageViewer>
<FlowDocument FontFamily="Constantia" FontSize="48">
<Paragraph>
afford afire aflight 1/4<LineBreak/>
<Run Typography.Fraction="1">afford afire aflight 1/4</Run>
</Paragraph>
</FlowDocument>
</FlowDocumentPageViewer>
</Grid>
</Page>
I see the ligatures being applied properly, even in the latter case:
(The fraction at the end is just to prove that that attribute is being applied.) If I assume XAML uses DirectWrite, then that proves my first option (simply overlaying my custom attributes on top of the defaults) should be possible... (I make this assumption based on the idea that XAML provides a strikingly similar API to Direct2D for drawing 2D graphics, and has a lot of holes filled in where I had to manually write a lot of glue code to do the same things with vanilla Direct2D, so I assume whatever is possible in XAML is possible with Direct2D, and by extension DirectWrite since they were technically introduced together...)
At this point I'm completely lost. I want to at least be predictable across platforms, and I'm not sure how programs are even supposed to, let alone going to, use OpenType features directly or not anyway. Am I making bad expectations of text layout APIs? Will I have to drop IDWriteTextLayout and do all the text shaping and layout myself if I want this?
Or do I have to drop vanilla Windows 7 support and upgrade to the Platform Update DirectWrite feature set? Or even Windows 7 entirely?
After some discussions with Peter Sikking and Ebrahim Byagowi, I went and debugged a more general-purpose program I built quickly to test things, and I figured out what's going on internally.
First, however, I will say this applies to Uniscribe and DirectWrite equally.
As it turns out, DirectWrite is always providing a set of default OpenType features, regardless of what feature set I use! The situation is that the list of default features provided differs depending on whether I load my own features or not, and depending on the shaping engine. For the latn script in horizontal writing mode and for English, this is done with the "generic engine".
If I don't provide any features, the generic engine will load script-specific features. For horizontal latn, this list is
locl
ccmp
rlig
rclt
calt
liga
clig
If I do provide features, the generic engine will use the same default list for all scripts:
locl
ccmp
rclt
rlig
mark
mkmk
dist
So I don't know what to do about this. I could probably just provide liga and a few others myself in libui code (marked as a HACK of course), but this is still weird. I'm not sure what the motivation is either. Either way, this explains the behavior I'm seeing.
Supposing your question in general is about programming or at least concerns programming, I will try and give answers to some of your interrogative sentences.
would I have to drop the use of IDWriteTextLayout entirely in my code if I want to be able to add typographical features on top of the defaults?
It depends. If an IDWriteTextLayout interface suits well your project tasks in all ways except ease of variation of DirectWrite default typographic features, learn what you should about typography and create an IDWriteTypography instance suitable for your needs. Developing a custom text layout for the program may require substantial time and effort, especially if the program is supposed to render bidirectional texts, complex scripts, inline objects, etc.
It may happen that the tasks of your project require to develop a text layout engine for reasons other than just controlling typographic features used in rendered text. For example, your manager/customer may ask for implementation of customized linebreaking opportunities or a glyph advance justification algorithm. In this scenario, you will implement an IDWriteTextAnalizer::GetGlyphs method. This method has parameters DWRITE_TYPOGRAPHIC_FEATURES ** features, const UINT32 * featureRangeLengths, UINT32 featureRanges, and this parameters enable you to supersede a set of "default" typography features for a range of the text to be rendered (see my answer to the other question What are the default typography settings used by IDWriteTextLayout?). Only affected features will be altered; the other features has their "default" values. Morever, if you omit this parameters in a GetGlyphs call for the next text range (for example, use values of NULL, NULL, 0), the features altered in the previous GetGlyphs call will not be altered by the call for this next range.
the documentation for the equivalent SCRIPT_ANALYSIS type says that its script ID is an "[opaque] value" whose "value for this member is undefined and applications should not rely on its value being the same from one release to the next". And while I can get a language code to identify the script by, there's still no defined value other than LANG_ENGLISH for "Western" (Latin?) scripts.
Strictly speaking, this is not an interrogative statement, but I guess you are dissatisfied with how these Unicode script IDs are defined and how one can use the API with so vaguely defined structures and constants.
It may be off topic, but I risk to hypothesize on the origin of the "Unicode script ID" values. As of 2010-07-17, the Unicode, Inc. published The Unicode 6.0 version. The standard contained the document
http://www.unicode.org/Public/6.0.0/ucd/PropertyValueAliases.txt, with a section containing a list of scripts. The list went so:
# Script (sc)
sc ; Arab ; Arabic
sc ; Armi ; Imperial_Aramaic
etc.
The Arabic script is #1, the Cyrillic script is #20, the Latin script is #47 in this list. Furthermore, elsewhere I saw this list starting with scripts Common and Inherited. It places the Arabic script to the 3rd, the Cyrillic to the 22nd, and the Latin to the 49th place. These ordinals are familiar to you, aren't they?
Fortunately, we need not rely on the "Unicode script ID" values; we need script properties, not script IDs or abbreviations. The API is self-consistent in that it gives actual script properties for the text range, when we pass to a GetScriptProperties method the number derived from an AnalyzeScript call.
I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.
The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.
Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.
I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.
Open up irb and
type gets. It should work fine.
Then try system("choice /c YN") It should work as expected.
Now try gets again, it behaves oddly.
Can someone tell me why this is?
EDIT: For some clarification on the "odd" behavior, it allows me to type for gets, but doesn't show me the characters and I have to press the enter key twice.
Terminal input-output handling is dark and mysterious art. Anyone trying to make colorized output of bash work in windows PowerShell via ssh knows that. (And various shortcutting habits like Ctrl+Backspace only make things worse.)
One of the possible reasons for your problem is special characters handling. Every terminal out there can type characters in number of different modes, and it parses its own output in search for certain character sequences in order to toggle states.
F.e. here one can find ANSI escape code sequences, one of possible supported standards among different kind of terminals.
See there Esc[5;45m? That will make all the following output to blink on magenta background. And there is significantly more stuff like that out there.
So, the answer to your question taken literally is — your choice command messes something with output modes using special escape sequences, and ruby's gets breaks in that quirk special mode of terminal operation.
But more useful will be the link to HighLine gem documentation. Why one might want to implement platform-specific and obtrusive behavior when it is possible to implement the same with about 12 LOC? All the respect for the Gist goes to botimer, I've only stumbled into his code using search.
Is there an easy way to manually (ie. not through code) find the size (in bytes, KB, etc) of a block of selected text? Currently I am taking the text, cutting/pasting into a new text document, saving it, then clicking "properties" to get an estimate of the size.
I am developing mainly in visual studio 2008 but I need any sort of simple way to manually do this.
Note: I understand this is not specifically a programming question but it's related to programming. I need this to compare a few functions and see which one is returning the smallest amount of text. I only need to do it a few times so figured writing a method for it would be overkill.
This question isn't meaningful as asked. Text can be encoded in different formats; ASCII, UTF-8, UTF-16, etc. The memory consumed by a block of text depends on which encoding you decide to use for it.
EDIT: To answer the question you've stated now (how do I determine which function is returning a "smaller" block of text) -- given a single encoding, the shorter text will almost always be smaller as well. Why can't you just compare the lengths?
In your comment you mention it's ASCII. In that case, it'll be one byte per character.
I don't see the difference between using the code written by the app you're pasting into, and using some other code. Being a python person myself, whenever I want to check length of some text I just do it in the interactive interpreter. Surely some equivalent solution more suited to your tastes would be appropriate?
ended up just cutting/pasting the text into MS Word and using the char count feature in there