I am using CoreText CTFontGetGlyphsForCharactersto get glyphs that correspond to unicode chars , i.e. UniChar. Now, I'd like to retrieve Ligature characters, for instance fi. Is there any way to retrieve ligature glyphs from CoreText (for instance by a series of unichars that should be combined)?
Thanks,
moka
After some useful advice from the Core Text mailing list, here is the solution:
You can't get ligatures directly from CTFont, as no API exists to do that.
What you do instead is the following:
Use CTLineCreateWithAttributedString() to create a text line from the character sequence whose ligature you want to get.
Make sure to set kCTLigatureAttributeName to either 1 or 2 on the attributed string to get ligatures.
If done correctly, with a ligature that exists in the font, you should get a CTLine containing one CTRun containing one CGGlyph which will be the ligated glyph you are looking for!
If you don't want to go down this route, the only other way appears to be parsing the Open - / True Type font tables yourself which I don't recommend.
Hope that helps!
Related
Some characters have ambiguous directionality, like whitespace and punctuation marks. This can lead to text layout situations where there doesn't appear to be single correct layout without access to additional data to resolve the ambiguity. Consider this text:
\u05e9\u05e0\u05d1\u05d2abcd!
That's four Hebrew characters (unambiguously right-to-left), four English characters (unambiguously left-to-right), and one punctuation mark (ambiguous). If I layout that string in an IDWriteTextLayout with DWRITE_READING_DIRECTION_RIGHT_TO_LEFT, I get the following:
The punctuation mark appears to be treated as a right-to-left character which is starting a new right-to-left block to the left of the English, which seems perfectly reasonable, especially considering that right-to-left was the specified reading direction. However, it's also entirely reasonable to expect the punctuation mark to be treated as a left-to-right character associated with the embedded left-to-right English text, which would mean it should appear to the right of the 'd'.
My app knows exactly how it wants this character should be treated. How do I pass that data to IDWriteTextLayout to resolve this ambiguity?
I found the SetLocaleName method and thought that it must be the answer, but I can't seem to get it to affect the result at all. I also found the localeName parameter when creating an IDWriteTextFormat (which is then used to create the IDWriteTextLayout).
If my goal is for this to generally be Hebrew text with a string of embedded US English, I would think I'd want to use locale he on the IDWriteTextFormat and then use SetLocaleName to override that with locale en-US on character range [4-9]. However, doing so has no effect. In fact, I can't get any combination of locales used in those places to have any effect on the layout at all, whether I restrict them to a subrange or apply them to the entire string.
Am I wrong in thinking that these APIs should serve this purpose? If so, what APIs should I be using? Or is there really no way to tell IDWriteTextLayout to resolve this ambiguity differently? Am I maybe using the APIs wrong? Here is the test code I'm using to create this IDWriteTextLayout:
TestTextRenderer::TestTextRenderer(const std::shared_ptr<DX::DeviceResources>& deviceResources) :
m_deviceResources(deviceResources),
m_text(L"\u05e9\u05e0\u05d1\u05d2abcd!"),
m_readingDirection(DWRITE_READING_DIRECTION_RIGHT_TO_LEFT),
m_formatLocale(L"en-US"),
m_layoutLocale(L"en-US")
{
ComPtr<IDWriteTextFormat> textFormat;
DX::ThrowIfFailed(
m_deviceResources->GetDWriteFactory()->CreateTextFormat(
L"Segoe UI",
nullptr,
DWRITE_FONT_WEIGHT_MEDIUM,
DWRITE_FONT_STYLE_NORMAL,
DWRITE_FONT_STRETCH_NORMAL,
24.0f,
m_formatLocale.c_str(),
&textFormat
)
);
DX::ThrowIfFailed(textFormat->SetReadingDirection(m_readingDirection));
DX::ThrowIfFailed(
m_deviceResources->GetDWriteFactory()->CreateTextLayout(
m_text.c_str(),
(uint32) m_text.length(),
textFormat.Get(),
250.0f,
100.0f,
&m_textLayout
)
);
DWRITE_TEXT_RANGE all{0u, m_text.size()};
DX::ThrowIfFailed(m_textLayout->SetLocaleName(m_layoutLocale.c_str(), all));
DX::ThrowIfFailed(m_deviceResources->GetD2DFactory()->CreateDrawingStateBlock(&m_stateBlock));
CreateDeviceDependentResources();
}
I don't think there's any ambiguity from the Unicode BiDi algorithm point of view. Initial direction set to IDWriteTextFormat or IDWriteTextLayout is crucial, but after that run directions will be derived strictly from codepoints.
Setting locale won't change direction, but it will potentially affect shaping, end result depends on particular features run font has.
I think you can accomplish abcd!... output using LRE/PDF controls around this part of the text.
I am trying to write documentation with asciidoctor-pdf and I need to use characters like : ă,â,î,ş,ţ. The pdf output is rendered but the mentioned characters are rendered empty. I am not sure how to handle the issue.
For example:
I wrote this code:
= Document Title
Doc Writer <doc#example.com>
:doctype: book
:source-highlighter: coderay
:listing-caption: Listing
// Uncomment next line to set page size (default is Letter)
//:pdf-page-size: A4
A simple http://asciidoc.org[AsciiDoc] document.
== Introducţie
A paragraph followed by a simple list with square bullets.
And the result was the word Introducţie rendered as Introduc ie and finally the error:
/usr/local/rvm/gems/ruby-2.2.2/gems/pdf-core-0.2.5/lib/pdf/core/pdf_object.rb:55: warning: regexp match /.../n against to UTF-8 string
Can be a system encoding configuration problem?
Do I need to set different encoding configuration in ruby?
Thank you.
I think that if you want to be sure, you can always use the decimal entity references form. For the latin small Letter T with cedilla it is: ţ
Check this table for the complete list:
List of Unicode characters
In addition, if you want to use this special char in a title, there was an issue with it:
Section id with characters outside of Windows-1252 encoding causes warning
It seems to be fixed now, but I did not verify it.
One of possible ways to write such special characters in titles is to declare them in preamble of your asciidoc document, for example,
:t-cedil: ţ
and to call it in the main text
== pass:normal[Test-{t-cedil}]
So your title will look like
Test-ţ
I want to user ghostscript to optimize pdf files.
My files are generated by iText, and there is font which is embeded too many times - 3000+;
I want to repring document with ghostscript, which will remove all embeded and embed it only once in file.
Do you know how to do it ?
And additional question - is there any difference detween ghostscript and ghost4j ?
Thanks
You cannot do that, most likely. Without seeing the file I cannot be certain, but the probability is that each font is embedded as a subset. That is it contains just a few of the glyphs that were originally present in the font.
So if the first instance contains, say a, c, f and g and the second instance contains b, e and h you can see that the two fonts are actually different.
Worse, the text is usually re-encoded, so the character codes are not what you would expect. In the example above, 'a' would not have the character code 0x61 (ASCII for 'a'), it would have the character code 1. c would be 2, f would be 3 and so on. But in the second case, character code 1 would be 'b', character code 2 would be 'e' etc.
There's no easy way to recombine the multiple font subsets, and also re-encode each set of text to the 'combined' font.
Where the pdfwrite device detects multiple subset fonts which have compatible encodings (the character codes used are unique in each font) it will combine them together. It won't attempt to re-encode them again though, so if the two fonts use the same character codes (as per my example above) pdfwrite will just emit two fonts.
Assuming you've already tried running the file through pdfwrite and didn't get the result you wanted, then that's it, you can't achieve the result you want with the current code.
Probably you can tell iText not to subset the fonts, which will solve the problem for you at the source, rather than trying to fix it afterwards.
For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor
I need to parse some text from pdfs but the pdf formatting results in extremely unreliable spacing. The result is that I have to ignore the spaces and have a continuous stream of non-space characters.
Any suggestions on how to parse the string and put spaces back into the string by guessing?
I'm using ruby. Or should I say I'musingruby?
Edit: I've pulled the text out using pdf-reader. Some of the pdf files are nicely formatted and some are not. An example of text mixed with positioning:
.7aspe-5.5cts-715.1o0.6f-708.5f-0.4aces-721.4that-716.3are-720.0i-1.8mportant-716.3in-713.9soc-5.5i-1.8alcommunica6.6tion6.3.-711.6Althoug6.3h-708.1m-1.9od6.3els-709.3o6.4f-702.8f5.4ace-707.9proc6.6essing-708.2haveproposed-611.2ways-615.5to-614.7deal-613.2with-613.0these-613.9diff10.4erent-613.7tasks,-611.9it-617.1remainsunclear-448.0how-450.7these-443.2mechanisms-451.7might-446.7be-447.7implemented-447.2in-450.3visualOne-418.9model-418.8of-417.3human-416.4face-421.9processing-417.5proposes-422.7that-419.8informa-tion-584.5is-578.0processed-586.1in-583.1specialised-584.7modules-577.0(Breen-584.4et-582.9al.,-582.32002;Bruce-382.1and-384.0Y92.0oung,-380.21986;-379.2Haxby-379.9et-380.5al.,-
and if I print just string data (I added returns at the end of each line to keep it from
messing up the layout here:
'Distinctrepresentationsforfacialidentityandchangeableaspectsoffacesinthehumantemporal
lobeTimothyJ.Andrews*andMichaelP.EwbankDepartmentofPsychology,WolfsonResearchInstitute,
UniversityofDurham,UKReceived23December2003;revised26March2004;accepted27July2004Availab
leonline14October2004Theneuralsystemunderlyingfaceperceptionmustrepresenttheunchanging
featuresofafacethatspecifyidentity,aswellasthechangeableaspectsofafacethatfacilitates
ocialcommunication.However,thewayinformationaboutfacesisrepresentedinthebrainremainsc
ontroversial.Inthisstudy,weusedfMRadaptation(thereductioninfMRIactivitythatfollowsthe
repeatedpresentationofidenticalimages)toaskhowdifferentface-andobject-selectiveregionsofvisualcortexcontributetospecificaspectsoffaceperception'
The data is spit out by callbacks so if I print each string as it is returned it looks like this:
'The
-571.3
neural
-573.7
system
-577.4
underly
13.9
ing
-577.2
face
-573.0
perc
13.7
eption
-574.9
must
-572.1
repr
20.8
esent
-577.0
the
unchangin
14.4
g
-538.5
featur
16.5
es
-529.5
of
-536.6
a
-531.4
face
'
On examination it looks like the true spaces are large negative numbers < -300 and the false spaces are much smaller positive numbers. Thanks guys. Just getting to the point where i am asking the question clearly helped me answer it!
Hmmmm... I'd have to say that guessing is never a good idea. Looking at the problem root cause and solving that is the answer, anything else is a kludge.
If the spacing is unreliable from the PDF, how is it unreliable? The PDF viewer needs to be able to reliably space the text so the data is there somewhere, you just need to find it.
EDIT following comment:
The idea of parsing the file using a dictionary (your only other option really, apart from randomly inserting spaces and hoping for the best) and inserting spaces at identified word boundaries (a real problem when dealing with punctuation, plurals that don't alter the base word i.e. plural, etc) would, I believe, be a much greater programming challenge than correctly parsing the PDF in the first place. After all, PDF is clearly defined whereas English is somewhat wooly.
Why not look down the route of existing solutions like ps2ascii in linux, call the function from your Ruby and pick up the result.
PDF doesn't only store spaces as space characters, but also uses layout commands for spacing (so it doesn't print a space, but moves the "pen" to the right). Perhaps you should have a look at the PDF reference (the big PDF on the bottom of the site), Chapter 9 "Text" should be what you're looking for.
EDIT: After reading your comment to Lazarus' answer, this doesn't seem to be what you're looking for. I think you should try to get a word list from somewhere and try to split your text using it. A good strategy would be to do that using recursion, because for example:
"meandyou"
The first word could be "me" or "mean", but if you try "mean", "dyou" doesn't make sense, so it will be "me", same for the next word that could be "a" or "an" or "and", only "and" makes sense.
If it were me I'd go back to the source PDFs and try a different method of extracting the text, such as iText (for Java) or maybe some kind of PDF-to-HTML to text conversion software method.