Sphinx not search for English word with underscore in Japanese document - full-text-search

I created Japanese and English documents in sphinx, which had the same contents. Japanese document includes some English words.
If you search for an English word with an underline in both documents (for example,
"OMP_NUM_THREADS"), you will see the search results in English, but not in Japanese.
Do you know how to fix it ?

Related

How do I link and embed a UTF8 encoded text file in an MS-Word document?

I would like to include the contents of a UTF8 text file in a MS Word document as a link. This works for an ansi encoded file using the field:
{INCLUDETEXT "path\file.txt" \c ansitext \* MERGEFORMAT}
Is there a directive akin to \c ansitext for UTF8 files? \c utf8 and \c utf8text do not appear to work.
If I do not give any directive, Word recognizes that the file is UTF8, but a dialog pops up requiring me to confirm this each time the file needs updating, which I want to avoid.
There is a directive ( \c Unicode ) but unfortunately using it does not actually eliminate the character encoding pop-up, even when the Unicode text starts with a BOM (Byte Order Mark), which are in any case discouraged by Unicode.
So although that answers the question actually asked, it doesn't solve the problem. Nor, according to the discussion in comments to the Question, would any of the following solve the problem for the OP, but they might help others.
According to the ISO 29500 standard that describes .docx documents, INCLUDETEXT is supposed to have an \e switch that lets you specify an encoding. But, according to Microsoft's standard document [MS-OI29500].pdf, Word ignores any \e switch.
As far as I am aware the only way to avoid that pop-up when the included text is in Unicode format (UTF-8) is to set a value in the Windows Registry that tells Word the default encoding for text files.
The problem with that is that that setting will affect what happens to all the text files opened by Word, whether through the file open dialog or an INCLUDETEXT.
To create the setting, you need to navigate to the following Registry location, e.g. for Word 2016/2019 it would be
HKEY_CURRENT_USER\Software\Microsoft\Office\16.0\Word\Options
and for Word 2010 it would be
HKEY_CURRENT_USER\Software\Microsoft\Office\14.0\Word\Options
Then add a DWORD value called DefaultCPG and set its value to the code page you want to be the default. For UTF-8, that's decimal 65001.
If you have control over the format of the file to be included, you could consider using a format that wouldn't trigger the encoding pop-up. That leads to another set of problems, e.g. if you used HTML you would probably have to deal with HTML special characters such as & etc., whitespace, and RTL characters (which Word seems to reverse). But the following HTML "framework" is enough to insert a text chunk without additional paragraph marks and so on:
<html>
<meta charset="UTF-8">
<body>
<a name="x">your text</a>
</body>
</html>
In the INCLUDETEXT field, you then use the "x" to indicate the subset you want to include, e.g.
{INCLUDETEXT "path\file.htm" x \c HTML}
The HTML coding <a name="something"> is deprecated in HTML 5, but Word only understands the earlier HTML convention.

Are there cases of editing HTML output by Aspose.Word with CKEditor?

I am in trouble with the event that the sentence edited in CKEditor are not output to Word as a result of inheriting attributes of “-aw-import:ignore”.
A tag with this attribute is a tag that conveys the attribute of the original word when converting from html to word, and it is not output as word as a meta tag.
If the sentence entered in CKEditor inherits the attributes, it will not be output as word by mistake.
Aspose.Words writes this "-aw-import:ignore" only when it needs to make certain elements visible in HTML that would otherwise be collapsed and hidden by web browsers e.g. empty paragraphs, space sequences, etc.
Currently we mark only the following elements with “-aw-import:ignore”:
Sequences of spaces and non-breaking spaces that are used to simulate
padding on native list item (<li>) elements.
Non-breaking spaces that are used to prevent empty paragraphs from collapsing.
However, note that this list is not fixed and we may add more cases to it in the future.
Also, please note that Aspose.Words write   instead of because is not defined in XML. And by default Aspose.Words generate XHTML documents (i.e. HTML documents that comply with XML rules).
I work with Aspose as Developer Evangelist.
Please find below list of custom styles that Aspose.Words uses to save extra information in output HTML and usually this information is used for Aspose.Words-HTML-Aspose.Words round-trip. We will add description of these entities in documentation as soon as possible.
-aw-comment-author
-aw-comment-datetime
-aw-comment-initial
-aw-comment-start
-aw-comment-end
-aw-footnote-type
-aw-footnote-numberstyle
-aw-footnote-startnumber
-aw-footnote-isauto
-aw-headerfooter-type
-aw-bookmark-start
-aw-bookmark-end
-aw-different-first-page
-aw-tabstop-align
-aw-tabstop-pos
-aw-tabstop-leader
-aw-field-code
-aw-wrap-type
-aw-left-pos
-aw-top-pos
-aw-rel-hpos
-aw-rel-vpos
-aw-revision-author
-aw-revision-datetime

Arabic text is not being displayed in proper format.

We are generating pdf templates from Html text. My template contains combination of Arabic and english words.
Arabic texts are not being displayed correctly in generated pdf. (They should be displayed from right to left).
I have loaded valid pdfcalligraph license key in my code.
Please let me know if I need to write any code for this; as I searched in web and found that pdf calligraph will automatically detect the Arabic text and do the job.

Special character in DOMPDF

We are copying content from word doc in CKEDITOR in Drupal. In generated pdf "?" is coming for some character.
Also at some places content is overlapping.
When same is seen in website, its appearing fine.
Any help in this?
Thanks

Joomla 2.5 with Sobi Pro and Mandarin characters

My client has a single page on his Joomla! 2.5 site whose content must be translated into Mandarin. In text areas within Sobi Pro, we can easily add the Mandarin text, and all is fine, but in text fields, we are unable to add the text - we get a bright red validation error with the text "The data entered in the "area name" field contains not allowed characters".
How can I make it work so that I can enter Chinese characters in these text fields?
Thanks!
Kobus
Found out by means of a person on the Sobi Pro website; it was a filter that prevented input of the characters. Problem is solved.
Thanks!
Kobus

Resources