reStructuredText and glossary terms translation - internationalization

I'd like to know how can I translate (as in i18n) the terms in glossary. I use Sphinx 1.1.3
Let's say I have:
.. glossary::
term
definition
After I run make gettext I get the .po files but I can only translate the definitions, not the terms. I searched the documentation throughout but couldn't find any hints. If translation of terms is somehow possible, how can I automatically sort them alphabetically in target language?

It seems that this feature will be available in Sphinx 1.2
The question about sorting the translated glossary still remains. :sorted: does not work.

Related

Works LibShortText with other languages too?

LibShortText is an open source tool for short-text classification and analysis.
http://www.csie.ntu.edu.tw/~cjlin/libshorttext/
I have tried to figure out if it also works with other languages than english (e.g. german)? But I didn't find a hint.
Who knows the answer? Thank you in advance.
I think so (but may need some extra preprocessing). Libsvm and Liblinear are both language-agnostic. Since LibShortText is built on top of LibLinear, it should work for all languages too.
According to this paper, it has internal pre-processing methods to extract features.
libshorttext.converter: For given short texts, LibShortText follows
the bag-of-word model to generate features. Users apply procedures in
this library to pre-process short texts by tokenization, stemming
(optional), and stop-word removal (optional). The library also allows
users to choose between unigram and bigram features.
However, it looks like its stemming and stop-word removal only supports English. So if you want to have better features extracted for non-English text, you might want to use your own pre-processing methods, for example, using nltk.

ElasticSearch multiple languages

I am currently trying to figure out analysis schemes for my ElasticSearch cluster. I am using ES to index pdf, word, powerpoint and excel documents. I am using Apache Tika to extract the text.
My problem is that I do not know before hand what languages to expect the file contents to be. They could be written in any language.
My question is, is there a way to make ES analyze text regardless of the language? Or should I have a pre-defined field for each language with its own tokenizer, analyzer and stopwords?
I suggest taking a look at the ElasticSearch plugin elasticsearch-mapper-attachments. I used it to build document search functionality.
When it comes to supporting multiple languages, we have had the best experience with one index per language. If you can identify the language before indexing you can insert the document into the appropriate index. This makes it easier to add new languages vs. a field per language approach.
One thing to remember is the Don't use Types for Languages note at the bottom of one language per document page. Doing that can mess up search in a very difficult to debug way.
If you need to detect the language, there are two options mentioned at the bottom of the Pitfalls of Mixing Languages page.

How to get list of figures in Asciidoc

I am using asciid for an article. In the end of my document I want to have a list of figures. How to I create a list of figures? Did not find something useful in the documentation for me.
Nope there isn't one at the time of answer. I checked the docs (which you indicated you did as well) and I also grepped the codebase. There is good news though! You should be able to do this with an extension.
Extensions can be written in any JVM language if you're using asciidoctorj, or in Ruby if you're using the core asciidoctor (I'm not sure about JavaScript for asciidoctorjs). You'll need to create two extensions probably: a TreeProcessor extension to go through the whole AST looking for images and pulling them out into a storage structure. Then you'll also need to create either an inline or block macro to actually place it within the page.
I strongly recommend examining the API for the nodes and functions you'll want to make use of. There are some other examples of processors that may also be helpful to examine.

What simple syntax can be used for rich text?

I want in an application with a simple text input, enriched with some marks to include formatting or semantic labeling. I want the syntax as easy as possible and I want to include self-defined labels.
Example:
[bold]Stackoverflow[/bold] is a [tag]good[/tag] resource for programmers.
Tables would be needed too.
HTML/XML and LaTeX are mighty enough to allow this, but too complicated. Wiki-Syntax seems simple, but uses another symbol for each markup, has unclear quoting and every Wiki seems to have another syntax. For tables and similar stuff Wiki becomes very complicated.
Exists a language/syntax, that matches my needs or can be slightly changed to do so? Or do I have to invent something myself? In that case, do you have suggestions?
Definitely do NOT invent your own. There are plenty of simple markup languages already, and users HATE learning new ones. Trust me on this!
I would suggest using one of the following:
Textile
Markdown
BBCode
Make your decision based on your userbase, as well as what tools and parsers are available in your chosen language. For my site, we went with Textile, but I've found that BBCode tends to be the language that most people already know. However, this will vary with different user demographics.
StackOverflow, along with several other sites, uses Markdown. I think it will give you the best balance between features and simplicity.
Let me add ReStructuredText to the list.
An additional benefit of using it is given by the availability of ReStructuredText to Anything service that makes extremely easy to create HTML or PDF versions of the document.
As already pointed out there are a lot of lightweight markup languages (many are listed here: wikipedia article), there should be no need of creating your own.

Lightweight fuzzy search library

Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
Red
Green
Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?
Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?
I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/
Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.
You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.
Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/
If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.
#aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.
A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org

Resources