Word wrap algorithms for Japanese - algorithm

In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping Japanese text very well, probably because it contains few spaces, as each character forms a whole word. However, that's not really a safe assumption to make as some words are constructed of several characters, and it is not safe to break some character groups into different lines.
Googling around hasn't really helped me understand the problem any better. It seems to me like one would need a dictionary of unbreakable patterns, and assume that everywhere else is safe to break. But I fear I don't know enough about Japanese to really know all the words, which I understand from some of my searching, are quite complicated.
How would you approach this problem? Are there any libraries or algorithms you are aware of that already exist that deal with this in a satisfactory way?

Japanese word wrap rules are called kinsoku shori and are surprisingly simple. They're actually mostly concerned with punctuation characters and do not try to keep words unbroken at all.
I just checked with a Japanese novel and indeed, both words in the syllabic kana script and those consisting of multiple Chinese ideograms are wrapped mid-word with impunity.

Below listed projects are useful to resolve Japanese wordwrap (or wordbreak from another point of view).
budou (Python): https://github.com/google/budou
mikan (JS): https://github.com/trkbt10/mikan.js
mikan.sharp (C#): https://github.com/YoungjaeKim/mikan.sharp
mikan has regex-based approach while budou uses natural language processing.

Related

How can I efficiently find all people mentioned in some text, while tolerating spelling mistakes?

I have a list of names of millions of famous people (from Wikidata), and I need to create a system that efficiently finds all people mentioned in a fairly short text: it can be just one word (eg. "Einstein") to a few pages of text (eg. a Wikipedia page).
I need the system to be fairly tolerant to spelling mistakes (eg. Mikael Jackson instead of Michael Jackson), and short forms (eg. M. Jackson). In case of ambiguity, it should return all possible people (eg. "George Bush" should return both father and son, and possibly other homonyms).
This related question has a few interesting answers, including using the Aho-Corasick algorithm. There are libraries in many languages, including in Python. However, it does not seem to support fuzzy search (ie. tolerate misspellings).
I guess I could extend the vocabulary to include all the possible spellings of each name, but that would make the vocabulary too large, so I would rather avoid that if possible (moreover, I may want to extend this solution to more than just people at one point).
I took a quick look at Lucene/ElasticSearch but it does not seem to support this use case (unless I missed it).
Any ideas?
Elasticsearch has support for fuzzy matching: See documentation here.

Best separator character for Hadoop files

If I'm writing csv style files out of a system to be consumed by Hadoop. What is the best column separator to use within the file? I have tried ctrl-A but it's a pain imo because other programs don't necessarily show it, eg I might view the file using vi, notepad, web browser, excel. Comma is a pain because the data might also contains commas. I was thinking of standardising on tab. Is there a best practice for this in regards to Hadoop or doesn't it matter. I have done a fair bit of searching and can't find much on this fairly basic question.
There are certainly tradeoffs to each. It really depends what you care most about.
Commas- if you care about interoperability. Every tool works with CSV. commas in the data are a pain only if the writing system doesn't escape properly, or the reading system doesn't respect the escaping. Hive handles escaping correctly, as far as I know.
Tabs- if you care about interoperability and expect commas in data but no tabs. You're slightly less likely to have tabs in the data, but also slightly less likely that any given tool supports TSV.
Ctrl+A- if you care only about hadoop-ecosystem functionality. This has definitely become the de-facto hadoop standard, but hadoop also easily supports commas and tabs. Upside is you usually don't have to care about escaping.
In the end, I think it's usually a toss-up, assuming you're escaping correctly (and you should be!). There's no best practice. If you find yourself worrying a lot about this kind of thing, you might also want to step up to a more serious serialization format, like Avro, which is very well-supported in Hadoop-world.

How can I find out the language from a character?

Given a Unicode character, we want to find out what languages include this character, and more importantly, understand whether or not each language is Left-To-Right.
For example, the character A might be both English and Spanish which are both LTR languages.
I want this for my own text editor.
Can anyone help me in finding an API function or something that solves my problem?
Thanks in advance
Unicode-wise, LTR/RTL is a property of characters, not of the languages that use that character. This matters because embedded English in an Arabic text should be displayed left-to-right, even if for simplicity the document as a whole may be marked as Arabic. If you're using JCL, these properties can be obtained using the UnicodeIsLeftToRight and UnicodeIsRightToLeft functions. Note that characters may be neither left-to-right nor right-to-left, and also note that JCL uses a private copy of the Unicode character list that may be a subtly different version from what any specific version of Windows uses.
Regarding the question in the title, you would need to carry out an extensive study of the use of characters in the languages of the world. There are a few thousands of languages, though many of them have no regular writing system; on the other hand, some languages have several writing systems. Different variants of a language may have different repertoires of characters.
So it would be a major effort, though some data has been compiled e.g. in the CLDR repertoire – but the concept “characters used in a language” is far from clear. (Are the characters æ, è, and ö used in English? They sure appear in some forms of written English.)
So it would be unrealistic to expect to find a library routine for such purposes.
Apparently your real need was for deciding whether a character is a left-to-right character or a right-to-left character. But for completeness, I have provided an answer to what you actually asked and that might be relevant in some other contexts.

How can I learn to read formulas with greek symbols?

I suppose maybe it's because I don't know the keywords to google for, but I can't find any sources on how to read those formulas you see on wikipedia, like this for instance:
Erlang Distribution
I've searched in the math world and computer science world. It feels like it is assumed that we're supposed to understand it out of thin air. Beginner lessons seem scarce.
So far I know how sigma works. And that upside-down shape that is used as the half-life logo is called lambda. But what the heck is it trying to say?? Why is there a semi-colon in the function, etc..
If there is a book on this stuff I'd buy it in an instant. It is probably very basic stuff but I never had experience in theoretical math or even know where to look.
Does anyone know what this subject is called, and what to google for?
Formulas with this symbols usually are statistics or probability notations.
Greek letters (e.g. θ, β) are commonly used to denote unknown parameters (population parameters).
Greek letters used in mathematics, science, and engineering
you can find info here
Notation in probability and statistics
here
I think the colon in alt.: \scriptstyle \theta \;=\; \frac{1}{\lambda} > 0\, scale (real) in the box in Wikipedia is just saying that there is an alternative definition, in which you specify theta rather than lambda, and in that definition what is called theta is the reciprocal of lambda in the other definition.
I once complained to a much better mathematician than I was that I came unstuck with formulas with some of the weirder greek letters in because I couldn't write them recognisably in my handwriting (which is bad enough for the latin alphabet). He said a lot of the people he knew simply said "let x be funny-squiggle-thing" and rewrote with sensible letters. I really wish I'd thought of that.
In general, letters in weird alphabets behave pretty much like sensible letters, at least in the sort of thing you are pointing at. It's done as a sort of type-checking - usually all of the letters pinched from some particular foreign language are related in some way - e.g. all parameters. Unfortunately that doesn't hold exactly in the Wikipedia example you quote, where two of the greek letters stand for functions - one is definitely the Gamma function. I suspect the other is http://en.wikipedia.org/wiki/Digamma_function, but I'm not really sure.
Check out the resources list here: http://en.wikipedia.org/wiki/Greek_alphabet
I would say your best bet is still searching in Google (or other search engine, whatever float your boat) about the specific formula you are trying to learn. Sometimes a symbol may be used in different meaning in different formula.
Anyway, there is a good resource in here that explained a lot of math symbols, not just the Greek symbol.
Some link that may interest you here and here.
First, find a Greek alphabet (upper and lower case) to refer to, so that you can at least call lambda by it's name. No one starts out knowing automatically what the various Greek characters are, not even Greeks.
Second, Read the actual article, usually either the character is defined (as lambda happens to be in the Wikipedia page you references) or it's standard nomenclature (in which case you've done the right thing by looking for a basic article on the function in question-- I do this all the time so don't feel bad.) Or, as a third option, it's a crappy paper. Happens sometimes. It's kind of a pain, though, since you can't just do a text search on the lambda character in a PDF.
(Someone educate me on that if I'm wrong....)
Third, try to pick out which unfamiliar symbols are variables (like lambda) and which are operators (like sigma, and it's helpers.) It's the operators that can sometimes cause real trouble. A variable is just a name for something, but operators come freighted with more meaning, more rules, and more syntax. It's not always obvious which symbols are operators, either.
Finally, and specifically for computer science, a good introductory book (college freshman/sophomore level) on discrete math will hopefully treat most of the basic notations and operators to at least get your feet on the ground. Nowadays, you kids and your newfangled internet might be able to get something similar from Udacity, Edx, Course RA, or the Khan Academy.
Basically, it's a lot of hard work, especially on your own, but you're already doing most of the right things.

Tips for writing good EBNF grammars

I'm writing some Extended Backus–Naur Form grammars for document parsing. There are lots of excellent guides for the syntax of these definitions, but very little online about how to design and structure them.
Can anyone suggest good articles (or general tips) about how you like to approach writing these as there does seem to be an element of style even if the final parse trees can be equivalent.
e.g. things like:
Deciding if you should explicitly tag newlines, or just treat it as whitespace?
Naming schemes for your nonterminals
Handing optional whitespace in long definitions
When to use bad syntax checks vs just letting those not match
Thanks,
You should work in the direction that you are most comfortable with - either bottom-up, top-down, or "sandwich" (do a little of both, meet somewhere in the middle).
Any "group" that can be derived and has a meaning of its own, should start from it's own non-terminal. So for example, I would use a non-terminal for all newline-related whitespaces, one for all the other whitespaces, and one for all whitespaces (which is basically the union of the former 2).
Naming conventions in grammars in general are that non-terminals are, or start with, a capital letter, and terminals start with non-capitals (but this of course depends on the language you're designing).
Regarding bad syntax checks, I'm not familiar with the concept. What I know of EBNFs are that you just write everything your language accepts, and only that.
Generally, just look around at some EBNFs of different languages from different websites, get a feeling of how they look, and then do what feels right to you.

Resources