How does language detection work? - algorithm

I have been wondering for some time how does Google translate(or maybe a hypothetical translator) detect language from the string entered in the "from" field. I have been thinking about this and only thing I can think of is looking for words that are unique to a language in the input string. The other way could be to check sentence formation or other semantics in addition to keywords. But this seems to be a very difficult task considering different languages and their semantics. I did some research to find that there are ways that use n-gram sequences and use some statistical models to detect language. Would appreciate a high level answer too.

Take the Wikipedia in English. Check what is the probability that after the letter 'a' comes a 'b' (for example) and do that for all the combination of letters, you will end up with a matrix of probabilities.
If you do the same for the Wikipedia in different languages you will get different matrices for each language.
To detect the language just use all those matrices and use the probabilities as a score, let say that in English you'd get this probabilities:
t->h = 0.3 h->e = .2
and in the Spanish matrix you'd get that
t->h = 0.01 h->e = .3
The word 'the', using the English matrix, would give you a score of 0.3+0.2 = 0.5
and using the Spanish one: 0.01+0.3 = 0.31
The English matrix wins so that has to be English.

If you want to implement a lightweight language guesser in the programming language of your choice you can use the method of 'Cavnar and Trenkle '94: N-Gram-Based Text Categorization'. You can find the Paper on Google Scholar and it is pretty straight forward.
Their method builds a N-Gram statistic for every language it should be able to guess afterwards from some text in that language. Then such statistic is build for the unknown text aswell and compared to the previously trained statistics by a simple out-of-place measure.
If you use Unigrams+Bigrams (possibly +Trigrams) and compare the 100-200 most frequent N-Grams your hit rate should be over 95% if the text to guess is not too short.
There was a demo available here but it doesn't seem to work at the moment.
There are other ways of Language Guessing including computing the probability of N-Grams and more advanced classifiers, but in the most cases the approach of Cavnar and Trenkle should perform sufficiently.

You don't have to do deep analysis of text to have an idea of what language it's in. Statistics tells us that every language has specific character patterns and frequencies. That's a pretty good first-order approximation. It gets worse when the text is in multiple languages, but still it's not something extremely complex.
Of course, if the text is too short (e.g. a single word, worse, a single short word), statistics doesn't work, you need a dictionary.

An implementation example.
Mathematica is a good fit for implementing this. It recognizes (ie has several dictionaries) words in the following languages:
dicts = DictionaryLookup[All]
{"Arabic", "BrazilianPortuguese", "Breton", "BritishEnglish", \
"Catalan", "Croatian", "Danish", "Dutch", "English", "Esperanto", \
"Faroese", "Finnish", "French", "Galician", "German", "Hebrew", \
"Hindi", "Hungarian", "IrishGaelic", "Italian", "Latin", "Polish", \
"Portuguese", "Russian", "ScottishGaelic", "Spanish", "Swedish"}
I built a little and naive function to calculate the probability of a sentence in each of those languages:
f[text_] :=
SortBy[{#[[1]], #[[2]] / Length#k} & /# (Tally#(First /#
Flatten[DictionaryLookup[{All, #}] & /# (k =
StringSplit[text]), 1])), -#[[2]] &]
So that, just looking for words in dictionaries, you may get a good approximation, also for short sentences:
f["we the people"]
{{BritishEnglish,1},{English,1},{Polish,2/3},{Dutch,1/3},{Latin,1/3}}
f["sino yo triste y cuitado que vivo en esta prisión"]
{{Spanish,1},{Portuguese,7/10},{Galician,3/5},... }
f["wszyscy ludzie rodzą się wolni"]
{{"Polish", 3/5}}
f["deutsch lernen mit jetzt"]
{{"German", 1}, {"Croatian", 1/4}, {"Danish", 1/4}, ...}

You might be interested in The WiLI benchmark dataset for written language identification. The high level-answer you can also find in the paper is the following:
Clean the text: Remove things you don't want / need; make unicode un-ambiguious by applying a normal form.
Feature Extraction: Count n-grams, create tf-idf features. Something like that
Train a classifier on the features: Neural networks, SVMs, Naive Bayes, ... whatever you think could work.

Related

Convert DNA Sequences into numerical vectors for R / Weka

I would like to use machine learning techniques such as Naive Bayes and SVM in Weka to identify species using DNA Sequence data.
The Issue is that I have to convert the DNA sequences into numerical vectors.
MY sequences are like this:
------------------------------------------------G
------------------------------------------GGAGATG
------------------------------------------GGAGATG
------------------------------------------GGAGATG
TTATTAATTCGAGCAGAATTAGGAAATCCTGGATCTTTAATTGGTGATG
----------------------------------------------ATG
CTATTAATTCGAGCTGAGCTAAGCCAGCCCGGGGCTCTGCTCGGAGATG
-----------------------TCAACCTGGGGCCCTACTCGGAGACG
----TAATCCGAGCAGAATTAAGCCAACCTGGCGCCCTACTAGGGGATG
CTATTAATTCGAGCTGAGCTAAGCCAGCCTGGGGCTCTGCTCGGAGATG
TTATTAATTCGTTTTGAGTTAGGCACTGTTGGAGTTTTATTAG---ATA
How can I do this? Any suggestion of other programs for doing ML with DNA sequences besides Weka?
This answer makes use of R.
You can use R's Biostrings package for this.
Install package first:
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Biostrings"))
Convert character string to DNAstring:
dna1 <- DNAString("------------------------------------------------G------------------------------------------GGAGATG")
Alternatively,
dna2 <- DNAStringSet(c("ACGT", "GTCA", "GCTA"))
alphabetFrequency(dna1)
letterFrequency(dna1, "GC")
....
Then (if you must) you can call Weka functions from R, e.g. Naive Bayes with NB <- make_Weka_classifier("weka/classifiers/bayes/NaiveBayes")
; NB(colx ~ . , data=mydata), or convert your data as you wish and/or export to other types of files that Weka understands. The foreign::write.arff() function comes to mind. But I wouldn't use Weka for this.
Needless to say, you can simply also enter these sequences into a website performing a BLAST search and get likely species candidates.
For CTATTAATTCGAGCTGAGCTAAGCCAGCCCGGGGCTCTGCTCGGAGATG I get mitochondrial DNA from "banded rock lizard" (Petrosaurus mearnsi) with 91% probability.

GoLang PoS Tagger script taking longer than it should with no output in terminal

This script is compling without errors in play.golang.org: http://play.golang.org/p/Hlr-IAc_1f
But when I run in on my machine, much longer than I expect happens with nothing happening in the terminal.
What I am trying to build is a PartOfSpeech Tagger.
I think the longest part is loading lexicon.txt into a map and then comparing each word with every word there to see if it has already been tagged in the lexicon. The lexicon only contains verbs. But doesn't every word need to be checked to see if it is a verb.
The larger problem is that I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
(Quoting):
I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
I can't speak to any issues in your Go implementation, but I'll address the larger problem of POS tagging in general. It sounds like you're attempting to build a rule-based unigram tagger. To elaborate a bit on those terms:
"unigram" means you're considering each word in the sentence separately. Note that a unigram tagger is inherently limited, in that it cannot disambiguate words which can take on multiple POS tags. E.g., should you tag 'fish' as a noun or a verb? Is 'last' a verb or an adverb?
"rule-based" means exactly what it sounds like: a set of rules to determine the tag for each word. Rule-based tagging is limited in a different way - it requires considerable development effort to assemble a ruleset that will handle a reasonable portion of the ambiguity in common language. This effort might be appropriate if you're working in a language for which we don't have good training resources, but in most common languages, we now have enough tagged text to train high-accuracy tagging models.
State-of-the-art for POS tagging is above 97% accuracy on well-formed newswire text (accuracy on less formal genres is naturally lower). A rule-based tagger will probably perform considerably worse (you'll have to determine the accuracy level needed to meet your requirements). If you want to continue down the rule-based path, I'd recommend reading this tutorial. The code is based on Haskell, but it will help you learn the concepts and issues in rule-based tagging.
That said, I'd strongly recommend you look at other tagging methods. I mentioned the weaknesses of unigram tagging. Related approaches would be 'bigram', meaning that we consider the previous word when tagging word n, 'trigram' (usually the previous 2 words, or the previous word, the current word, and the following word); more generally, 'n-gram' refers to considering a sequence of n words (often, a sliding window around the word we're currently tagging). That context can help us disambiguate 'fish', 'last', 'flies', etc.
E.g., in
We fish
we probably want to tag fish as a verb, whereas in
ate fish
it's certainly a noun.
The NLTK tutorial might be a good reference here. An solid n-gram tagger should get you above 90% accuracy; likely above 95% (again on newswire text).
More sophisticated methods (known as 'structured inference') consider the entire tag sequence as a whole. That is, instead of trying to predict the most probable tag for each word separately, they attempt to predict the most probable sequence of tags for the entire input sequence. Structured inference is of course more difficult to implement and train, but will usually improve accuracy vs. n-gram approaches. If you want to read up on this area, I suggest Sutton and McCallum's excellent introduction.
You've got a large array argument in this function:
func stringInArray(a string, list [214]string) bool{
for _, b := range list{
if b == a{
return true;
}
}
return false
}
The array of stopwords gets copied each time you call this function.
Mostly in Go, you should uses slices rather than arrays most of the time. Change the definition of this to be list []string and define stopWords as a slice rather than an array:
stopWords := []string{
"and", "or", ...
}
Probably an even better approach would be to build a map of the stopWords:
isStopWord := map[string]bool{}
for _, sw := range stopWords {
isStopWord[sw] = true
}
and then you can check if a word is a stopword quickly:
if isStopWord[word] { ... }

How to fuzzily search for a dictionary word?

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)
f(x, y) = 1, if characters x and y are similar
= 0, otherwise
(These "similarities" can be specified by the programmer)
such that, say,
f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1
but,
f('a', 'z') = 0
etc.
Now if we have a query 'cofatyst', the algorithm should report the following matches:
('cot', 0)
('cat', 0)
('catalyst', 0)
where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).
One thing to note is that in the wild, the dictionary is going to be really large.
I might try to use the cosine similarity using the position of each character as a feature and mapping the product between features using a match function based on your character relations.
Not a very specific advise, I know, but I hope it helps you.
edited: Expanded answer.
With the cosine similarity, you will compute how similar two vectors are. In your case the normalisation might not make sense. So, what I would do is something very simple (I might be oversimplifying the problem): First, see the matrix of CxC as a dependency matrix with the probability that two characters are related (e.g., P('t' | 'l') = 1). This will also allow you to have partial dependencies to differentiate between perfect and partial matches. After this I will compute, for each position the probability that the letter from each word is not the same (using the complement of P(t_i, t_j)) and then you can just aggregate the results using a sum.
It will count the number of terms that are different for a specific pair of words, and it allows you to define partial dependencies. Furthermore, the implementation is very simple and should scale well. This is why I am not sure if I misunderstood your question.
I am using Fuse JavaScript Library for a project of mine. It is a javascript file which works on JSON dataset. It is quite fast. Have a look at it.
It has implemented a full Bitap algorithm, leveraging a modified version of the Diff, Match & Patch tool by Google(from his site).
The code is simple to understand the algorithm implementation done.

Data structure for multi-language dictionary?

One-line summary: suggest optimal (lookup-speed/compactness) data structure(s) for a multi-lingual dictionary representing primarily Indo-European languages (list at bottom).
Say you want to build some data structure(s) to implement a multi-language dictionary for let's say the top-N (N~40) European languages on the internet, ranking choice of language by number of webpages (rough list of languages given at bottom of this question).
The aim is to store the working vocabulary of each language (i.e. 25,000 words for English etc.) Proper nouns excluded. Not sure whether we store plurals, verb conjugations, prefixes etc., or add language-specific rules on how these are formed from noun singulars or verb stems.
Also your choice on how we encode and handle accents, diphthongs and language-specific special characters e.g. maybe where possible we transliterate things (e.g. Romanize German ß as 'ss', then add a rule to convert it). Obviously if you choose to use 40-100 characters and a trie, there are way too many branches and most of them are empty.
Task definition: Whatever data structure(s) you use, you must do both of the following:
The main operation in lookup is to quickly get an indication 'Yes this is a valid word in languages A,B and F but not C,D or E'. So, if N=40 languages, your structure quickly returns 40 Booleans.
The secondary operation is to return some pointer/object for that word (and all its variants) for each language (or null if it was invalid). This pointer/object could be user-defined e.g. the Part-of-Speech and dictionary definition/thesaurus similes/list of translations into the other languages/... It could be language-specific or language-independent e.g. a shared definition of pizza)
And the main metric for efficiency is a tradeoff of a) compactness (across all N languages) and b) lookup speed. Insertion time not important. The compactness constraint excludes memory-wasteful approaches like "keep a separate hash for each word" or "keep a separate for each language, and each word within that language".
So:
What are the possible data structures, how do they rank on the
lookup speed/compactness curve?
Do you have a unified structure for all N languages, or partition e.g. the Germanic languages into one sub-structure, Slavic into
another etc? or just N separate structures (which would allow you to
Huffman-encode )?
What representation do you use for characters, accents and language-specific special characters?
Ideally, give link to algorithm or code, esp. Python or else C. -
(I checked SO and there have been related questions but not this exact question. Certainly not looking for a SQL database. One 2000 paper which might be useful: "Estimation of English and non-English Language Use on the WWW" - Grefenstette & Nioche. And one list of multi-language dictionaries)
Resources: two online multi-language dictionaries are Interglot (en/ge/nl/fr/sp/se) and LookWayUp (en<->fr/ge/sp/nl/pt).
Languages to include:
Probably mainly Indo-European languages for simplicity: English, French, Spanish, German, Italian, Swedish + Albanian, Czech, Danish, Dutch, Estonian, Finnish, Hungarian, Icelandic, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbo Croat, Slovak, Slovenian + Breton, Catalan, Corsican, Esperanto, Gaelic, Welsh
Probably include Russian, Slavic, Turkish, exclude Arabic, Hebrew, Iranian, Indian etc. Maybe include Malay family too. Tell me what's achievable.
I will not win points here, but some things.
A multi-language dictionary is a large and time-consuming undertaking. You did not talk in detail about the exact uses for which your dictionary is intended: statistical probably, not translating, not grammatical, .... Different usages require different data to be collected, for instance classifying "went" as passed tense.
First formulate your first requirements in a document, and with a programmed interface prototype. Asking data structures before algorithmic conception I see often for complex business logic. One would then start out wrong, risking feature creep. Or premature optimisation, like that romanisation, which might have no advantage, and bar bidrectiveness.
Maybe you can start with some active projects like Reta Vortaro; its XML might not be efficient, but give you some ideas for organisation. There are several academic linguistic projects. The most relevant aspect might be stemming: recognising greet/greets/greeted/greeter/greeting/greetings (#smci) as belonging to the same (major) entry. You want to take the already programmed stemmers; they often are well-tested and already applied in electronic dictionaries. My advise would be to research those projects without losing to much energy, impetus, to them; just enough to collect ideas and see where they might be used.
The data structures one can think up, are IMHO of secondary importance. I would first collect all in a well defined database, and then generate the software used data structures. You can then compare and measure alternatives. And it might be for a developer the most interesting part, creating a beautiful data structure & algorithm.
An answer
Requirement:
Map of word to list of [language, definition reference].
List of definitions.
Several words can have the same definition, hence the need for a definition reference.
The definition could consist of a language bound definition (grammatical properties, declinations), and/or a language indepedendant definition (description of the notion).
One word can have several definitions (book = (noun) reading material, = (verb) reserve use of location).
Remarks
As single words are handled, this does not consider that an occuring text is in general mono-lingual. As a text can be of mixed languages, and I see no special overhead in the O-complexity, that seems irrelevant.
So a over-general abstract data structure would be:
Map<String /*Word*/, List<DefinitionEntry>> wordDefinitions;
Map<String /*Language/Locale/""*/, List<Definition>> definitions;
class Definition {
String content;
}
class DefinitionEntry {
String language;
Ref<Definition> definition;
}
The concrete data structure:
The wordDefinitions are best served with an optimised hash map.
Please let me add:
I did come up with a concrete data structure at last. I started with the following.
Guava's MultiMap is, what we have here, but Trove's collections with primitive types is what one needs, if using a compact binary representation in core.
One would do something like:
import gnu.trove.map.*;
/**
* Map of word to DefinitionEntry.
* Key: word.
* Value: offset in byte array wordDefinitionEntries,
* 0 serves as null, which implies a dummy byte (data version?)
* in the byte arrary at [0].
*/
TObjectIntMap<String> wordDefinitions = TObjectIntHashMap<String>();
byte[] wordDefinitionEntries = new byte[...]; // Actually read from file.
void walkEntries(String word) {
int value = wordDefinitions.get(word);
if (value == 0)
return;
DataInputStream in = new DataInputStream(
new ByteArrayInputStream(wordDefinitionEntries));
in.skipBytes(value);
int entriesCount = in.readShort();
for (int entryno = 0; entryno < entriesCount; ++entryno) {
int language = in.readByte();
walkDefinition(in, language); // Index to readUTF8 or gzipped bytes.
}
}
I'm not sure whether or not this will work for your particular problem, but here's one idea to think about.
A data structure that's often used for fast, compact representations of language is a minimum-state DFA for the language. You could construct this by creating a trie for the language (which is itself an automaton for recognizing strings in the language), then using of the canonical algorithms for constructing a minimum-state DFA for the language. This may require an enormous amount of preprocessing time, but once you've constructed the automaton you'll have extremely fast lookup of words. You would just start at the start state and follow the labeled transitions for each of the letters. Each state could encode (perhaps) a 40-bit value encoding for each language whether or not the state corresponds to a word in that language.
Because different languages use different alphabets, it might a good idea to separate out each language by alphabet so that you can minimize the size of the transition table for the automaton. After all, if you have words using the Latin and Cyrillic alphabets, the state transitions for states representing Greek words would probably all be to the dead state on Latin letters, while the transitions for Greek characters for Latin words would also be to the dead state. Having multiple automata for each of these alphabets thus could eliminate a huge amount of wasted space.
A common solution to this in the field of NLP is finite automata. See http://www.stanford.edu/~laurik/fsmbook/home.html.
Easy.
Construct a minimal, perfect hash function for your data (union of all dictionaries, construct the hash offline), and enjoy O(1) lookup for the rest of eternity.
Takes advantage of the fact your keys are known statically. Doesn't care about your accents and so on (normalize them prior to hashing if you want).
I had a similar (but not exactly) task: implement a four-way mapping for sets, e.g. A, B, C, D
Each item x has it's projections in all of the sets, x.A, x.B, x.C, x.D;
The task was: for each item encountered, determine which set it belongs to and find its projections in other sets.
Using languages analogy: for any word, identify its language and find all translations to other languages.
However: in my case a word can be uniquely identified as belonging to one language only, so no false friends such as burro in Spanish is donkey in English, whereas burro in Italian is butter in English (see also https://www.daytranslations.com/blog/different-meanings/)
I implemented the following solution:
Four maps/dictionaries matching the entry to its unique id (integer)
AtoI[x.A] = BtoI[x.B] = CtoI[x.C] = DtoI[x.D] = i
Four maps/dictionaries matching the unique id to the corresponding language
ItoA[i] = x.A;
ItoB[i] = x.B;
ItoC[i] = x.C;
ItoD[i] = x.D;
For each encounter x, I need to do 4 searches at worst to get its id (each search is O(log(N))); then 3 access operations, each O(log(N)). All in all, O(log(N)).
I have not implemented this, but I don't see why hash sets cannot be used for either set of dictionaries to make it O(1).
Going back to your problem:
Given N concepts in M languages (so N*M words in total)
My approach adapts in the following way:
M lookup hashsets, that give you integer id for every language (or None/null if the word does not exist in the language).
Overlapped case is covered by the fact that lookups for different languages will yield different ids.
For each word, you do M*O(1) lookups in the hash sets corresponding to languages, yielding K<=M ids, identifying the word as belonging to K languages;
for each id you need to do (M-1)*O(1) lookups in actual dictionaries, mapping K ids to M-1 translations each)
In total, O(MKM) which I think is not bad, given your M=40 and K will be much smaller than M in most cases (K=1 for quite a lot of words).
As for storage: NM words + NM integers for the id-to-word dictionaries, and the same amount for reverse lookups (word-to-id);

Algorithm to estimate number of English translation words from Japanese source

I'm trying to come up with a way to estimate the number of English words a translation from Japanese will turn into. Japanese has three main scripts -- Kanji, Hiragana, and Katakana -- and each has a different average character-to-word ratio (Kanji being the lowest, Katakana the highest).
Examples:
computer: コンピュータ (Katakana - 6
characters); 計算機 (Kanji: 3
characters)
whale: くじら (Hiragana --
3 characters); 鯨 (Kanji: 1
character)
As data, I have a large glossary of Japanese words and their English translations, and a fairly large corpus of matched Japanese source documents and their English translations. I want to come up with a formula that will count numbers of Kanji, Hiragana, and Katakana characters in a source text, and estimate the number of English words this is likely to turn into.
Here's what Borland (now Embarcadero) thinks about English to non-English:
Length of English string (in characters)
Expected increase
1-5 100%
6-12 80%
13-20 60%
21-30 40%
31-50 20%
over 50 10%
I think you can sort of apply this (with some modification) for Japanese to non-Japanese.
Another element you might want to consider is the tone of the language. In English, instructions are phrased as an imperative as in "Press OK." But in Japanese language, imperatives are considered rude, and you must phrase instructions in honorific (or keigo) as in "OKボタンを押してください。"
Watch out for three-letter kanji combos. Many of the big words translate into three- or four- letter kanji combo such as 国際化(internationalization: 20 chars), 高可用性(high availability: 17 chars).
I would start with linear approximation: approx_english_words = a1*no_characters_in_script1 + a2 * no_chars_in_script2 + a3 * no_chars_in_script3, with the coefficients a1, a2, a3 fit from your data using linear least squares.
If this doesn't approximate very well, then look at the worst cases for the reasons they don't fit (specialized words, etc.).
In my experience as a translator and localization specialist, a good rule of thumb is 2 Japanese characters per English word.
As an experienced translator between Japanese and English, I can say that this is extremely difficult to quantify, but typically in my experience English text translated from Japanese is nearly 200% as many characters as the source text. In Japanese there are many culturally specific phrases and nouns that can't be translated literally and need to be explained in English.
When translating it is not unusual for me to take a single Japanese sentence and to make a single English paragraph out of it in order for the meaning to be communicated to the reader. Off the top of my here is an example:
「懐かしい」
This literally means nostalgic. However, in Japanese it can be used as a single phrase in an exclamation. Yet, in English in order to convey a feeling of nostalgia we require a lot more context. For instance, you may need to turn that single phrase into a sentence:
"As I walked by my old elementary school, I was flooded with memories of the past."
This is why machine translation between Japanese and English is impossible.
Well, it's a little more complex than just the number of characters in a noun compared to English, for instance, Japanese also has a different grammatical structure compared to English, so certain sentences would use MORE words in Japanese, and others would use LESS words. I don't really know Japanese, so please forgive me for using Korean as an example.
In Korean, a sentence is often shorter than an English sentence, due mainly to the fact that they are cut short by using context to fill in the missing words. For instance, saying "I love you" could be as short as 사랑해 ("sarang hae", simply the verb "love"), or as long as the fully qualified sentence 저는 당신을 살앙해요 (I [topic] you [object] love [verb + polite modifier]. In a text how it is written depends on context, which is usually set by earlier sentences in the paragraph.
Anyway, having an algorithm to actually KNOW this kind of thing would be very difficult, so you're probably much better off, just using statistics. What you should do is use random samples where the known Japanese texts, and English texts have the same meaning. The larger the sample (and the more random it is) the better... though if they are truly random, it won't make much difference how many you have past a few hundred.
Now, another thing is this ratio would change completely on the type of text being translated. For instance, highly technical document is quite likely to have a much higher Japanese/English length ratio than a soppy novel.
As for simply using your dictionary of word to word translations - that probably won't work to well (and is probably wrong). The same word does not translate to the same word every time in a different language (although much more likely to happen in technical discussions). For instance, the word beautiful. There is not only more than one word I could assign it to in Korean (i.e. there is a choice), but sometimes I lose that choice, as in the sentence (that food is beautiful), where I don't mean the food looks good. I mean it tastes good, and my option of translations for that word changes. And this is a VERY common circumstance.
Another big problem is optimal translation. Something that human's are really bad at, and something that computers are much much worse at. Whenever I've proofread a document translated from another text to English, I can always see various ways to cut it much much shorter.
So although, with statistics, you would be able to work out a pretty good average ratio in length between translations, this will be far different than it would be were all translations to be optimal.
It seems simple enough - you just need to find out the ratios.
For each script, count the number of script characters and English words in your glossary and work out the ratio.
This can be augmented with the Japanese source documents assuming you can both detect which script a Japanese word is in and what the English equivalent phrase is in the translation. Otherwise you'll have to guesstimate the ratios or ignore this as source data,
Then, as you say, count the number of words in each script of your source text, do the multiplies, and you should have a rough estimate.
My (albeit tiny) experience seems to indicate that, no matter what the language, blocks of text take the same amount of printed space to convey equivalent information. So, for a large-ish block of text, you could assign a width count to each character in English (grab this from a common font like Times New Roman), and likewise use a common Japanese font at the same point size to calculate the number of characters that would be required.

Resources