openNLP NGramModel does not keep the original order of the words? - opennlp

Here is my simple code using openNLP:
public static void main(String[] args) {
String text = "This is the original sequence in the text";
System.out.println(text);
StringList tokens = new StringList(WhitespaceTokenizer.INSTANCE.tokenize(text));
System.out.println("Tokens: " + tokens);
NGramModel nGramModel = new NGramModel();
nGramModel.add(tokens, 2, 2);
System.out.println("Total ngrams: " + nGramModel.numberOfGrams());
for (StringList ngram : nGramModel) {
System.out.println(nGramModel.getCount(ngram) + " - " + ngram);
}
}
and it gives the following output:
This is the original sequence in the text
Tokens: [This,is,the,original,sequence,in,the,text]
Total ngrams: 7
1 - [the,text]
1 - [the,original]
1 - [is,the]
1 - [sequence,in]
1 - [This,is]
1 - [original,sequence]
1 - [in,the]
So it does not keep the original order of the words in the sentence? How can I get [This,is] as the very first n-gram, and then [is,the] as the second ngram, ... so on so forth? if we lose this original ordering of the n-gram... would that hurt?
thanks for the help!

I think it's important to clarify what is your use case and why you think you need order preserved.
Ngrams are often used in bag of words models (which disrespect order anyway) and / or in language models where probability estimation (e.g. based on ngram counts) are calculated at ngram level and aggregated using the chain rule.

Related

Scripting in elasticsearch

I have a couple of questions about scripting in elasticsearch, I hope someone can help me. I need to add several parameters from the document to _score and sort by the total value. First, I will describe the data that I have and which need to be added:
rating - a number from 1 to 9,
duration_bucket is a number from 0 to 2,
rating_adj [
{
text - text, if the passed parameter matches this value, the result will be changed to the next value.
adj - the number by which the result will be changed.
}]
Well, the score itself, usually this value ranges from 1 to 4.
Initially, I just had a sort in this order:
score
rating
duration_bucket
But this gave a slightly different result.
Therefore, a small script was written that would add all these values.
def found = null;
if (params.text != null) {
found = params._source['rating_adj'].find(item -> item.text == params.text);
}
def res = _score + params._source['duration_bucket'] + params._source['rating'];
if (found != null) {
return res + found.adj
}
return res;
And the first question. I've tried two options.
Through function score and already sorted by this score.
Directly via script sort
I did not notice the difference in performance, are there any significant differences in these approaches?
And the second question. When using this script, the processor is fully loaded, in contrast to the usual sorting. Are there any ways to optimize scripts or is it all about hardware?

Google search suggestion implementation

In a recent amazon interview I was asked to implement Google "suggestion" feature. When a user enters "Aeniffer Aninston", Google suggests "Did you mean Jeniffer Aninston". I tried to solve it by using hashing but could not cover the corner cases. Please let me know your thought on this.
There are 4 most common types of erros -
Omitted letter: "stck" instead of "stack"
One letter typo: "styck" instead of "stack"
Extra letter: "starck" instead of "stack"
Adjacent letters swapped: "satck" instead of "stack"
BTW, we can swap not adjacent letters but any letters but this is not common typo.
Initial state - typed word. Run BFS/DFS from initial vertex. Depth of search is your own choice. Remember that increasing depth of search leads to dramatically increasing number of "probable corrections". I think depth ~ 4-5 is a good start.
After generating "probable corrections" search each generated word-candidate in a dictionary - binary search in sorted dictionary or search in a trie which populated with your dictionary.
Trie is faster but binary search allows searching in Random Access File without loading dictionary to RAM. You have to load only precomputed integer array[]. Array[i] gives you number of bytes to skip for accesing i-th word. Words in Random Acces File should be written in a sorted order. If you have enough RAM to store dictionary use trie.
Before suggesting corrections check typed word - if it is in a dictionary, provide nothing.
UPDATE
Generate corrections should be done by BFS - when I tried DFS, entries like "Jeniffer" showed "edit distance = 3". DFS doesn't works, since it make a lot of changes which can be done in one step - for example, Jniffer->nJiffer->enJiffer->eJniffer->Jeniffer instead of Jniffer->Jeniffer.
Sample code for generating corrections by BFS
static class Pair
{
private String word;
private byte dist;
// dist is byte because dist<=128.
// Moreover, dist<=6 in real application
public Pair(String word,byte dist)
{
this.word = word;
this.dist = dist;
}
public String getWord()
{
return word;
}
public int getDist()
{
return dist;
}
}
public static void main(String[] args) throws Exception
{
HashSet<String> usedWords;
HashSet<String> dict;
ArrayList<String> corrections;
ArrayDeque<Pair> states;
usedWords = new HashSet<String>();
corrections = new ArrayList<String>();
dict = new HashSet<String>();
states = new ArrayDeque<Pair>();
// populate dictionary. In real usage should be populated from prepared file.
dict.add("Jeniffer");
dict.add("Jeniffert"); //depth 2 test
usedWords.add("Jniffer");
states.add(new Pair("Jniffer", (byte)0));
while(!states.isEmpty())
{
Pair head = states.pollFirst();
//System.out.println(head.getWord()+" "+head.getDist());
if(head.getDist()<=2)
{
// checking reached depth.
//4 is the first depth where we don't generate anything
// swap adjacent letters
for(int i=0;i<head.getWord().length()-1;i++)
{
// swap i-th and i+1-th letters
String newWord = head.getWord().substring(0,i)+head.getWord().charAt(i+1)+head.getWord().charAt(i)+head.getWord().substring(i+2);
// even if i==curWord.length()-2 and then i+2==curWord.length
//substring(i+2) doesn't throw exception and returns empty string
// the same for substring(0,i) when i==0
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
// insert letters
for(int i=0;i<=head.getWord().length();i++)
for(char ch='a';ch<='z';ch++)
{
String newWord = head.getWord().substring(0,i)+ch+head.getWord().substring(i);
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
}
}
for(String correction:corrections)
{
System.out.println("Did you mean "+correction+"?");
}
usedWords.clear();
corrections.clear();
// helper data structures must be cleared after each generateCorrections call - must be empty for the future usage.
}
Words in a dictionary - Jeniffer,Jeniffert. Jeniffert is just for testing)
Output:
Did you mean Jeniffer?
Did you mean Jeniffert?
Important!
I choose depth of generating = 2. In real application depth should be 4-6, but as number of combinations grows exponentially, I don't go so deep. There are some optomizations devoted to reduce number of branches in a searching tree but I don't think much about them. I wrote only main idea.
Also, I used HashSet for storing dictionary and for labeling used words. It seems HashSet's constant is too large when it containt million objects. May be you should use trie both for word in a dictionary checking and for is word labeled checking.
I didn't implement erase letters and change letters operations because I want to show only main idea.

AS3 dynamic text algorithm

Afternoon,
I have an odd algorithm. I would like to populate a string of code dynamically based on some user entry.
I have a multi-dimensional array with data in it and a multi-line input text field.
What I want is for a user to be able to enter some text
example:
00
01 - 02 - 03
comments: 12
my code would identify the numbers an treat everything else as text.
Thus, if my array is data[x][#], the # will correspond to their entry.
I would get
algorithm_string = data[x][0] + "\n" + data[x][1] + " - " + data[x][2] + " - " + data[x][3] + "\n" + "comments: " + data[x][12]
So the algorithm would construct the above, and then I could run through the code.
for(var x:int = 0; x < data.length; x++){
some_object._display_text.text = algorithm_string;
}
Ok so I want to first say that relying on a user to put in the entry exactly the way you want is probably not a good idea. They WILL make mistakes and your code WILL eventually not work as expected. I would recommend using 5 inputs restricted to numeric input, and labeling each field with which number should go in it.
However, you can accomplish what you are trying to do above like this:
var parts:Array = myInput.text.split(" ");
for (var i:int=0; i<parts.length, i++){
if(!isNaN(parseInt(parts[i]))){
// you have a number here.
data[x].push(parts[i]);
} else {
//this was not a number so ignore it
}
}
Again let me state I think you should refactor how you get the numbers, but that code will grab the numbers out and put them in the 0,1,2,3,and 4 indexes of your data[x], but relies on the user perfectly inputting the text every time.
Good luck! (refactor) :)

Generating nice looking BETA keys

I built a web application that is going to launch a beta test soon. I would really like to hand out beta invites and keys that look nice.
i.e. A3E6-7C24-9876-235B
This is around 16 character, hexadecimal digits.
It looks like the typical beta key you might see.
My question is what is a standard way to generate something like this and make sure that it is unique and that it will not be easy for someone to guess a beta key and generate their own.
I have some ideas that would probably work for beta keys:
MD5 is secure enough for this, but it is long and ugly looking and could cause confusion between 0 and O, or 1 and l.
I could start off with a large hexadecimal number that is 16 digits in length. To prevent people from guessing what the next beta key might be increment the value by a random number each time. The range of numbers between 1111-1111-1111-1111 and eeee-eeee-eeee-eeee will have plenty of room to spare even if I am skipping large quantities of numbers.
I guess I am just wondering if there is a standard way for doing this that I am not finding with google. Is there a better way?
The canonical "unique identifying number" is a uuid. There are various forms - you can generate one from random numbers (version 4) or from a hash of some value (user's email + salt?) (versions 3 and 5), for example.
Libraries for java, python and a bunch more exist.
PS I have to add that when I read your question title I thought you were looking for something cool and different. You might consider using an "interesting" word list and combining words with hyphens to encode a number (based on hash of email + salt). That would be much more attractive imho: "your beta code is secret-wombat-cookie-ninja" (I'm sure I read an article describing an example, but I can't find it now).
One way (C# but the code is simple enough to port to other languages):
private static readonly Random random = new Random(Guid.NewGuid().GetHashCode());
static void Main(string[] args)
{
string x = GenerateBetaString();
}
public static string GenerateBetaString()
{
const string alphabet = "ABCDEF0123456789";
string x = GenerateRandomString(16, alphabet);
return x.Substring(0, 4) + "-" + x.Substring(4, 4) + "-"
+ x.Substring(8, 4) + "-" + x.Substring(12, 4);
}
public static string GenerateRandomString(int length, string alphabet)
{
int maxlen = alphabet.Length;
StringBuilder randomChars = new StringBuilder(length);
for (int i = 0; i < length; i++)
{
randomChars.Append(alphabet[random.Next(0, maxlen)]);
}
return randomChars.ToString();
}
Output:
97A8-55E5-C6B8-959E
8C60-6597-B71D-5CAF
8E1B-B625-68ED-107B
A6B5-1D2E-8D77-EB99
5595-E8DC-3A47-0605
Doing this way gives you precise control of the characters in the alphabet. If you need crypto strength randomness (unlikely) use the cryto random class to generate random bytes (possibly mod the alphabet length).
Computing power is cheap, take your idea of the MD5 and run an "aesthetic" of your own devising over the set. The code below generates 2000 unique keys almost instantaneously that do not have a 0,1,L,O character in them. Modify aesthetic to fit any additional criteria:
import random, hashlib
def potential_key():
x = random.random()
m = hashlib.md5()
m.update(str(x))
s = m.hexdigest().upper()[:16]
return "%s-%s-%s-%s" % (s[:4],s[4:8],s[8:12],s[12:])
def aesthetic(s):
bad_chars = ["0","1","L","O"]
for b in bad_chars:
if b in s: return False
return True
key_set = set()
while len(key_set) < 2000:
k = potential_key()
if aesthetic(k):
key_set.add(k)
print key_set
Example keys:
'4297-CAC6-9DA8-625A', '43DD-2ED4-E4F8-3E8D', '4A8D-D5EF-C7A3-E4D5',
'A68D-9986-4489-B66C', '9B23-6259-9832-9639', '2C36-FE65-EDDB-2CF7',
'BFB6-7769-4993-CD86', 'B4F4-E278-D672-3D2C', 'EEC4-3357-2EAB-96F5',
'6B69-C6DA-99C3-7B67', '9ED7-FED5-3CC6-D4C6', 'D3AA-AF48-6379-92EF', ...

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

Resources