What is the most efficient way to replace a list of words without touching html attributes? - algorithm

I absolutely disagree that this question is a duplicate! I am asking for an efficiency way to replace hundreds of words at once. This is an algorithm question! All the provided links are about to replace one word. Should I repeat that expensive operation hundreds of times? I'm sure that there are better ways as a suffix tree where I sort out html while building that tree. I removed that regex tag since for no good reason you are focusing on that part.
I want to translate a given set of words (more then 100) and translate them. My first idea was to use a simple regular expression that works better then expected. As sample:
const input = "I like strawberry cheese cake with apple juice"
const en2de = {
apple: "Apfel",
strawberry: "Erdbeere",
banana: "Banane",
/* ... */}
input.replace(new RegExp(Object.keys(en2de).join("|"), "gi"), match => en2de[match.toLowerCase()])
This works fine on the first view. However it become strange if you words which contains each other like "pineapple" that would return "pineApfel" which is totally nonsense. So I was thinking about checking word boundaries and similar things. While playing around I created this test case:
Apple is a company
That created the output:
Apfel is a company.
The translation is wrong, which is somehow tolerable, but the link is broken. That must not happen.
So I was thinking about extend the regex to check if there is a quote before. I know well that html parsing with regex is a bad idea, but I thought that this should work anyway. In the end I gave up and was looking for solutions of other devs and found on Stack Overflow a couple of questions, all without answers, so it seems to be a hard problem (or my search skills are bad).
So I went two steps back and was thinking to implement that myself with a parser or something like that. However since I have multiple inputs and I need to ignore the case I was thinking what the best way is.
Right now I think to build a dictionary with pointers to the position of the words. I would store the dict in lower case so that should be fast, I could also skip all words with the wrong prefix etc to get my matches. In the end I would replace the words from the end to the beginning to avoid breaking the indices. But is that way efficiency? Is there a better way to achieve that?
While my sample is in JavaScript the solution must not be in JS as long the solution doesn't include dozens of dependencies which cannot be translated easy to JS.
TL;DR:
I want to replace multiple words by other words in a case insensitive way without breaking html.

You may try a treeWalker and replace the text inplace.
To find words you may tokenize your text, lower case your words and map them.
const mapText = (dic, s) => {
return s.replace(/[a-zA-Z-_]+/g, w => {
return dic[w.toLowerCase()] || w
})
}
const dic = {
'grodzi': 'grodzila',
'laaaa': 'forever',
}
const treeWalker = document.createTreeWalker(
document.body,
NodeFilter.SHOW_TEXT
)
// skip body node
let currentNode = treeWalker.nextNode()
while(currentNode) {
const newS = mapText(dic, currentNode.data)
currentNode.data = newS
currentNode = treeWalker.nextNode()
}
p {background:#eeeeee;}
<p>
grodzi
LAAAA
</p>
The link stay untouched.
However mapping each word in an other language is bound to fail (be it missing representation of some word, humour/irony, or simply grammar construct). For this matter (which is a hard problem on its own) you may rely on some tools to translate data for you (neural networks, api(s), ...)

Here is my current work in progress solution of a suffix tree (or at least how I interpreted it). I'm building a dictionary with all words, which are not inside of a tag, with their position. After sorting the dict I replace them all. This works for me without handling html at all.
function suffixTree(input) {
const dict = new Map()
let start = 0
let insideTag = false
// define word borders
const borders = ' .,<"\'(){}\r\n\t'.split('')
// build dictionary
for (let end = 0; end <= input.length; end++) {
const c = input[end]
if (c === '<') insideTag = true
if (c === '>') {
insideTag = false
start = end + 1
continue
}
if (insideTag && c !== '<') continue
if (borders.indexOf(c) >= 0) {
if(start !== end) {
const word = input.substring(start, end).toLowerCase()
const entry = dict.get(word) || []
// save the word and its position in an array so when the word is
// used multiple times that we can use this list
entry.push([start, end])
dict.set(word, entry)
}
start = end + 1
}
}
// last word handling
const word = input.substring(start).toLowerCase()
const entry = dict.get(word) || []
entry.push([start, input.length])
dict.set(word, entry)
// create a list of replace operations, we would break the
// indices if we do that directly
const words = Object.keys(en2de)
const replacements = []
words.forEach(word => {
(dict.get(word) || []).forEach(match => {
// [0] is start, [1] is end, [2] is the replacement
replacements.push([match[0], match[1], en2de[word]])
})
})
// converting the input to a char array and replacing the found ranges.
// beginning from the end and replace the ranges with the replacement
let output = [...input]
replacements.sort((a, b) => b[0] - a[0])
replacements.forEach(match => {
output.splice(match[0], match[1] - match[0], match[2])
})
return output.join('')
}
Feel free to leave a comment how this can be improved.

Related

Kotlin map not working with List of String

I have been working on code where I have to generate all possible ways to the target string. I am using the below-mentioned code.
Print Statement:
println("---------- How Construct -------")
println("${
window.howConstruct("purple", listOf(
"purp",
"p",
"ur",
"le",
"purpl"
))
}")
Function Call:
fun howConstruct(
target: String,
wordBank: List<String>,
): List<List<String>> {
if (target.isEmpty()) return emptyList()
var result = emptyList<List<String>>()
for (word in wordBank) {
if (target.indexOf(word) == 0) { // Starting with prefix
val substring = target.substring(word.length)
val suffixWays = howConstruct(substring, wordBank)
val targetWays = suffixWays.map { way ->
val a = way.toMutableList().apply {
add(word)
}
a.toList()
}
result = targetWays
}
}
return result
}
Expected Output:-
[['purp','le'],['p','ur','p','le']]
Current Output:-
[]
Your code is almost working; only a couple of small changes are needed to get the required output:
If the target is empty, return listOf(emptyList()) instead of emptyList().
Use add(0, word) instead of add(word).
The first of those changes is the important one. Your function returns a list of matches; and since each match is itself a list of strings, it returns a list of lists of strings. Once your code has matched the entire target and calls itself one last time, it returned an empty list — i.e. no matches — instead of a list containing an empty list — meaning one match with no remaining strings.
The second change simply fixes the order of strings within each match, which was reversed (because it appended the prefix after the returned suffix match).
However, there are many others ways that code could be improved. Rather than list them all individually, it's probably easier to give an alternative version:
fun howConstruct(target: String, wordBank: List<String>
): List<List<String>>
= if (target == "") listOf(emptyList())
else wordBank.filter{ target.endsWith(it) } // Look for suffixes of the target in the word bank
.flatMap { suffix: String ->
howConstruct(target.removeSuffix(suffix), wordBank) // For each, recurse to search the rest
.map{ it + suffix } } // And append the suffix to each match.
That does almost exactly the same as your code, except that it searches from the end of the string — matching suffixes — instead of from the beginning. The result is the same; the main benefit is that it's simpler to append a suffix string to a partial match list (using +) than to prepend a prefix (which is quite messy, as you found).
However, it's a lot more concise, mainly because it uses a functional style — in particular, it uses filter() to determine which words are valid suffixes, and flatMap() to collate the list of matches corresponding to each one recursively, as well as map() to append the suffix to each one (like your code does). That avoids all the business of looping over lists, creating lists, and adding to them. As a result, it doesn't need to deal with mutable lists or variables, avoiding some sources of confusion and error.
I've written it as an expression body (with = instead of { … }) for simplicity. I find that's simpler and clearer for short functions — this one is about the limit, though. It might fit as it an extension function on String, since it's effectively returning a transformation of the string, without any side-effects — though again, that tends to work best on short functions.
There are also several small tweaks. It's a bit simpler — and more efficient — to use startsWith() or endsWith() instead of indexOf(); removePrefix() or removeSuffix() is arguably slightly clearer than substring(); and I find == "" clearer than isEmpty().
(Also, the name howConstruct() doesn't really describe the result very well, but I haven't come up with anything better so far…)
Many of these changes are of course a matter of personal preference, and I'm sure other developers would write it in many other ways! But I hope this has given some ideas.

Using the Haxe While Loop to Remove All of a Value from an Array

I'm wanting to remove all of a possibly duplicated value in an array. At the moment I'm using the remove(x:T):Bool function in a while loop, but I'm wondering about the expression part.
I've started by using:
function removeAll(array:Array<String>, element:String):Void
while (array.remove(element)) {}
but I'm wondering if any of these lines would be more efficient:
while (array.remove(element)) continue;
while (array.remove(element)) true;
while (array.remove(element)) 0;
or if it makes any kind of difference.
I'm guessing that using continue is less efficient because it actually has to do something, true and 0 are slightly more efficient, but still do something, and {} would probably be most efficient.
Does anyone have any background information on this?
While other suggested filter, it will create a new instance of list/array which may cause your other code to lose reference.
If you loop array.remove, it is going to loop through all the elements in the front of the array every time, which is not so performant.
IMO a better approach is to use a reverse while loop:
var i = array.length;
while(--i >= 0)
if(array[i] == element) array.splice(i, 1);
It doesn't make any difference. In fact, there's not even any difference in the generated code for the {}, 0 and false cases: they all end up generating {}, at least on the JS target.
However, you could run into issues if you have a large array with many duplicates: in that case, remove() would be called many times, and it has to iterate over the array each time (until it finds a match, that is). In that case, it's probably more efficient to use filter():
function removeAll(array:Array<String>, element:String):Array<String>
return array.filter(function(e) return e != element);
Personally, I also find this to be a bit more elegant than your while-loop with an empty body. But again, it depends on the use case: this does create a new array, and thus causes an allocation. Usually, that's not worth worrying about, but if you for instance do it in the update loop of a game, you might want to avoid it.
In terms of the expression part of the while loop, it seems that it's just set to empty brases ({}) when compiled so it doesn't really matter what you do.
In terms of performance, a much better solution is the Method 2 from the following:
class Test
{
static function main()
{
var thing:Array<String> = new Array<String>();
for (index in 0...1000)
{
thing.push("0");
thing.push("1");
}
var copy1 = thing.copy();
var copy2 = thing.copy();
trace("epoch");
while (copy1.remove("0")) {}
trace("check");
// Method 2.
copy2 = [
for (item in Lambda.filter(copy2, function(v)
{return v != "0";}))
item
];
trace("check");
}
}
which can be seen [here](https://try.haxe.org/#D0468"Try Haxe example."). For 200,000 one-character elements in an Array<String>, Method 2 takes 0.017s while Method 1 takes 44.544s.
For large arrays it will be faster to use a temporary array and then assign that back after populating ( method3 in try )?
OR
If you don't want to use a temp you can assign back and splice ( method4 in try )?
https://try.haxe.org/#5f80c
Both are more verbose codewise as I setup vars, but on mac seems faster at runtime, summary of my method3 approach:
while( i < l ) { if( ( s = copy[ i++ ] ) != '0' ) arr[ j++ ] = s;
copy = arr;
am I missing something obvious against these approaches?

How to display a list of number of words of each length - Javascript

Hi guys I am really stuck in this one situation :S I have a local .txtfile with a random sentence and my program is meant to :
I am finding it difficult to execute the third question. My code is ..
JavaScript
lengths.forEach((leng) => {
counter[leng] = counter[leng] || 0;
counter[leng]++;
});
$("#display_File_most").text(counter);
}
}
r.readAsText(f);
}
});
</script>
I have used this question for help but no luck - Using Javascript to find most common words in string?
I believe I have to store the sentence in an array and loop through it, uncertain if that is the correct step or if there is quicker way of finding the solution so I ask you guys.
Thanks for your time & I hope my question made sense :)
If you think of your solution as separated well done tasks, it would be really simple to find it. Here you have them together:
Convert the words into an array. Your guts were right about this :)
var source = "Hello world & good morning. The date is 18/09/2018";
var words = source.split(' ');
The next step is to find out the length of each word
var lengths = words.map(function(word) {
return word.length;
});
Finally the most complicated part is to get the number of occurrences for each length. One idea is to use an object to use key/value where key is the length and value is its count (source: https://stackoverflow.com/a/10541220/1505348)
Now you will see under the counter object have each word length with its repetition number on the source string.
var source = "Hello world & good morning. The date is 18/09/2018";
var words = source.split(' ');
var lengths = words.map(function(word) {
return word.length;
});
var counter = {};
lengths.forEach((leng) => {
counter[leng] = counter[leng] || 0;
counter[leng]++;
});
console.log(counter);
3.Produce a list of number of words of each length in sentence (not done).
Based on the question would this not be the solution?
var words = str.split(" ");
var count = {};
for (var i = 0; i<words.length; i++){
count[words[i].length] = (count [words[i].length] || 0) + 1
}

How to prevent users writing with caps lock?

I don't really like people who write with Caps Lock. In addition the aversion, it defaces the whole application. I am wondering how to prevent users writing all characters with caps lock. I cannot force all text to lowercase due to special names and abbreviations. What logic should I use?
Politely decline their posts—explaining why—if the number of capital letter exceeds the number of lowercase letters by more than 30, say.
Don't implement this on a FORTRAN forum
You could check how many upper case characters are in a word, then limit that. Someone above has given the example of names like 'McLaren', this way would allow that. the down side is, if you put the maximum on 3, 'LOL' would stil be possible.
The way to go would be to take the length of the word 'McLaren' would be 7 then cap it on a percentage like 20%, this enables longer words to have more uppercase characters, but not be all caps. (nothing will completely prevent it, but this will make it harder for them.)
Fun fact, today is international caps-lock day. :)
keypress: function(e) {
var ev = e ? e : window.event;
if (!ev) {
return;
}
var targ = ev.target ? ev.target : ev.srcElement;
// get key pressed
var which = -1;
if (ev.which) {
which = ev.which;
} else if (ev.keyCode) {
which = ev.keyCode;
}
// get shift status
var shift_status = false;
if (ev.shiftKey) {
shift_status = ev.shiftKey;
} else if (ev.modifiers) {
shift_status = !!(ev.modifiers & 4);
}
// At this point, you have the ASCII code in "which",
// and shift_status is true if the shift key is pressed
}
Source --http://24ways.org/2007/capturing-caps-lock

Word-separating algorithm

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?
Start with a basic Trie data structure representing your dictionary. As you iterate through the characters of the the string, search your way through the trie with a set of pointers rather than a single pointer - the set is seeded with the root of the trie. For each letter, the whole set is advanced at once via the pointer indicated by the letter, and if a set element cannot be advanced by the letter, it is removed from the set. Whenever you reach a possible end-of-word, add a new root-of-trie to the set (keeping track of the list of words seen associated with that set element). Finally, once all characters have been processed, return an arbitrary list of words which is at the root-of-trie. If there's more than one, that means the string could be broken up in multiple ways (such as "therapistforum" which can be parsed as ["therapist", "forum"] or ["the", "rapist", "forum"]) and it's undefined which we'll return.
Or, in a wacked up pseudocode (Java foreach, tuple indicated with parens, set indicated with braces, cons using head :: tail, [] is the empty list):
List<String> breakUp(String str, Trie root) {
Set<(List<String>, Trie)> set = {([], root)};
for (char c : str) {
Set<(List<String>, Trie)> newSet = {};
for (List<String> ls, Trie t : set) {
Trie tNext = t.follow(c);
if (tNext != null) {
newSet.add((ls, tNext));
if (tNext.isWord()) {
newSet.add((t.follow(c).getWord() :: ls, root));
}
}
}
set = newSet;
}
for (List<String> ls, Trie t : set) {
if (t == root) return ls;
}
return null;
}
Let me know if I need to clarify or I missed something...
I would imagine they take a dictionary word list like /usr/share/dict/words on your common or garden variety Unix system and try to find sets of word matches (starting from the left?) that result in the largest amount of original text being covered by a match. A simple breadth-first-search implementation would probably work fine, since it obviously doesn't have to run fast.
I'd imaging these sites do it similar to this:
Get a list of word for your target language
Remove "useless" words like "a", "the", ...
Run through the list and check which of the words are substrings of the domain name
Take the most common words of the remaining list (Or the ones with the highest adsense rating,...)
Of course that leads to nonsense for expertsexchange, but what else would you expect there...
(disclaimer: I did not try it myself, so take it merely as a food for experimentation. 4-grams are taken mostly out of the blue sky, just from my experience that 3-grams won't work all too well; 5-grams and more might work better, even though you will have to deal with a pretty large table). It's also simplistic in a sense that it does not take into the account the ending of the string - if it works for you otherwise, you'd probably need to think about fixing the endings.
This algorithm would run in a predictable time proportional to the length of the string that you are trying to split.
So, first: Take a lot of human-readable texts. for each of the text, supposing it is in a single string str, run the following algorithm (pseudocode-ish notation, assumes the [] is a hashtable-like indexing, and that nonexistent indexes return '0'):
for(i=0;i<length(s)-5;i++) {
// take 4-character substring starting at position i
subs2 = substring(str, i, 4);
if(has_space(subs2)) {
subs = substring(str, i, 5);
delete_space(subs);
yes_space[subs][position(space, subs2)]++;
} else {
subs = subs2;
no_space[subs]++;
}
}
This will build you the tables which will help to decide whether a given 4-gram would need to have a space in it inserted or not.
Then, take your string to split, I denote it as xstr, and do:
for(i=0;i<length(xstr)-5;i++) {
subs = substring(xstr, i, 4);
for(j=0;j<4;j++) {
do_insert_space_here[i+j] -= no_space[subs];
}
for(j=0;j<4;j++) {
do_insert_space_here[i+j] += yes_space[subs][j];
}
}
Then you can walk the "do_insert_space_here[]" array - if an element at a given position is bigger than 0, then you should insert a space in that position in the original string. If it's less than zero, then you shouldn't.
Please drop a note here if you try it (or something of this sort) and it works (or does not work) for you :-)

Resources