Search String By SubWords

Search String By SubWords - algorithm

What Kind of algorithms + data structures that would help me to do that?
Having a file contains like 10000~ lines loaded in memory in a ordered set. With a given search string I want to be able to get all the lines that have words prefixed with words found in search string. Well let me give an example to clarify this:
Lines:
"A brow Fox flies."
"Boxes are full of food."
"Cats runs slow"
"Dogs hates eagles"
"Dolphins have eyes and teath"
Cases 1:
search string = "fl b a"
"A brow Fox flies."
Explanation: search string have three words "fl", "b", and "a" and
the only string that have some words that are prefixed with words
from the search string is line 1.
Case 2:
search string "e do ha"
"Dogs hates eagles", "Dolphins have eyes and teath"
Solution
(fast enough for me it took about 30ms~(including sorting the final result) on my pc on a set of 10k lines 3 words each line)
I used a trie suggested in answer.
And some other hacky methods to be able to filter out duplicate and false positive results (mainly used hash sets for this).

I think what you're probably wanting is a trie. Construct one for the set of all words in your document, and have each leaf point to a hashset containing the indices of the lines in which the key of the leaf appears.
To execute a search, you'd use each fragment of the search string to navigate to a node in the tree and take the union over the hashsets of all leaves in that node's subtree. Then you'd take the intersection of those unions over the set of fragments to get the list of lines satisfying the search string.

Here is my 2 cents:
class DicVal
{
public int OriginalValue;
public int CurrentValue;
public int LineNumber;
}
private static void Main()
{
var a = "A brow Fox flies.\r\n" +
"Boxes are full of food.\r\n" +
"Cats runs slow\r\n" +
"Dogs hates eagles\r\n" +
"A brow Fox flies. AA AB AC\r\n" +
"Dolphins have eyes and teath";
var lines = a.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
var dic = new Dictionary<string, DicVal>
{
{"fl", new DicVal { OriginalValue = 1, LineNumber = -1}},
{"b", new DicVal { OriginalValue = 1, LineNumber = -1}},
{"a", new DicVal { OriginalValue = 4, LineNumber = -1}}
};
var globalCount = dic.Sum(x => x.Value.OriginalValue);
var lineNumber = 0;
foreach(var line in lines)
{
var words = line.Split(' ');
var currentCount = globalCount;
foreach (var word in words.Select(x => x.ToLower()))
{
for (var i = 1; i <= word.Length; i++)
{
var substr = word.Substring(0, i);
if (dic.ContainsKey(substr))
{
if (dic[substr].LineNumber != lineNumber)
{
dic[substr].CurrentValue = dic[substr].OriginalValue;
dic[substr].LineNumber = lineNumber;
}
if (dic[substr].CurrentValue > 0)
{
currentCount--;
dic[substr].CurrentValue--;
}
}
}
}
if(currentCount == 0)
Console.WriteLine(line);
lineNumber++;
}
}
Not going to explain much, as code is the best documentation :P.
Output: A brow Fox flies. AA AB AC
Assuming you implement everything efficiently, the running time will be as good as possible, since you need to read every word at least ONCE.
Further optimization can be done and apply threading. You can view into PARALLEL AGGREGATION concept, as this problem can be parallelized easily.

Here's a fairly simple implementation that should be appropriate for your use case. The idea is that you can store all combinations of short prefixes for each line (and for each query) since you only have 10,000 lines and assuming each line doesn't contain too many words. Now look up each hash generated for the query string. For each hash match, we then check for an exact match. For my example code I consider only prefixes of length 1, however you could repeat this approach for prefixes of length 2 & 3 provided the prefixes in your query have those lengths too.
__author__ = 'www.google.com/+robertking'
from itertools import combinations
from collections import defaultdict
lines = [
"A brow Fox flies.",
"Boxes are full of food.",
"Cats runs slow",
"Dogs hates eagles",
"Dolphins have eyes and teath"
]
lines = [line.lower() for line in lines]
def short_prefixes(line):
for word in line.split():
yield word[:1]
def get_hashes(line):
starts = list(short_prefixes(line))
for prefixes_in_hash in range(1, min(4, len(starts))):
for hash_group in combinations(starts, r=prefixes_in_hash):
yield tuple(sorted(hash_group))
def get_hash_map():
possible_matches = defaultdict(list)
for line_pos, line in enumerate(lines):
for hash in get_hashes(line):
possible_matches[hash].append(line_pos)
return possible_matches
possible_matches = get_hash_map()
def ok(line, q):
return all(line.startswith(prefix) or ((" " + prefix) in line) for prefix in q)
def query(search_string):
search_string = search_string.lower()
q = search_string.split()
hashes = set(get_hashes(search_string))
possible_lines = set()
for hash in hashes:
for line_pos in possible_matches[hash]:
possible_lines.add(line_pos)
for line_pos in possible_lines:
if ok(lines[line_pos], q):
yield lines[line_pos]
print(list(query("fl b a")))
#['a brow fox flies.']

Related

I did not get the Desired Outcome for this Question in Kotlin?

So I was creating an adjacency list from an Undirected Graph
val presentedGraph = listOf(
listOf('i', 'j'),
listOf('k', 'i'),
listOf('m', 'k'),
listOf('k', 'l'),
listOf('o', 'n')
)
The outcome that I was looking for was this
hashMapOf(
'i' to listOf('j', 'k'),
'j' to listOf('i'),
'k' to listOf('i', 'm', 'l'),
'm' to listOf('k'),
'l' to listOf('k'),
'o' to listOf('n'),
'n' to listOf('o')
)
But got this instead
{i=[i], j=[j], k=[k], l=[l], m=[m], n=[n], o=[o]}
Here's the code for it
fun undirectedPath (edges: List<List<Char>>, root: Char, destination: Char){
val graph = buildGraph(edges)
println(graph)
}
fun buildGraph(edges: List<List<Char>>): HashMap<Char, List<Char>>{
val graph = hashMapOf<Char, List<Char>>()
for (i in edges.indices){
for (j in edges[i].indices){
val a = edges[i][j]
val b = edges[i][j]
if (!graph.containsKey(a)) { graph[a] = listOf() }
if (!graph.containsKey(b)) { graph[b] = listOf() }
graph[a] = listOf(b)
graph[b] = listOf(a)
}
}
return graph
}
Any help will be appreciated, Thank You.

Several things wrong here:
The fact that you're setting both a and b to the same expression ought to be a clue that one of them is wrong! In fact a should be set to edges[i][0].
Because j runs from 0, it effectively assumes an extra edge from each node to itself. To avoid that, j should skip the first item and start from 1.
Each time you assign graph[a] and graph[b], you discard any previous items. That's why the result has only one target for each edge. To fix that, you need to add() the target to the existing list…
…which means that each target list must be a MutableList.
Those changes should be enough to get the result you want.
However, there are still several code smells present. For one thing, the input is a list of lists — but each of the inner lists has exactly two items. It would be neater to use a more precise structure, such as a Pair.
And it's always worth being aware of the standard library, which includes a wide range of manipulations and algorithms. In this case, you could replace the whole function with a one-liner:
fun buildGraph(edges: List<Pair<Char, Char>>)
= (edges + edges.map{ it.second to it.first })
.groupBy({ it.first }, { it.second })
As well as being a good deal shorter, that also makes it a good deal clearer what it's doing: combining the list of edges with the reverse list, and returning a map from each node to the list of nodes it connects to/from.

You can try this.
val hashMap = HashMap<Char, ArrayList<Char>>()
presentedGraph.forEach { list ->
list.forEach { char ->
if (!hashMap.containsKey(char)) {
hashMap[char] = arrayListOf()
}
hashMap[char]?.addAll(list.filter { char != it }.toList().distinct())
}
}
println(hashMap)
Output:
{i=[j, k], j=[i], k=[i, m, l], l=[k], m=[k], n=[o], o=[n]}

Answering the Longest Substring Without Repeating Characters in Kotlin

I've spend some time working on the problem and got this close
fun lengthOfLongestSubstring(s: String): Int {
var set = HashSet<Char>()
var initalChar = 0
var count = 0
s.forEach {r ->
while(!set.add(s[r]))
set.remove(s[r])
initalChar++
set.add(s[r])
count = maxOf(count, r - initialChar + 1)
}
return count
}
I understand that a HashSet is needed to answer the question since it doesn't allow for repeating characters but I keep getting a type mismatch error. I'm not above being wrong. Any assistance will be appreciated.

Your misunderstanding is that r represents a character in the string, not an index of the string, so saying s[r] doesn't make sense. You just mean r.
But you are also using r on its own, so you should be using forEachIndexed, which lets you access both the element of the sequence and the index of that element:
s.forEach { i, r ->
while(!set.add(r))
set.remove(r)
initialChar++
set.add(r)
count = maxOf(count, i - initialChar + 1)
}
Though there are still some parts of your code that doesn't quite make sense.
while(!set.add(r)) set.remove(r) is functionally the same as set.add(r). If add returns false, that means the element is already in the set, you remove it and the next iteration of the loop adds the element back into the set. If add returns true, that means the set didn't have the element and it was successfully added, so in any case, the result is you add r to the set.
And then you do set.add(r) again two lines later for some reason?
Anyway, here is a brute-force solution that you can use as a starting point to optimise:
fun lengthOfLongestSubstring(s: String): Int {
val set = mutableSetOf<Char>()
var currentMax = 0
// for each substring starting at index i...
for (i in s.indices) {
// update the current max from the previous iterations...
currentMax = maxOf(currentMax, set.size)
// clear the set to record a new substring
set.clear()
// loop through the characters in this substring
for (j in i..s.lastIndex) {
if (!set.add(s[j])) { // if the letter already exists
break // go to the next iteration of the outer for loop
}
}
}
return maxOf(currentMax, set.size)
}

How to display a list of number of words of each length - Javascript

Hi guys I am really stuck in this one situation :S I have a local .txtfile with a random sentence and my program is meant to :
I am finding it difficult to execute the third question. My code is ..
JavaScript
lengths.forEach((leng) => {
counter[leng] = counter[leng] || 0;
counter[leng]++;
});
$("#display_File_most").text(counter);
}
}
r.readAsText(f);
}
});
</script>
I have used this question for help but no luck - Using Javascript to find most common words in string?
I believe I have to store the sentence in an array and loop through it, uncertain if that is the correct step or if there is quicker way of finding the solution so I ask you guys.
Thanks for your time & I hope my question made sense :)

If you think of your solution as separated well done tasks, it would be really simple to find it. Here you have them together:
Convert the words into an array. Your guts were right about this :)
var source = "Hello world & good morning. The date is 18/09/2018";
var words = source.split(' ');
The next step is to find out the length of each word
var lengths = words.map(function(word) {
return word.length;
});
Finally the most complicated part is to get the number of occurrences for each length. One idea is to use an object to use key/value where key is the length and value is its count (source: https://stackoverflow.com/a/10541220/1505348)
Now you will see under the counter object have each word length with its repetition number on the source string.
var source = "Hello world & good morning. The date is 18/09/2018";
var words = source.split(' ');
var lengths = words.map(function(word) {
return word.length;
});
var counter = {};
lengths.forEach((leng) => {
counter[leng] = counter[leng] || 0;
counter[leng]++;
});
console.log(counter);

3.Produce a list of number of words of each length in sentence (not done).
Based on the question would this not be the solution?
var words = str.split(" ");
var count = {};
for (var i = 0; i<words.length; i++){
count[words[i].length] = (count [words[i].length] || 0) + 1
}

What's the most efficient way of filtering a string with numbers at the end (e.g. foo12)?

Here's a self-thought up quiz very similar to a real life problem that I'm facing.
Say I have a list of strings (say it's called stringlist), and among them some have two digit numbers attached at the end. For example, "foo", "foo01", "foo24".
I want to group those with the same letters (but with different two digit numbers at the end).
So, "foo", "foo01", and "foo24" would be under the group "foo".
However, I can't just check for any string that begins with "foo", because we can also have "food", "food08", "food42".
There are no duplicates.
It is possible to have numbers in the middle. Ex) "foo543food43" is under group "foo543food"
Or even multiple numbers at then end. Ex) "foo1234" is under group "foo12"
Most obvious solution I can think of is having a list of numbers.
numbers = ["0", "1", "2", ... "9"]
Then, I would do
grouplist = [[]] //Of the form: [[group_name1, word_index1, word_index2, ...], [group_name2, ...]]
for(word_index=0; word_index < len(stringlist); word_index++) //loop through stringlist
for(char_index=0; char_index < len(stringlist[word_index]); char_index++) //loop through the word
if(char_index == len(stringlist[word_index])-1) //Reached the end
for(number1 in numbers)
if(char_index == number1) //Found a number at the end
for(number2 in numbers)
if(char_index-1 == number2) //Found another number one before the end
group_name = stringlist[word_index].substring(0,char_index-1)
for(group_element in grouplist)
if(group_element[0] == group_name) //Does that group name exist already? If so, add the index to the end. If not, add the group name and the index.
group_element.append(word_index)
else
group_element.append([stringlist[word_index].substring(0,char_index-1), word_index])
break //If you found the first number, stop looping through numbers
break //If you found the second number, stop looping through numbers
Now this looks messy as hell. Any cleaner way you guys can think of?
Any of the data structures including the final result's can be what you want it to be.

I would create a map that maps the group-name to a list of all String of the corresponding group.
Here my approach in java:
public Map<String, List<String>> createGroupMap(Lust<String> listOfAllStrings){
Map<String, List<String>> result= new Hashmap<>();
for(String s: listOfAllStrings){
addToMap(result, s)
}
}
private addToMap(Map<String, List<String>> map, String s){
String group=getGroupName(s);
if(!map.containsKey(group))
map.put(group,new ArrayList<String>();
map.get(group).add(s);
}
private String getGroupName(String s){
return s.replaceFirst("\\d+$", "");
}
Maybe you can gain some speed by avoiding the RegExp in getGroupName(..) but you need to profile it to be sure that an implementation without RegExp would be faster.

You can divide the string into 2 parts like this.
pair<string, int> divide(string s) {
int r = 0;
if(isdigit(s.back())) {
r = s.back() - '0';
s.pop_back();
if(isdigit(s.back())) {
r += 10 * (s.back() - '0');
s.pop_back();
}
}
return {s, r}
}

All anagrams in a File

Source : Microsoft Interview Question
We are given a File containing words.We need to determine all the Anagrams Present in it .
Can someone suggest most optimal algorithm to do this.
Only way i know is
Sorting all the words,then checking .

It would be good to know more about data before suggesting an algorithm, but lets just assume that the words are in English in the single case.
Lets assign each letter a prime number from 2 to 101. For each word we can count it's "anagram number" by multiplying its letter corresponding numbers.
Lets declare a dictionary of {number, list} pairs. And one list to collect resulting anagrams into.
Then we can collect anagrams in two steps: simply traverse through the file and put each word to a dictionary's list according to its "anagram number"; traverce the map and for every pairs list with length more then 1 store it's contents in a single big anagram list.
UPDATE:
import operator
words = ["thore", "ganamar", "notanagram", "anagram", "other"]
letter_code = {'a':2, 'b':3, 'c':5, 'd':7, 'e':11, 'f':13, 'g':17, 'h':19, 'i':23, 'j':29, 'k':31, 'l':37, 'm':41, 'n':43,
'o':47, 'p':53, 'q':59, 'r':61, 's':67, 't':71, 'u':73, 'v':79, 'w':83, 'x':89, 'y':97, 'z':101}
def evaluate(word):
return reduce( operator.mul, [letter_code[letter] for letter in word] )
anagram_map = {}
anagram_list = []
for word in words:
anagram_number = evaluate(word)
if anagram_number in anagram_map:
anagram_map[ anagram_number ] += [word]
else:
anagram_map[ anagram_number ] = [word]
if len(anagram_map[ anagram_number ]) == 2:
anagram_list += anagram_map[ anagram_number ]
elif len(anagram_map[ anagram_number ]) > 2:
anagram_list += [ word ]
print anagram_list
Of course the implementation can be optimized further. For instance, you don't really need a map of anagrams, just a counters would do fine. But I guess the code illustrates the idea best as it is.

You can use "Tries".A trie (derived from retrieval) is a multi way search tree. Tries use pattern matching algorithms. It's basic use is to create spell check programs, but I think it can help your case..
Have a look at this link http://ww0.java4.datastructures.net/handouts/Tries.pdf

I just did this one not to long ago, in a different way.
split the file content into an array of words
create a HashMap that maps a key string to a linked list of strings
for each word in the array, sort the letters in the word and use that as the key to a linked list of anagrams
public static void allAnagrams2(String s) {
String[] input = s.toLowerCase().replaceAll("[^a-z^\s]", "").split("\s");
HashMap> hm = new HashMap>();
for (int i = 0; i < input.length; i++) {
String current = input[i];
char[] chars = current.toCharArray();
Arrays.sort(chars);
String key = new String(chars);
LinkedList<String> ll = hm.containsKey(key) ? hm.get(key) : new LinkedList<String>();
ll.add(current);
if (!hm.containsKey(key))
hm.put(key, ll);
}
}

Slightly different approach from the one above. Returning a Hashmap of anagrams instead.
Public static Hashmap<String> anagrams(String [] list){
Hashmap<String, String> hm = new Hashmap<String, String>();
Hashmap<String> anagrams = new Hashmap<String>();
for (int i=0;i<list.length;i++){
char[] chars = list[i].toCharArray();
Arrays.sort(chars);
String k = chars.toString();
if(hm.containsKey(k)){
anagrams.put(k);
anagrams.put(hm.get(k));
}else{
hm.put(k, list[i]);
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Search String By SubWords - algorithm

Related

I did not get the Desired Outcome for this Question in Kotlin?

Answering the Longest Substring Without Repeating Characters in Kotlin

How to display a list of number of words of each length - Javascript

What's the most efficient way of filtering a string with numbers at the end (e.g. foo12)?

All anagrams in a File

Categories

Resources