I have a column in Power Query (standalone power query with Excel), with text like this
"Hazelnut Berries Nuts Raspberry"
I need to be able to identify if there are more than 1 instance of "nut" ("berry") in it and remove generic word, to have result as
"Hazelnut Raspberry"
I have seen this post, but it works off whole words repeated.
I'm not entirely certain about your criteria for searching for the words you want to remove (PQ is fairly limited in how it can evaluate this with built in functions anyways). This will look through that string and remove any words that start with "Nut" or "Berr".
Text.Combine(List.Transform(Text.Split("Hazelnut Berries Nuts Raspberry", " "), each if (Text.StartsWith(_, "Nut") or Text.StartsWith(_, "Berr")) then null else _), " ")
Which will get your desired output. Don't know if you need more detailed criteria for evaluating each word, but that would probably need a custom function.
List.Distinct: https://learn.microsoft.com/en-ie/powerquery-m/list-distinct should do it; something like: List.Distinct(Text.Split("Hazelnut Berries Nuts Raspberry", " "))
You might need a bit more if your list could contain multiple spaces or other "stuff"
Related
Suppose I want to switch certain pairs of words. Say, for example, I want to switch dogs with cats and mice with rats, so that
This is my opinion about dogs and cats: I like dogs but I don't like cats. This is my opinion about mice and rats: I'm afraid of mice but I'm not afraid of rats.
becomes
This is my opinion about cats and dogs: I like cats but I don't like dogs. This is my opinion about rats and mice: I'm afraid of rats but I'm not afraid of mice.
The naŃ—ve approach
text = text.replace("dogs", "cats")
.replace("cats", "dogs")
.replace("mice", "rats")
.replace("rats", "mice")
is problematic since it can perform replacement on the same words multiple times. Either of the above example sentences would become
This is my opinion about dogs and dogs: I like dogs but I don't like dogs. This is my opinion about mice and mice: I'm afraid of mice but I'm not afraid of mice.
What's the simplest algorithm for replacing string pairs, while preventing something from being replaced multiple times?
Use whichever string search algorithm you deem to be appropriate, as long as it is able to search for regular expressions. Search for a regex that matches all the words you want to swap, e.g. dogs|cats|mice|rats. Maintain a separate string (in many languages, this needs to be some kind of StringBuilder in order for repeated appending to be fast) for the result, initially empty. For each match, you append the characters between the end of the previous match (or the beginning of the string) and the current match, and then you append the appropriate replacement (presumably obtained from a hashmap) to the result.
Most standard libraries should allow you to do this easily with built-in methods. For a Java example, see the documentation of Matcher.appendReplacement(StringBuffer, String). I recall doing this in C# as well, using a feature where you can specify a lambda function that decides what to replace each match with.
A naive solution that avoids any unexpected outcomes would be to replace each string with a temporary string, and then replace the temporary strings with the final strings. This assumes however, that you can form a string which is known not to be in the text, e.g.
text = text.replace("dogs", "{]1[}")
.replace("cats", "{]2[}")
.replace("mice", "{]3[}")
.replace("rats", "{]4[}")
.replace("{]2[}", "dogs")
.replace("{]1[}", "cats")
.replace("{]4[}", "mice")
.replace("{]3[}", "rats")
I am admittedly not very familiar with regex, so my idea is to create an array then loop through the elements to see if it should be replaced. First split() the sentence into an array of words:
String text = "This is my opinion about dogs and cats: I like dogs but I don't like cats.";
String[] sentence = text.split("[^a-zA-Z]"); //can't avoid regex here
Then use a for loop which contains a series of if statements to replace words:
for(int i = 0; i < sentence.length; i++) {
if(sentence[i].equals("cats") {
sentence[i] = "dogs";
}
//more similar if statements
}
Now sentence[] contains the new sentence with words. Some regex magic should allow you to also keep punctuation marks. I hope this helps, and please let me know if anything could be improved.
I have a document and a query term. I want to
find the query term in the document.
Pad each occurrence of the query term with a certain text marker.
For example
Text: I solemnly swear that I am upto no good.
Query: swear
Output: I solemnly MATCHSTART swear MATCHEND that I am upto no good.
Assuming that I have multiple query words and a large document, now can I do this efficiently.
I did go over various links on the internet but couldn't find anything very conclusive or definite. Moreover, this is just a programming question and has nothing to do with search engine development or information retrieval.
Any help would be appreciated. Thanks.
If each your query is word (some substring, does not contains SP/TAB/NL, etc), and allowed with very low probability false positive (when you mark some word, omitted in the query set) - you can use Bloom filter: http://en.wikipedia.org/wiki/Bloom_filter
First, load your query words into Bloom filter, and thereafter scan document, and match each word in the filter. If search result is positive - mark this word.
You can use my implementation of bloom filter: http://olegh.cc.st/src/bloom.c.txt
In Python:
text = "I solemnly swear I am up to no good" #read in however you like.
query = input("Query: ")
text.replace(" "+query" "," MATCHSTART "+query+" MATCHEND ")
OUTPUT:
'I solemnly MATCHSTART swear MATCHEND that I am up to no good.'
You could also use regex, but that's slower, so I just used string concat to add whitespace to the beginning and end of the word (so as not to match "swears" or "swearing" or "sportswear". This is easily translatable to whatever language you prefer.
I'm building an application that returns results based on a movie input from a user. If the user messes up and forgets to space out the title of the movie is there a way I can still take the input and return the correct data? For example "outofsight" will still be interpreted as "out of sight".
There is no regex that can do this in a good and reliable way. You could try a search server like Solr.
Alternatively, you could do auto-complete in the GUI (if you have one) on the input of the user, and this way mitigate some of the common errors users can end up doing.
Example:
User wants to search for "outofsight"
Starts typing "out"
Sees "out of sight" as suggestion
Selects "out of sight" from suggestions
????
PROFIT!!!
There's no regex that can tell you where the word breaks were supposed to be. For example, if the input is "offlight", is it supposed to return "Off Light" or "Of Flight"?
This is impossible without a dictionary and some kind of fuzzy-search algorithm. For the latter see How can I do fuzzy substring matching in Ruby?.
You could take a string and put \s* in between each character.
So outofsight would be converted to:
o\s*u\s*t\s*o\s*f\s*s\s*i\s*g\s*h\s*t
... and match out of sight.
You can't do this with regular expressions, unless you want to store one or more patterns to match for each movie record. That would be silly.
A better approach for catching minor misspellings would be to calculate Levenshtein distances between what the user is typing and your movie titles. However, when your list of movies is large, this will become a rather slow operation, so you're better off using a dedicated search engine like Lucene/Solr that excels at this sort of thing.
I'm very new to programming and am a beginner in Ruby. I've done a lot of searching to try to find the answers I need, but nothing seems to match what I'm looking for.
I need to make a program for work that will:
Get keywords from the user
Match those keywords with the same keywords in a database of sentences, and then
Spit out randomized sentences that:
contain all the keywords 1 time
do NOT contain keywords not listed
do NOT duplicate keywords
Important to know: Sentences all have a mix of several keywords, NOT one per sentence
1 & 2 are OK, I've been able to do those. My problem is with part 3. I've tried long lists of "if include?" parameters, but it never ends up working and I know there must be a better way to do this.
My grasp of Ruby (and programming generally) is basic and I don't really know what it can and can't do, so any tips or hints in what functions would be useful would be very very much appreciated.
If the match is found, why don't you consecutively pop it out of your array/db? It will ensure no duplication, since that record would not be present to be matched later. No?
Consider this snippet:
db=%q(It is hot today), %q(It is going to rain), %q(Where are you, sonny?), %q(sentence contains is and are)
keyw=%w(is am are)
de=[]
keyw.each do |word|
for index in 0...db.length
if db[index].include?(word)
puts "Matched #{word} with #{db[index]}"
de<<index
end
end
until de.empty?
db.delete_at(de.pop)
end
end
db is database example and keyw contains keywords.
Corresponding output:
Matched is with It is hot today
Matched is with It is going to rain
Matched is with sentence contains is and are
Matched are with Where are you, sonny?
No duplication. :)
I am working on implementing an autocompletion script in javascript. However, some of the names are two word names with a space in the middle. What kind of algorithm can you use to deal with it. I am using a trie to store the names.
The only solutions I could come up with were just saying that two word names cannot be used (either run them together or put a dash in the middle). The other idea was to create a list of these kind of names and have a separate loop to check the input. The other and possibly best idea I have is to redesign it slightly and have categories for first and last names and then an extra name category. I was wondering if there was a better solution out there?
Edit: I realized I wasn't very clear on what I was asking. My problem isn't adding two word phrases to the trie, but returning them when someone is typing in a name. In the trie I split the first and last names so you can search by either. So if someone types in the first name and then a space, how would I tell if they are typing in the rest of the first name or if they are now typing in the last name.
Why not have the trie also include the names with spaces?
Once you have a list of candidates, split each of them on the space and show the first token...
Is there a reason you are rolling your own autocomplete script, instead of using a currently existing one, such as YUI autocomplete? (i.e. are you doing it just for fun?, etc.)
If you have a way to parse the two-word names, then just include spaces in your trie. But if you cannot determine what is a two-word name and what is two separate words, and your trie cannot be large enough to hold all two-word sequences, then you have a problem.
One simple way to solve this is to default to allowing two-word pairs, but if you have too much branching after the space, throw away that entire branch. This way, when the first word is predictive for the second, you'll get autocompletion, but when it could be any of a huge number of things, your trie will end at the end of a single word.
If you using multiline editor, i guess the best choice autocomplete items will be a word. So firstname, middlename and lastname must be parsed and add a lookup item.
For (one line) textbox use you can add whitespaces (and firstname + space + middlename + space + lastname pattern) in search criteria.