Search for a list of words in a paragraph - algorithm

I have a paragraph written in English.
I have a list of words.
I want to check if the paragraph contains any one word
What is the best algorithm to do so:
Presently, I have the following but it seems very naive:
private boolean findMatch(List<String> list, String param, ArrayList<String> skipChars) {
boolean matchResult = false;
for (String s : list) {
if(skipChars == null || !skipChars.contains(s)){
if (param.indexOf(s) != -1) {
matchResult = true;
break;
}
}
}
return matchResult;
}
}

split the paragraph to wrods, and store them in a hash table
now for each word in your list search for it in the hash.
for real life applications this will probably do.
--EDIT--
if you cannot split the paragraph into words, and you need to tell if only one word is in the paragraph I suggest constructing a trie from your list of words, and then going over the paragraph and checking the trie for matches as you go.

In c# i usually use linq to entities for quering list and get result.
this is my code:
private bool findMatch(List<String> list, String param, List<String> skipChars)
{
if (skipChars == null)
skipChars = new List<string>();
var c = (from l in list.Except(skipChars)
where param.IndexOf(l) != -1
select l).Count();
return c != 0;
}

Related

The probability distribution of two words in a file using java 8

I need the number of lines that contain two words. For this purpose I have written the following code:
The input file contains 1000 lines and about 4,000 words, and it takes about 4 hours.
Is there a library in Java that can do it faster?
Can I implement this code using Appache Lucene or Stanford Core NLP to achieve less run time?
ArrayList<String> reviews = new ArrayList<String>();
ArrayList<String> terms = new ArrayList<String>();
Map<String,Double> pij = new HashMap<String,Double>();
BufferedReader br = null;
FileReader fr = null;
try
{
fr = new FileReader("src/reviews-preprocessing.txt");
br = new BufferedReader(fr);
String line;
while ((line= br.readLine()) != null)
{
for(String term : line.split(" "))
{
if(!terms.contains(term))
terms.add(term);
}
reviews.add(line);
}
}
catch (IOException e) { e.printStackTrace();}
finally
{
try
{
if (br != null)
br.close();
if (fr != null)
fr.close();
}
catch (IOException ex) { ex.printStackTrace();}
}
long Count = reviews.size();
for(String term_i : terms)
{
for(String term_j : terms)
{
if(!term_i.equals(term_j))
{
double p = (double) reviews.parallelStream().filter(s -> s.contains(term_i) && s.contains(term_j)).count();
String key = String.format("%s_%s", term_i,term_j);
pij.put(key, p/Count);
}
}
}
Your first loop getting the distinct words relies on ArrayList.contains, which has a linear time complexity, instead of using a Set. So if we assume nd distinct words, it already has a time complexity of “number of lines”×nd.
Then, you are creating nd×nd word combinations and probing all 1,000 lines for the presence of these combination. In other words, if we only assume 100 distinct words, you are performing 1,000×100 + 100×100×1,000 = 10,100,000 operations, if we assume 500 distinct words, we’re talking about 250,500,000 already.
Instead, you should just create the combinations actually existing in a line and collect them into the map. This will only process those combinations actually existing and you may improve this by only checking either of each “a_b”/“b_a” combination, as the probability of both is identical. Then, you are only performing “number of lines”דword per line”דword per line” operations, in other words, roughly 16,000 operations in your case.
The following method combines all words of a line, only keeping one of the “a_b”/“b_a” combination, and eliminates duplicates so each combination can count as a line.
static Stream<String> allCombinations(String line) {
String[] words = line.split(" ");
return Arrays.stream(words)
.flatMap(word1 ->
Arrays.stream(words)
.filter(words2 -> word1.compareTo(words2)<0)
.map(word2 -> word1+'_'+word2))
.distinct();
}
This method can be use like
List<String> lines = Files.readAllLines(Paths.get("src/reviews-preprocessing.txt"));
double ratio = 1.0/lines.size();
Map<String, Double> pij = lines.stream()
.flatMap(line -> allCombinations(line))
.collect(Collectors.groupingBy(Function.identity(),
Collectors.summingDouble(x->ratio)));
It ran through my copy of “War and Peace” within a few seconds, without needing any attempt to do parallel processing. Not much surprising, “and_the” was the combination with the highest probability.
You may consider changing the line
String[] words = line.split(" ");
to
String[] words = line.toLowerCase().split("\\W+");
to generalize the code to work with different input, handling multiple spaces or other punctuation characters and ignoring the case.

Most efficient way to determine if there are any differences between specific properties of 2 lists of items?

In C# .NET 4.0, I am struggling to come up with the most efficient way to determine if the contents of 2 lists of items contain any differences.
I don't need to know what the differences are, just true/false whether the lists are different based on my criteria.
The 2 lists I am trying to compare contain FileInfo objects, and I want to compare only the FileInfo.Name and FileInfo.LastWriteTimeUtc properties of each item. All the FileInfo items are for files located in the same directory, so the FileInfo.Name values will be unique.
To summarize, I am looking for a single Boolean result for the following criteria:
Does ListA contain any items with FileInfo.Name not in ListB?
Does ListB contain any items with FileInfo.Name not in ListA?
For items with the same FileInfo.Name in both lists, are the FileInfo.LastWriteTimeUtc values different?
Thank you,
Kyle
I would use a custom IEqualityComparer<FileInfo> for this task:
public class FileNameAndLastWriteTimeUtcComparer : IEqualityComparer<FileInfo>
{
public bool Equals(FileInfo x, FileInfo y)
{
if(Object.ReferenceEquals(x, y)) return true;
if (x == null || y == null) return false;
return x.FullName.Equals(y.FullName) && x.LastWriteTimeUtc.Equals(y.LastWriteTimeUtc);
}
public int GetHashCode(FileInfo fi)
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
hash = hash * 23 + fi.FullName.GetHashCode();
hash = hash * 23 + fi.LastWriteTimeUtc.GetHashCode();
return hash;
}
}
}
Now you can use a HashSet<FileInfo> with this comparer and HashSet<T>.SetEquals:
var comparer = new FileNameAndLastWriteTimeUtcComparer();
var uniqueFiles1 = new HashSet<FileInfo>(list1, comparer);
bool anyDifferences = !uniqueFiles1.SetEquals(list2);
Note that i've used FileInfo.FullName instead of Name since names aren't unqiue at all.
Sidenote: another advantage is that you can use this comparer for many LINQ methods like GroupBy, Except, Intersect or Distinct.
This is not the most efficient way (probably ranks a 4 out of 5 in the quick-and-dirty category):
var comparableListA = ListA.Select(a =>
new { Name = a.Name, LastWrite = a.LastWriteTimeUtc, Object = a});
var comparableListB = ListB.Select(b =>
new { Name = b.Name, LastWrite = b.LastWriteTimeUtc, Object = b});
var diffList = comparableListA.Except(comparableListB);
var youHaveDiff = diffList.Any();
Explanation:
Anonymous classes are compared by property values, which is what you're looking to do, which led to my thinking of doing a LINQ projection along those lines.
P.S.
You should double check the syntax, I just rattled this off without the compiler.

Item-by-item list comparison, updating each item with its result (no third list)

The solutions I have found so far in my research on comparing lists of objects have usually generated a new list of objects, say of those items existing in one list, but not in the other. In my case, I want to compare two lists to discover the items whose key exists in one list and not the other (comparing both ways), and for those keys found in both lists, checking whether the value is the same or different.
The object being compared has multiple properites that constitute the key, plus a property that constitutes the value, and finally, an enum property that describes the result of the comparison, e.g., {Equal, NotEqual, NoMatch, NotYetCompared}. So my object might look like:
class MyObject
{
//Key combination
string columnA;
string columnB;
decimal columnC;
//The Value
decimal columnD;
//Enum for comparison, used for styling the item (value hidden from UI)
//Alternatively...this could be a string type, holding the enum.ToString()
MyComparisonEnum result;
}
These objects are collected into two ObservableCollection<MyObject> to be compared. When bound to the UI, the grid rows are being styled based on the caomparison result enum, so the user can easily see what keys are in the new dataset but not in the old, vice-versa, along with those keys in both datasets with a different value. Both lists are presented in the UI in data grids, with the rows styled based on the comparison result.
Would LINQ be suitable as a tool to solve this efficiently, or should I use loops to scan the lists and break out when the key is found, etc (a solution like this comes naturally to be from my procedural programming background)... or some other method?
Thank you!
You can use Except and Intersect:
var list1 = new List<MyObject>();
var list2 = new List<MyObject>();
// initialization code
var notIn2 = list1.Except(list2);
var notIn1 = list2.Except(list1);
var both = list1.Intersect(list2);
To find objects with different values (ColumnD) you can use this (quite efficient) Linq query:
var diffValue = from o1 in list1
join o2 in list2
on new { o1.columnA, o1.columnB, o1.columnC } equals new { o2.columnA, o2.columnB, o2.columnC }
where o1.columnD != o2.columnD
select new { Object1 = o1, Object2 = o2 };
foreach (var diff in diffValue)
{
MyObject obj1 = diff.Object1;
MyObject obj2 = diff.Object2;
Console.WriteLine("Obj1-Value:{0} Obj2-Value:{1}", obj1.columnD, obj2.columnD);
}
when you override Equals and GetHashCode appropriately:
class MyObject
{
//Key combination
string columnA;
string columnB;
decimal columnC;
//The Value
decimal columnD;
//Enum for comparison, used for styling the item (value hidden from UI)
//Alternatively...this could be a string type, holding the enum.ToString()
MyComparisonEnum result;
public override bool Equals(object obj)
{
if (obj == null || !(obj is MyObject)) return false;
MyObject other = (MyObject)obj;
return columnA.Equals(other.columnA) && columnB.Equals(other.columnB) && columnC.Equals(other.columnC);
}
public override int GetHashCode()
{
int hash = 19;
hash = hash + (columnA ?? "").GetHashCode();
hash = hash + (columnB ?? "").GetHashCode();
hash = hash + columnC.GetHashCode();
return hash;
}
}

Linq query to find non-numeric items in list?

Suppose I have the following list:
var strings = new List<string>();
strings.Add("1");
strings.Add("12.456");
strings.Add("Foobar");
strings.Add("0.56");
strings.Add("zero");
Is there some sort of query I can write in Linq that will return to me only the numeric items, i.e. the 1st, 2nd, and 4th items from the list?
-R.
strings.Where(s => { double ignored; return double.TryParse(s, out ignored); })
This will return all the strings that are parseable as doubles as strings. If you want them as numbers (which makes more sense), you could write an extension method:
public static IEnumerable<double> GetDoubles(this IEnumerable<string> strings)
{
foreach (var s in strings)
{
double result;
if (double.TryParse(s, out result))
yield return result;
}
}
Don't forget that double.TryParse() uses your current culture, so it will give different results on different computers. If you don't want that, use double.TryParse(s, NumberStyles.AllowDecimalPoint, CultureInfo.InvariantCulture, out result).
Try this:
double dummy = 0;
var strings = new List<string>();
strings.Add("1");
strings.Add("12.456");
strings.Add("Foobar");
strings.Add("0.56");
strings.Add("zero");
var numbers = strings.Where(a=>double.TryParse(a, out dummy));
You could use a simple predicate to examine each string, like so:
var strings = new List<string>();
strings.Add("1");
strings.Add("12.456");
strings.Add("Foobar");
strings.Add("0.56");
strings.Add("zero");
var nums = strings.Where( s => s.ToCharArray().All( n => Char.IsNumber( n ) || n == '.' ) );

Multidimensional data lookup

I have a collection of tuples of N values. A value may be a wildcard (matches any value), or a concrete value. What would be the best way to lookup all tuples in the collection matching a specific tuple without scanning the entire collection and testing items one by one?
E.g. 1.2.3 matches 1.*.3 and *.*.3, but not 1.2.4 or *.2.4.
What data structure am I looking for here?
I'd use a trie to implement this. Here's how I would construct the trie:
The data structure would look like:
Trie{
Integer value
Map<Integer, Trie> tries
}
To insert:
insert(tuple, trie){
curTrie = trie
foreach( number in tuple){
nextTrie = curTrie.getTrie(number)
//add the number to the trie if it isn't in there
if(nextTrie == null){
newTrie = new Trie(number)
curTrie.setTrie(number, newTrie)
}
curTrie = curTrie.getTrie(number)
}
}
To get all the tuples:
getTuples(tuple, trie){
if(head(tuple) == "*"){
allTuples = {}
forEach(subTrie in trie){
allTuples.union(getTuples(restOf(tuple), subTrie))
forEach(partialTuple in allTuples){
partialTuple = head(tuple)+partialTuple
}
}
return allTuples
}
if(tuple == null)
return {trie.value}
if(trie.getTrie(head(tuple)) == null)
raise error because tuple does not exist
allTuples = {}
allTuples.union(getTuples(restOf(tuple), trie.getTrie(head(tuple))
forEach(partialTuple in allTuples){
partialTuple = head(tuple)+partialTuple
}
return allTuples
}

Resources