Sorting large amount of strings and grouping - algorithm

I'm trying to find a set of unique strings and break it up into disjoint groups by criterium: if two strings have coincidence at 1 or more columns, it belongs to one group.
For example
111;123;222
200;123;100
300;;100
All of them are belong to one group, cause of overlap at:
first string with second by "123" value
second string with third by "100" value
After getting these groups, I need to save them to a text file.
I got 60 MB file with strings, which should be sorted.(time limit: 30 sec)
I think. First, the best way is to divide strings into columns, and then try to find any coincidence, but I'm not sure at all.
Please, help me to find the solution.
For now, I have this code; it works about 2.5-3 sec:
// getting from file
File file = new File(path);
InputStream inputFS = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(inputFS));
List<String> inputList = br.lines().collect(Collectors.toList());
br.close();
List<String> firstValues = new ArrayList<>();
List<String> secondValues = new ArrayList<>();
List<String> thirdValues = new ArrayList<>();
// extracting distinct values and splitting
final String qq = "\"";
inputList.stream()
.map(s -> s.split(";"))
.forEach(strings -> {
firstValues.add(strings.length > 0 ? strings[0].replaceAll(qq, "") : null);
secondValues.add(strings.length > 1 ? strings[1].replaceAll(qq, "") : null);
thirdValues.add(strings.length > 2 ? strings[2].replaceAll(qq, "") : null);
});
// todo: add to maps by the row and then find groups

Related

Filter multiple values from arrays

I have old Java Code like below:
String str1 = "aaa,bbb,ccc,ddd,eee,fff,ggg";
String str2 = "111,222,333,444,555,666,777";
//Start
String[] str1Array = str1.split(",", -1);
String[] str2Array = str2.split(",", -1);
HashMap<String, String> someMap = new HashMap<String, String>();
if (str1Array.length == str2Array.length) {
for (int i = 0; i < str1Array.length; i++) {
if (str1Array[i].equalsIgnoreCase("bbb") || str1Array[i].equalsIgnoreCase("ddd") || str1Array[i].equalsIgnoreCase("ggg")) {
System.out.println("Processing:" + str1Array[i] + "-" + str2Array[i]);
someMap.put(str1Array[i], str2Array[i]);
}
}
}
for (Map.Entry<String, String> entry : someMap.entrySet()) {
System.out.println(someProcessMethod1(entry.getKey()) + " : " + someProcessMethod2(entry.getValue()));
}
//Finish
I am trying to use Java 8 (I am still learning), and trying to filter and process the filtered values. Need some pointers here.
I am starting here (I know its wrong):
String[] mixAndFilter = (String[]) Stream.concat(Stream.of(str1Array), Stream.of(str2Array)).filter(???????).thenWhatHere()
.toArray(b -> new String[b]);
for (String b : mixAndFilter) {
System.out.println(mapkey : mapValue);
}
What should I do here to filter those three strings and use them in my someProcessMethods as its done the old way? If possible, is there a way to get thing done from Start to Finish in Java 8 way? Any pointers/solution/pseudo-code is welcome.
Update (as asked by WJS):
The goal is to:
Filter bbb, ddd, ggg strings from first array and then map those with the respective values in second array. so that the expected output is:
bbb : 222
ddd : 444
ggg : 777
The easiest way to use streams is to:
set up a set of strings to use as a filter list
first split the strings like you are doing.
generate a range of indices for the arrays.
then, if the string at the index i is contained in the set, pass that index thru the filter.
the use the index to populate the map
String str1 = "aaa,bbb,ccc,ddd,eee,fff,ggg";
String str2 = "111,222,333,444,555,666,777";
Set<String> filterList = Set.of("bbb", "ddd", "ggg");
String[] str1Array = str1.split(",", -1);
String[] str2Array = str2.split(",", -1);
Map<String, String> someMap = IntStream
.range(0, str1Array.length).boxed()
.filter(i -> filterList.contains(str1Array[i]))
.collect(Collectors.toMap(i -> str1Array[i],
i -> str2Array[i]));
someMap.entrySet().forEach(System.out::println);
prints
bbb=222
ddd=444
ggg=777
It is possible to make the splitting of the strings as part of the streaming process. But I found it much more cumbersome to do.

How to get new userinput in a stream while its running using Java8

I need to validate user input and if it doesn't meet the conditions then I need to replace it with correct input. So far I am stuck on two parts. Im fairly new to java8 and not so familiar with all the libraries so if you can give me advice on where to read up more on these I would appreciate it.
List<String> input = Arrays.asList(args);
List<String> validatedinput = input.stream()
.filter(p -> {
if (p.matches("[0-9, /,]+")) {
return true;
}
System.out.println("The value has to be positve number and not a character");
//Does the new input actually get saved here?
sc.nextLine();
return false;
}) //And here I am not really sure how to map the String object
.map(String::)
.validatedinput(Collectors.toList());
This type of logic shouldn't be done with streams, a while loop would be a good candidate for it.
First, let's partition the data into two lists, one list representing the valid inputs and the other representing invalid inputs:
Map<Boolean, List<String>> resultSet =
Arrays.stream(args)
.collect(Collectors.partitioningBy(s -> s.matches(yourRegex),
Collectors.toCollection(ArrayList::new)));
Then create the while loop to ask the user to correct all their invalid inputs:
int i = 0;
List<String> invalidInputs = resultSet.get(false);
final int size = invalidInputs.size();
while (i < size){
System.out.println("The value --> " + invalidInputs.get(i) +
" has to be positive number and not a character");
String temp = sc.nextLine();
if(temp.matches(yourRegex)){
resultSet.get(true).add(temp);
i++;
}
}
Now, you can collect the list of all the valid inputs and do what you like with it:
List<String> result = resultSet.get(true);

The probability distribution of two words in a file using java 8

I need the number of lines that contain two words. For this purpose I have written the following code:
The input file contains 1000 lines and about 4,000 words, and it takes about 4 hours.
Is there a library in Java that can do it faster?
Can I implement this code using Appache Lucene or Stanford Core NLP to achieve less run time?
ArrayList<String> reviews = new ArrayList<String>();
ArrayList<String> terms = new ArrayList<String>();
Map<String,Double> pij = new HashMap<String,Double>();
BufferedReader br = null;
FileReader fr = null;
try
{
fr = new FileReader("src/reviews-preprocessing.txt");
br = new BufferedReader(fr);
String line;
while ((line= br.readLine()) != null)
{
for(String term : line.split(" "))
{
if(!terms.contains(term))
terms.add(term);
}
reviews.add(line);
}
}
catch (IOException e) { e.printStackTrace();}
finally
{
try
{
if (br != null)
br.close();
if (fr != null)
fr.close();
}
catch (IOException ex) { ex.printStackTrace();}
}
long Count = reviews.size();
for(String term_i : terms)
{
for(String term_j : terms)
{
if(!term_i.equals(term_j))
{
double p = (double) reviews.parallelStream().filter(s -> s.contains(term_i) && s.contains(term_j)).count();
String key = String.format("%s_%s", term_i,term_j);
pij.put(key, p/Count);
}
}
}
Your first loop getting the distinct words relies on ArrayList.contains, which has a linear time complexity, instead of using a Set. So if we assume nd distinct words, it already has a time complexity of “number of lines”×nd.
Then, you are creating nd×nd word combinations and probing all 1,000 lines for the presence of these combination. In other words, if we only assume 100 distinct words, you are performing 1,000×100 + 100×100×1,000 = 10,100,000 operations, if we assume 500 distinct words, we’re talking about 250,500,000 already.
Instead, you should just create the combinations actually existing in a line and collect them into the map. This will only process those combinations actually existing and you may improve this by only checking either of each “a_b”/“b_a” combination, as the probability of both is identical. Then, you are only performing “number of lines”דword per line”דword per line” operations, in other words, roughly 16,000 operations in your case.
The following method combines all words of a line, only keeping one of the “a_b”/“b_a” combination, and eliminates duplicates so each combination can count as a line.
static Stream<String> allCombinations(String line) {
String[] words = line.split(" ");
return Arrays.stream(words)
.flatMap(word1 ->
Arrays.stream(words)
.filter(words2 -> word1.compareTo(words2)<0)
.map(word2 -> word1+'_'+word2))
.distinct();
}
This method can be use like
List<String> lines = Files.readAllLines(Paths.get("src/reviews-preprocessing.txt"));
double ratio = 1.0/lines.size();
Map<String, Double> pij = lines.stream()
.flatMap(line -> allCombinations(line))
.collect(Collectors.groupingBy(Function.identity(),
Collectors.summingDouble(x->ratio)));
It ran through my copy of “War and Peace” within a few seconds, without needing any attempt to do parallel processing. Not much surprising, “and_the” was the combination with the highest probability.
You may consider changing the line
String[] words = line.split(" ");
to
String[] words = line.toLowerCase().split("\\W+");
to generalize the code to work with different input, handling multiple spaces or other punctuation characters and ignoring the case.

Group lines of log-file using Linq

I have an array of strings from a log file with the following format:
var lines = new []
{
"--------",
"TimeStamp: 12:45",
"Message: Message #1",
"--------",
"--------",
"TimeStamp: 12:54",
"Message: Message #2",
"--------",
"--------",
"Message: Message #3",
"TimeStamp: 12:55",
"--------"
}
I want to group each set of lines (as delimited by "--------") into a list using LINQ. Basically, I want a List<List<string>> or similar where each inner list contains 4 strings - 2 separators, a timestamp and a message.
I should add that I would like to make this as generic as possible, as the log-file format could change.
Can this be done?
Will this work?
var result = Enumerable.Range(0, lines.Length / 4)
.Select(l => lines.Skip(l * 4).Take(4).ToList())
.ToList()
EDIT:
This looks a little hacky but I'm sure it can be cleaned up
IEnumerable<List<String>> GetLogGroups(string[] lines)
{
var list = new List<String>();
foreach (var line in lines)
{
list.Add(line);
if (list.Count(l => l.All(c => c == '-')) == 2)
{
yield return list;
list = new List<string>();
}
}
}
You should be able to actually do better than returning a List>. If you're using C# 4, you could project each set of values into a dynamic type where the string before the colon becomes the property name and the value is on the left-hand side. You then create a custom iterator which reads the lines until the end "------" appears in each set and then yield return that row. On MoveNext, you read the next set of lines. Rinse and repeat until EOF. I don't have time at the moment to write up a full implementation, but my sample on reading in CSV and using LINQ over the dynamic objects may give you an idea of what you can do. See http://www.thinqlinq.com/Post.aspx/Title/LINQ-to-CSV-using-DynamicObject. (note this sample is in VB, but the same can be done in C# as well with some modifications).
The iterator implementation has the added benefit of not having to load the entire document into memory before parsing. With this version, you only load the amount for one set of blocks at a time. It allows you to handle really large files.
Assuming that your structure is always
delimeter
TimeStamp
Message
delimeter
public List<List<String>> ConvertLog(String[] log)
{
var LogSet = new List<List<String>>();
for(i = 0; i < log.Length(); i += 4)
{
if (log.Length <= i+3)
{
var set = new List<String>() { log[i], log[i+1], log[i+2], log[i+3] };
LogSet.Add(set);
}
}
}
Or in Linq
public List<List<String> ConvertLog(String[] log)
{
return Enumerable.Range(0, lines.Length / 4)
.Select(l => lines.Skip(l * 4).Take(4).ToList())
.ToList()
}

Need an algorithm to group several parameters of a person under the persons name

I have a bunch of names in alphabetical order with multiple instances of the same name all in alphabetical order so that the names are all grouped together. Beside each name, after a coma, I have a role that has been assigned to them, one name-role pair per line, something like whats shown below
name1,role1
name1,role2
name1,role3
name1,role8
name2,role8
name2,role2
name2,role4
name3,role1
name4,role5
name4,role1
...
..
.
I am looking for an algorithm to take the above .csv file as input create an output .csv file in the following format
name1,role1,role2,role3,role8
name2,role8,role2,role4
name3,role1
name4,role5,role1
...
..
.
So basically I want each name to appear only once and then the roles to be printed in csv format next to the names for all names and roles in the input file.
The algorithm should be language independent. I would appreciate it if it does NOT use OOP principles :-) I am a newbie.
Obviously has some formatting bugs but this will get you started.
var lastName = "";
do{
var name = readName();
var role = readRole();
if(lastName!=name){
print("\n"+name+",");
lastName = name;
}
print(role+",");
}while(reader.isReady());
This is easy to do if your language has associative arrays: arrays that can be indexed by anything (such as a string) rather than just numbers. Some languages call them "hashes," "maps," or "dictionaries."
On the other hand, if you can guarantee that the names are grouped together as in your sample data, Stefan's solution works quite well.
It's kind of a pity you said it had to be language-agnostic because Python is rather well-qualified for this:
import itertools
def split(s):
return s.strip().split(',', 1)
with open(filename, 'r') as f:
for name, lines in itertools.groupby(f, lambda s: split(s)[0])
print name + ',' + ','.join(split(s)[1] for s in lines)
Basically the groupby call takes all consecutive lines with the same name and groups them together.
Now that I think about it, though, Stefan's answer is probably more efficient.
Here is a solution in Java:
Scanner sc = new Scanner (new File(fileName));
Map<String, List<String>> nameRoles = new HashMap<String, List<String>> ();
while (sc.hasNextLine()) {
String line = sc.nextLine();
String args[] = line.split (",");
if (nameRoles.containsKey(args[0]) {
nameRoles.get(args[0]).add(args[1]);
} else {
List<String> roles = new ArrayList<String>();
roles.add(args[1]);
nameRoles.put(args[0], roles);
}
}
// then print it out
for (String name : nameRoles.keySet()) {
List<String> roles = nameRoles.get(name);
System.out.print(name + ",");
for (String role : roles) {
System.out.print(role + ",");
}
System.out.println();
}
With this approach, you can work with an random input like:
name1,role1
name3,role1
name2,role8
name1,role2
name2,role2
name4,role5
name4,role1
Here it is in C# using nothing fancy. It should be self-explanatory:
static void Main(string[] args)
{
using (StreamReader file = new StreamReader("input.txt"))
{
string prevName = "";
while (!file.EndOfStream)
{
string line = file.ReadLine(); // read a line
string[] tokens = line.Split(','); // split the name and the parameter
string name = tokens[0]; // this is the name
string param = tokens[1]; // this is the parameter
if (name == prevName) // if the name is the same as the previous name we read, we add the current param to that name. This works right because the names are sorted.
{
Console.Write(param + " ");
}
else // otherwise, we are definitely done with the previous name, and have printed all of its parameters (due to the sorting).
{
if (prevName != "") // make sure we don't print an extra newline the first time around
{
Console.WriteLine();
}
Console.Write(name + ": " + param + " "); // write the name followed by the first parameter. The output format can easily be tweaked to print commas.
prevName = name; // store the new name as the previous name.
}
}
}
}

Resources