Why is the Scala-written lines deduplication app of mine so slow?

Why is the Scala-written lines deduplication app of mine so slow? - performance

I've got some big (let's say 200 MiB - 2 GiB) textual files filled with tons of duplicated records. Each line can have about 100 or even more exact duplicates spread over the file. The task is to remove all the repetitions, leaving one unique instance of every record.
I've implemented it as follows:
object CleanFile {
def apply(s: String, t: String) {
import java.io.{PrintWriter, FileWriter, BufferedReader, FileReader}
println("Reading " + s + "...")
var linesRead = 0
val lines = new scala.collection.mutable.ArrayBuffer[String]()
val fr = new FileReader(s)
val br = new BufferedReader(fr)
var rl = ""
while (rl != null) {
rl = br.readLine()
if (!lines.contains(rl))
lines += rl
linesRead += 1
if (linesRead > 0 && linesRead % 100000 == 0)
println(linesRead + " lines read, " + lines.length + " unique found.")
}
br.close()
fr.close()
println(linesRead + " lines read, " + lines.length + " unique found.")
println("Writing " + t + "...")
val fw = new FileWriter(t);
val pw = new PrintWriter(fw);
lines.foreach(line => pw.println(line))
pw.close()
fw.close()
}
}
And it takes ~15 minutes (on my Core 2 Duo with 4 GB RAM) to process a 92 MiB file. While the following command:
awk '!seen[$0]++' filename
Takes about a minute to process 1.1 GiB file (which would take many hours with the above code of mine).
What's wrong with my code?

What is wrong is that you're using an array to store the lines. Lookup (lines.contains) takes O(n) in an array, so the whole thing runs in O(n²) time. By contrast, the Awk solution uses a hashtable, meaning O(1) lookup and a total running time of O(n).
Try using a mutable.HashSet instead.

You could also just read all lines and call .distinct on them. I don't know how distinct is implemented, but I'm betting it uses a HashSet to do it.

Related

Checking if a random number already exists VBscript

I have written code which generates a random number between 1-9.
I now want to add to this and generate a random number but generate a new number if that number has already been used before.
Sub main()
Dim max,min
max=9
min=1
Randomize
MsgBox(Int((max-min+1)*Rnd+min))
// Random number between 1-9 is generated
I had tried to implement a loop but i'm unsure of how it would work as i would need to keep the number generated in memory
If Int = random
Msgbox("Already in use")
End If
If Int = not random Then
Msgbox("Can be used")
End If
End Sub

Sounds like you just want to keep track of which random numbers have already been chosen. You could handle this any number of different ways (eg, with an array, hash-table/dictionary, numeric bit mask, etc.)
The solution I present below is similar to a numeric bit mask, but uses a string. Starting at all zeros (eg, "0000"), each indexable position in the string gets populated with a one (1) until the string becomes all ones (eg, "1111"). While potentially oversimplified -- since it assumes your min will always be one (1) -- it should get you started.
Dim min : min=1
Dim max : max=9
Dim result : result = ""
Dim r, s: s = String(max, "0")
Randomize
Do
r = Int((max-min+1)*Rnd+min)
If "0" = Mid(s, r, 1) Then
WScript.Echo "Can be used: " & r
result = result & ":" & r
s = Left(s, r-1) & "1" & Right(s, max-r)
Else
WScript.Echo "Already in use: " & r
End If
Loop Until String(max, "1") = s
WScript.Echo "Result" & result
Sample output:
Can be used: 4
Can be used: 5
Can be used: 9
Can be used: 3
Already in use: 3
Can be used: 1
Can be used: 8
Can be used: 6
Already in use: 6
Can be used: 7
Already in use: 8
Already in use: 4
Already in use: 3
Already in use: 1
Already in use: 6
Can be used: 2
Result:4:5:9:3:1:8:6:7:2
Hope this helps.

chronicle-bytes shared DirectBytesStores

I've created a MappedBytes instance to a file that I'm using as shared cache between different Java processes.
I would like to be able to split out additional MappedByte instances (or ByteBuffer or any other instance) from the original that provide direct read/write access to a subset of the underlying file.
I've spent today experimenting with different methods but options like subBytes(), rawCopy() and copyTo() all seem to create local copies of the underlying file, rather than accessing the file directly.
For example:
File tmpFile = new File(System.getProperty("java.io.tmpdir"), "data.dat");
MappedFile mappedFile = MappedFile.mappedFile(tmpfile, 1000, 100, 10, false);
MappedBytes original = MappedBytes.mappedBytes(mappedFile);
original.zeroOut(0, 1000);
original.writeInt(0, 1234);
BytesStore copy = original.bytesStore().subBytes(0, 200);
// Print out the int in the two BytesStores.
// This shows that the copy has the same contents of the original.
System.out.println("Original(0): " + original.readInt(0));
System.out.println("Copy(0): " + copy.readInt(0));
// Now modify the copy and print out the new int in the two BytesStores again.
copy.writeInt(50, 4321);
System.out.println("Original(50): " + original.readInt(50));
System.out.println("Copy(50): " + copy.readInt(50));
Produces the output:
Original(0): 1234
Copy(0): 1234
Original(50): 0
Copy(50): 4321
The copy has been modified but not the original. I would like the original to be modified, can chronicle-bytes do this?
Thanks for your help,
Josh.

This is a self-contained test which I think behaves the way you need.
#Test
public void multiBytes() throws FileNotFoundException {
String tmpfile = OS.TMP + "/data.dat";
MappedFile mappedFile = MappedFile.mappedFile(new File(tmpfile), 64 << 10);
MappedBytes original = MappedBytes.mappedBytes(mappedFile);
original.zeroOut(0, 1000);
original.writeInt(0, 1234);
PointerBytesStore pbs = new PointerBytesStore();
pbs.set(original.addressForRead(50), 100);
// Print out the int in the two BytesStores.
// This shows that the copy has the same contents of the original.
System.out.println("Original(0): " + original.readInt(0));
System.out.println("PBS(0): " + pbs.readInt(0));
// Now modify the copy and print out the new int in the two BytesStores again.
pbs.writeInt(0, 4321);
System.out.println("Original(50): " + original.readInt(50));
System.out.println("PBS(0): " + pbs.readInt(0));
original.writeInt(54, 12345678);
System.out.println("Original(54): " + original.readInt(54));
System.out.println("PBS(4): " + pbs.readInt(4));
}
prints
Original(0): 1234
PBS(0): 0
Original(50): 4321
PBS(0): 4321
Original(54): 12345678
PBS(4): 12345678

R: tm Textmining package: Doc-Level metadata generation is slow

I have a list of documents to process, and for each record I want to attach some metadata to the document "member" inside the "corpus" data structure that tm, the R package, generates (from reading in text files).
This for-loop works but it is very slow,
Performance seems to degrade as a function f ~ 1/n_docs.
for (i in seq(from= 1, to=length(corpus), by=1)){
if(opts$options$verbose == TRUE || i %% 50 == 0){
print(paste(i, " ", substr(corpus[[i]], 1, 140), sep = " "))
}
DublinCore(corpus[[i]], "title") = csv[[i,10]]
DublinCore(corpus[[i]], "Publisher" ) = csv[[i,16]] #institutions
}
This may do something to the corpus variable but I don't know what.
But when I put it inside a tm_map() (similar to lapply() function), it runs much faster, but the changes are not made persistent:
i = 0
corpus = tm_map(corpus, function(x){
i <<- i + 1
if(opts$options$verbose == TRUE){
print(paste(i, " ", substr(x, 1, 140), sep = " "))
}
meta(x, tag = "Heading") = csv[[i,10]]
meta(x, tag = "publisher" ) = csv[[i,16]]
})
Variable corpus has empty metadata fields after exiting the tm_map function. It should be filled. I have a few other things to do with the collection.
The R documentation for the meta() function says this:
Examples:
data("crude")
meta(crude[[1]])
DublinCore(crude[[1]])
meta(crude[[1]], tag = "Topics")
meta(crude[[1]], tag = "Comment") <- "A short comment."
meta(crude[[1]], tag = "Topics") <- NULL
DublinCore(crude[[1]], tag = "creator") <- "Ano Nymous"
DublinCore(crude[[1]], tag = "Format") <- "XML"
DublinCore(crude[[1]])
meta(crude[[1]])
meta(crude)
meta(crude, type = "corpus")
meta(crude, "labels") <- 21:40
meta(crude)
I tried many of these calls (with var "corpus" instead of "crude"), but they do not seem to work.
Someone else once seemed to have had the same problem with a similar data set (forum post from 2009, no response)

Here's a bit of benchmarking...
With the for loop :
expr.for <- function() {
for (i in seq(from= 1, to=length(corpus), by=1)){
DublinCore(corpus[[i]], "title") = LETTERS[round(runif(26))]
DublinCore(corpus[[i]], "Publisher" ) = LETTERS[round(runif(26))]
}
}
microbenchmark(expr.for())
# Unit: milliseconds
# expr min lq median uq max
# 1 expr.for() 21.50504 22.40111 23.56246 23.90446 70.12398
With tm_map :
corpus <- crude
expr.map <- function() {
tm_map(corpus, function(x) {
meta(x, "title") = LETTERS[round(runif(26))]
meta(x, "Publisher" ) = LETTERS[round(runif(26))]
x
})
}
microbenchmark(expr.map())
# Unit: milliseconds
# expr min lq median uq max
# 1 expr.map() 5.575842 5.700616 5.796284 5.886589 8.753482
So the tm_map version, as you noticed, seems to be about 4 times faster.
In your question you say that the changes in the tm_map version are not persistent, it is because you don't return x at the end of your anonymous function. In the end it should be :
meta(x, tag = "Heading") = csv[[i,10]]
meta(x, tag = "publisher" ) = csv[[i,16]]
x

How to read large matrix from a csv efficiently in Octave

There are many reports of slow performance of Octave's dlmread. I was hoping that this was fixed in 3.2.4, but when I tried to load a csv file that has a size of ca. 8 * 4 mil (32 mil in total), it also took very, very long time. I searched the web but could not find a workaround for this. Does anybody know a good workaround?

I experienced the same problem and had R handy, so my solution was to use "read.csv" in R, and then use the R package "R.matlab" to write a ".mat" file, and then load that in Octave.
"read.csv" can be pretty slow too, but this worked very well in my case.

The reason is that Octave has a bug that adding data to a very large matrix takes more time then adding the same amount of data to a small matrix.
Below is my try. I choose to save data each 50000 lines, so meanwhile I could already take a look instead of being forced to wait. It is slower for small files, but much faster for larger files.
function alldata = load_data(filename)
fid = fopen(filename,'r');
s=0;
data=[];
alldata=[];
save "temp.mat" alldata;
if fid == -1
disp("Couldn't find file mydata");
else
while (~feof(fid))
line = fgetl(fid);
[t1,t2,t3,t4,d] = sscanf(line,'%i:%i:%i:%i %f', "C"); #reading time as hh:mm:ss:ms and data as float
s++;
t = (t1 * 3600000 + t2 * 60000 + t3 * 1000 + t4);
data = [data; t, d];
if (mod(s,10000) == 0)
#disp(s), disp(" "), disp(t), disp(" "), disp(d), disp("\n");
disp(s);
fflush(stdout);
end
if (mod(s,50000) == 0)
load "temp.mat";
alldata=[alldata; data];
data=[];
save "temp.mat" alldata;
disp("data saved");
fflush(stdout);
end
end
disp(s);
load "temp.mat";
alldata=[alldata; data];
save "temp.mat" alldata;
disp("data saved");
fflush(stdout);
end
fclose(fid);

Here is a workaround that I am using.
I did not find that sscanf will parse input lines as indicated above. Also, I didn't use the temp file.
My .csv file has a large number of rows. They begin with a header of 18 lines and are followed by a data block, each of which has 135 columns. The following code has been tested. My file also begins each row with a dd/mm/yyyy hh:mm field. This will also catch poor lines and indicate where they are by using try/catch.
My .csv file came from a customer who dumped his PARCView load in an Excel file.
function [tags,descr,alldata] = fbcsvread(filename)
fid = fopen(filename,'r');
s = 0;
data=[];
alldata=zeros(1,135);
if fid==-1
disp("Couldn't find file %s\n",filename);
else
linecount = 1;
while (~feof(fid))
line = fgetl(fid);
data2 = zeros(1,135);
if linecount == 1
tags = strsplit(line,",");
elseif linecount == 2
descr = strsplit(line,",");
elseif linecount >= 19
data = strsplit(line,",");
datetime = strsplit(char(data(1))," ");
modyyr = strsplit(char(datetime(1)),"/");
hrmin = strsplit(char(datetime(2)),":");
year1 = sscanf(char(modyyr(3)),"%d","C");
day1 = sscanf(char(modyyr(2)),"%d","C");
month1 = sscanf(char(modyyr(1)),"%d","C");
hour1 = sscanf(char(hrmin(1)),"%d","C");
minute1 = sscanf(char(hrmin(2)),"%d","C");
realtime = datenum(year1,month1,day1,hour1,minute1);
data2(1) = realtime;
for location = 2:134
try
data2(location) = sscanf(char(data(location)),"%f","C");
catch
printf("Error at %s %s\n",char(datetime(1)),char(datetime(2)) );
fflush(stdout);
end_try_catch
endfor
alldata(linecount-18,:) = data2;
if mod(linecount,50) == 0
printf(".");
fflush(stdout);
endif
endif
linecount = linecount + 1;
endwhile
fclose(fid);
endif
endfunction

Creating a unique filename from a list of alphanumeric strings

I apologize for creating a similar thread to many that are out there now, but I mainly wanted to also get some insight on some methods.
I have a list of Strings (could be just 1 or over a 1000)
Format = XXX-XXXXX-XX where each one is alphanumeric
I am trying to generate a unique string (currently 18 in length but probably could be longer ensuring not to maximize file length or path length) that I could reproduce if I have that same list. Order doesn't matter; although I may be interested if its easier to restrict the order as well.
My current Java code is follows (which failed today, hence why I am here):
public String createOutputFileName(ArrayList alInput, EnumFPFunction efpf, boolean pHeaders) {
/* create file name based on input list */
String sFileName = "";
long partNum = 0;
for (String sGPN : alInput) {
sGPN = sGPN.replaceAll("-", ""); //remove dashes
partNum += Long.parseLong(sGPN, 36); //(base 36)
}
sFileName = Long.toString(partNum);
if (sFileName.length() > 19) {
sFileName.substring(0, 18); //Max length of 19
}
return alInput;
}
So obviously just adding them did not work out so well I found out (also think I should take last 18 digits and not first 18)
Are there any good methods out there (possibly CRC related) that would work?
To assist with my key creation:
The first 3 characters are almost always numeric and would probably have many duplicate (out of 100, there may only be 10 different starting numbers)
These characters are not allowed - I,O
There will never be a character then a number in the last two alphachar subset.

I would use the system time. Here's how you might do it in Java:
public String createOutputFileName() {
long mills = System.currentTimeMillis();
long nanos = System.nanoTime();
return mills + " " + nanos;
}
If you want to add some information about the items and their part numbers, you can, of course!
======== EDIT: "What do I mean by batch object" =========
class Batch {
ArrayList<Item> itemsToProcess;
String inputFilename; // input to external process
boolean processingFinished;
public Batch(ArrayList<Item> itemsToProcess) {
this.itemsToProcess = itemsToProcess;
inputFilename = null;
processingFinished = false;
}
public void processWithExternal() {
if(inputFilename != null || processingFinished) {
throw new IllegalStateException("Cannot initiate process more than once!");
}
String base = System.currentTimeMillis() + " " + System.nanoTime();
this.inputFilename = base + "_input";
writeItemsToFile();
// however you build your process, do it here
Process p = new ProcessBuilder("myProcess","myargs", inputFilename);
p.start();
p.waitFor();
processingFinished = true;
}
private void writeItemsToFile() {
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(inputFilename)));
int flushcount = 0;
for(Item item : itemsToProcess) {
String output = item.getFileRepresentation();
out.println(output);
if(++flushcount % 10 == 0) out.flush();
}
out.flush();
out.close();
}
}

In addition to GlowCoder's response, I have thought of another "decent one" that would work.
Instead of just adding the list in base 36, I would do two separate things to the same list.
In this case, since there is no way for negative or decimal numbers, adding every number and multiplying every number separately and concatenating these base36 number strings isn't a bad way either.
In my case, I would take the last nine digits of the added number and last nine of the multiplied number. This would eliminate my previous errors and make it quite robust. It obviously is still possible for errors once overflow starts occurring, but could also work in this case. Extending the allowable string length would make it more robust as well.
Sample code:
public String createOutputFileName(ArrayList alInput, EnumFPFunction efpf, boolean pHeaders) {
/* create file name based on input list */
String sFileName1 = "";
String sFileName2 = "";
long partNum1 = 0; // Starting point for addition
long partNum2 = 1; // Starting point for multiplication
for (String sGPN : alInput) {
//remove dashes
sGPN = sGPN.replaceAll("-", "");
partNum1 += Long.parseLong(sGPN, 36); //(base 36)
partNum2 *= Long.parseLong(sGPN, 36); //(base 36)
}
// Initial strings
sFileName1 = "000000000" + Long.toString(partNum1, 36); // base 36
sFileName2 = "000000000" + Long.toString(partNum2, 36); // base 36
// Cropped strings
sFileName1 = sFileName1.substring(sFileName1.length()-9, sFileName1.length());
sFileName2 = sFileName2.substring(sFileName2.length()-9, sFileName2.length());
return sFileName1 + sFileName2;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why is the Scala-written lines deduplication app of mine so slow? - performance

What is wrong is that you're using an array to store the lines. Lookup (lines.contains) takes O(n) in an array, so the whole thing runs in O(n²) time. By contrast, the Awk solution uses a hashtable, meaning O(1) lookup and a total running time of O(n). Try using a mutable.HashSet instead.

You could also just read all lines and call .distinct on them. I don't know how distinct is implemented, but I'm betting it uses a HashSet to do it.

Related

Checking if a random number already exists VBscript

chronicle-bytes shared DirectBytesStores

R: tm Textmining package: Doc-Level metadata generation is slow

How to read large matrix from a csv efficiently in Octave

Creating a unique filename from a list of alphanumeric strings

Categories

Resources