Comparison of extra-large subsets of strings

Comparison of extra-large subsets of strings - algorithm

There is one every-day file from 2000000 to 4000000 strings, which contains unique 15-symbol numbers line by line like this:
850025000010145
401115000010152
400025000010166
770025555010152
512498004158752
From beginning of current year you have some amount of such files accordingly. So I have to compare every line of today's file with all previous files from beginning of the year and return only that numbers which never meet before in all checked files.
Which language and algorithm should I use? How to implement it?

You should be able to do this without having to write any code beyond a simple script (i.e. bash, Windows batch, Powershell, etc.). There are standard tools that make quick work of this type of thing.
First, you have some number of files that contain from 2 million to 4 million numbers. It's difficult to work with all those files, so the first thing you want to do is create a combined file that's sorted. The simple-minded way to do that is to concatenate all the files into a single file, sort it, and remove duplicates. For example, using the GNU/Linux cat and sort commands:
cat file1 file2 file3 file4 > combined
sort -u combined > combined_sort
(The -u removes duplicates)
The problem with that approach is that you end up sorting a very large file. Figure 4 million lines at 15 characters, plus newlines, on each line, and almost 100 days of files, and you're working with 7 gigabytes. A whole year's worth of data would be 25 gigabytes. That takes a long time.
So instead, sort each individual file, then merge them:
sort -u file1 >file1_sort
sort -u file2 >file2_sort
...
sort -m -u file1 file2 file3 > combined_sorted
The -m switch merges the already-sorted files.
Now what you have is a sorted list of all the identifiers you've seen so far. You want to compare today's file with that. First, sort today's file:
sort -u today >today_sort
Now, you can compare the files and output only the files unique to today's file:
comm -2 -3 today_sort combined_sort
-2 says suppress lines that occur only in the second file, and -3 says to suppress lines that are common to both files. So all you'll get is the lines in today_sort that don't exist in combined_sort.
Now, if you're going to do this every day, then you need to take the output from the comm command and merge it with combined_sort so that you can use that combined file tomorrow. That prevents you from having to rebuild the combined_sort file every day. So:
comm -2 -3 today_sort combined_sort > new_values
Then:
sort -m combined_sort new_values > combined_sort_new
You'd probably want to name the file with the date, so you'd have combined_sort_20140401 and combined_sort_20140402, etc.
So if you started at the beginning of the year and wanted to do this every day, your script would look something like:
sort -u $todays_file > todays_sorted_file
comm -2 -3 todays_sorted_file $old_combined_sort > todays_uniques
sort -m $old_combined_sort todays_sorted_file > $new_combined_sort
$todays_file, $old_combined_sort, and $new_combined_sort are parameters that you pass on the command line. So, if the script was called "daily":
daily todays_file.txt all_values_20140101 all_values_20140102

If you must solve the problem by hands:
- Convert strings to 64-bit integers. This saves space (2x to 4x) and speeds up - calculations calculations.
- Sort current file of integers
- Merge current file with old data file (already sorted), selecting new numbers
Merging step may looks like merge step of MergeSort.
You can store ranges of numbers in separate files to avoid extra large file sizes.
P.S. I wanted to propose to use bit map, but it will have size about 125 TB

One solution could be to build a prefix tree based on the previous n-1 files(suppose n-th file was created today). The most time-consuming build process has to be done only once. After you build the prefix tree, you can save it as file(google for this topic).
Run the program to check the new file:
try(BufferedReader br = new BufferedReader(new FileReader("new_file.txt"))) {
String line = br.readLine();
while (line != null) {
if(!tree.contains(line)){
counter++;
}else{
tree.insert(line);
}
line = br.readLine();
}
}
So every day you run this 'pseudo' code, get the unique queries and update the tree.
contains takes O(m) time where m is number of chars
insert takes O(m) time too
I would suggest Java.

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)

If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.

if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

Merging CSVs into one sees exponentially bigger size

I have 600 CSV files of size ~1Mo for a total of roughly 600Mo. I want to put all of them into a sqlite3 db. So my first step would be to merge them into one big csv (of ~600Mo right?) before importing it into a sql db.
However, when I run the following bash command (to merge all files keeping one header):
cat file-chunk0001.csv | head -n1 > file.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> file.csv; done
The resulting file.csv has a size of 38Go, at which point the process stops because I have no space left on device.
So my question is: why would the merged file size be more than 50x times bigger than expected? And what can I do to put them in a sqlite3 db with a reasonable size?

I guess my first question is: if you know how to do a for loop, why do you need to merge all the files into a single CSV file? Can't you just load them one after the other?
But your problem is an infinite loop. Your wildcard (*.csv) includes the file you're writing to. You could put your output file in a different directory or make sure your file glob does not include the output file (for f in file-*.csv maybe).

grep listing false duplicates

I have the following data containing a subset of record numbers formatting like so:
>head pilot.dat
AnalogPoint,206407
AnalogPoint,2584
AnalogPoint,206292
AnalogPoint,206278
AnalogPoint,206409
AnalogPoint,206410
AnalogPoint,206254
AnalogPoint,206266
AnalogPoint,206408
AnalogPoint,206284
I want to compare the list of entries to another subset file called "disps.dat" to find duplicates, which is formatted in the same way:
>head disps.dat
StatusPoint,280264
StatusPoint,280266
StatusPoint,280267
StatusPoint,280268
StatusPoint,280269
StatusPoint,280335
StatusPoint,280336
StatusPoint,280334
StatusPoint,280124
I used the command:
grep -f pilot.dat disps.dat > duplicate.dat
However, the output file "duplicate.dat" is listing records that exist in the second file "disps.dat", but do not exist in the first file.
(Note, both files are big, so the sample shown above don't have duplicates, but I do expect and have confirmed at least 10-12k duplicates to show up in total)
> head duplicate.dat
AnalogPoint,208106
AnalogPoint,208107
StatusPoint,1235220
AnalogPoint,217270
AnalogPoint,217271
AnalogPoint,217272
AnalogPoint,217273
AnalogPoint,217274
AnalogPoint,217275
AnalogPoint,217277
> grep "AnalogPoint,208106" pilot.dat
>
I tested the above command with a smaller sample of data (10 records), also formatted the same, and the results work fine, so I'm a little bit confused on why it is failing on the larger execution.
I also tried feeding it in as a string with -F thinking that the "," comma might be the source of issue. Right now, I am feeding the data through a 'for' loop and echoing each line, which is executing very, very slowly but at least it will help me cross out the regex possibility.

the -x or -w option is needed to do an exact match.
-x will match exact string, and -w will match exact substring and block non-word characters which works in my case to handle trailing numbers.
The issue is that a record in the first file such as:
"AnalogPoint,1"
Would end up flagging records in the second file like:
"AnalogPoint,10"
"AnalogPoint,123"
"AnalogPoint,100200"
And so on.
Thanks to #Barmar for pointing out my issue.

How to aggregate the result of bash sort on multiple files into a single file?

I have a ~90GB file. Each line consists of tab-separated pairs such as Something \t SomethingElse. My main goal is to find the frequency of each unique line in the file. So I tried
sort --parallel=50 bigFile | unique -c > new_sortedFile
which did not work due to the file size. Then I split the original big file into 9 parts (each 10 GB) then the same command worked for each file separately.
So my question is how can I aggregate the result of those 9 files into a single file in order to have the same result of the bigFile?

Is there a better split function for terminal?

I'm trying to split a very big CSV file into smaller more manageable ones. I've tried split but it seems that it tops out at 676 files.
The CSV file I have is in excess of 80mb and I'd like to split it into 50 line files.
Note by better I mean one that uses a numbering structure instead of split's a-z sequencing.

split is the right tool, the problem is that the suffix is only 2 long 26^2 = 676, if you make it longer you should be fine:
split -a LEN file

Use 'cat' to number each line and pipe the output to 'grep' with params to only print n lines

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio