Is it possible to infer the edit distance using a unified diff? - algorithm

I have 2 files with contents spanning multiple lines. I'd like to find the edit distance; i.e. how many changes are required to transform A to B assuming only insertions and deletions are possible.
> cat > A
A
B
C
D
E
> cat > B
A
B
D
D
F
E
> diff -u A B
--- A 2015-05-12 16:09:31.000000000 +0200
+++ B 2015-05-12 16:09:42.000000000 +0200
## -1,5 +1,6 ##
A
B
-C
D
+D
+F
E
Would it be accurate to say that the total number of + and - give me the edit distance?

Going by your definition of edit distance (similiar to "Longest common subsequence problem"), you will first need to define what a single change is.
a single character?
a line?
a file?
The longest common subsequence problem is a classic computer science
problem, the basis of data comparison programs such as the diff
utility, and has applications in bioinformatics. It is also widely
used by revision control systems such as Git for reconciling multiple
changes made to a revision-controlled collection of files.
Assuming you want lines to define a change (based on your example), then yes, the total number of + and - using the diff command would suffice. This is because an update/substitution will show up as both a deletion (-) and an insertion (+).
See also http://en.wikipedia.org/wiki/Diff_utility#Unified_format

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)
If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.
if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

How to name hundreds of files in increasing order in bash?

I need to download 300 files on the cloud, and name them one by one in increasing order. I can achieve one time by running the following code. The pathname before '>' is the location of the initial files, the pathname after '>' is where I want to save.
/Applications/samtools-1.14/samtools depth -r dna /Volumes/lab/plants/aligned_data/S1_dedup.bam > /Volumes/lab/students/test1.txt
My question is how to change the numbers in 'S1_dedup.bam' and 'test1.txt' from 1 to 300 in a loop (or something), instead of hardcode the numbers 300 times by hand.
for ((i=1;i<=300;i++))
do
/Applications/samtools-1.14/samtools depth -r nda /Volumes/lab/plants/aligned_data/S${i}_dedup.bam > /Volumes/lab/students/test${i}.txt
done
you can use a for loop
for i in {1..300}
do
/Applications/samtools-1.14/samtools depth -r nda /Volumes/lab/plants/aligned_data/S${i}_dedup.bam > /Volumes/lab/students/test${i}.txt
done

AWK - replace with constant character in a specified number of random lines

I'm tasked with imputing masked genotypes, and I have to mask (hide) 2% of genotypes.
The file I do this in looks like this (genotype.dat):
M rs4911642
M rs9604821
M rs9605903
M rs5746647
M rs5747968
M rs5747999
M rs2070501
M rs11089263
M rs2096537
and to mask it, I simply change M to S2.
Yet, I have to do this for 110 (2%) of 5505 lines, so my strategy of using a random number generator (generate 110 numbers between 1 and 5505 and then manually changing the corresponding line number's M to S2 took almost an hour... (I know, not terribly sophisticated).
I thought about saving the numbers in a separate file (maskedlines.txt) and then telling awk to replace the first character in that line number with S2, but I could not find any adjustable example of to do this.
Anyway, any suggestions of how to tackle this will be deeply appreciated.
Here's one simple way, if you have shuf (it's in Gnu coreutils, so if you have Linux, you almost certainly have it):
sed "$(printf '%ds/M/S2/;' $(shuf -n110 -i1-5505 | sort -n))" \
genotype.dat > genotype.masked
A more sophisticated version wouldn't depend on knowing that you want 110 of 5505 lines masked; you can easily extract the line count with lines=$(wc -l < genotype.dat), and from there you can compute the percentage.
shuf is used to produce a random sample of lines, usually from a file; the -i1-5505 option means to use the integers from 1 to 5505 instead, and -n110 means to produce a random sample of 110 (without repetition). I sorted that for efficiency before using printf to create a sed edit script.
awk 'NR==FNR{a[$1]=1;next;} a[FNR]{$1="S2"} 1' maskedlines.txt genotype.dat
How it works
In sum, we first read in maskedlines.txt into an associative array a. This file is assumed to have one number per line and a of that number is set to one. We then read in genotype.dat. If a for that line number is one, we change the first field to S2 to mask it. The line, whether changed or not, is then printed.
In detail:
NR==FNR{a[$1]=1;next;}
In awk, FNR is the number of records (lines) read so far from the current file and NR is the total number of lines read so far. So, when NR==FNR, we are reading the first file (maskedlines.txt). This file contains the line number of lines in genotype.dat that are to be masked. For each of these line numbers, we set a to 1. We then skip the rest of the commands and jump to the next line.
a[FNR]{$1="S2"}
If we get here, we are working on the second file: genotype.dat. For each line in this file, we check to see if its line number, FNR, was mentioned in maskedlines.txt. If it was, we set the first field to S2 to mask this line.
1
This is awk's cryptic shorthand to print the current line.

Comparison of extra-large subsets of strings

There is one every-day file from 2000000 to 4000000 strings, which contains unique 15-symbol numbers line by line like this:
850025000010145
401115000010152
400025000010166
770025555010152
512498004158752
From beginning of current year you have some amount of such files accordingly. So I have to compare every line of today's file with all previous files from beginning of the year and return only that numbers which never meet before in all checked files.
Which language and algorithm should I use? How to implement it?
You should be able to do this without having to write any code beyond a simple script (i.e. bash, Windows batch, Powershell, etc.). There are standard tools that make quick work of this type of thing.
First, you have some number of files that contain from 2 million to 4 million numbers. It's difficult to work with all those files, so the first thing you want to do is create a combined file that's sorted. The simple-minded way to do that is to concatenate all the files into a single file, sort it, and remove duplicates. For example, using the GNU/Linux cat and sort commands:
cat file1 file2 file3 file4 > combined
sort -u combined > combined_sort
(The -u removes duplicates)
The problem with that approach is that you end up sorting a very large file. Figure 4 million lines at 15 characters, plus newlines, on each line, and almost 100 days of files, and you're working with 7 gigabytes. A whole year's worth of data would be 25 gigabytes. That takes a long time.
So instead, sort each individual file, then merge them:
sort -u file1 >file1_sort
sort -u file2 >file2_sort
...
sort -m -u file1 file2 file3 > combined_sorted
The -m switch merges the already-sorted files.
Now what you have is a sorted list of all the identifiers you've seen so far. You want to compare today's file with that. First, sort today's file:
sort -u today >today_sort
Now, you can compare the files and output only the files unique to today's file:
comm -2 -3 today_sort combined_sort
-2 says suppress lines that occur only in the second file, and -3 says to suppress lines that are common to both files. So all you'll get is the lines in today_sort that don't exist in combined_sort.
Now, if you're going to do this every day, then you need to take the output from the comm command and merge it with combined_sort so that you can use that combined file tomorrow. That prevents you from having to rebuild the combined_sort file every day. So:
comm -2 -3 today_sort combined_sort > new_values
Then:
sort -m combined_sort new_values > combined_sort_new
You'd probably want to name the file with the date, so you'd have combined_sort_20140401 and combined_sort_20140402, etc.
So if you started at the beginning of the year and wanted to do this every day, your script would look something like:
sort -u $todays_file > todays_sorted_file
comm -2 -3 todays_sorted_file $old_combined_sort > todays_uniques
sort -m $old_combined_sort todays_sorted_file > $new_combined_sort
$todays_file, $old_combined_sort, and $new_combined_sort are parameters that you pass on the command line. So, if the script was called "daily":
daily todays_file.txt all_values_20140101 all_values_20140102
If you must solve the problem by hands:
- Convert strings to 64-bit integers. This saves space (2x to 4x) and speeds up - calculations calculations.
- Sort current file of integers
- Merge current file with old data file (already sorted), selecting new numbers
Merging step may looks like merge step of MergeSort.
You can store ranges of numbers in separate files to avoid extra large file sizes.
P.S. I wanted to propose to use bit map, but it will have size about 125 TB
One solution could be to build a prefix tree based on the previous n-1 files(suppose n-th file was created today). The most time-consuming build process has to be done only once. After you build the prefix tree, you can save it as file(google for this topic).
Run the program to check the new file:
try(BufferedReader br = new BufferedReader(new FileReader("new_file.txt"))) {
String line = br.readLine();
while (line != null) {
if(!tree.contains(line)){
counter++;
}else{
tree.insert(line);
}
line = br.readLine();
}
}
So every day you run this 'pseudo' code, get the unique queries and update the tree.
contains takes O(m) time where m is number of chars
insert takes O(m) time too
I would suggest Java.

Difference between "**/*/" and "**/"?

Here are two ways to use glob to recursively list directories:
Dir.glob("**/*/")
Dir.glob("**/")
The output appears to be the same, at least for a small subtree. Is there a difference between those two commands I am missing out on?
The ** matches 0 or more directories. By placing a * at the end you remove directories in the root, essentially making it 1 or more:
a = Dir.glob('/tmp/**/*/').sort
b = Dir.glob('/tmp/**/').sort.size
b.size => 19
a.size => 18
b - a => ["/tmp/"]
Without a leading constant path though, it doesn't look like there is a difference as 0 length matches aren't interesting and don't get put in the results.
In that case no there isn't.
But, there are cases where that type of distinction can be important. If the patterns were instead **/* and **/*/* to recursively match files rather than directories, the first one would include files in the current directory while the latter would only list files that were at least one level down from the current directory since the /*/ in the middle has to match something.

Resources