What is the general format of Ruby "diff-lcs" diff output? - ruby

The Ruby diff-lcs library does a great job of generating the changeset you need to get from one sequence to another but the format of the output is somewhat confusing to me. I would expect a list of changes but instead the output is always a list containing one or two lists of changes. What is the meaning/intent of having multiple lists of changes?
Consider the following simple example:
> Diff::LCS.diff('abc', 'a-c')
# => [[#<Diff::LCS::Change:0x01 #action="-", #position=1, #element="b">,
# #<Diff::LCS::Change:0x02 #action="+", #position=1, #element="-">],
# [#<Diff::LCS::Change:0x03 #action="-", #position=3, #element="">]]
Ignoring the fact that the last change is blank, why are there two lists of changes instead of just one?

You might have better luck with a better example. If you do this:
Diff::LCS.diff('ab cd', 'a- c_')
Then the output looks like this (with the noise removed):
[
[
<#action="-", #position=1, #element="b">,
<#action="+", #position=1, #element="-">
], [
<#action="-", #position=4, #element="d">,
<#action="+", #position=4, #element="_">
]
]
If we look at Diff::LCS.diff('ab cd ef', 'a- c_ e+'), then we'd get three inner arrays instead of two.
What possible reason could there be for this? There are three operations in a diff:
Add a string.
Remove string.
Change a string.
A change is really just a combination of removes and adds so we're left with just remove and add as the fundamental operations; these line up with the #action values quite nicely. However, when humans look at diffs, we want to see a change as a distinct operation, we want to see that b has become -, the "remove b, add -" version is an implementation detail.
If all we had was this:
[
<#action="-", #position=1, #element="b">,
<#action="+", #position=1, #element="-">,
<#action="-", #position=4, #element="d">,
<#action="+", #position=4, #element="_">
]
then you'd have to figure out which +/- pairs were really changes and which were separate additions and removals.
So the inner arrays map the two fundamental operations (add, remove) to the three operations (add, remove, change) that humans want to see.
You might want to examine the structure of the outputs from these as well:
Diff::LCS.diff('ab cd', 'a- x c_')
Diff::LCS.diff('ab', 'abx')
Diff::LCS.diff('ab', 'xbx')
I think an explicit change #action for Diff::LCS::Change would be better but at least the inner arrays let you group the individual additions and removals into higher level edits.

Related

In Pylint, how do I disable "Exactly one space after comma" for multidimensional array indices?

I like having PyLint check that commas are generally followed by spaces, except in one case: multidimensional indices. For example, I get the following warning from Pylint:
C: 31, 0: Exactly one space required after comma
num_features = len(X_train[0,:])
^ (bad-whitespace)
Is there a way to get rid of the warnings requiring spaces after commas for the case multidimensional arrays, but keep the space-checking logic the same for all other comma uses?
Thanks!
I am sure you figured this out by now but for anyone, like me, who happened upon this looking for an answer...
use # pylint: disable=C0326 on the line that is guilty of this. for instance:
num_features = len(X_train[0,:]) #pylint: disable=C0326
This applies to multiple kinds of space errors. See pylint wiki
You'll almost certainly want to disable this via the .pylintrc file for larger situations.
Example, say I have:
x111 = thing.abc(asdf)
x112_b = thing1.abc(asdf)
x112_b224 = thing.abc(asdf)
x112_f = thing1.abc(asdf)
... lots more
Now, presume I want to visually see the situation:
x111 = thing.abc(asdf)
x112_b = thing1.abc(asdf)
... lots more
so I add the following line to .pylintrc
disable=C0326,C0115,C0116
(note only the first one, c0326, counts, but I'm leaving two other docstring ones there so you can see you just add err messages you want to ignore.)

Checking if a word is already in a list (Bash)

I'm writing a small bash script and am trying to test is a newly generated word is already in a list of all previously made words.
This is what I'm working with now:
dict=("word1"... "word21") #there would be 21 words in here
prev_guesses=()
guess_creator() {
guess=""
for i in {1..5} ;
do
guess_num=$( shuf -i 0-21 -n 1 )
guess+="${dict[$guess_num]}"
done
# using recursion to take another guess
if [ $guess (is in) $prev_guesses] ; then
guess_creator
else
prev_guess+=($guess)
fi
}
I'm also not sure if recursion works like this in bash. If it doesn't, I'm asking here how to actually "unbreak" this code.The idea is to have this function constantly outputting a unique string every time it runs so I can use it later on in the script.
I have three questions:
How can I compare guess to the list prev_guesses and get a true or false output
How can I append guessed string to the list prev_guesses (I just checked it and it is just concatenating the strings together, I need a list like prev_guesses=("guess1" "guess2"...) - I may have solved this with the final edit.
Does this recursion in guess_creator work?
Associative Arrays
Since you are only interested in »is this word in the list or not?« but not in the order of entries, you could use an associative array (also known as dictionary or hash map) to store your words. Checking whether an entry is in such a map is very fast (time complexity O(1)):
declare -A oldGuesses=([word1]= [word2]= [word3]=)
if [[ "${oldGuesses[$guess]+1}" ]]; then
echo "$guess was already taken"
else
echo "$guess was not taken yet"
fi
You can add an entry to dict using
dict["newEntry"]=
Don't worry about the empty right hand side. Maps are normally used to store key-value pairs. Here we only use the keys (the things which are written inside the []).
Avoiding the list of guesses completely
You mentioned that you want to bruteforce and that the list could grow up to 4M entries. I would advise against using bash, but even more against storing all guesses at all (no matter what language you are using).
Instead, enumerate all possible guesses in an ordered way:
You want to create guesses which are five concatenated words?
Just create five for-loops:
for w1 in "${dict[#]}"; do
for w2 in "${dict[#]}"; do
for w3 in "${dict[#]}"; do
for w4 in "${dict[#]}"; do
for w5 in "${dict[#]}"; do
guess="$w1$w2$w3$w4$w5"
# do something with your guess here
done
done
done
done
done
Benefits of this approach over your old approach:
Don't have to store 4M guesses.
Don't have to search through 4M guesses whenever taking a new guess.
Guarantees that the same guess is not picked over and over again.
Terminates when all possible guesses are made.
There's nothing like that in bash for arrays (Socowi's idea of using Associative Array is better), you would have to iterate through the list again, or maybe try to use grep or something
to refer to all the elements of an array you need the syntax ${prev_guesses[*]}
so you can concatenate with something like
prev_guesses=(${prev_guesses[*]} $guess)
Spaces in your words would make it all more complicated
It should do. BUT....
That's the hard way. If you want to avoid repeating guesses, better to take out each guess from the array when you take it, so you can't take it again.
Easier still is to use the shuf commmand to do everything
guess=($( shuf -e ${dict[*]} -n 5))
shuffle your words and take the first five

Need an algorithm that detects diffs between two files for additions and reorders

I am trying to figure out if there are existing algorithms that can detect changes between two files in terms of additions but also reorders. I have an example below:
1 - User1 commit
processes = 1
a = 0
allactive = []
2 - User2 commit
processes = 2
a = 0
allrecords = range(10)
allactive = []
3 - User3 commit
a = 0
allrecords = range(10)
allactive = []
processes = 2
I need to be able to say that for example user1 code is the three initial lines of code, user 2 added the "allrecords = range(10)" part (as well as a number change), and user 3 did not change anything since he/she just reordered the code.
Ideally, at commit 3, I want to be able to look at the code and say that from character 0 to 20 (this is user1's code), 21-25 user2's code, 26-30 user1's code etc.
I know there are two popular algorithms, Longest common subsequence and longest common substring but I am not sure which one can correctly count additions of new code but be able also to identify reorders.
Of course this still leaves out the question of having the same substring existing twice in a text. Are there any other algorithms that are better suited to this problem?
Each "diff" algorithm defines a set of possible code-change edit types, and then (typically) tries to find the smallest set of such changes that explains how the new file resulted from the old. Usually such algorithms are defined purely syntactically; semantics are not taken into account.
So what you want, based on your example, is an algorithm that allow "change line", "insert line", "move line" (and presumably "delete line" [not in your example but necessary for a practical set of edits]). Given this you ought to be able to define a dynamic programming algorithm to find a smallest set of edits to explain how one file differs from another. Note that this set is defined in terms of edits to whole-lines, rather like classical "diff"; of course classical diff does not have "change line" or "move line" which is why you are looking for something else.
You could pick different types of deltas. Your example explicitly noted "number change"; if narrowly interpreted, this is NOT an edit on lines, but rather within lines. Once you start to allow partial line edits, you need to define how much of a partial line edit is allowed ("unit of change"). (Will your edit set allow "change of digit"?)
Our Smart Differencer family of tools defines the set of edits over well-defined sub-phrases of the targeted language; we use formal language grammar (non)terminals as the unit of change. [This makes each member of the family specific to the grammar of some language] Deltas include programmer-centric concepts such as "replace phrase by phrase", "delete listmember", "move listmember", "copy listmember", "rename identifier"; the algorithm operates by computing a minimal tree difference in terms of these operations. To do this, the SmartDifferencer needs (and has) a full parser (producing ASTs) for the language.
You didn't identify the language for your example. But in general, for a language looking like that, the SmartDifferencer would typically report that User2 commit changes were:
Replaced (numeric literal) "1" in line 1 column 13 by "2"
Inserted (statement) "allrecords = range(10)" after line 2
and that User3 commit changes were:
Move (statement) at line 1 after line 4
If you know who contributed the original code, with the edits you can straightforwardly determine who contributed which part of the final answer. You have to decide the unit-of-reporting; e.g., if you want report such contributions on a line by line basis for easy readability, or if you really want to track that Mary wrote the code, but Joe modified the number.
To detect that User3's change is semantically null can't be done with purely syntax-driven diff tool of any kind. To do this, the tool has to be able to compute the syntactic deltas somehow, and then compute the side effects of all statements (well, "phrases"), requiring a full static analyzer of the language to interpret the deltas to see if they have such null effects. Such a static analyzer requires a parser anyway so it makes sense to do a tree based differencer, but it also requires a lot more than just parser [We have such language front ends and have considered building such tools, but haven't gotten there yet].
Bottom line: there is no simple algorithm for determining "that user3 did not change anything". There is reasonable hope that such tools can be built.

Python3 Make tie-breaking lambda sort more pythonic?

As an exercise in python lambdas (just so I can learn how to use them more properly) I gave myself an assignment to sort some strings based on something other than their natural string order.
I scraped apache for version number strings and then came up with a lambda to sort them based on numbers I extracted with regexes. It works, but I think it can be better I just don't know how to improve it so it's more robust.
from lxml import html
import requests
import re
# Send GET request to page and parse it into a list of html links
jmeter_archive_url='https://archive.apache.org/dist/jmeter/binaries/'
jmeter_archive_get=requests.get(url=jmeter_archive_url)
page_tree=html.fromstring(jmeter_archive_get.text)
list_of_links=page_tree.xpath('//a[#href]/text()')
# Filter out all the non-md5s. There are a lot of links, and ultimately
# it's more data than needed for his exercise
jmeter_md5_list=list(filter(lambda x: x.endswith('.tgz.md5'), list_of_links))
# Here's where the 'magic' happens. We use two different regexes to rip the first
# and then the second number out of the string and turn them into integers. We
# then return them in the order we grabbed them, allowing us to tie break.
jmeter_md5_list.sort(key=lambda val: (int(re.search('(\d+)\.\d+', val).group(1)), int(re.search('\d+\.(\d+)', val).group(1))))
print(jmeter_md5_list)
This does have the desired effect, The output is:
['jakarta-jmeter-2.5.1.tgz.md5', 'apache-jmeter-2.6.tgz.md5', 'apache-jmeter-2.7.tgz.md5', 'apache-jmeter-2.8.tgz.md5', 'apache-jmeter-2.9.tgz.md5', 'apache-jmeter-2.10.tgz.md5', 'apache-jmeter-2.11.tgz.md5', 'apache-jmeter-2.12.tgz.md5', 'apache-jmeter-2.13.tgz.md5']
So we can see that the strings are sorted into an order that makes sense. Lowest version first and highest version last. Immediate problems that I see with my solution are two-fold.
First, we have to create two different regexes to get the numbers we want instead of just capturing groups 1 and 2. Mainly because I know there are no multiline lambdas, I don't know how to reuse a single regex object instead of creating a second.
Secondly, this only works as long as the version numbers are two numbers separated by a single period. The first element is 2.5.1, which is sorted into the correct place but the current method wouldn't know how to tie break for 2.5.2, or 2.5.3, or for any string with an arbitrary number of version points.
So it works, but there's got to be a better way to do it. How can I improve this?
This is not a full answer, but it will get you far along the road to one.
The return value of the key function can be a tuple, and tuples sort naturally. You want the output from the key function to be:
((2, 5, 1), 'jakarta-jmeter')
((2, 6), 'apache-jmeter')
etc.
Do note that this is a poor use case for a lambda regardless.
Originally, I came up with this:
jmeter_md5_list.sort(key=lambda val: list(map(int, re.compile('(\d+(?!$))').findall(val))))
However, based on Ignacio Vazquez-Abrams's answer, I made the following changes.
def sortable_key_from_string(value):
version_tuple = tuple(map(int, re.compile('(\d+(?!$))').findall(value)))
match = re.match('^(\D+)', value)
version_name = ''
if match:
version_name = match.group(1)
return (version_tuple, version_name)
and this:
jmeter_md5_list.sort(key = lambda val: sortable_key_from_string(val))

Parsing one large array into several sub-arrays

I have a list of adjectives (found here), that I would like to be the basis for a "random_adjective(category)" method.
I'm really just taking a stab at this, as my first real attempt at a useful program.
Step 1: Open file, remove formatting. No problem.
list=File.read('adjectivelist')
list.gsub(/\n/, " ")
The next step is to break the string up by category..
list.split(" ")
Now I have an array of every word in the file. Neat. The ones with a tilde before them represent the category names.
Now I would like to break up this LARGE array into several smaller ones, based on category.
I need help with the syntax here, although the pseudocode for this would be something like
Scan the array for an element which begins with a tilde.
Now create a new array based on the name of that element sans the tilde, and ALSO place this "category name" into the "categories" array. Now pull all the elements from the main array, and pop them into the sub-array, until you meet another tilde. Then repeat the process until there are no more elements in the array.
Finally I would pull a random word from the category named in the parameter. If there was no category name matching the parameter, it would return false and exit (this is simply in case I want to add more categories later.)
Tips would be appreciated
You may want to go back and split first time around like this:
categories = list.split(" ~")
Then each list item will start with the category name. This will save you having to go back through your data structure as you suggest. Consider that a tip: sometimes it's better to re-think the start of a coding problem than to head inexorably forwards
The structure you are reaching towards is probably a Hash, where the keys are category names, and the values are arrays of all the matching adjectives. It might look like this:
{
'category' => [ 'word1', 'word2', 'word3' ]
}
So you might do this:
words_in_category = Hash.new
categories.each do |category_string|
cat_name, *words = category_string.split(" ")
words_in_category[cat_name] = words
end
Finally, to pick a random element from an array, Ruby provides a very useful method sample, so you can just do this
words_in_category[ chosen_category ].sample
. . . assuming chosen_category contains the string name of an actual category. I'll leave it to you to figure out how to put this all together and handle errors, bad input etc
Use slice_before:
categories = list.split(" ").slice_before(/~\w+/)
This will create an sub array for each word starting with ~, containing all words before the next matching word.
If this file format is your original and you have freedom to change it, then I recommend you save the data as yaml or json format and read it when needed. There are libraries to do this. That is all. No worry about the mess. Don't spend time reinventing the wheel.

Resources