Can a set have duplicate elements? - set

I have been asked a question that is a little ambiguous for my coursework.
The array of strings is regarded as a set, i.e. unordered.
I'm not sure whether I need to remove duplicates from this array?
I've tried googling but one place will tell me something different to the next. Any help would be appreciated.

From Wikipedia in Set (Mathematics)
A set is a collection of well defined and distinct objects.
Perhaps the confusion derives from the fact that a set does not depend on the way its elements are displayed. A set remains the same if its elements are allegedly repeated or rearranged.
As such, the programming languages I know would not put an element into a set if the element already belongs to it, or they would replace it if it already exists, but would never allow a duplication.
Programming Language Examples
Let me offer a few examples in different programming languages.
In Python
A set in Python is defined as "an unordered collection of unique elements". And if you declare a set like a = {1,2,2,3,4} it will only add 2 once to the set.
If you do print(a) the output will be {1,2,3,4}.
Haskell
In Haskell the insert operation of sets is defined as: "[...] if the set already contains an element equal to the given value, it is replaced with the new value."
As such, if you do this: let a = fromList([1,2,2,3,4]), if you print a to the main ouput it would render [1,2,3,4].
Java
In Java sets are defined as: "a collection that contains no duplicate elements.". Its add operation is defined as: "adds the specified element to this set if it is not already present [...] If this set already contains the element, the call leaves the set unchanged".
Set<Integer> myInts = new HashSet<>(asList(1,2,2,3,4));
System.out.println(myInts);
This code, as in the other examples, would ouput [1,2,3,4].

A set cannot have duplicate elements by its mere definition. The correct structure to allow duplicate elements is Multiset or Bag:
In mathematics, a multiset (or bag) is a generalization of the concept of a set that, unlike a set, allows multiple instances of the multiset's elements. For example, {a, a, b} and {a, b} are different multisets although they are the same set. However, order does not matter, so {a, a, b} and {a, b, a} are the same multiset.
A very common and useful example of a Multiset in programming is the collection of values of an object:
values({a: 1, b: 1}) //=> Multiset(1,1)
The values here are unordered, yet cannot be reduced to Set(1) that would e.g. break the iteration over the object values.
Further, quoting from the linked Wikipedia article (see there for the references):
Multisets have become an important tool in databases.[18][19][20] For instance, multisets are often used to implement relations in database systems. Multisets also play an important role in computer science.

Let A={1,2,2,3,4,5,6,7,...} and B={1,2,3,4,5,6,7,...} then any element in A is in B and any element in B is in A ==> A contains B and B contains A ==> A=B. So of course sets can have duplicate elements, it's just that the one with duplicate elements would end up being exactly the same as the one without duplicate elements.

"Sets are Iterables that contain no duplicate elements."
https://docs.scala-lang.org/overviews/collections/sets.html

Related

Parse multiple array similarity query

I am working on an algorithm that will compare 2 objects, object 1 and object 2. Each object has attributes which are 5 different arrays, array A, B, C, D, and E.
In order for the two objects to be a match, at least one item from Object 1 A must be in Object 2 A AND Object 1 B must be in Object 2 B, etc through object E must be similar. With a higher number of matches in each array A-E, the higher of a score The match will produce.
Am I going to have to pull Object 1 and object 2 then do an n^2 complexity search on each array to determine which ones exist in both arrays? Then I would go about serving a score by how many matches there were in each array, then add them up and the total would give me the score.
I feel like there has to be a better option for this, especially for Parse.com
Maybe I am going about this problem all wrong, can someone PLEASE help me with this problem. I would provide some code for this one, but I have not started the code yet because I cannot wrap my head around the best way to design it. The two object database are in place already though.
Thanks!
As I said, I may be thinking of this problem in the wrong way. If I am unclear about anything that I am trying to do, let me know and I will update accordingly.
Simplest solution:
Copy all elements from some array object1 to hash table (unordered map), and thereafter iterate array in the 2nd object, and lookup presence in the map. Thus, time complexity is O(N).
Smart solution:
Keep elements in all objects not in the "naive arrays", but in the arrays, structured as hash tables with double hashing algorithm. If so, all arrays in an objects1, 2, already will be pre-indexed, and what is you needed - iterate array, contains less number of elements, and match elements vs longest pre-indexed array.

Why are Python sets not considered sequences?

In the python documentation for versions 2.x it says explicitly that there are seven sequence data types. The docs go on to discuss sets and tuples some time later (on the same page), both of which are not included in the above seven. Does anyone know what exactly makes defines a sequence type? My intuited definition has sets and tuples fitting the bill quite nicely, and I haven't had any luck finding an explicit official definition.
Thanks!
The word "sequence" implies an order, but sets are not in a specific order.
Element index is a fundamental notion for Python sequences. If you look at the table of sequence operations, you'll see a few that work directly with indices:
s[i] ith item of s, origin 0 (3)
s[i:j] slice of s from i to j (3)(4)
s[i:j:k] slice of s from i to j with step k (3)(5)
s.index(i) index of the first occurence of i in s
Sets and dictionaries have no notion of an element index, and therefore can't be considered sequences.
In mathematics, informally speaking, a sequence is an ordered list of objects (or events). Like a set, it contains members (also called elements, or terms). The number of ordered elements (possibly infinite) is called the length of the sequence. Unlike a set, order matters, and exactly the same elements can appear multiple times at different positions in the sequence. Most precisely, a sequence can be defined as a function whose domain is a countable totally ordered set, such as the natural numbers.
http://en.wikipedia.org/wiki/Sequence
;)
See the Python glossary:
Sequence
An iterable which supports efficient element access using integer indices via the __getitem__() special method and defines a len() method that returns the length of the sequence. Some built-in sequence types are list, str, tuple, and unicode. Note that dict also supports __getitem__() and __len__(), but is considered a mapping rather than a sequence because the lookups use arbitrary immutable keys rather than integers.
Tuples are sequences. Sets aren't sequences - they have no order and they can't be indexed via set[index] - they even don't have any kind of notion of indices. (They are iterable, though - you can iterate over their items.)

How to determine correspondence between two lists of names?

I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.
This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.

What is determining the items that make the difference between two arrays called?

I want to find which elements of two arrays make the two arrays different.
For example, if I start off with
known_unacceptable_array = [bad, bad, good, good, good, bad, good]
known_acceptable_array = []
and an array is only unacceptable if there's three bads (but I don't know that at the time), but I'm able to evaluate whether an array is acceptable or unacceptable, I would like to find the smallest array that makes the array unacceptable
possibly_minimal_unacceptable = [bad, bad, bad]
maximal_acceptable = [bad, bad] # Third bad required to make the array unacceptable
What is this problem called, and what algorithms are there for this?
Edit: The elements can't be changed in order, and adding an element can only either change the list from being acceptable to unacceptable or have no effect - it can't change it from being unacceptable to acceptable.
Background: I've randomly generated thousands of instructions that make a ruby interpreter crash, and I want to isolate the specific instructions that cause it to crash, and at the time I thought that multiple bad instructions were required to make it crash. A very naive attempt to determine what the bad instructions is at this link
What is determining the elements that make the difference
between two arrays called?
Differencing is often called
subtraction.
I want to determine which elements of two arrays make the
two arrays different.
Again, that's subtraction(at least
some form of it):
Given A ={ x , y , z } B = { x , y a },
A - B = { z , -a }
or "only A has z and only B has a", or "z and a" make them
different.
For example, if I start off with
known_bad = [bad, bad, good, good, good, bad, good] >
known_good = []
Why start with a full array and an empty one? Isn't this an
extreme case, or are these "two arrays" not two of which you
are trying to determine the "difference."
possibly_minimal_bad = [bad, bad, bad]
maximal_good = [bad, bad] # Third bad required to make the list bad
Is this just a set of rules? Or is this the result of
finding the difference between the two arrays of the previous
(known_good,bad) set?
What is this problem called, and what algorithms are there
for this?
If it isn't called "difference" or "subtraction" then why
introduce it that way?
Is the problem: a. going from
the first two arrays (known_xx) to the second two (min,max);
or is it: b. classifying finite sequences of the words "good"
and "bad."
a) I can't see a relation between the first two
arrays and the second two. How did you get from the first two
to the second?
b) Classifying a sequence of words could be
"parsing a language", or decoding a message, recognizing a
pattern, etc.
Is it "Pattern Recognition"?
It appears that you are looking for a pattern in test input(or test point) data and it's relationship to product failure,
and want to represent the relationship in some codical
form for further analysis. Or searching for a correlation between certain test points and product failure. That makes this question rather
interesting. However, the presentation of the question
is quite confusing. Maybe those groups of
equations could be explained a little more, clarifying if they are related,and if so, then: In what way?
I'm not entirely sure if I understand the question. If my answer is unsatisfactory, please rephrase your question to be more clear. I'll base my answer on this.
I want to determine which elements of two arrays make the two arrays different.
This is a combination of the three set operations union, intersection and difference. Different combinations can achieve the same result.
Complement is the the subset of A which is not in B.
Intersection is the set of elements which is both in A and B, but not just A or B.
Union is the subset which is either in A or B (no duplicates).
It sounds like you want the union of both complements, which is:
A\B ∪ B\A
Or the complement between the intersection and the union:
A∩B \ A∪B
See http://en.wikipedia.org/wiki/Set_operations_(Boolean) for more information.

Understanding retainAll of Set- which is the specified set?

I'm having trouble understanding just how to use retainAll in Java. Its function is to create an intersection between sets A,B, where the resulting set has all the common elements between those two sets. And according to the Javadocs, retainAll()
Retains only the elements in this set that are contained in the specified
collection (optional operation). In other words, removes from this set
all of its elements that are not contained in the specified collection
for sets A,B, a.retainAll(b), which is the specified collection? Is it the argument passed to the method? The textbook is not clear on this point.
The specified collection is B. "This set" is A, since it is the set that has the method on it. B is the other "specified collection".

Resources