Understanding retainAll of Set- which is the specified set? - data-structures

I'm having trouble understanding just how to use retainAll in Java. Its function is to create an intersection between sets A,B, where the resulting set has all the common elements between those two sets. And according to the Javadocs, retainAll()
Retains only the elements in this set that are contained in the specified
collection (optional operation). In other words, removes from this set
all of its elements that are not contained in the specified collection
for sets A,B, a.retainAll(b), which is the specified collection? Is it the argument passed to the method? The textbook is not clear on this point.

The specified collection is B. "This set" is A, since it is the set that has the method on it. B is the other "specified collection".

Related

Parse multiple array similarity query

I am working on an algorithm that will compare 2 objects, object 1 and object 2. Each object has attributes which are 5 different arrays, array A, B, C, D, and E.
In order for the two objects to be a match, at least one item from Object 1 A must be in Object 2 A AND Object 1 B must be in Object 2 B, etc through object E must be similar. With a higher number of matches in each array A-E, the higher of a score The match will produce.
Am I going to have to pull Object 1 and object 2 then do an n^2 complexity search on each array to determine which ones exist in both arrays? Then I would go about serving a score by how many matches there were in each array, then add them up and the total would give me the score.
I feel like there has to be a better option for this, especially for Parse.com
Maybe I am going about this problem all wrong, can someone PLEASE help me with this problem. I would provide some code for this one, but I have not started the code yet because I cannot wrap my head around the best way to design it. The two object database are in place already though.
Thanks!
As I said, I may be thinking of this problem in the wrong way. If I am unclear about anything that I am trying to do, let me know and I will update accordingly.
Simplest solution:
Copy all elements from some array object1 to hash table (unordered map), and thereafter iterate array in the 2nd object, and lookup presence in the map. Thus, time complexity is O(N).
Smart solution:
Keep elements in all objects not in the "naive arrays", but in the arrays, structured as hash tables with double hashing algorithm. If so, all arrays in an objects1, 2, already will be pre-indexed, and what is you needed - iterate array, contains less number of elements, and match elements vs longest pre-indexed array.

Disjoint Set Operation Find_Set(x) using linked list

Its about naive Union-Find algorithm using linked-list representation of disjoint sets:
Find_Set(x) operation returns a pointer to the representative of the set containing element x.which requires O(1) time, since node containing x has a pointer directly pointing to representative of x.But before that first we need to find the particular node containing element x among all the disjoint sets.so this searching is not O(1).I don't understand how Find_set(x) is O(1)(As given in books), when we don't know in which disjoint set the node containing x belongs.
Each element is assumed to contain some pointer/reference to the set it belongs to (the set can actually be represented by one of its member element). So when querying Find_Set(x), since you already have the element x, you simply have to consult this pointer/reference and the operation is O(1). With a linked-list implementation, where each set is stored as a linked list of elements, each element holds a pointer to the head of the linked list which is chosen as representative element of the set.

Why are Python sets not considered sequences?

In the python documentation for versions 2.x it says explicitly that there are seven sequence data types. The docs go on to discuss sets and tuples some time later (on the same page), both of which are not included in the above seven. Does anyone know what exactly makes defines a sequence type? My intuited definition has sets and tuples fitting the bill quite nicely, and I haven't had any luck finding an explicit official definition.
Thanks!
The word "sequence" implies an order, but sets are not in a specific order.
Element index is a fundamental notion for Python sequences. If you look at the table of sequence operations, you'll see a few that work directly with indices:
s[i] ith item of s, origin 0 (3)
s[i:j] slice of s from i to j (3)(4)
s[i:j:k] slice of s from i to j with step k (3)(5)
s.index(i) index of the first occurence of i in s
Sets and dictionaries have no notion of an element index, and therefore can't be considered sequences.
In mathematics, informally speaking, a sequence is an ordered list of objects (or events). Like a set, it contains members (also called elements, or terms). The number of ordered elements (possibly infinite) is called the length of the sequence. Unlike a set, order matters, and exactly the same elements can appear multiple times at different positions in the sequence. Most precisely, a sequence can be defined as a function whose domain is a countable totally ordered set, such as the natural numbers.
http://en.wikipedia.org/wiki/Sequence
;)
See the Python glossary:
Sequence
An iterable which supports efficient element access using integer indices via the __getitem__() special method and defines a len() method that returns the length of the sequence. Some built-in sequence types are list, str, tuple, and unicode. Note that dict also supports __getitem__() and __len__(), but is considered a mapping rather than a sequence because the lookups use arbitrary immutable keys rather than integers.
Tuples are sequences. Sets aren't sequences - they have no order and they can't be indexed via set[index] - they even don't have any kind of notion of indices. (They are iterable, though - you can iterate over their items.)

Can a set have duplicate elements?

I have been asked a question that is a little ambiguous for my coursework.
The array of strings is regarded as a set, i.e. unordered.
I'm not sure whether I need to remove duplicates from this array?
I've tried googling but one place will tell me something different to the next. Any help would be appreciated.
From Wikipedia in Set (Mathematics)
A set is a collection of well defined and distinct objects.
Perhaps the confusion derives from the fact that a set does not depend on the way its elements are displayed. A set remains the same if its elements are allegedly repeated or rearranged.
As such, the programming languages I know would not put an element into a set if the element already belongs to it, or they would replace it if it already exists, but would never allow a duplication.
Programming Language Examples
Let me offer a few examples in different programming languages.
In Python
A set in Python is defined as "an unordered collection of unique elements". And if you declare a set like a = {1,2,2,3,4} it will only add 2 once to the set.
If you do print(a) the output will be {1,2,3,4}.
Haskell
In Haskell the insert operation of sets is defined as: "[...] if the set already contains an element equal to the given value, it is replaced with the new value."
As such, if you do this: let a = fromList([1,2,2,3,4]), if you print a to the main ouput it would render [1,2,3,4].
Java
In Java sets are defined as: "a collection that contains no duplicate elements.". Its add operation is defined as: "adds the specified element to this set if it is not already present [...] If this set already contains the element, the call leaves the set unchanged".
Set<Integer> myInts = new HashSet<>(asList(1,2,2,3,4));
System.out.println(myInts);
This code, as in the other examples, would ouput [1,2,3,4].
A set cannot have duplicate elements by its mere definition. The correct structure to allow duplicate elements is Multiset or Bag:
In mathematics, a multiset (or bag) is a generalization of the concept of a set that, unlike a set, allows multiple instances of the multiset's elements. For example, {a, a, b} and {a, b} are different multisets although they are the same set. However, order does not matter, so {a, a, b} and {a, b, a} are the same multiset.
A very common and useful example of a Multiset in programming is the collection of values of an object:
values({a: 1, b: 1}) //=> Multiset(1,1)
The values here are unordered, yet cannot be reduced to Set(1) that would e.g. break the iteration over the object values.
Further, quoting from the linked Wikipedia article (see there for the references):
Multisets have become an important tool in databases.[18][19][20] For instance, multisets are often used to implement relations in database systems. Multisets also play an important role in computer science.
Let A={1,2,2,3,4,5,6,7,...} and B={1,2,3,4,5,6,7,...} then any element in A is in B and any element in B is in A ==> A contains B and B contains A ==> A=B. So of course sets can have duplicate elements, it's just that the one with duplicate elements would end up being exactly the same as the one without duplicate elements.
"Sets are Iterables that contain no duplicate elements."
https://docs.scala-lang.org/overviews/collections/sets.html

How to find an distinct URL only in set A not in set B

There are two sets of URL, both contains millions of URLs. Now, How can I get an URL from A that is not in B. What's The best methods?
Note: you can use any technique, use any tools like database, mapreduce, hashcode, etc, . We should consider the memory efficient, time efficient. You have to consider that every set (A and B) have millions of URL. We should try to find the specific URLs using less memory and less time.
A decent algorithm might be:
load all of set A into a hashmap, O(a)
traverse set B, and for each item, delete the identical value from set A (from the hashmap) if it exists, O(b)
Then your hashmap has the result. This would be O(a+b) where a is size of set A and b is size of set B. (In practice, this would be multiplied by the hash time, which ideally corresponds to approximately O(1) for a good hash.)
Something perhaps a little naive might be a procedure like
Sort list A
Sort list B
Navigate list A and B together such that:
a. Increment pointer to A and pointer to B when elements match
b. Increment pointer to B until the element matches the next element in a or until the record b in B would appear after the next element in a (this rule discards elements in B that are not in A)
c. A match has been found when incrementing subject to these rules such that the next element b in B does not match the next element a in A.
This might actually be an interesting place to apply Bloom filters: construct a Bloom filter for set B then for every URL in set A determine if it is in set B. With diminishingly small probability of error you should be able to find all URLs in A not in B.
(sort -u A; cat B B) | sort | uniq -u

Resources