Efficient data structure to look up co-occurrence within a document - data-structures

I have 10 million documents. Each document is a set of tokens (about 100 unique tokens) regardless of their frequency (bag-of-words). All unique tokens in all documents form the vocabulary V. The size of V is roughly 50000.
Requirement 1:
Given a set of tokens, say (t1, t2) (there may be more than two tokens in practice) where t1,t2 in V, I want to find all tokens that co-occur with t1,t2 in at least one document. That is to say, find all tokens u, satisfying u in V and t1,t2,u constituting a subset of at least one document.
Requirement 2:
Given a set of tokens, find documents that contain these tokens.
Is there any efficient data structure to fulfill my requirements? "Efficienct" means avoid iterating all documents.

You can use a map (usually implemented as either a hashmap or a binary tree) that maps each token to the set of documents that contain this token.
When given two tokens t1,t2, compute the intersection of the two corresponding sets of documents. This gives you the set of documents that contain both t1 and t2. Then return the union of all the tokens contained in these documents.
In python:
from collections import defaultdict
def build_map(documents):
m = defaultdict(set)
for i, document in enumerate(documents):
for token in document:
m[token].add(i)
return m
def cooccuring_with_pair(m, documents, t1, t2):
doc_ids = set.intersection(m[t1], m[t2])
return set().union(*(documents[i] for i in doc_ids))
Testing the python code with a small example:
documents = [set(s.split()) for s in ('the quick brown fox jumps over the lazy dog', 'the grey cat jumps in fright when hearing the dog', 'two brown mice are chased by a dog', 'the quick brown fox grows old and slow')]
m = build_map(documents)
print(m)
# defaultdict(<class 'set'>, {'brown': {0, 2, 3}, 'lazy': {0}, 'fox': {0, 3}, 'the': {0, 1, 3}, 'over': {0}, 'jumps': {0, 1}, 'dog': {0, 1, 2}, 'quick': {0, 3}, 'cat': {1}, 'hearing': {1}, 'fright': {1}, 'grey': {1}, 'when': {1}, 'in': {1}, 'mice': {2}, 'by': {2}, 'a': {2}, 'chased': {2}, 'two': {2}, 'are': {2}, 'and': {3}, 'slow': {3}, 'grows': {3}, 'old': {3}})
for t1, t2 in [('jumps', 'dog'), ('brown', 'fox'), ('cat', 'fox')]:
print(t1, t2)
print(cooccuring_with_pair(m, documents, t1, t2))
print()
# jumps dog
# {'brown', 'lazy', 'fox', 'cat', 'the', 'hearing', 'over', 'jumps', 'dog', 'fright', 'grey', 'when', 'quick', 'in'}
#
# brown fox
# {'brown', 'lazy', 'and', 'fox', 'the', 'slow', 'grows', 'old', 'over', 'jumps', 'dog', 'quick'}
#
# cat fox
# set()

Related

algorithm to get all combinations of splitting up N items into K bins

Assuming I have a list of elements [1,2,3,4,] and a number of bins (let's assume 2 bins), I want to come up with a list of all combinations of splitting up items 1-4 into the 2 bins. Solution should look something like this
[{{1}, {2,3,4}}, {{2}, {1,3,4}}, {{3}, {1,2,4}}, {{4}, {1,2,3}},
{{1,2}, {3,4}}, {{1,3}, {2,4}}, {{1,4}, {2,3}}, {{}, {1, 2, 3, 4}, {{1, 2, 3, 4}, {}}]
Also, order does matter -- I didn't write out all the return values but {{1, 2, 3}, {4}} is a different solution from {{3, 2, 1}, {4}}
A common approach is as follows.
If you have, say, K bins, then add K-1 special values to your initial array. I will use the -1 value assuming that it never occurs in the initial array; you can choose a different value.
So for your example the initial array becomes arr=[1,2,3,4,-1]; if K were, say, 4, the array would be arr=[1,2,3,4,-1,-1,-1].
Now list all the permutations of array arr. For each permutation, treat all -1s as bin separators, so all the elements befors the first -1 go to the first bin (in that particular order), all the elements between first and second -1 go to the second bin, and so on.
For example:
[-1,1,2,3,4] -> {{}, {1,2,3,4}}
[2,1,3,-1,4] -> {{2,3,4}, {4}}
[3,1,2,-1,4] -> {{3,1,2}, {4}}
[1,3,-1,2,4] -> {{1,3}, {2,4}}
and so on.
Generating all permutation is a standard task (see, for example, Permutations in JavaScript?), and splitting an array by -1s should be easy.

Find all partitions from a list of subsets

Given a list of specific subsets like
S = [ {1, 2}, {3, 4}, {1}, {2, 3}, {4}, {3} ]
and a "universe" set like
U = {1, 2, 3, 4}
what elegant and simple algorithm can be used to find all the possible partitions of U made of sets from S? With this example, such partitions include
{1, 2} {3, 4}
{1, 2} {3} {4}
etc.
Use recursion.
Split the problem into two smaller problems based on whether the first element is used or not:
Partition using {1,2} and any of the the remaining sets.
Partition without using {1,2} but using any of the the remaining sets.
These two options cover all possibilities.
The first is solved by partitioning {3,4} using only [ {3, 4}, {1}, {2, 3}, {4}, {3} ].
The second is solved by partitioning {1,2,3,4} using only [ {3, 4}, {1}, {2, 3}, {4}, {3} ].
To see how to solve these smaller problems refer to this similar question.

generate all combintations for list with repeated items

Related to this question, I am wondering the algorithms (and actual code in java/c/c++/python/etc., if you have!) to generate all combinations of r elements for a list with m elements in total. Some of these m elements may be repeated.
Thanks!
recurse for each element type
int recurseMe(list<list<item>> items, int r, list<item> container)
{
if (r == container.length)
{
//print out your collection;
return 1;
}
else if (container.length > score)
{
return 0;
}
list<item> firstType = items[0];
int score = 0;
for(int i = 0; i < firstType.length; i++)
{
score += recurseMe(items without items[0], r, container + i items from firstType);
}
return score;
}
This takes as input a list containing lists of items, assuming each inner list represents a unique type of item. You may have to build a sorting function to feed as input to this.
//start with a list<item> original;
list<list<item>> grouped = new list<list<item>>();
list<item> sorted = original.sort();//use whichever method for this
list<item> temp = null;
item current = null;
for(int x = 0; x < original.length; x++)
if (sorted[x] == current)
{
temp.add(current);
}
else
{
if (temp != null && temp.isNotEmpty)
grouped.add(temp);
temp = new list<item>();
temp.add(sorted[x]);
}
}
if (temp != null && temp.isNotEmpty)
grouped.add(temp);
//grouped is the result
This sorts the list, then creates sublists containing elements that are the same, inserting them into the list of lists grouped
Here is a recursion that I believe is closely related to Jean-Bernard Pellerin's algorithm, in Mathematica.
This takes input as the number of each type of element. The output is in similar form. For example:
{a,a,b,b,c,d,d,d,d} -> {2,2,1,4}
Function:
f[k_, {}, c__] := If[+c == k, {{c}}, {}]
f[k_, {x_, r___}, c___] := Join ## (f[k, {r}, c, #] & /# 0~Range~Min[x, k - +c])
Use:
f[4, {2, 2, 1, 4}]
{{0, 0, 0, 4}, {0, 0, 1, 3}, {0, 1, 0, 3}, {0, 1, 1, 2}, {0, 2, 0, 2},
{0, 2, 1, 1}, {1, 0, 0, 3}, {1, 0, 1, 2}, {1, 1, 0, 2}, {1, 1, 1, 1},
{1, 2, 0, 1}, {1, 2, 1, 0}, {2, 0, 0, 2}, {2, 0, 1, 1}, {2, 1, 0, 1},
{2, 1, 1, 0}, {2, 2, 0, 0}}
An explanation of this code was requested. It is a recursive function that takes a variable number of arguments. The first argument is k, length of subset. The second is a list of counts of each type to select from. The third argument and beyond is used internally by the function to hold the subset (combination) as it is constructed.
This definition therefore is used when there are no more items in the selection set:
f[k_, {}, c__] := If[+c == k, {{c}}, {}]
If the total of the values of the combination (its length) is equal to k, then return that combination, otherwise return an empty set. (+c is shorthand for Plus[c])
Otherwise:
f[k_, {x_, r___}, c___] := Join ## (f[k, {r}, c, #] & /# 0~Range~Min[x, k - +c])
Reading left to right:
Join is used to flatten out a level of nested lists, so that the result is not an increasingly deep tensor.
f[k, {r}, c, #] & calls the function, dropping the first position of the selection set (x), and adding a new element to the combination (#).
/# 0 ~Range~ Min[x, k - +c]) for each integer between zero and the lesser of the first element of the selection set, and k less total of combination, which is the maximum that can be selected without exceeding combination size k.
I'm going to make this an answer rather than a bunch of comments.
My original comment was:
The CombinationGenerator Java class systematically generates all
combinations of n elements, taken r at a time. The algorithm is
described by Kenneth H. Rosen, Discrete Mathematics and Its
Applications, 2nd edition (NY: McGraw-Hill, 1991), pp. 284-286." See
merriampark.com/comb.htm. It has a link to source code.
As you pointed out in your comment, you want unique combinations. So, given the array ["a", "a", "b", "b"], you want it to generate aab, abb. The code I linked generates aab, aab, baa, baa.
With that array, removing duplicates is very easy. Depending on how you implement it, you either let it generate the duplicates and then filter them after the fact (i.e. selecting unique elements from an array), or you modify the code to include a hash table so that when it generates a combination, it checks the hash table before putting the item into the output array.
Looking something up in a hash table is an O(1) operation, so either of those is going to be efficient. Doing it after the fact will be a little bit more expensive, because you'll have to copy items. Still, you're talking O(n), where n is the number of combinations generated.
There is one complication: order is irrelevant. That is, given the array ["a", "b", "a", "b"], the code will generate aba, abb, aab, bab. In this case, aba and aab are duplicate combinations, as are abb and bab, and using a hash table isn't going to remove those duplicates for you. You could, though, create a bit mask for each combination, and use the hash table idea with the bit masks. This would be slightly more complicated, but not terribly so.
If you sort the initial array first, so that duplicate items are adjacent, then the problem goes away and you can use the hash table idea.
There's undoubtedly a way to modify the code to prevent it from generating duplicates. I can see a possible approach, but it would be messy and expensive. It would probably make the algorithm slower than if you just used the hash table idea. The approach I would take:
Sort the input array
Use the linked code to generate the combinations
Use a hash table or some other code to select unique items.
Although ... a thought that occurred to me.
Is it true that if you sort the input array, then any generated duplicates will be adjacent? That is, given the input array ["a", "a", "b", "b"], then the generated output will be aab, aab, abb, abb, in that order. This will be implementation dependent, of course. But if it's true in your implementation, then modifying the algorithm to remove duplicates is a simple matter of checking to see if the current combination is equal to the previous one.

A fast implementation in Mathematica for Position2D

I'm looking for a fast implementation for the following, I'll call it Position2D for lack of a better term:
Position2D[ matrix, sub_matrix ]
which finds the locations of sub_matrix inside matrix and returns the upper left and lower right row/column of a match.
For example, this:
Position2D[{
{0, 1, 2, 3},
{1, 2, 3, 4},
{2, 3, 4, 5},
{3, 4, 5, 6}
}, {
{2, 3},
{3, 4}
}]
should return this:
{
{{1, 3}, {2, 4}},
{{2, 2}, {3, 3}},
{{3, 1}, {4, 2}}
}
It should be fast enough to work quickly on 3000x2000 matrices with 100x100 sub-matrices. For simplicity, it is enough to only consider integer matrices.
Algorithm
The following code is based on an efficient custom position function to find positions of (possibly overlapping) integer sequences in a large integer list. The main idea is that we can first try to eficiently find the positions where the first row of the sub-matrix is in the Flatten-ed large matrix, and then filter those, extracting full sub-matrices and comparing to the sub-matrix of interest. This will be efficient for most cases except very pathological ones (those, for which this procedure would generate a huge number of potential position candidates, while the true number of entries of the sub-matrix would be much smaller. But such cases seem rather unlikely generally, and then, further improvements to this simple scheme can be made).
For large matrices, the proposed solution will be about 15-25 times faster than the solution of #Szabolcs when a compiled version of sequence positions function is used, and 3-5 times faster for the top-level implementation of sequence positions - finding function. The actual speedup depends on matrix sizes, it is more for larger matrices. The code and benchmarks are below.
Code
A generally efficient function for finding positions of a sub-list (sequence)
These helper functions are due to Norbert Pozar and taken from this Mathgroup thread. They are used to efficiently find starting positions of an integer sequence in a larger list (see the mentioned post for details).
Clear[seqPos];
fdz[v_] := Rest#DeleteDuplicates#Prepend[v, 0];
seqPos[list_List, seq_List] :=
Fold[
fdz[#1 (1 - Unitize[list[[#1]] - #2])] + 1 &,
fdz[Range[Length[list] - Length[seq] + 1] *
(1 - Unitize[list[[;; -Length[seq]]] - seq[[1]]])] + 1,
Rest#seq
] - Length[seq];
Example of use:
In[71]:= seqPos[{1,2,3,2,3,2,3,4},{2,3,2}]
Out[71]= {2,4}
A faster position-finding function for integers
However fast seqPos might be, it is still the major bottleneck in my solution. Here is a compiled-to-C version of this, which gives another 5x performance boost to my code:
seqposC =
Compile[{{list, _Integer, 1}, {seq, _Integer, 1}},
Module[{i = 1, j = 1, res = Table[0, {Length[list]}], ctr = 0},
For[i = 1, i <= Length[list], i++,
If[list[[i]] == seq[[1]],
While[j < Length[seq] && i + j <= Length[list] &&
list[[i + j]] == seq[[j + 1]],
j++
];
If[j == Length[seq], res[[++ctr]] = i];
j = 1;
]
];
Take[res, ctr]
], CompilationTarget -> "C", RuntimeOptions -> "Speed"]
Example of use:
In[72]:= seqposC[{1, 2, 3, 2, 3, 2, 3, 4}, {2, 3, 2}]
Out[72]= {2, 4}
The benchmarks below have been redone with this function (also the code for main function is slightly modified )
Main function
This is the main function. It finds positions of the first row in a matrix, and then filters them, extracting the sub-matrices at these positions and testing against the full sub-matrix of interest:
Clear[Position2D];
Position2D[m_, what_,seqposF_:Automatic] :=
Module[{posFlat, pos2D,sp = If[seqposF === Automatic,seqposC,seqposF]},
With[{dm = Dimensions[m], dwr = Reverse#Dimensions[what]},
posFlat = sp[Flatten#m, First#what];
pos2D =
Pick[Transpose[#], Total[Clip[Reverse#dm - # - dwr + 2, {0, 1}]],2] &#
{Mod[posFlat, #, 1], IntegerPart[posFlat/#] + 1} &#Last[dm];
Transpose[{#, Transpose[Transpose[#] + dwr - 1]}] &#
Select[pos2D,
m[[Last## ;; Last## + Last#dwr - 1,
First## ;; First## + First#dwr - 1]] == what &
]
]
];
For integer lists, the faster compiled subsequence position-finding function seqposC can be used (this is a default). For generic lists, one can supply e.g. seqPos, as a third argument.
How it works
We will use a simple example to dissect the code and explain its inner workings. This defines our test matrix and sub-matrix:
m = {{0, 1, 2, 3}, {1, 2, 3, 4}, {2, 3, 4, 5}};
what = {{2, 3}, {3, 4}};
This computes the dimensions of the above (it is more convenient to work with reversed dimensions for a sub-matrix):
In[78]:=
dm=Dimensions[m]
dwr=Reverse#Dimensions[what]
Out[78]= {3,4}
Out[79]= {2,2}
This finds a list of starting positions of the first row ({2,3} here) in the Flattened main matrix. These positions are at the same time "flat" candidate positions of the top left corner of the sub-matrix:
In[77]:= posFlat = seqPos[Flatten#m, First#what]
Out[77]= {3, 6, 9}
This will reconstruct the 2D "candidate" positions of the top left corner of a sub-matrix in a full matrix, using the dimensions of the main matrix:
In[83]:= posInterm = Transpose#{Mod[posFlat,#,1],IntegerPart[posFlat/#]+1}&#Last[dm]
Out[83]= {{3,1},{2,2},{1,3}}
We can then try using Select to filter them out, extracting the full sub-matrix and comparing to what, but we'll run into a problem here:
In[84]:=
Select[posInterm,
m[[Last##;;Last##+Last#dwr-1,First##;;First##+First#dwr-1]]==what&]
During evaluation of In[84]:= Part::take: Cannot take positions 3 through 4
in {{0,1,2,3},{1,2,3,4},{2,3,4,5}}. >>
Out[84]= {{3,1},{2,2}}
Apart from the error message, the result is correct. The error message itself is due to the fact that for the last position ({1,3}) in the list, the bottom right corner of the sub-matrix will be outside the main matrix. We could of course use Quiet to simply ignore the error messages, but that's a bad style. So, we will first filter those cases out, and this is what the line Pick[Transpose[#], Total[Clip[Reverse#dm - # - dwr + 2, {0, 1}]], 2] &# is for. Specifically, consider
In[90]:=
Reverse#dm - # - dwr + 2 &#{Mod[posFlat, #, 1],IntegerPart[posFlat/#] + 1} &#Last[dm]
Out[90]= {{1,2,3},{2,1,0}}
The coordinates of the top left corners should stay within a difference of dimensions of matrix and a sub-matrix. The above sub-lists were made of x and y coordiantes of top - left corners. I added 2 to make all valid results strictly positive. We have to pick only coordiantes at those positions in Transpose#{Mod[posFlat, #, 1], IntegerPart[posFlat/#] + 1} &#Last[dm] ( which is posInterm), at which both sub-lists above have strictly positive numbers. I used Total[Clip[...,{0,1}]] to recast it into picking only at those positions at which this second list has 2 (Clip converts all positive integers to 1, and Total sums numbers in 2 sublists. The only way to get 2 is when numbers in both sublists are positive).
So, we have:
In[92]:=
pos2D=Pick[Transpose[#],Total[Clip[Reverse#dm-#-dwr+2,{0,1}]],2]&#
{Mod[posFlat,#,1],IntegerPart[posFlat/#]+1}&#Last[dm]
Out[92]= {{3,1},{2,2}}
After the list of 2D positions has been filtered, so that no structurally invalid positions are present, we can use Select to extract the full sub-matrices and test against the sub-matrix of interest:
In[93]:=
finalPos =
Select[pos2D,m[[Last##;;Last##+Last#dwr-1,First##;;First##+First#dwr-1]]==what&]
Out[93]= {{3,1},{2,2}}
In this case, both positions are genuine. The final thing to do is to reconstruct the positions of the bottom - right corners of the submatrix and add them to the top-left corner positions. This is done by this line:
In[94]:= Transpose[{#,Transpose[Transpose[#]+dwr-1]}]&#finalPos
Out[94]= {{{3,1},{4,2}},{{2,2},{3,3}}}
I could have used Map, but for a large list of positions, the above code would be more efficient.
Example and benchmarks
The original example:
In[216]:= Position2D[{{0,1,2,3},{1,2,3,4},{2,3,4,5},{3,4,5,6}},{{2,3},{3,4}}]
Out[216]= {{{3,1},{4,2}},{{2,2},{3,3}},{{1,3},{2,4}}}
Note that my index conventions are reversed w.r.t. #Szabolcs' solution.
Benchmarks for large matrices and sub-matrices
Here is a power test:
nmat = 1000;
(* generate a large random matrix and a sub-matrix *)
largeTestMat = RandomInteger[100, {2000, 3000}];
what = RandomInteger[10, {100, 100}];
(* generate upper left random positions where to insert the submatrix *)
rposx = RandomInteger[{1,Last#Dimensions[largeTestMat] - Last#Dimensions[what] + 1}, nmat];
rposy = RandomInteger[{1,First#Dimensions[largeTestMat] - First#Dimensions[what] + 1},nmat];
(* insert the submatrix nmat times *)
With[{dwr = Reverse#Dimensions[what]},
Do[largeTestMat[[Last#p ;; Last#p + Last#dwr - 1,
First#p ;; First#p + First#dwr - 1]] = what,
{p,Transpose[{rposx, rposy}]}]]
Now, we test:
In[358]:= (ps1 = position2D[largeTestMat,what])//Short//Timing
Out[358]= {1.39,{{{1,2461},{100,2560}},<<151>>,{{1900,42},{1999,141}}}}
In[359]:= (ps2 = Position2D[largeTestMat,what])//Short//Timing
Out[359]= {0.062,{{{2461,1},{2560,100}},<<151>>,{{42,1900},{141,1999}}}}
(the actual number of sub-matrices is smaller than the number we try to generate, since many of them overlap and "destroy" the previously inserted ones - this is so because the sub-matrix size is a sizable fraction of the matrix size in our benchmark).
To compare, we should reverse the x-y indices in one of the solutions (level 3), and sort both lists, since positions may have been obtained in different order:
In[360]:= Sort#ps1===Sort[Reverse[ps2,{3}]]
Out[360]= True
I do not exclude a possibility that further optimizations are possible.
This is my implementation:
position2D[m_, k_] :=
Module[{di, dj, extractSubmatrix, pos},
{di, dj} = Dimensions[k] - 1;
extractSubmatrix[{i_, j_}] := m[[i ;; i + di, j ;; j + dj]];
pos = Position[ListCorrelate[k, m], ListCorrelate[k, k][[1, 1]]];
pos = Select[pos, extractSubmatrix[#] == k &];
{#, # + {di, dj}} & /# pos
]
It uses ListCorrelate to get a list of potential positions, then filters those that actually match. It's probably faster on packed real matrices.
As per Leonid's suggestion here's my solution. I know it isn't very efficient (it's about 600 times slower than Leonid's when I timed it) but it's very short, rememberable, and a nice illustration of a rarely used function, PartitionMap. It's from the Developer package, so it needs a Needs["Developer`"] call first.
Given that, Position2D can be defined as:
Position2D[m_, k_] := Position[PartitionMap[k == # &, m, Dimensions[k], {1, 1}], True]
This only gives the upper-left coordinates. I feel the lower-right coordinates are really redundant, since the dimensions of the sub-matrix are known, but if the need arises one can add those to the output by prepending {#, Dimensions[k] + # - {1, 1}} & /# to the above definition.
How about something like
Position2D[bigMat_?MatrixQ, smallMat_?MatrixQ] :=
Module[{pos, sdim = Dimensions[smallMat] - 1},
pos = Position[bigMat, smallMat[[1, 1]]];
Quiet[Select[pos, (MatchQ[
bigMat[[Sequence##Thread[Span[#, # + sdim]]]], smallMat] &)],
Part::take]]
which will return the top left-hand positions of the submatrices.
Example:
Position2D[{{0, 1, 2, 3}, {1, 2, 3, 4}, {2, 3, 4, 5}, {3, 5, 5, 6}},
{{2, 3}, {3, _}}]
(* Returns: {{1, 3}, {2, 2}, {3, 1}} *)
And to search a 1000x1000 matrix, it takes about 2 seconds on my old machine
SeedRandom[1]
big = RandomInteger[{0, 10}, {1000, 1000}];
Position2D[big, {{1, 1, _}, {1, 1, 1}}] // Timing
(* {1.88012, {{155, 91}, {295, 709}, {685, 661},
{818, 568}, {924, 45}, {981, 613}}} *)

Why does Extract add extra {} to the result and what is best way to remove them

When I type the following
lis = {1, 2};
pos = Position[lis, 1];
result = Extract[lis, pos]
the result is always a list.
{1}
another example
lis = {{1}, {2}};
pos = Position[lis, {1}];
result = Extract[lis, pos]
{{1}}
Mathematica always adds an extra {} in the result. What would be the best way to remove this extra {}, other than applying Flatten[result,1] each time? And is there a case where removing these extra {} can cause a problem?
You probably realise this, but Position and Extract return lists because the requested values may be found in more than one position. So in general, removing the outer brackets doesn't make sense.
If you are sure the result is a singleton list, using Flatten would destroy information if the element is itself a list. For example,
Position[{{1}},1]
gives a list whose sole element is a list. So in this case, using Extract would make more sense.
Even so, there are many situations where Mathematica treats {x} very differently to x, as in
Position[1,1]
Position[{1},1]
which have very different results. So whether you can remove the outer braces from a one-member list depends on what you plan to do with it.
If I understood your question correctly, you are asking why
lis = {{1}, {2}};
pos = Position[lis, {1}];
result = Extract[lis, pos]
returns
{{1}}
rather than
{1}
The answer is, I think, simple: Position[lis,{1}] gives the position at which {1}, not 1 appears in lis; when you then go and look at that position using Extract, you do indeed get {1} which is then wrapped in a list (which is exactly what happened in the first case, when you looked for 1 and obtained {1} as a result; just replace 1 by {1}, because that is now what you are asking for.
To see this more clearly, try
lis = {f[1], f[2]};
pos = Position[lis, f[1]];
result = Extract[lis, pos]
which gives
{f[1]}
The point here is that List in {1} (which is the same as List[1] if you check look at the FullForm) before was just a head, like f here. Should mathematica have remove f here? If not, then why should it have removed the innermost List earlier?
And finally, if you really want to remove the inner {} in your second example, try
lis = {{1}, {2, {1}}};
pos = Position[lis, {1}];
result = Extract[lis, pos, #[[1]] &]
giving
{1, 1}
EDIT: I am becoming puzzled with some of the answers here. If I do
lis = {{1}, {2, {1, 2, {1}}}};
pos = Position[lis, 1];
result = Extract[lis, pos]
then I get
{1, 1, 1}
in result. I only get the extra brackets when I actually obtain the positions of {1} in pos instead of the positions of 1 (and then when I look at those positions, I find {1}). Or am I missing something in your question?
Short answer: You should probably use First#Position[...]
Long answer:
Lets separate the question to 2 parts:
Why do you have the extra {} in the result for Position?
i.e. why:
lis = {1, 2};
Position[lis, 1]
returns {{1}}?
This is in order to work consistently with n-dimensional list, that may have the requested values in more than one position. For example:
lis = {{1, 2, 3}, {1, 5, 6}, {1, 2, 1}};
Position[lis, 1]
returns {{1, 1}, {2, 1}, {3, 1}, {3, 3}}
which is a list of the coordinates the result is found in.
So in your case:
lis = {1, 2};
Position[lis, 1]
return {{1}}, as in: we found your requested value one time, in the coordinate-set {1}.
Now, a lot of times Mathematica assume that there might be a list of solutions (for example, in Solve), but the user know that he expect only one. A suitable code to this in your case will be First#Position[...]. this will return the first (and, assumebly, only) element in the list of positions --
So, if you are sure that the element you are searching for exist only once in the list and want to know where, use this way.
Why do you have the extra {} in the result for Extract?
Extract can work in two different ways.
If I'm doing Extract[{{a, b, c}, {d, e, f}, {g, e, h}}, {1, 2}]
I will get b, so extract with a 1 dimensional list of is just choosing and returning this element. In fact, Extract[lis, {1, 2}] is equal to lis[[1, 2]]
If I'm doing Extract[{{a, b, c}, {d, e, f}, {g, e, h}}, {{1, 2}, {3, 4}}]
I will get {b, h}, so extract with a 2 dimensional list is choosing and returning a list of elements.
In your case(s), you are doing Extract[lis, {{1}}], as in: give me a list containing only the element lis[[1]]. The result is always this element in a list, which is the extra {}

Resources