What is the best way to find the period in a repeating list?
For example:
a = {4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2}
has repeat {4, 5, 1, 2, 3} with the remainder {4, 5, 1, 2} matching, but being incomplete.
The algorithm should be fast enough to handle longer cases, like so:
b = RandomInteger[10000, {100}];
a = Join[b, b, b, b, Take[b, 27]]
The algorithm should return $Failed if there is no repeating pattern like above.
Please see the comments interspersed with the code on how it works.
(* True if a has period p *)
testPeriod[p_, a_] := Drop[a, p] === Drop[a, -p]
(* are all the list elements the same? *)
homogeneousQ[list_List] := Length#Tally[list] === 1
homogeneousQ[{}] := Throw[$Failed] (* yes, it's ugly to put this here ... *)
(* auxiliary for findPeriodOfFirstElement[] *)
reduce[a_] := Differences#Flatten#Position[a, First[a], {1}]
(* the first element occurs every ?th position ? *)
findPeriodOfFirstElement[a_] := Module[{nl},
nl = NestWhileList[reduce, reduce[a], ! homogeneousQ[#] &];
Fold[Total#Take[#2, #1] &, 1, Reverse[nl]]
]
(* the period must be a multiple of the period of the first element *)
period[a_] := Catch#With[{fp = findPeriodOfFirstElement[a]},
Do[
If[testPeriod[p, a], Return[p]],
{p, fp, Quotient[Length[a], 2], fp}
]
]
Please ask if findPeriodOfFirstElement[] is not clear. I did this independently (for fun!), but now I see that the principle is the same as in Verbeia's solution, except the problem pointed out by Brett is fixed.
I was testing with
b = RandomInteger[100, {1000}];
a = Flatten[{ConstantArray[b, 1000], Take[b, 27]}];
(Note the low integer values: there will be lots of repeating elements within the same period *)
EDIT: According to Leonid's comment below, another 2-3x speedup (~2.4x on my machine) is possible by using a custom position function, compiled specifically for lists of integers:
(* Leonid's reduce[] *)
myPosition = Compile[
{{lst, _Integer, 1}, {val, _Integer}},
Module[{pos = Table[0, {Length[lst]}], i = 1, ctr = 0},
For[i = 1, i <= Length[lst], i++,
If[lst[[i]] == val, pos[[++ctr]] = i]
];
Take[pos, ctr]
],
CompilationTarget -> "C", RuntimeOptions -> "Speed"
]
reduce[a_] := Differences#myPosition[a, First[a]]
Compiling testPeriod gives a further ~20% speedup in a quick test, but I believe this will depend on the input data:
Clear[testPeriod]
testPeriod =
Compile[{{p, _Integer}, {a, _Integer, 1}},
Drop[a, p] === Drop[a, -p]]
Above methods are better if you have no noise. If your signal is only approximate then Fourier transform methods might be useful. I'll illustrate with a "parametrized" setup wherein the length and number of repetitions of the base signal, the length of the trailing part, and a bound on the noise perturbation are all variables one can play with.
noise = 20;
extra = 40;
baselen = 103;
base = RandomInteger[10000, {baselen}];
repeat = 5;
signal = Flatten[Join[ConstantArray[base, repeat], Take[base, extra]]];
noisysignal = signal + RandomInteger[{-noise, noise}, Length[signal]];
We compute the absolute value of the FFT. We adjoin zeros to both ends. The object will be to threshold by comparing to neighbors.
sigfft = Join[{0.}, Abs[Fourier[noisysignal]], {0}];
Now we create two 0-1 vectors. In one we threshold by making a 1 for each element in the fft that is greater than twice the geometric mean of its two neighbors. In the other we use the average (arithmetic mean) but we lower the size bound to 3/4. This was based on some experimentation. We count the number of 1s in each case. Ideally we'd get 100 for each, as that would be the number of nonzeros in a "perfect" case of no noise and no tail part.
In[419]:=
thresh1 =
Table[If[sigfft[[j]]^2 > 2*sigfft[[j - 1]]*sigfft[[j + 1]], 1,
0], {j, 2, Length[sigfft] - 1}];
count1 = Count[thresh1, 1]
thresh2 =
Table[If[sigfft[[j]] > 3/4*(sigfft[[j - 1]] + sigfft[[j + 1]]), 1,
0], {j, 2, Length[sigfft] - 1}];
count2 = Count[thresh2, 1]
Out[420]= 114
Out[422]= 100
Now we get our best guess as to the value of "repeats", by taking the floor of the total length over the average of our counts.
approxrepeats = Floor[2*Length[signal]/(count1 + count2)]
Out[423]= 5
So we have found that the basic signal is repeated 5 times. That can give a start toward refining to estimate the correct length (baselen, above). To that end we might try removing elements at the end and seeing when we get ffts closer to actually having runs of four 0s between nonzero values.
Something else that might work for estimating number of repeats is finding the modal number of zeros in run length encoding of the thresholded ffts. While I have not actually tried that, it looks like it might be robust to bad choices in the details of how one does the thresholding (mine were just experiments that seem to work).
Daniel Lichtblau
The following assumes that the cycle starts on the first element and gives the period length and the cycle.
findCyclingList[a_?VectorQ] :=
Module[{repeats1, repeats2, cl, cLs, vec},
repeats1 = Flatten#Differences[Position[a, First[a]]];
repeats2 = Flatten[Position[repeats1, First[repeats1]]];
If[Equal ## Differences[repeats2] && Length[repeats2] > 2(*
is potentially cyclic - first element appears cyclically *),
cl = Plus ### Partition[repeats1, First[Differences[repeats2]]];
cLs = Partition[a, First[cl]];
If[SameQ ## cLs (* candidate cycles all actually the same *),
vec = First[cLs];
{Length[vec], vec}, $Failed], $Failed] ]
Testing
b = RandomInteger[50, {100}];
a = Join[b, b, b, b, Take[b, 27]];
findCyclingList[a]
{100, {47, 15, 42, 10, 14, 29, 12, 29, 11, 37, 6, 19, 14, 50, 4, 38,
23, 3, 41, 39, 41, 17, 32, 8, 18, 37, 5, 45, 38, 8, 39, 9, 26, 33,
40, 50, 0, 45, 1, 48, 32, 37, 15, 37, 49, 16, 27, 36, 11, 16, 4, 28,
31, 46, 30, 24, 30, 3, 32, 31, 31, 0, 32, 35, 47, 44, 7, 21, 1, 22,
43, 13, 44, 35, 29, 38, 31, 31, 17, 37, 49, 22, 15, 28, 21, 8, 31,
42, 26, 33, 1, 47, 26, 1, 37, 22, 40, 27, 27, 16}}
b1 = RandomInteger[10000, {100}];
a1 = Join[b1, b1, b1, b1, Take[b1, 23]];
findCyclingList[a1]
{100, {1281, 5325, 8435, 7505, 1355, 857, 2597, 8807, 1095, 4203,
3718, 3501, 7054, 4620, 6359, 1624, 6115, 8567, 4030, 5029, 6515,
5921, 4875, 2677, 6776, 2468, 7983, 4750, 7609, 9471, 1328, 7830,
2241, 4859, 9289, 6294, 7259, 4693, 7188, 2038, 3994, 1907, 2389,
6622, 4758, 3171, 1746, 2254, 556, 3010, 1814, 4782, 3849, 6695,
4316, 1548, 3824, 5094, 8161, 8423, 8765, 1134, 7442, 8218, 5429,
7255, 4131, 9474, 6016, 2438, 403, 6783, 4217, 7452, 2418, 9744,
6405, 8757, 9666, 4035, 7833, 2657, 7432, 3066, 9081, 9523, 3284,
3661, 1947, 3619, 2550, 4950, 1537, 2772, 5432, 6517, 6142, 9774,
1289, 6352}}
This case should fail because it isn't cyclical.
findCyclingList[Join[b, Take[b, 11], b]]
$Failed
I tried to something with Repeated, e.g. a /. Repeated[t__, {2, 100}] -> {t} but it just doesn't work for me.
Does this work for you?
period[a_] :=
Quiet[Check[
First[Cases[
Table[
{k, Equal ## Partition[a, k]},
{k, Floor[Length[a]/2]}],
{k_, True} :> k
]],
$Failed]]
Strictly speaking, this will fail for things like
a = {1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5}
although this can be fixed by using something like:
(Equal ## Partition[a, k]) && (Equal ## Partition[Reverse[a], k])
(probably computing Reverse[a] just once ahead of time.)
I propose this. It borrows from both Verbeia and Brett's answers.
Do[
If[MatchQ ## Equal ## Partition[#, i, i, 1, _], Return ## i],
{i, #[[ 2 ;; Floor[Length##/2] ]] ~Position~ First##}
] /. Null -> $Failed &
It is not quite as efficient as Vebeia's function on long periods, but it is faster on short ones, and it is simpler as well.
I don't know how to solve it in mathematica, but the following algorithm (written in python) should work. It's O(n) so speed should be no concern.
def period(array):
if len(array) == 0:
return False
else:
s = array[0]
match = False
end = 0
i = 0
for k in range(1,len(array)):
c = array[k]
if not match:
if c == s:
i = 1
match = True
end = k
else:
if not c == array[i]:
match = False
i += 1
if match:
return array[:end]
else:
return False
# False
print(period([4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2,1]))
# [4, 5, 1, 2, 3]
print(period([4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2]))
# False
print(period([4]))
# [4, 2]
print(period([4,2,4]))
# False
print(period([4,2,1]))
# False
print(period([]))
Ok, just to show my own work here:
ModifiedTortoiseHare[a_List] := Module[{counter, tortoise, hare},
Quiet[
Check[
counter = 1;
tortoise = a[[counter]];
hare = a[[2 counter]];
While[(tortoise != hare) || (a[[counter ;; 2 counter - 1]] != a[[2 counter ;; 3 counter - 1]]),
counter++;
tortoise = a[[counter]];
hare = a[[2 counter]];
];
counter,
$Failed]]]
I'm not sure this is a 100% correct, especially with cases like {pattern,pattern,different,pattern, pattern} and it gets slower and slower when there are a lot of repeating elements, like so:
{ 1,2,1,1, 1,2,1,1, 1,2,1,1, ...}
because it is making too many expensive comparisons.
#include <iostream>
#include <vector>
using namespace std;
int period(vector<int> v)
{
int p=0; // period 0
for(int i=p+1; i<v.size(); i++)
{
if(v[i] == v[0])
{
p=i; // new potential period
bool periodical=true;
for(int i=0; i<v.size()-p; i++)
{
if(v[i]!=v[i+p])
{
periodical=false;
break;
}
}
if(periodical) return p;
i=p; // try to find new period
}
}
return 0; // no period
}
int main()
{
vector<int> v3{1,2,3,1,2,3,1,2,3};
cout<<"Period is :\t"<<period(v3)<<endl;
vector<int> v0{1,2,3,1,2,3,1,9,6};
cout<<"Period is :\t"<<period(v0)<<endl;
vector<int> v1{1,2,1,1,7,1,2,1,1,7,1,2,1,1};
cout<<"Period is :\t"<<period(v1)<<endl;
return 0;
}
This sounds like it might relate to sequence alignment. These algorithms are well studied, and might already be implemented in mathematica.
I am exporting data from mathematica in this manner to a file with "dat" extension.
numbercount=0;
exporttable =
TableForm[
Flatten[
Table[
Table[
Table[{++numbercount, xcord, ycord, zcord}, {xcord, 0, 100, 5}],
{ycord, 0, 100, 5}],
{zcord,10, 100, 10}],
2]];
Export["mydata.dat", exporttable]
Now what happens is the "mydata.dat" file the output appears like this
1 0 0 10
2 5 0 10
3 10 0 10 and so on
But I want the data to appear like this in the "mydata.dat" file.
1, 0, 0, 10
2, 5, 0, 10
3, 10, 0, 10 and so on
If you observer I want a comma after every first,second and third number but not after the fourth number in each line.
I have tried this code it inserts the commas between the number But it takes a long time to run as I have huge amounts of data to be exported.I also feel that someone can perhaps come up with a better solution.
numbercount=0;
exporttable =Flatten[
Table[
Table[
Table[{++numbercount, xcord, ycord, zcord}, {xcord, 0, 100, 5}],
{ycord, 0, 100, 5}],
{zcord,10, 100, 10}],
2];
x = TableForm[Insert[
exporttable[[i]], ",", {{2}, {3}, {4}}], {i, 1, Length[exporttable]}];
Export["mydata.dat", x]
Have you tried exporting it as a CSV file? The third parameter of Export is file type, so you'd type
Export["mydata.dat", x, "CSV"]
To add to this, here is a categorical list and an alphabetical list of the available formats in Mathematica.
As an aside note, please note that you can build your list with only one Table command, and without explicit index variables:
exporttable1 = MapIndexed[Join[#2, #1] &,
Flatten[Table[{xcord, ycord, zcord},
{zcord, 10, 100, 10},
{ycord, 0, 100, 5},
{xcord, 0, 100, 5}], 2]]
exporttable1 == exporttable
(*
-> True
*)
I have the following problem.
I need to build a very large number of definitions(*) such as
f[{1,0,0,0}] = 1
f[{0,1,0,0}] = 2
f[{0,0,1,0}] = 3
f[{0,0,0,1}] = 2
...
f[{2,3,1,2}] = 4
...
f[{n1,n2,n3,n4}] = some integer
...
This is just an example. The length of the argument list does not need to be 4 but can be anything.
I realized that the lookup for each value slows down with exponential complexity when the length of the argument list increases. Perhaps this is not so strange, since it is clear that in principle there is a combinatorial explosion in how many definitions Mathematica needs to store.
Though, I have expected Mathematica to be smart and that value extract should be constant time complexity. Apparently it is not.
Is there any way to speed up lookup time? This probably has to do with how Mathematica internally handles symbol definition lookups. Does it phrases the list until it finds the match? It seems that it does so.
All suggestions highly appreciated.
With best regards
Zoran
(*) I am working on a stochastic simulation software that generates all configurations of a system and needs to store how many times each configuration occurred. In that sense a list {n1, n2, ..., nT} describes a particular configuration of the system saying that there are n1 particles of type 1, n2 particles of type 2, ..., nT particles of type T. There can be exponentially many such configurations.
Could you give some detail on how you worked out that lookup time is exponential?
If it is indeed exponential, perhaps you could speed things up by using Hash on your keys (configurations), then storing key-value pairs in a list like {{key1,value1},{key2,value2}}, kept sorted by key and then using binary search (which should be log time). This should be very quick to code up but not optimum in terms of speed.
If that's not fast enough, one could think about setting up a proper hashtable implementation (which I thought was what the f[{0,1,0,1}]=3 approach did, without having checked).
But some toy example of the slowdown would be useful to proceed further...
EDIT: I just tried
test[length_] := Block[{f},
Do[
f[RandomInteger[{0, 10}, 100]] = RandomInteger[0, 10];,
{i, 1, length}
];
f[{0, 0, 0, 0, 1, 7, 0, 3, 7, 8, 0, 4, 5, 8, 0, 8, 6, 7, 7, 0, 1, 6,
3, 9, 6, 9, 2, 7, 2, 8, 1, 1, 8, 4, 0, 5, 2, 9, 9, 10, 6, 3, 6,
8, 10, 0, 7, 1, 2, 8, 4, 4, 9, 5, 1, 10, 4, 1, 1, 3, 0, 3, 6, 5,
4, 0, 9, 5, 4, 6, 9, 6, 10, 6, 2, 4, 9, 2, 9, 8, 10, 0, 8, 4, 9,
5, 5, 9, 7, 2, 7, 4, 0, 2, 0, 10, 2, 4, 10, 1}] // timeIt
]
with timeIt defined to accurately time even short runs as follows:
timeIt::usage = "timeIt[expr] gives the time taken to execute expr,
repeating as many times as necessary to achieve a total time of \
1s";
SetAttributes[timeIt, HoldAll]
timeIt[expr_] := Module[{t = Timing[expr;][[1]], tries = 1},
While[t < 1.,
tries *= 2;
t = Timing[Do[expr, {tries}];][[1]];
];
Return[t/tries]]
and then
out = {#, test[#]} & /# {10, 100, 1000, 10000, 100000, 100000};
ListLogLogPlot#out
(also for larger runs). So it seems constant time here.
Suppose you enter your information not like
f[{1,0,0,0}] = 1
f[{0,1,0,0}] = 2
but into a n1 x n2 x n3 x n4 matrix m like
m[[2,1,1,1]] = 1
m[[1,2,1,1]] = 2
etc.
(you could even enter values not as f[{1,0,0,0}]=1, but as f[{1,0,0,0},1] with
f[li_List, i_Integer] := Part[m, Apply[Sequence, li + 1]] = i;
f[li_List] := Part[m, Apply[Sequence, li + 1]];
where you have to initialize m e.g. by m = ConstantArray[0, {4, 4, 4, 4}];)
Let's compare timings:
testf[z_] :=
(
Do[ f[{n1, n2, n3, n4}] = RandomInteger[{1,100}], {n1,z}, {n2,z}, {n3,z},{n4,z}];
First[ Timing[ Do[ f[{n2, n4, n1, n3}], {n1, z}, {n2, z}, {n3, z}, {n4, z} ] ] ]
);
Framed[
ListLinePlot[
Table[{z, testf[z]}, {z, 22, 36, 2}],
PlotLabel -> Row[{"DownValue approach: ",
Round[MemoryInUse[]/1024.^2],
" MB needed"
}],
AxesLabel -> {"n1,n2,n3,n4", "time/s"},ImageSize -> 500
]
]
Clear[f];
testf2[z_] :=
(
m = RandomInteger[{1, 100}, {z, z, z, z}];
f2[ni__Integer] := m[[Sequence ## ({ni} + 1)]];
First[ Timing[ Do[ f2[{n2, n4, n1, n3}], {n1, z}, {n2, z}, {n3, z}, {n4, z}] ] ]
)
Framed[
ListLinePlot[
Table[{z, testf2[z]}, {z, 22, 36, 2}],
PlotLabel -> Row[{"Matrix approach: ",
Round[MemoryInUse[]/1024.^2],
" MB needed"
}],
AxesLabel -> {"n1,n2,n3,n4", "time/s"}, ImageSize -> 500
]
]
gives
So for larger sets up information a matrix approach seems clearly preferrable.
Of course, if you have truly large data, say more GB than you have RAM, then you just
have to use a database and DatabaseLink.