Elasticsearch - Replicas is unassigned after reopen index INDEX_REOPENED error - elasticsearch
I closed my index and reopen it and now my shards can't assigne.
curl -s -XGET localhost:9201/_cat/shards?h=index,shard,prirep,state,unassigned.reason | grep UNASSIGNED
2018.03.27-team-logs 2 r UNASSIGNED INDEX_REOPENED
2018.03.27-team-logs 5 r UNASSIGNED INDEX_REOPENED
2018.03.27-team-logs 3 r UNASSIGNED INDEX_REOPENED
2018.03.27-team-logs 4 r UNASSIGNED INDEX_REOPENED
2018.03.27-team-logs 1 r UNASSIGNED INDEX_REOPENED
2018.03.27-team-logs 0 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 2 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 5 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 3 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 4 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 1 r UNASSIGNED INDEX_REOPENED
2018.03.28-team-logs 0 r UNASSIGNED INDEX_REOPENED
Could anybody explain me what does error means and how to solve it? Before I closed it everything works fine. I configured 6 shards and 1 replica. Installed Elasticsearch 6.2.
EDIT:
Output of curl -XGET "localhost:9201/_cat/shards":
2018.03.29-team-logs 1 r STARTED 1739969 206.2mb 10.207.46.247 elk-es-data-hot-1.platform.osdc2.mall.local
2018.03.29-team-logs 1 p STARTED 1739969 173mb 10.206.46.246 elk-es-data-hot-2.platform.osdc1.mall.local
2018.03.29-team-logs 2 p STARTED 1739414 169.9mb 10.207.46.247 elk-es-data-hot-1.platform.osdc2.mall.local
2018.03.29-team-logs 2 r STARTED 1739414 176.3mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.29-team-logs 4 p STARTED 1740185 186mb 10.206.46.247 elk-es-data-hot-1.platform.osdc1.mall.local
2018.03.29-team-logs 4 r STARTED 1740185 169.4mb 10.206.46.246 elk-es-data-hot-2.platform.osdc1.mall.local
2018.03.29-team-logs 5 r STARTED 1739660 164.3mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.29-team-logs 5 p STARTED 1739660 180.1mb 10.206.46.246 elk-es-data-hot-2.platform.osdc1.mall.local
2018.03.29-team-logs 3 p STARTED 1740606 171.2mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.29-team-logs 3 r STARTED 1740606 173.4mb 10.206.46.247 elk-es-data-hot-1.platform.osdc1.mall.local
2018.03.29-team-logs 0 r STARTED 1740166 169.7mb 10.207.46.247 elk-es-data-hot-1.platform.osdc2.mall.local
2018.03.29-team-logs 0 p STARTED 1740166 187mb 10.206.46.247 elk-es-data-hot-1.platform.osdc1.mall.local
2018.03.28-team-logs 1 p STARTED 2075020 194.2mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.28-team-logs 1 r UNASSIGNED
2018.03.28-team-logs 2 p STARTED 2076268 194.9mb 10.206.46.247 elk-es-data-hot-1.platform.osdc1.mall.local
2018.03.28-team-logs 2 r UNASSIGNED
2018.03.28-team-logs 4 p STARTED 2073906 194.9mb 10.207.46.247 elk-es-data-hot-1.platform.osdc2.mall.local
2018.03.28-team-logs 4 r UNASSIGNED
2018.03.28-team-logs 5 p STARTED 2072921 195mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.28-team-logs 5 r UNASSIGNED
2018.03.28-team-logs 3 p STARTED 2074579 194.1mb 10.206.46.246 elk-es-data-hot-2.platform.osdc1.mall.local
2018.03.28-team-logs 3 r UNASSIGNED
2018.03.28-team-logs 0 p STARTED 2073349 193.9mb 10.207.46.248 elk-es-data-hot-2.platform.osdc2.mall.local
2018.03.28-team-logs 0 r UNASSIGNED
2018.03.27-team-logs 1 p STARTED 356769 33.5mb 10.207.46.246 elk-es-data-warm-1.platform.osdc2.mall.local
2018.03.27-team-logs 1 r UNASSIGNED
2018.03.27-team-logs 2 p STARTED 356798 33.6mb 10.206.46.244 elk-es-data-warm-2.platform.osdc1.mall.local
2018.03.27-team-logs 2 r UNASSIGNED
2018.03.27-team-logs 4 p STARTED 356747 33.7mb 10.207.46.246 elk-es-data-warm-1.platform.osdc2.mall.local
2018.03.27-team-logs 4 r UNASSIGNED
2018.03.27-team-logs 5 p STARTED 357399 33.8mb 10.207.46.245 elk-es-data-warm-2.platform.osdc2.mall.local
2018.03.27-team-logs 5 r UNASSIGNED
2018.03.27-team-logs 3 p STARTED 357957 33.7mb 10.206.46.245 elk-es-data-warm-1.platform.osdc1.mall.local
2018.03.27-team-logs 3 r UNASSIGNED
2018.03.27-team-logs 0 p STARTED 356357 33.4mb 10.207.46.245 elk-es-data-warm-2.platform.osdc2.mall.local
2018.03.27-team-logs 0 r UNASSIGNED
.kibana 0 p STARTED 2 12.3kb 10.207.46.247 elk-es-data-hot-1.platform.osdc2.mall.local
.kibana 0 r UNASSIGNED
Output of curl -XGET "localhost:9201/_cat/nodes":
10.207.46.248 8 82 0 0.07 0.08 0.11 d - elk-es-data-hot-2
10.206.46.245 9 64 0 0.04 0.11 0.08 d - elk-es-data-warm-1
10.207.46.249 11 90 0 0.00 0.01 0.05 m * elk-es-master-2
10.207.46.245 9 64 0 0.00 0.01 0.05 d - elk-es-data-warm-2
10.206.46.247 12 82 0 0.00 0.06 0.08 d - elk-es-data-hot-1
10.206.46.244 10 64 0 0.08 0.04 0.05 d - elk-es-data-warm-2
10.207.46.243 5 86 0 0.00 0.01 0.05 d - elk-kibana
10.206.46.248 10 92 1 0.04 0.18 0.24 m - elk-es-master-1
10.206.46.246 6 82 0 0.02 0.07 0.09 d - elk-es-data-hot-2
10.207.46.247 9 82 0 0.06 0.06 0.05 d - elk-es-data-hot-1
10.206.46.241 6 91 0 0.00 0.02 0.05 m - master-test
10.206.46.242 8 89 0 0.00 0.02 0.05 d - es-kibana
10.207.46.246 8 64 0 0.00 0.02 0.05 d - elk-es-data-warm-1
It is expected behaviour.
Elasticsearch will not put primary and replica shard on the same
node. You will need at least 2 nodes to have have 1 replica.
You can simply set replica to 0
PUT */_settings
{
"index" : {
"number_of_replicas" : 0
}
}
UPDATE:
After running following request
GET /_cluster/allocation/explain?pretty
we can see response here
https://pastebin.com/1ag1Z7jL
"explanation" : "there are too many copies of the shard allocated to
nodes with attribute [datacenter], there are [2] total configured
shard copies for this shard id and [3] total attribute values,
expected the allocated shard count per attribute [2] to be less than
or equal to the upper bound of the required number of shards per
attribute [1]"
Probbably you have zone setting used. Elasticsearch will avoid to put primary and replica shard in same zone
https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-awareness.html
With ordinary awareness, if one zone lost contact with the other zone,
Elasticsearch would assign all of the missing replica shards to a
single zone. But in this example, this sudden extra load would cause
the hardware in the remaining zone to be overloaded.
Forced awareness solves this problem by NEVER allowing copies of the
same shard to be allocated to the same zone.
For example, lets say we have an awareness attribute called zone, and
we know we are going to have two zones, zone1 and zone2. Here is how
we can force awareness on a node:
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
cluster.routing.allocation.awareness.attributes: zone
Related
Indexing very slow in elasticsearch
I can't increase the indexing more than 10000 event/second no matter what I do. I am getting around 13000 events per second from kafka in a single logstash instance. I am running 3 Logstash in different machines reading data from same kafka topic. I have setup a ELK cluster with 3 Logstash reading data from Kafka and sending them to my elastic cluster. My cluster contains 3 Logstash, 3 Elastic Master Node, 3 Elastic Client node and 50 Elastic Data Node. Logstash 2.0.4 Elastic Search 5.0.2 Kibana 5.0.2 All Citrix VM having same configuration of : Red Hat Linux-7 Intel(R) Xeon(R) CPU E5-2630 v3 # 2.40GHz 6 Cores 32 GB RAM 2 TB spinning media Logstash Config file : output { elasticsearch { hosts => ["dataNode1:9200","dataNode2:9200","dataNode3:9200" upto "**dataNode50**:9200"] index => "logstash-applogs-%{+YYYY.MM.dd}-1" workers => 6 user => "uname" password => "pwd" } } Elasticsearch Data Node's elastcisearch.yml File: cluster.name: my-cluster-name node.name: node46-data-46 node.master: false node.data: true bootstrap.memory_lock: true path.data: /apps/dataES1/data path.logs: /apps/dataES1/logs discovery.zen.ping.unicast.hosts: ["master1","master2","master3"] network.host: hostname http.port: 9200 The only change that I made in my **jvm.options** file is -Xms15g -Xmx15g System config changes that I did are as follows: vm.max_map_count=262144 and in /etc/security/limits.conf I added : elastic soft nofile 65536 elastic hard nofile 65536 elastic soft memlock unlimited elastic hard memlock unlimited elastic soft nproc 65536 elastic hard nproc unlimited Indexing Rate One of the active data node: $ sudo iotop -o Total DISK READ : 0.00 B/s | Total DISK WRITE : 243.29 K/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 357.09 K/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 5199 be/3 root 0.00 B/s 3.92 K/s 0.00 % 1.05 % [jbd2/xvdb1-8] 14079 be/4 elkadmin 0.00 B/s 51.01 K/s 0.00 % 0.53 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch 13936 be/4 elkadmin 0.00 B/s 51.01 K/s 0.00 % 0.39 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch 13857 be/4 elkadmin 0.00 B/s 58.86 K/s 0.00 % 0.34 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch 13960 be/4 elkadmin 0.00 B/s 35.32 K/s 0.00 % 0.33 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch 13964 be/4 elkadmin 0.00 B/s 31.39 K/s 0.00 % 0.27 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch 14078 be/4 elkadmin 0.00 B/s 11.77 K/s 0.00 % 0.00 % java -Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:CMSIni~h-5.0.2/lib/* org.elasticsearch.bootstrap.Elasticsearch Index Details : index shard prirep state docs store logstash-applogs-2017.01.23-3 11 r STARTED 30528186 35gb logstash-applogs-2017.01.23-3 11 p STARTED 30528186 30.3gb logstash-applogs-2017.01.23-3 9 p STARTED 30530585 35.2gb logstash-applogs-2017.01.23-3 9 r STARTED 30530585 30.5gb logstash-applogs-2017.01.23-3 1 r STARTED 30526639 30.4gb logstash-applogs-2017.01.23-3 1 p STARTED 30526668 30.5gb logstash-applogs-2017.01.23-3 14 p STARTED 30539209 35.5gb logstash-applogs-2017.01.23-3 14 r STARTED 30539209 35gb logstash-applogs-2017.01.23-3 12 p STARTED 30536132 30.3gb logstash-applogs-2017.01.23-3 12 r STARTED 30536132 30.3gb logstash-applogs-2017.01.23-3 15 p STARTED 30528216 30.4gb logstash-applogs-2017.01.23-3 15 r STARTED 30528216 30.4gb logstash-applogs-2017.01.23-3 19 r STARTED 30533725 35.3gb logstash-applogs-2017.01.23-3 19 p STARTED 30533725 36.4gb logstash-applogs-2017.01.23-3 18 r STARTED 30525190 30.2gb logstash-applogs-2017.01.23-3 18 p STARTED 30525190 30.3gb logstash-applogs-2017.01.23-3 8 p STARTED 30526785 35.8gb logstash-applogs-2017.01.23-3 8 r STARTED 30526785 35.3gb logstash-applogs-2017.01.23-3 3 p STARTED 30526960 30.4gb logstash-applogs-2017.01.23-3 3 r STARTED 30526960 30.2gb logstash-applogs-2017.01.23-3 5 p STARTED 30522469 35.3gb logstash-applogs-2017.01.23-3 5 r STARTED 30522469 30.8gb logstash-applogs-2017.01.23-3 6 p STARTED 30539580 30.9gb logstash-applogs-2017.01.23-3 6 r STARTED 30539580 30.3gb logstash-applogs-2017.01.23-3 7 p STARTED 30535488 30.3gb logstash-applogs-2017.01.23-3 7 r STARTED 30535488 30.4gb logstash-applogs-2017.01.23-3 2 p STARTED 30524575 35.2gb logstash-applogs-2017.01.23-3 2 r STARTED 30524575 35.3gb logstash-applogs-2017.01.23-3 10 p STARTED 30537232 30.4gb logstash-applogs-2017.01.23-3 10 r STARTED 30537232 30.4gb logstash-applogs-2017.01.23-3 16 p STARTED 30530098 30.3gb logstash-applogs-2017.01.23-3 16 r STARTED 30530098 30.3gb logstash-applogs-2017.01.23-3 4 r STARTED 30529877 30.2gb logstash-applogs-2017.01.23-3 4 p STARTED 30529877 30.2gb logstash-applogs-2017.01.23-3 17 r STARTED 30528132 30.2gb logstash-applogs-2017.01.23-3 17 p STARTED 30528132 30.4gb logstash-applogs-2017.01.23-3 13 r STARTED 30521873 30.3gb logstash-applogs-2017.01.23-3 13 p STARTED 30521873 30.4gb logstash-applogs-2017.01.23-3 0 r STARTED 30520172 30.4gb logstash-applogs-2017.01.23-3 0 p STARTED 30520172 30.5gb I tested the incoming data in logstash by dumping data into a file. I got a file of 290 MB with 377822 lines in 30 seconds. So there is no issue from Kafka as at a given time I am receiving 35000 events per second in my 3 Logstash servers but my Elasticsearch is able to index maximum of 10000 events per second. Can someone please help me with this issue? Edit: I tried sending the request in batch of default 125, then 500, 1000, 10000, but still I didn't got any improvement in the indexing speed.
I improved indexing rate by moving to a larger Machines for Data nodes. Data Node: A VMWare virtual machine with the following config: 14 CPU # 2.60GHz 64GB RAM, 31GB dedicated for elasticsearch. The fasted disk that was available to me was SAN with Fibre Channel as I couldn't get any SSD or Local Disks. I achieved maximum indexing rate of 100,000 events per second. Each document size is around 2 to 5 KB.
UnavailableShardsException on ElasticSearch
I'm using Elasticsearch on my dedicated server (not amazon). Recently it's giving me error like: UnavailableShardsException[[tribune][4] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: delete {[tribune][news][90755]}] when ever I'm making /_cat/shards?v the result is: index shard prirep state docs store ip node tribune 4 p UNASSIGNED tribune 4 r UNASSIGNED tribune 0 p STARTED 5971 34mb ***.**.***.** Benny Beckley tribune 0 r UNASSIGNED tribune 3 p STARTED 5875 33.9mb ***.**.***.** Benny Beckley tribune 3 r UNASSIGNED tribune 1 p INITIALIZING ***.**.***.** Benny Beckley tribune 1 r UNASSIGNED tribune 2 p STARTED 5875 33.6mb ***.**.***.** Benny Beckley tribune 2 r UNASSIGNED
Elasticsearch:how to change cluster health from yellow to green
I have a cluster with one node (by local). Health cluster is yellow. Now I add more one node, but shards can not be allocated in second node. So the health of my cluster is still yellow. I can not change this state to green, not like as this guide:health cluster example. So how to change health state to green? My cluster: Cluster health: curl -XGET 'http://localhost:9200/_cluster/health?pretty=true' { "cluster_name" : "astrung", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 2, "number_of_data_nodes" : 2, "active_primary_shards" : 22, "active_shards" : 22, "relocating_shards" : 0, "initializing_shards" : 2, "unassigned_shards" : 20 } Shard status: curl -XGET 'http://localhost:9200/_cat/shards?v' index shard prirep state docs store ip node _river 0 p STARTED 2 8.1kb 192.168.1.3 One _river 0 r UNASSIGNED megacorp 4 p STARTED 1 3.4kb 192.168.1.3 One megacorp 4 r UNASSIGNED megacorp 0 p STARTED 2 6.1kb 192.168.1.3 One megacorp 0 r UNASSIGNED megacorp 3 p STARTED 1 2.2kb 192.168.1.3 One megacorp 3 r UNASSIGNED megacorp 1 p STARTED 0 115b 192.168.1.3 One megacorp 1 r UNASSIGNED megacorp 2 p STARTED 1 2.2kb 192.168.1.3 One megacorp 2 r UNASSIGNED mybucket 2 p STARTED 1 2.1kb 192.168.1.3 One mybucket 2 r UNASSIGNED mybucket 0 p STARTED 0 115b 192.168.1.3 One mybucket 0 r UNASSIGNED mybucket 3 p STARTED 2 5.4kb 192.168.1.3 One mybucket 3 r UNASSIGNED mybucket 1 p STARTED 1 2.2kb 192.168.1.3 One mybucket 1 r UNASSIGNED mybucket 4 p STARTED 1 2.5kb 192.168.1.3 One mybucket 4 r UNASSIGNED .kibana 0 r INITIALIZING 192.168.1.3 Two .kibana 0 p STARTED 2 8.9kb 192.168.1.3 One .marvel-kibana 2 p STARTED 0 115b 192.168.1.3 One .marvel-kibana 2 r UNASSIGNED .marvel-kibana 0 r INITIALIZING 192.168.1.3 Two .marvel-kibana 0 p STARTED 1 2.9kb 192.168.1.3 One .marvel-kibana 3 p STARTED 0 115b 192.168.1.3 One .marvel-kibana 3 r UNASSIGNED .marvel-kibana 1 p STARTED 0 115b 192.168.1.3 One .marvel-kibana 1 r UNASSIGNED .marvel-kibana 4 p STARTED 0 115b 192.168.1.3 One .marvel-kibana 4 r UNASSIGNED user_ids 4 p STARTED 11 5kb 192.168.1.3 One user_ids 4 r UNASSIGNED user_ids 0 p STARTED 7 25.1kb 192.168.1.3 One user_ids 0 r UNASSIGNED user_ids 3 p STARTED 11 4.9kb 192.168.1.3 One user_ids 3 r UNASSIGNED user_ids 1 p STARTED 8 28.7kb 192.168.1.3 One user_ids 1 r UNASSIGNED user_ids 2 p STARTED 11 8.5kb 192.168.1.3 One user_ids 2 r UNASSIGNED
I suggest updating the replication factor of all the indices to 0 and then update it back to 1. Here's a curl to get you started curl -XPUT 'http://localhost:9200/_settings' -H 'Content-Type: application/json' -d ' { "index" : { "number_of_replicas" : 0 } }'
like #mohitt said above, update number_of_replicas to zero(for local dev only,be careful to use in production) you can run the below in Kibana DevTool Console: PUT _settings { "index" : { "number_of_replicas" : 0 } }
thou recovery normally takes a long time, looking at the number and size of your documents, it should take a very sort time to recover. Looks like you have issues with the nodes contacting each other, check firewall rules, ensure ports 9200 and 9300 are reachable from each.
Filter Data In a Cleaner/More Efficient Way
I have a set of data with a bunch of columns. Something like the following (in reality my data has about half a million rows): big = [ 1 1 0.93 0.58; 1 2 0.40 0.34; 1 3 0.26 0.31; 1 4 0.40 0.26; 2 1 0.60 0.04; 2 2 0.84 0.55; 2 3 0.53 0.72; 2 4 0.00 0.39; 3 1 0.27 0.51; 3 2 0.46 0.18; 3 3 0.61 0.01; 3 4 0.07 0.04; 4 1 0.26 0.43; 4 2 0.77 0.91; 4 3 0.49 0.80; 4 4 0.40 0.55; 5 1 0.77 0.40; 5 2 0.91 0.28; 5 3 0.80 0.65; 5 4 0.05 0.06; 6 1 0.41 0.37; 6 2 0.11 0.87; 6 3 0.78 0.61; 6 4 0.87 0.51 ]; Now, let's say I want to get rid of the rows where the first column is a 3 or a 6. I'm doing that like so: filterRows = [3 6]; for i = filterRows big = big(~ismember(1:size(big,1), find(big(:,1) == i)), :); end Which works, but the loop makes me think I'm missing a more efficient trick. Is there a better way to do this? Originally I tried: big(find(big(:,1) == filterRows ),:) = []; but of course that doesn't work.
Use logical indexing: rows = (big(:, 1) == 3 | big(:, 1) == 6); big(rows, :) = []; In the general case, where the values of the first column are stored in filterRows, you can generate the logical vector rows with ismember: rows = ismember(big(:, 1), filterRows); or with bsxfun: rows = any(bsxfun(#eq, big(:, 1), filterRows(:).'), 2);
Permute all unique enumerations of a vector in R
I'm trying to find a function that will permute all the unique permutations of a vector, while not counting juxtapositions within subsets of the same element type. For example: dat <- c(1,0,3,4,1,0,0,3,0,4) has factorial(10) > 3628800 possible permutations, but only 10!/(2!*2!*4!*2!) factorial(10)/(factorial(2)*factorial(2)*factorial(2)*factorial(4)) > 18900 unique permutations when ignoring juxtapositions within subsets of the same element type. I can get this by using unique() and the permn() function from the package combinat unique( permn(dat) ) but this is computationally very expensive, since it involves enumerating n!, which can be an order of magnitude more permutations than I need. Is there a way to do this without first computing n!?
EDIT: Here's a faster answer; again based on the ideas of Louisa Grey and Bryce Wagner, but with faster R code thanks to better use of matrix indexing. It's quite a bit faster than my original: > ddd <- c(1,0,3,4,1,0,0,3,0,4) > system.time(up1 <- uniqueperm(d)) user system elapsed 0.183 0.000 0.186 > system.time(up2 <- uniqueperm2(d)) user system elapsed 0.037 0.000 0.038 And the code: uniqueperm2 <- function(d) { dat <- factor(d) N <- length(dat) n <- tabulate(dat) ng <- length(n) if(ng==1) return(d) a <- N-c(0,cumsum(n))[-(ng+1)] foo <- lapply(1:ng, function(i) matrix(combn(a[i],n[i]),nrow=n[i])) out <- matrix(NA, nrow=N, ncol=prod(sapply(foo, ncol))) xxx <- c(0,cumsum(sapply(foo, nrow))) xxx <- cbind(xxx[-length(xxx)]+1, xxx[-1]) miss <- matrix(1:N,ncol=1) for(i in seq_len(length(foo)-1)) { l1 <- foo[[i]] nn <- ncol(miss) miss <- matrix(rep(miss, ncol(l1)), nrow=nrow(miss)) k <- (rep(0:(ncol(miss)-1), each=nrow(l1)))*nrow(miss) + l1[,rep(1:ncol(l1), each=nn)] out[xxx[i,1]:xxx[i,2],] <- matrix(miss[k], ncol=ncol(miss)) miss <- matrix(miss[-k], ncol=ncol(miss)) } k <- length(foo) out[xxx[k,1]:xxx[k,2],] <- miss out <- out[rank(as.numeric(dat), ties="first"),] foo <- cbind(as.vector(out), as.vector(col(out))) out[foo] <- d t(out) } It doesn't return the same order, but after sorting, the results are identical. up1a <- up1[do.call(order, as.data.frame(up1)),] up2a <- up2[do.call(order, as.data.frame(up2)),] identical(up1a, up2a) For my first attempt, see the edit history.
The following function (which implements the classic formula for repeated permutations just like you did manually in your question) seems quite fast to me: upermn <- function(x) { n <- length(x) duplicates <- as.numeric(table(x)) factorial(n) / prod(factorial(duplicates)) } It does compute n! but not like permn function which generates all permutations first. See it in action: > dat <- c(1,0,3,4,1,0,0,3,0,4) > upermn(dat) [1] 18900 > system.time(uperm(dat)) user system elapsed 0.000 0.000 0.001 UPDATE: I have just realized that the question was about generating all unique permutations not just specifying the number of them - sorry for that! You could improve the unique(perm(...)) part with specifying unique permutations for one less element and later adding the uniqe elements in front of them. Well, my explanation may fail, so let the source speak: uperm <- function(x) { u <- unique(x) # unique values of the vector result <- x # let's start the result matrix with the vector for (i in 1:length(u)) { v <- x[-which(x==u[i])[1]] # leave the first occurance of duplicated values result <- rbind(result, cbind(u[i], do.call(rbind, unique(permn(v))))) } return(result) } This way you could gain some speed. I was lazy to run the code on the vector you provided (took so much time), here is a small comparison on a smaller vector: > dat <- c(1,0,3,4,1,0,0) > system.time(unique(permn(dat))) user system elapsed 0.264 0.000 0.268 > system.time(uperm(dat)) user system elapsed 0.147 0.000 0.150 I think you could gain a lot more by rewriting this function to be recursive! UPDATE (again): I have tried to make up a recursive function with my limited knowledge: uperm <- function(x) { u <- sort(unique(x)) l <- length(u) if (l == length(x)) { return(do.call(rbind,permn(x))) } if (l == 1) return(x) result <- matrix(NA, upermn(x), length(x)) index <- 1 for (i in 1:l) { v <- x[-which(x==u[i])[1]] newindex <- upermn(v) if (table(x)[i] == 1) { result[index:(index+newindex-1),] <- cbind(u[i], do.call(rbind, unique(permn(v)))) } else { result[index:(index+newindex-1),] <- cbind(u[i], uperm(v)) } index <- index+newindex } return(result) } Which has a great gain: > system.time(unique(permn(c(1,0,3,4,1,0,0,3,0)))) user system elapsed 22.808 0.103 23.241 > system.time(uperm(c(1,0,3,4,1,0,0,3,0))) user system elapsed 4.613 0.003 4.645 Please report back if this would work for you!
One option that hasn't been mentioned here is the allPerm function from the multicool package. It can be used pretty easily to get all the unique permutations: library(multicool) perms <- allPerm(initMC(dat)) dim(perms) # [1] 18900 10 head(perms) # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] # [1,] 4 4 3 3 1 1 0 0 0 0 # [2,] 0 4 4 3 3 1 1 0 0 0 # [3,] 4 0 4 3 3 1 1 0 0 0 # [4,] 4 4 0 3 3 1 1 0 0 0 # [5,] 3 4 4 0 3 1 1 0 0 0 # [6,] 4 3 4 0 3 1 1 0 0 0 In benchmarking I found it to be faster on dat than the solutions from the OP and daroczig but slower than the solution from Aaron.
I don't actually know R, but here's how I'd approach the problem: Find how many of each element type, i.e. 4 X 0 2 X 1 2 X 3 2 X 4 Sort by frequency (which the above already is). Start with the most frequent value, which takes up 4 of the 10 spots. Determine the unique combinations of 4 values within the 10 available spots. (0,1,2,3),(0,1,2,4),(0,1,2,5),(0,1,2,6) ... (0,1,2,9),(0,1,3,4),(0,1,3,5) ... (6,7,8,9) Go to the second most frequent value, it takes up 2 of 6 available spots, and determine it's unique combinations of 2 of 6. (0,1),(0,2),(0,3),(0,4),(0,5),(1,2),(1,3) ... (4,6),(5,6) Then 2 of 4: (0,1),(0,2),(0,3),(1,2),(1,3),(2,3) And the remaining values, 2 of 2: (0,1) Then you need to combine them into each possible combination. Here's some pseudocode (I'm convinced there's a more efficient algorithm for this, but this shouldn't be too bad): lookup = (0,1,3,4) For each of the above sets of combinations, example: input = ((0,2,4,6),(0,2),(2,3),(0,1)) newPermutation = (-1,-1,-1,-1,-1,-1,-1,-1,-1,-1) for i = 0 to 3 index = 0 for j = 0 to 9 if newPermutation(j) = -1 if index = input(i)(j) newPermutation(j) = lookup(i) break else index = index + 1
Another option is the iterpc package, I believe it is the fastest of the existing method. More importantly, the result is in dictionary order (which may be somehow preferable). dat <- c(1, 0, 3, 4, 1, 0, 0, 3, 0, 4) library(iterpc) getall(iterpc(table(dat), order=TRUE)) The benchmark indicates that iterpc is significant faster than all other methods described here library(multicool) library(microbenchmark) microbenchmark(uniqueperm2(dat), allPerm(initMC(dat)), getall(iterpc(table(dat), order=TRUE)) ) Unit: milliseconds expr min lq mean median uniqueperm2(dat) 23.011864 25.33241 40.141907 27.143952 allPerm(initMC(dat)) 1713.549069 1771.83972 1814.434743 1810.331342 getall(iterpc(table(dat), order = TRUE)) 4.332674 5.18348 7.656063 5.989448 uq max neval 64.147399 74.66312 100 1855.869670 1937.48088 100 6.705741 49.98038 100
As this question is old and continues to attract many views, this post is solely meant to inform R users of the current state of the language with regards to performing the popular task outlined by the OP. As #RandyLai alludes to, there are packages developed with this task in mind. They are: arrangements and RcppAlgos*. Efficiency They are very efficient and quite easy to use for generating permutations of a multiset. dat <- c(1, 0, 3, 4, 1, 0, 0, 3, 0, 4) dim(RcppAlgos::permuteGeneral(sort(unique(dat)), freqs = table(dat))) [1] 18900 10 microbenchmark(algos = RcppAlgos::permuteGeneral(sort(unique(dat)), freqs = table(dat)), arngmnt = arrangements::permutations(sort(unique(dat)), freq = table(dat)), curaccptd = uniqueperm2(dat), unit = "relative") Unit: relative expr min lq mean median uq max neval algos 1.000000 1.000000 1.0000000 1.000000 1.000000 1.0000000 100 arngmnt 1.501262 1.093072 0.8783185 1.089927 1.133112 0.3238829 100 curaccptd 19.847457 12.573657 10.2272080 11.705090 11.872955 3.9007364 100 With RcppAlgos we can utilize parallel processing for even better efficiency on larger examples. hugeDat <- rep(dat, 2)[-(1:5)] RcppAlgos::permuteCount(sort(unique(hugeDat)), freqs = table(hugeDat)) [1] 3603600 microbenchmark(algospar = RcppAlgos::permuteGeneral(sort(unique(hugeDat)), freqs = table(hugeDat), nThreads = 4), arngmnt = arrangements::permutations(sort(unique(hugeDat)), freq = table(hugeDat)), curaccptd = uniqueperm2(hugeDat), unit = "relative", times = 10) Unit: relative expr min lq mean median uq max neval algospar 1.00000 1.000000 1.000000 1.000000 1.00000 1.00000 10 arngmnt 3.23193 3.109092 2.427836 2.598058 2.15965 1.79889 10 curaccptd 49.46989 45.910901 34.533521 39.399481 28.87192 22.95247 10 Lexicographical Order A nice benefit of these packages is that the output is in lexicographical order: head(RcppAlgos::permuteGeneral(sort(unique(dat)), freqs = table(dat))) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 0 0 0 1 1 3 3 4 4 [2,] 0 0 0 0 1 1 3 4 3 4 [3,] 0 0 0 0 1 1 3 4 4 3 [4,] 0 0 0 0 1 1 4 3 3 4 [5,] 0 0 0 0 1 1 4 3 4 3 [6,] 0 0 0 0 1 1 4 4 3 3 tail(RcppAlgos::permuteGeneral(sort(unique(dat)), freqs = table(dat))) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [18895,] 4 4 3 3 0 1 1 0 0 0 [18896,] 4 4 3 3 1 0 0 0 0 1 [18897,] 4 4 3 3 1 0 0 0 1 0 [18898,] 4 4 3 3 1 0 0 1 0 0 [18899,] 4 4 3 3 1 0 1 0 0 0 [18900,] 4 4 3 3 1 1 0 0 0 0 identical(RcppAlgos::permuteGeneral(sort(unique(dat)), freqs = table(dat)), arrangements::permutations(sort(unique(dat)), freq = table(dat))) [1] TRUE Iterators Additionally, both packages offer iterators that allow for memory efficient generation of permutation, one by one: algosIter <- RcppAlgos::permuteIter(sort(unique(dat)), freqs = table(dat)) algosIter$nextIter() [1] 0 0 0 0 1 1 3 3 4 4 algosIter$nextNIter(5) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 0 0 0 1 1 3 4 3 4 [2,] 0 0 0 0 1 1 3 4 4 3 [3,] 0 0 0 0 1 1 4 3 3 4 [4,] 0 0 0 0 1 1 4 3 4 3 [5,] 0 0 0 0 1 1 4 4 3 3 ## last permutation algosIter$back() [1] 4 4 3 3 1 1 0 0 0 0 ## use reverse iterator methods algosIter$prevNIter(5) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 4 4 3 3 1 0 1 0 0 0 [2,] 4 4 3 3 1 0 0 1 0 0 [3,] 4 4 3 3 1 0 0 0 1 0 [4,] 4 4 3 3 1 0 0 0 0 1 [5,] 4 4 3 3 0 1 1 0 0 0 * I am the author of RcppAlgos
Another option is by using the Rcpp package. The difference is that it returns a list. //[[Rcpp::export]] std::vector<std::vector< int > > UniqueP(std::vector<int> v){ std::vector< std::vector<int> > out; std::sort (v.begin(),v.end()); do { out.push_back(v); } while ( std::next_permutation(v.begin(),v.end())); return out; } Unit: milliseconds expr min lq mean median uq max neval cld uniqueperm2(dat) 10.753426 13.5283 15.61438 13.751179 16.16061 34.03334 100 b UniqueP(dat) 9.090222 9.6371 10.30185 9.838324 10.20819 24.50451 100 a