Mesos webUI display only the recently connected slave

Mesos webUI display only the recently connected slave - mesos

I am setting up a mesos cluster for demonstration purpose. My setup is one master and two slaves. All machines are running ubuntu 14.04 LTS and on same local network. I am starting mesos master with the command as per the documentation.
./mesos-master.sh --ip=192.168.65.27 --work_dir=/var/lib/mesos
I can start and connect slave running on another machine to the master with
./mesos-slave.sh --master=192.168.65.27:5050
But whenever I try to connect one more slave to the same master, only the recently connected slave get listed under the webUI of mesos http://192.168.65.24:5050/#/slaves. I checked the terminal outputs. For one slave it goes on as follows.
I0107 11:31:39.346242 6742 slave.cpp:3053] Current usage 5.71%. Max allowed age:
5.900004310159479days
I0107 11:32:39.349727 6744 slave.cpp:3053] Current usage 5.71%. Max allowed age:
5.900000961729503days
I0107 11:33:39.355268 6740 slave.cpp:3053] Current usage 5.71%. Max allowed age:
5.900000256796875days
I0107 11:34:39.355785 6744 slave.cpp:3053] Current usage 5.71%. Max allowed age:
5.900000080563727days
I0107 11:35:39.376319 6742 slave.cpp:3053] Current usage 5.71%. Max allowed age:
5.900009538409780days
and for the other its as given below
I0106 11:34:34.815814 6238 slave.cpp:3053] Current usage 40.74%. Max allowed age:
3.448325928286030days
I0106 11:35:34.816684 6238 slave.cpp:3053] Current usage 40.74%. Max allowed age:
3.448326110500035days
I0106 11:36:34.821465 6244 slave.cpp:3053] Current usage 40.74%. Max allowed age:
3.448323923931886days
I0106 11:37:34.822031 6243 slave.cpp:3053] Current usage 40.74%. Max allowed age:
3.448324106145903days
I0106 11:38:34.846472 6243 slave.cpp:3053] Current usage 40.74%. Max allowed age:
3.448324835001956days
I0106 11:39:34.889264 6243 slave.cpp:3053] Current usage 40.74%. Max allowed age:
3.448322101791771days
terminal ouput of mesos master is given below
I0107 15:12:28.482170 6412 master.cpp:2781] Removing old disconnected slave 20150107-
150547-406956224-5050-6393-31 at slave(1)#127.0.1.1:5051 (mesos_slave2-ThinkCentre-
Edge72) because a registration attempt is being made from slave(1)#127.0.1.1:5051
I0107 15:12:28.482221 6412 master.cpp:4218] Removing slave 20150107-150547-406956224-
5050-6393-31 at slave(1)#127.0.1.1:5051 (mesos_slave2-ThinkCentre-Edge72)
I0107 15:12:28.482307 6414 hierarchical_allocator_process.hpp:467] Removed slave
20150107-150547-406956224-5050-6393-31
I0107 15:12:28.482364 6412 master.cpp:2811] Registering slave at slave(1)#127.0.1.1:5051
(mesos_slave1-ThinkCentre-Edge72) with id 20150107-150547-406956224-5050-6393-32
I0107 15:12:28.482379 6414 registrar.cpp:422] Attempting to update the 'registry'
I0107 15:12:28.483706 6413 log.cpp:680] Attempting to append 344 bytes to the log
I0107 15:12:28.483772 6413 coordinator.cpp:340] Coordinator attempting to write APPEND
action at position 847
I0107 15:12:28.484074 6413 replica.cpp:508] Replica received write request for position
847
I0107 15:12:28.537632 6413 leveldb.cpp:343] Persisting action (364 bytes) to leveldb
took 53.520241ms
I0107 15:12:28.537683 6413 replica.cpp:676] Persisted action at 847
I0107 15:12:28.537832 6413 replica.cpp:655] Replica received learned notice for position
847
I0107 15:12:28.579407 6413 leveldb.cpp:343] Persisting action (366 bytes) to leveldb
took 41.551104ms
I0107 15:12:28.579454 6413 replica.cpp:676] Persisted action at 847
I0107 15:12:28.579471 6413 replica.cpp:661] Replica learned APPEND action at position
847
I0107 15:12:28.579779 6413 registrar.cpp:479] Successfully updated 'registry'
I0107 15:12:28.579825 6409 log.cpp:699] Attempting to truncate the log to 847
I0107 15:12:28.579876 6413 registrar.cpp:422] Attempting to update the 'registry'
I0107 15:12:28.579929 6414 master.cpp:4321] Removed slave 20150107-150547-406956224-
5050-6393-31 (mesos_slave2-ThinkCentre-Edge72)
I0107 15:12:28.580001 6412 coordinator.cpp:340] Coordinator attempting to write TRUNCATE
action at position 848
I0107 15:12:28.580216 6412 replica.cpp:508] Replica received write request for position
848
I0107 15:12:28.621160 6412 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took
40.912992ms
I0107 15:12:28.621215 6412 replica.cpp:676] Persisted action at 848
I0107 15:12:28.621426 6413 replica.cpp:655] Replica received learned notice for position
848
I0107 15:12:28.662858 6413 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took
41.418688ms
I0107 15:12:28.662943 6413 leveldb.cpp:401] Deleting ~2 keys from leveldb took 32165ns
I0107 15:12:28.662963 6413 replica.cpp:676] Persisted action at 848
I0107 15:12:28.662976 6413 replica.cpp:661] Replica learned TRUNCATE action at position
848
I0107 15:12:28.663244 6409 log.cpp:680] Attempting to append 550 bytes to the log
I0107 15:12:28.663331 6408 coordinator.cpp:340] Coordinator attempting to write APPEND
action at position 849
I0107 15:12:28.663539 6409 replica.cpp:508] Replica received write request for position
849
I0107 15:12:28.704601 6409 leveldb.cpp:343] Persisting action (570 bytes) to leveldb
took 41.040042ms
I0107 15:12:28.704654 6409 replica.cpp:676] Persisted action at 849
I0107 15:12:28.704839 6410 replica.cpp:655] Replica received learned notice for position
849
I0107 15:12:28.746300 6410 leveldb.cpp:343] Persisting action (572 bytes) to leveldb
took 41.427841ms
I0107 15:12:28.746354 6410 replica.cpp:676] Persisted action at 849
I0107 15:12:28.746371 6410 replica.cpp:661] Replica learned APPEND action at position
849
I0107 15:12:28.746661 6410 registrar.cpp:479] Successfully updated 'registry'
I0107 15:12:28.746722 6414 log.cpp:699] Attempting to truncate the log to 849
I0107 15:12:28.746759 6409 master.cpp:2851] Registered slave 20150107-150547-406956224-
5050-6393-32 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72)
I0107 15:12:28.746775 6410 coordinator.cpp:340] Coordinator attempting to write TRUNCATE
action at position 850
I0107 15:12:28.746789 6409 master.cpp:4085] Adding slave 20150107-150547-406956224-5050-
6393-32 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72) with cpus(*):4;
mem(*):6785; disk(*):144943; ports(*):[31000-32000]
I0107 15:12:28.746940 6407 replica.cpp:508] Replica received write request for position
850
I0107 15:12:28.746958 6409 master.cpp:775] Slave 20150107-150547-406956224-5050-6393-32
at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72) disconnected
I0107 15:12:28.746968 6409 master.cpp:1680] Disconnecting slave 20150107-150547-
406956224-5050-6393-32
I0107 15:12:28.746999 6409 hierarchical_allocator_process.hpp:442] Added slave 20150107-
150547-406956224-5050-6393-32 (mesos_slave1-ThinkCentre-Edge72) with cpus(*):4;
mem(*):6785; disk(*):144943; ports(*):[31000-32000] (and cpus(*):4; mem(*):6785;
disk(*):144943; ports(*):[31000-32000] available)
I0107 15:12:28.747051 6409 hierarchical_allocator_process.hpp:481] Slave 20150107-
150547-406956224-5050-6393-32 deactivated
I0107 15:12:28.788100 6407 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took
41.139421ms
I0107 15:12:28.788164 6407 replica.cpp:676] Persisted action at 850
I0107 15:12:28.788331 6411 replica.cpp:655] Replica received learned notice for position
850
I0107 15:12:28.829857 6411 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took
41.48648ms
I0107 15:12:28.829910 6411 leveldb.cpp:401] Deleting ~2 keys from leveldb took 23122ns
I0107 15:12:28.829924 6411 replica.cpp:676] Persisted action at 850
I0107 15:12:28.829936 6411 replica.cpp:661] Replica learned TRUNCATE action at position
850
I0107 15:12:29.070030 6412 master.cpp:2781] Removing old disconnected slave 20150107-
150547-406956224-5050-6393-32 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-
Edge72) because a registration attempt is being made from slave(1)#127.0.1.1:5051
I0107 15:12:29.070081 6412 master.cpp:4218] Removing slave 20150107-150547-406956224-
5050-6393-32 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72)
I0107 15:12:29.070184 6407 hierarchical_allocator_process.hpp:467] Removed slave
20150107-150547-406956224-5050-6393-32
I0107 15:12:29.070305 6412 master.cpp:2811] Registering slave at slave(1)#127.0.1.1:5051
(mesos_slave1-ThinkCentre-Edge72) with id 20150107-150547-406956224-5050-6393-33
I0107 15:12:29.070363 6411 registrar.cpp:422] Attempting to update the 'registry'
I0107 15:12:29.071686 6412 log.cpp:680] Attempting to append 344 bytes to the log
I0107 15:12:29.071750 6414 coordinator.cpp:340] Coordinator attempting to write APPEND
action at position 851
I0107 15:12:29.072064 6408 replica.cpp:508] Replica received write request for position
851
I0107 15:12:29.104596 6408 leveldb.cpp:343] Persisting action (364 bytes) to leveldb
took 32.50025ms
I0107 15:12:29.104645 6408 replica.cpp:676] Persisted action at 851
I0107 15:12:29.104837 6409 replica.cpp:655] Replica received learned notice for position
851
I0107 15:12:29.146327 6409 leveldb.cpp:343] Persisting action (366 bytes) to leveldb
took 41.451476ms
I0107 15:12:29.146374 6409 replica.cpp:676] Persisted action at 851
I0107 15:12:29.146390 6409 replica.cpp:661] Replica learned APPEND action at position
851
I0107 15:12:29.146685 6409 registrar.cpp:479] Successfully updated 'registry'
I0107 15:12:29.146765 6412 log.cpp:699] Attempting to truncate the log to 851
I0107 15:12:29.146781 6409 registrar.cpp:422] Attempting to update the 'registry'
I0107 15:12:29.146823 6407 master.cpp:4321] Removed slave 20150107-150547-406956224-
5050-6393-32 (mesos_slave1-ThinkCentre-Edge72)
I0107 15:12:29.146837 6411 coordinator.cpp:340] Coordinator attempting to write TRUNCATE
action at position 852
I0107 15:12:29.147100 6414 replica.cpp:508] Replica received write request for position
852
I0107 15:12:29.188091 6414 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took
40.960719ms
I0107 15:12:29.188145 6414 replica.cpp:676] Persisted action at 852
I0107 15:12:29.188280 6414 replica.cpp:655] Replica received learned notice for position
852
I0107 15:12:29.229823 6414 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took
41.531512ms
I0107 15:12:29.229907 6414 leveldb.cpp:401] Deleting ~2 keys from leveldb took 30444ns
I0107 15:12:29.229926 6414 replica.cpp:676] Persisted action at 852
I0107 15:12:29.229939 6414 replica.cpp:661] Replica learned TRUNCATE action at position
852
I0107 15:12:29.230134 6410 log.cpp:680] Attempting to append 550 bytes to the log
I0107 15:12:29.230185 6410 coordinator.cpp:340] Coordinator attempting to write APPEND
action at position 853
I0107 15:12:29.230376 6411 replica.cpp:508] Replica received write request for position
853
I0107 15:12:29.271564 6411 leveldb.cpp:343] Persisting action (570 bytes) to leveldb
took 41.128758ms
I0107 15:12:29.271617 6411 replica.cpp:676] Persisted action at 853
I0107 15:12:29.271826 6411 replica.cpp:655] Replica received learned notice for position
853
I0107 15:12:29.313411 6411 leveldb.cpp:343] Persisting action (572 bytes) to leveldb
took 41.551225ms
I0107 15:12:29.313457 6411 replica.cpp:676] Persisted action at 853
I0107 15:12:29.313473 6411 replica.cpp:661] Replica learned APPEND action at position
853
I0107 15:12:29.313753 6410 registrar.cpp:479] Successfully updated 'registry'
I0107 15:12:29.313794 6409 log.cpp:699] Attempting to truncate the log to 853
I0107 15:12:29.313823 6413 master.cpp:2851] Registered slave 20150107-150547-406956224-
5050-6393-33 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72)
I0107 15:12:29.313843 6413 master.cpp:4085] Adding slave 20150107-150547-406956224-5050-
6393-33 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72) with cpus(*):4;
mem(*):6785; disk(*):144943; ports(*):[31000-32000]
I0107 15:12:29.313854 6410 coordinator.cpp:340] Coordinator attempting to write TRUNCATE
action at position 854
I0107 15:12:29.314043 6409 replica.cpp:508] Replica received write request for position
854
I0107 15:12:29.314091 6413 master.cpp:775] Slave 20150107-150547-406956224-5050-6393-33
at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72) disconnected
I0107 15:12:29.314115 6413 master.cpp:1680] Disconnecting slave 20150107-150547-
406956224-5050-6393-33
I0107 15:12:29.314128 6410 hierarchical_allocator_process.hpp:442] Added slave 20150107-
150547-406956224-5050-6393-33 (mesos_slave1-ThinkCentre-Edge72) with cpus(*):4;
mem(*):6785; disk(*):144943; ports(*):[31000-32000] (and cpus(*):4; mem(*):6785;
disk(*):144943; ports(*):[31000-32000] available)
I0107 15:12:29.314184 6410 hierarchical_allocator_process.hpp:481] Slave 20150107-
150547-406956224-5050-6393-33 deactivated
I0107 15:12:29.355125 6409 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took
41.039775ms
I0107 15:12:29.355178 6409 replica.cpp:676] Persisted action at 854
I0107 15:12:29.355316 6409 replica.cpp:655] Replica received learned notice for position
854
I0107 15:12:29.396852 6409 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took
41.50737ms
I0107 15:12:29.396934 6409 leveldb.cpp:401] Deleting ~2 keys from leveldb took 30887ns
I0107 15:12:29.396955 6409 replica.cpp:676] Persisted action at 854
I0107 15:12:29.396967 6409 replica.cpp:661] Replica learned TRUNCATE action at position
854
I0107 15:12:29.529793 6407 master.cpp:2781] Removing old disconnected slave 20150107-
150547-406956224-5050-6393-33 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-
Edge72) because a registration attempt is being made from slave(1)#127.0.1.1:5051
I0107 15:12:29.529831 6407 master.cpp:4218] Removing slave 20150107-150547-406956224-
5050-6393-33 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72)
I0107 15:12:29.529917 6414 hierarchical_allocator_process.hpp:467] Removed slave
20150107-150547-406956224-5050-6393-33
I0107 15:12:29.529963 6407 master.cpp:2811] Registering slave at slave(1)#127.0.1.1:5051
(mesos_slave1-ThinkCentre-Edge72) with id 20150107-150547-406956224-5050-6393-34
I0107 15:12:29.529988 6412 registrar.cpp:422] Attempting to update the 'registry'
I0107 15:12:29.531298 6410 log.cpp:680] Attempting to append 344 bytes to the log
I0107 15:12:29.531371 6410 coordinator.cpp:340] Coordinator attempting to write APPEND
action at position 855
I0107 15:12:29.531597 6411 replica.cpp:508] Replica received write request for position
855
I0107 15:12:29.571789 6411 leveldb.cpp:343] Persisting action (364 bytes) to leveldb
took 40.154081ms
I0107 15:12:29.571836 6411 replica.cpp:676] Persisted action at 855
I0107 15:12:29.572059 6412 replica.cpp:655] Replica received learned notice for position
855
I0107 15:12:29.613510 6412 leveldb.cpp:343] Persisting action (366 bytes) to leveldb
took 41.426794ms
I0107 15:12:29.613565 6412 replica.cpp:676] Persisted action at 855
I0107 15:12:29.613584 6412 replica.cpp:661] Replica learned APPEND action at position
855
I0107 15:12:29.613906 6414 registrar.cpp:479] Successfully updated 'registry'
I0107 15:12:29.613915 6412 log.cpp:699] Attempting to truncate the log to 855
I0107 15:12:29.613998 6407 coordinator.cpp:340] Coordinator attempting to write TRUNCATE
action at position 856
I0107 15:12:29.614168 6407 replica.cpp:508] Replica received write request for position
856
I0107 15:12:29.613999 6414 registrar.cpp:422] Attempting to update the 'registry'
I0107 15:12:29.614001 6412 master.cpp:4321] Removed slave 20150107-150547-406956224-
5050-6393-33 (mesos_slave1-ThinkCentre-Edge72)
I0107 15:12:29.655239 6407 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took
41.046542ms
I0107 15:12:29.655294 6407 replica.cpp:676] Persisted action at 856
I0107 15:12:29.655437 6407 replica.cpp:655] Replica received learned notice for position
856
I0107 15:12:29.696975 6407 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took
41.51733ms
I0107 15:12:29.697057 6407 leveldb.cpp:401] Deleting ~2 keys from leveldb took 30767ns
I0107 15:12:29.697078 6407 replica.cpp:676] Persisted action at 856
I0107 15:12:29.697090 6407 replica.cpp:661] Replica learned TRUNCATE action at position
856
I0107 15:12:29.697302 6414 log.cpp:680] Attempting to append 550 bytes to the log
I0107 15:12:29.697357 6412 coordinator.cpp:340] Coordinator attempting to write APPEND
action at position 857
I0107 15:12:29.697630 6409 replica.cpp:508] Replica received write request for position
857
I0107 15:12:29.738620 6409 leveldb.cpp:343] Persisting action (570 bytes) to leveldb
took 40.971859ms
I0107 15:12:29.738662 6409 replica.cpp:676] Persisted action at 857
I0107 15:12:29.738785 6409 replica.cpp:655] Replica received learned notice for position
857
I0107 15:12:29.780450 6409 leveldb.cpp:343] Persisting action (572 bytes) to leveldb
took 41.637468ms
I0107 15:12:29.780506 6409 replica.cpp:676] Persisted action at 857
I0107 15:12:29.780524 6409 replica.cpp:661] Replica learned APPEND action at position
857
I0107 15:12:29.780766 6409 registrar.cpp:479] Successfully updated 'registry'
I0107 15:12:29.780788 6410 log.cpp:699] Attempting to truncate the log to 857
I0107 15:12:29.780823 6413 coordinator.cpp:340] Coordinator attempting to write TRUNCATE
action at position 858
I0107 15:12:29.780838 6412 master.cpp:2851] Registered slave 20150107-150547-406956224-
5050-6393-34 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72)
I0107 15:12:29.780858 6412 master.cpp:4085] Adding slave 20150107-150547-406956224-5050-
6393-34 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72) with cpus(*):4;
mem(*):6785; disk(*):144943; ports(*):[31000-32000]
I0107 15:12:29.780977 6410 replica.cpp:508] Replica received write request for position
858
I0107 15:12:29.780987 6412 master.cpp:775] Slave 20150107-150547-406956224-5050-6393-34
at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72) disconnected
I0107 15:12:29.780997 6412 master.cpp:1680] Disconnecting slave 20150107-150547-
406956224-5050-6393-34
I0107 15:12:29.781045 6414 hierarchical_allocator_process.hpp:442] Added slave 20150107-
150547-406956224-5050-6393-34 (mesos_slave1-ThinkCentre-Edge72) with cpus(*):4;
mem(*):6785; disk(*):144943; ports(*):[31000-32000] (and cpus(*):4; mem(*):6785;
disk(*):144943; ports(*):[31000-32000] available)
I0107 15:12:29.781128 6414 hierarchical_allocator_process.hpp:481] Slave 20150107-
150547-406956224-5050-6393-34 deactivated
I0107 15:12:29.822186 6410 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took
41.194886ms
I0107 15:12:29.822240 6410 replica.cpp:676] Persisted action at 858
I0107 15:12:29.822494 6407 replica.cpp:655] Replica received learned notice for position
858
I0107 15:12:29.863934 6407 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took
41.397501ms
I0107 15:12:29.863970 6407 leveldb.cpp:401] Deleting ~2 keys from leveldb took 14397ns
I0107 15:12:29.863980 6407 replica.cpp:676] Persisted action at 858
I0107 15:12:29.863987 6407 replica.cpp:661] Replica learned TRUNCATE action at position
858
I0107 15:12:30.644934 6413 http.cpp:466] HTTP request for '/master/state.json'
I0107 15:12:36.460794 6408 master.cpp:2781] Removing old disconnected slave 20150107-
150547-406956224-5050-6393-34 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-
Edge72) because a registration attempt is being made from slave(1)#127.0.1.1:5051
I0107 15:12:36.460850 6408 master.cpp:4218] Removing slave 20150107-150547-406956224-
5050-6393-34 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72)
I0107 15:12:36.460953 6407 hierarchical_allocator_process.hpp:467] Removed slave
20150107- 150547-406956224-5050-6393-34
I0107 15:12:36.461001 6408 master.cpp:2811] Registering slave at slave(1)#127.0.1.1:5051
(mesos_slave1-ThinkCentre-Edge72) with id 20150107-150547-406956224-5050-6393-35
I0107 15:12:36.461027 6407 registrar.cpp:422] Attempting to update the 'registry'
I0107 15:12:36.461735 6409 log.cpp:680] Attempting to append 344 bytes to the log
I0107 15:12:36.461803 6408 coordinator.cpp:340] Coordinator attempting to write APPEND
action at position 859
I0107 15:12:36.462061 6414 replica.cpp:508] Replica received write request for position
859
I0107 15:12:36.479645 6414 leveldb.cpp:343] Persisting action (364 bytes) to leveldb
took 17.545233ms
I0107 15:12:36.479693 6414 replica.cpp:676] Persisted action at 859
I0107 15:12:36.479923 6414 replica.cpp:655] Replica received learned notice for position
859
I0107 15:12:36.515755 6414 leveldb.cpp:343] Persisting action (366 bytes) to leveldb
took 35.807609ms
I0107 15:12:36.515801 6414 replica.cpp:676] Persisted action at 859
I0107 15:12:36.515818 6414 replica.cpp:661] Replica learned APPEND action at position
859
I0107 15:12:36.516130 6414 registrar.cpp:479] Successfully updated 'registry'
I0107 15:12:36.516180 6407 log.cpp:699] Attempting to truncate the log to 859
I0107 15:12:36.516222 6414 registrar.cpp:422] Attempting to update the 'registry'
I0107 15:12:36.516242 6407 coordinator.cpp:340] Coordinator attempting to write TRUNCATE
action at position 860
I0107 15:12:36.516243 6410 master.cpp:4321] Removed slave 20150107-150547-406956224-
5050-6393-34 (mesos_slave1-ThinkCentre-Edge72)
I0107 15:12:36.516499 6411 replica.cpp:508] Replica received write request for position
860
I0107 15:12:36.557504 6411 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took
40.974358ms
I0107 15:12:36.557559 6411 replica.cpp:676] Persisted action at 860
I0107 15:12:36.557689 6412 replica.cpp:655] Replica received learned notice for position
860
I0107 15:12:36.599247 6412 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took
41.522371ms
I0107 15:12:36.599334 6412 leveldb.cpp:401] Deleting ~2 keys from leveldb took 32551ns
I0107 15:12:36.599354 6412 replica.cpp:676] Persisted action at 860
I0107 15:12:36.599367 6412 replica.cpp:661] Replica learned TRUNCATE action at position
860
I0107 15:12:36.599599 6409 log.cpp:680] Attempting to append 550 bytes to the log
I0107 15:12:36.599660 6409 coordinator.cpp:340] Coordinator attempting to write APPEND
action at position 861
I0107 15:12:36.599859 6409 replica.cpp:508] Replica received write request for position
861
I0107 15:12:36.641084 6409 leveldb.cpp:343] Persisting action (570 bytes) to leveldb
took 41.195477ms
I0107 15:12:36.641137 6409 replica.cpp:676] Persisted action at 861
I0107 15:12:36.641297 6409 replica.cpp:655] Replica received learned notice for position
861
I0107 15:12:36.682837 6409 leveldb.cpp:343] Persisting action (572 bytes) to leveldb
took 41.528099ms
I0107 15:12:36.682858 6409 replica.cpp:676] Persisted action at 861
I0107 15:12:36.682868 6409 replica.cpp:661] Replica learned APPEND action at position
861
I0107 15:12:36.683104 6409 registrar.cpp:479] Successfully updated 'registry'
I0107 15:12:36.683161 6410 log.cpp:699] Attempting to truncate the log to 861
I0107 15:12:36.683189 6410 coordinator.cpp:340] Coordinator attempting to write TRUNCATE
action at position 862
I0107 15:12:36.683195 6414 master.cpp:2851] Registered slave 20150107-150547-406956224-
5050-6393-35 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72)
I0107 15:12:36.683213 6414 master.cpp:4085] Adding slave 20150107-150547-406956224-5050-
6393-35 at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72) with cpus(*):4;
mem(*):6785; disk(*):144943; ports(*):[31000-32000]
I0107 15:12:36.683306 6407 replica.cpp:508] Replica received write request for position
862
I0107 15:12:36.683346 6414 master.cpp:775] Slave 20150107-150547-406956224-5050-6393-35
at slave(1)#127.0.1.1:5051 (mesos_slave1-ThinkCentre-Edge72) disconnected
I0107 15:12:36.683368 6414 master.cpp:1680] Disconnecting slave 20150107-150547-
406956224-5050-6393-35
I0107 15:12:36.683452 6412 hierarchical_allocator_process.hpp:442] Added slave 20150107-
150547-406956224-5050-6393-35 (mesos_slave1-ThinkCentre-Edge72) with cpus(*):4;
mem(*):6785; disk(*):144943; ports(*):[31000-32000] (and cpus(*):4; mem(*):6785;
disk(*):144943; ports(*):[31000-32000] available)
I0107 15:12:36.683516 6412 hierarchical_allocator_process.hpp:481] Slave 20150107-
150547-406956224-5050-6393-35 deactivated
I0107 15:12:36.724563 6407 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took
41.207139ms
I0107 15:12:36.724611 6407 replica.cpp:676] Persisted action at 862
I0107 15:12:36.724793 6408 replica.cpp:655] Replica received learned notice for position
862
I0107 15:12:36.766309 6408 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took
41.490591ms
I0107 15:12:36.766386 6408 leveldb.cpp:401] Deleting ~2 keys from leveldb took 30472ns
I0107 15:12:36.766405 6408 replica.cpp:676] Persisted action at 862
I0107 15:12:36.766417 6408 replica.cpp:661] Replica learned TRUNCATE action at position
862
I0107 15:12:40.659693 6413 http.cpp:466] HTTP request for '/master/state.json'
I tried running mesos-master.sh on a different machine and tried connecting slaves from other machines but the output is the same. How can I fix this?

Looks like the slaves are trying to register as 127.0.1.1:5051, so only 1 slave can be registered at that that hostname:port. There are at least two ways to fix this:
Set --ip=<non-localhost-ip> when launching mesos-slave
Modify /etc/hosts on each slave such that hostname resolves to something other than 127.0.x.1.

Install mesosphere. You can get the the instructions here. After installation run mesos-master on master and mesos-slave on slaves with ip that point to your master. If you get port already used error as mentioned by Adam above, try running slave on a different port using argument --port=value. If you get permission denied while running on another port try sudo.

Related

Why does Spark only use one executor on my 2 worker node cluster if I increase the executor memory past 5 GB?

I am using a 3 node cluster: 1 master node and 2 worker nodes, using T2.large EC2 instances.
The "free -m" command gives me the following info:
Master:
total used free shared buffers cached
Mem: 7733 6324 1409 0 221 4555
-/+ buffers/cache: 1547 6186
Swap: 1023 0 1023
Worker Node 1:
total used free shared buffers cached
Mem: 7733 3203 4530 0 185 2166
-/+ buffers/cache: 851 6881
Swap: 1023 0 1023
Worker Node 2:
total used free shared buffers cached
Mem: 7733 3402 4331 0 185 2399
-/+ buffers/cache: 817 6915
Swap: 1023 0 1023
In the yarn-site.xml file, I have the following properties set:
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>7733</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>7733</value>
</property>
In $SPARK_HOME/conf/spark-defaults.conf I am setting the spark.executor.cores at 2 and spark.executor.instances at 2.
When looking at the spark-history UI after running my spark application, both executors (1 and 2) show up in the "Executors" tab along with the driver. In the cores column on that same page, it says 2 for both executors.
When I set the executor-memory at 5G and lower, my spark application runs fine with both worker node executors running. When I set the executor memory at 6G or more, only one worker node runs an executor. Why does this happen? Note: I have tried increasing the yarn.nodemanager.resource.memory-mb and it doesn't change this behavior.

Hadoop 2 node cluster UI showing 1 live Node

I am trying to configure Hadoop 2.7 2-node cluster.When i start hadoop
using start-dfs.sh and start-yarn.sh.All services on master and slave start perfectly.
Here is my jps command on my master:
23913 Jps
22140 SecondaryNameNode
22316 ResourceManager
22457 NodeManager
21916 DataNode
21777 NameNode
Here is my jps command on my slave:
17223 Jps
14225 DataNode
14363 NodeManager
But if i see Hadoop cluster UI it shows only 1 live data node.
Here is the dfs admin report : /bin/hdfs dfsadmin -report
Live datanodes (1):
Name: 192.168.1.104:50010 (nn1.cluster.com)
Hostname: nn1.cluster.com
Decommission Status : Normal
Configured Capacity: 401224601600 (373.67 GB)
DFS Used: 237568 (232 KB)
Non DFS Used: 48905121792 (45.55 GB)
DFS Remaining: 352319242240 (328.12 GB)
DFS Used%: 0.00%
DFS Remaining%: 87.81%
I am able to ssh on all machines.
Here is the sample of name node logs(i.p = 192.168.1.104) :
2016-07-12 01:17:34,293 INFO BlockStateChange: BLOCK* processReport: from storage DS-d9ed40cf-bd5d-4033-a6ca-14fb4a8c3587 node DatanodeRegistration(192.168.1.104:50010, datanodeUuid=b702b518-5daa-4fa1-8e69-e4d620a72470, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-e86d0353-9f33-495b-88fa-16035abd3672;nsid=616310490;c=0), blocks: 24, hasStaleStorage: false, processing time: 0 msecs
2016-07-12 01:17:35,501 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(192.168.1.104:50010, datanodeUuid=37038a9f-23ac-42e2-abea-bdf356aaefbe, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-e86d0353-9f33-495b-88fa-16035abd3672;nsid=616310490;c=0) storage 37038a9f-23ac-42e2-abea-bdf356aaefbe
2016-07-12 01:17:35,501 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: BLOCK* registerDatanode: 192.168.1.104:50010
2016-07-12 01:17:35,501 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.1.104:50010
2016-07-12 01:17:35,501 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Number of failed storage changes from 0 to 0
2016-07-12 01:17:35,502 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.1.104:50010
2016-07-12 01:17:35,504 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Number of failed storage changes from 0 to 0
2016-07-12 01:17:35,504 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID DS-495b6b0e-f1fc-407c-bb9f-6c314c2fdaec for DN 192.168.1.104:50010
here is the sample datanode logs (i.p = 192.168.1.104):
2016-07-12 02:02:12,044 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from nn1.cluster.com/192.168.1.104:8020 with active state
2016-07-12 02:02:12,045 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid b702b518-5daa-4fa1-8e69-e4d620a72470) service to nn1.cluster.com/192.168.1.104:8020 beginning handshake with NN
2016-07-12 02:02:12,047 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid b702b518-5daa-4fa1-8e69-e4d620a72470) service to nn1.cluster.com/192.168.1.104:8020 successfully registered with NN
2016-07-12 02:02:12,050 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x236119eb3082, containing 1 storage report(s), of which we sent 1. The reports had 24 total blocks and used 1 RPC(s). This took 0 msec to generate and 1 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-12 02:02:12,050 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1235752202-192.168.1.104-1468159707934
2016-07-12 02:02:15,049 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from nn1.cluster.com/192.168.1.104:8020 with active state
2016-07-12 02:02:15,052 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid b702b518-5daa-4fa1-8e69-e4d620a72470) service to nn1.cluster.com/192.168.1.104:8020 beginning handshake with NN
2016-07-12 02:02:15,056 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid b702b518-5daa-4fa1-8e69-e4d620a72470) service to nn1.cluster.com/192.168.1.104:8020 successfully registered with NN
2016-07-12 02:02:15,061 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x2361cd4be40d, containing 1 storage report(s), of which we sent 1. The reports had 24 total blocks and used 1 RPC(s). This took 0 msec to generate and 2 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-12 02:02:15,061 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1235752202-192.168.1.104-1468159707934
Here is the sample of 2nd datanode logs(ip :192.168.35.128)
2016-07-12 11:45:07,346 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from nn1.cluster.com/192.168.1.104:8020 with active state
2016-07-12 11:45:07,349 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 beginning handshake with NN
2016-07-12 11:45:07,355 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 successfully registered with NN
2016-07-12 11:45:07,364 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0xb0de42ec7c, containing 1 storage report(s), of which we sent 1. The reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate and 4 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-12 11:45:07,364 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1235752202-192.168.1.104-1468159707934
2016-07-12 11:45:10,360 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from nn1.cluster.com/192.168.1.104:8020 with active state
2016-07-12 11:45:10,363 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 beginning handshake with NN
2016-07-12 11:45:10,370 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 successfully registered with NN
2016-07-12 11:45:10,377 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0xb191ea9cb9, containing 1 storage report(s), of which we sent 1. The reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate and 3 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-12 11:45:10,377 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1235752202-192.168.1.104-1468159707934
2016-07-12 11:45:13,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from nn1.cluster.com/192.168.1.104:8020 with active state
2016-07-12 11:45:13,380 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 beginning handshake with NN
2016-07-12 11:45:13,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 successfully registered with NN
2016-07-12 11:45:13,395 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0xb245b893c4, containing 1 storage report(s), of which we sent 1. The reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate and 5 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-12 11:45:13,396 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1235752202-192.168.1.104-1468159707934
Why is this happening? Thank you so much for help!

Got The Solution. If the Slave Nodes/Data Nodes are alive individually and in the report if it is not showing it on the report hadoop dfsadmin - report...then there is the problem with the communication. The communication from data node to the master is not available. Technically speaking there is the issue in the firewall. the Fire wall in the master node is blocking the communication.
" We have to stop the firewall connection in the master" / Allow the specific Ip to access the master"
to stop Firewall in CentOs do the below Command in CentOs
service iptables save
service iptables stop
chkconfig iptables off

Got the Solution. The issue was in namenode Ipaddress. i had considered ipaddress from wlan0 interface which keeps
on changing.Since I have installed VMware workstation so considered the Ipaddress
from vmnet interface which was static ipaddress and after changing Live node shows 2 instead of 1.

why mesos agent can not register

I start a mesos-master and mesos-agent on my virtual machine(master and agent all on the same server).
# mesos-master --work_dir=/opt/mesos_master
# GLOG_v=1 mesos-agent --master=127.0.0.1:5050 \
--isolation=docker/runtime,filesystem/linux \
--work_dir=/opt/mesos_slave --image_providers=docker
And I got the screen output like this
I0726 18:13:57.042263 8224 master.cpp:4619] Registered agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 at slave(1)#202.106.199.37:5051 (bt-199-037.bta.net.cn) with cpus(*):4; mem(*):944; disk(*):10680; ports(*):[31000-32000]
I0726 18:13:57.042392 8224 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 226
I0726 18:13:57.042790 8224 hierarchical.cpp:478] Added agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 (bt-199-037.bta.net.cn) with cpus(*):4; mem(*):944; disk(*):10680; ports(*):[31000-32000] (allocated: )
I0726 18:13:57.042994 8224 replica.cpp:537] Replica received write request for position 226 from (21)#202.106.199.37:5050
I0726 18:13:57.050371 8224 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 7.277511ms
I0726 18:13:57.050611 8224 replica.cpp:712] Persisted action at 226
I0726 18:13:57.050882 8224 replica.cpp:691] Replica received learned notice for position 226 from #0.0.0.0:0
I0726 18:13:57.053961 8224 leveldb.cpp:341] Persisting action (20 bytes) to leveldb took 3.035601ms
I0726 18:13:57.054203 8224 leveldb.cpp:399] Deleting ~2 keys from leveldb took 167530ns
I0726 18:13:57.054226 8224 replica.cpp:712] Persisted action at 226
I0726 18:13:57.054234 8224 replica.cpp:697] Replica learned TRUNCATE action at position 226
I0726 18:14:46.817351 8228 master.cpp:4520] Agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 at slave(1)#202.106.199.37:5051 (bt-199-037.bta.net.cn) already registered, resending acknowledgement
E0726 18:14:50.530529 8231 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0726 18:15:00.045917 8231 process.cpp:2105] Failed to shutdown socket with fd 13: Transport endpoint is not connected
I0726 18:15:00.045985 8226 master.cpp:1245] Agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 at slave(1)#202.106.199.37:5051 (bt-199-037.bta.net.cn) disconnected
I0726 18:15:00.046139 8226 master.cpp:2784] Disconnecting agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 at slave(1)#202.106.199.37:5051 (bt-199-037.bta.net.cn)
I0726 18:15:00.046185 8226 master.cpp:2803] Deactivating agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 at slave(1)#202.106.199.37:5051 (bt-199-037.bta.net.cn)
I0726 18:15:00.046233 8226 hierarchical.cpp:571] Agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 deactivated
Can anybody know that why the agent can not got registered to the master?

I have seen this issue before. Add your local ip to /etc/mesos-master/ip or /etc/mesos-slave/ip

When you see in your mesos-master log file the next line:
master.cpp:3216] Deactivating agent AGENT_ID at slave(1)#127.0.1.1:5051 (HOSTNAME)
Means that you didn't mention the mesos-agent IP address. Add as startup parameter --ip=AGENT_HOST_IP to your agent startup script or command.

You didn't tell the master which network interface to listen on. Most probably—that's what your agent log hints at—it listens at 202.106.199.37:5050.
Either explicitly tell your master to listen on 127.0.0.1 via --ip flag, or tell your agent where your master is (you can get this information from its log).

SonarQube WebServer process spikes CPU after a while

We're running SonarQube 5.1.2 on an AWS node. After a short period of use, typically a day or two, the Sonar web server becomes unresponsive and spikes the server's CPUs:
top - 01:59:47 up 2 days, 3:43, 1 user, load average: 1.89, 1.76, 1.11
Tasks: 93 total, 1 running, 92 sleeping, 0 stopped, 0 zombie
Cpu(s): 94.5%us, 0.0%sy, 0.0%ni, 5.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 7514056k total, 2828772k used, 4685284k free, 155372k buffers
Swap: 0k total, 0k used, 0k free, 872440k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2328 root 20 0 3260m 1.1g 19m S 188.3 15.5 62:51.79 java
11 root 20 0 0 0 0 S 0.3 0.0 0:07.90 events/0
2284 root 20 0 3426m 407m 19m S 0.3 5.5 9:51.04 java
1 root 20 0 19356 1536 1224 S 0.0 0.0 0:00.23 init
The 188% CPU load is coming from the WebServer process:
$ ps -eF|grep "root *2328"
root 2328 2262 2 834562 1162384 0 Mar01 ? 01:06:24 /usr/java/jre1.8.0_25/bin/java -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djruby.management.enabled=false -Djruby.compile.invokedynamic=false -Xmx768m -XX:MaxPermSize=160m -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/sonar/temp -cp ./lib/common/*:./lib/server/*:/opt/sonar/lib/jdbc/mysql/mysql-connector-java-5.1.34.jar org.sonar.server.app.WebServer /tmp/sq-process615754070383971531properties
We initially thought that we were running on way too small of a node and recently upgraded to an m3-large instance, but we're seeing the same problem (except now it's spiking 2 CPUs instead of one).
The only interesting info in the log is this:
2016.03.04 01:52:38 WARN web[o.e.transport] [sonar-1456875684135] Received response for a request that has timed out, sent [39974ms] ago, timed out [25635ms] ago, action [cluster:monitor/nodes/info], node [[#transport#-1][xxxxxxxx-build02-us-west-2b][inet[/127.0.0.1:9001]]], id [43817]
2016.03.04 01:53:19 INFO web[o.e.client.transport] [sonar-1456875684135] failed to get node info for [#transport#-1][xxxxxxxx-build02-us-west-2b][inet[/127.0.0.1:9001]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[/127.0.0.1:9001]][cluster:monitor/nodes/info] request_id [43817] timed out after [14339ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366) ~[elasticsearch-1.4.4.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [na:1.8.0_25]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [na:1.8.0_25]
at java.lang.Thread.run(Unknown Source) [na:1.8.0_25]
Does anyone know what might be going on here or has some ideas how to further diagnose this problem?

mesos slave on different host unable to add itself

Mesos slave is unable to add itself to the cluster. Right now i have 3 machine, with 3 slaves running and 1 master.
But at the mesos page i can see just one master and one slave (same as the master's host present). I can see the marathon running, app etc..
But just the other slaves are unable to connect to the master.
slave logs ::
I0825 21:30:00.971642 4110 slave.cpp:4193] Received oversubscribable resources from the resource estimator
I0825 21:30:01.000732 4106 group.cpp:313] Group process (group(1)#127.0.1.1:5051) connected to ZooKeeper
I0825 21:30:01.000821 4106 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0825 21:30:01.000874 4106 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0825 21:30:01.007753 4106 detector.cpp:138] Detected a new leader: (id='9')
I0825 21:30:01.008038 4106 group.cpp:656] Trying to get '/mesos/info_0000000009' in ZooKeeper
W0825 21:30:01.020577 4106 detector.cpp:444] Leading master master#127.0.1.1:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0825 21:30:01.021152 4106 detector.cpp:481] A new leading master (UPID=master#127.0.1.1:5050) is detected
I0825 21:30:01.021353 4106 status_update_manager.cpp:176] Pausing sending status updates
I0825 21:30:01.021385 4105 slave.cpp:684] New master detected at master#127.0.1.1:5050
I0825 21:30:01.022073 4105 slave.cpp:709] No credentials provided. Attempting to register without authentication
E0825 21:30:01.022299 4113 socket.hpp:107] Shutdown failed on fd=11: Transport endpoint is not connected [107]
zookeeer on master ::
ls /mesos
[info_0000000009, info_0000000010, log_replicas]
ls /mesos/info_0000000009
[]
Please note the lines in slave logs :
Trying to get '/mesos/info_0000000009' in ZooKeeper
and then why slave assumes the master as 127.0.1.1:5050 .. i never specified that
Leading master master#127.0.1.1:5050
but zookeeper returns
ls /mesos/info_0000000009
[]
looked into master's zookeeper and found that it was not set at all.. is t a bug in mesos or ki am missing some configuration..
also, the zookeeper logs on master closed the client connection(may now client started to connect to some other master)
2015-08-25 21:30:01,882 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxn#349] - caught
end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x14f657dafeb000d, likely cl
ient has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2015-08-25 21:30:01,884 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxn#1001] - Closed
socket connection for client /192.168.0.3:53125 which had sessionid 0x14f657dafeb000d
2015-08-25 21:30:01,952 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] -
Accepted socket connection from /192.168.0.3:53166
Note : slave on the same host as the master works perfectly fine.
TRYING TO RESOLVE IT OVER MORE THAN 2 DAYS NOW .. PLEASEE HELP..
Looks like a bug to me .. where can i see the current master in zookeeper .. is it something like /mesos/info_0000000009 ? but i was getting the in zookeeper
ls /mesos/info_0000000009
[]
an empty array thr .. is this correct because from client logs were trying to look for this : ...
I0825 21:30:01.008038 4106 group.cpp:656] Trying to get '/mesos/info_0000000009' in ZooKeeper
W0825 21:30:01.020577 4106 detector.cpp:444] Leading master master#127.0.1.1:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0825 21:30:01.021152 4106 detector.cpp:481] A new leading master (UPID=master#127.0.1.1:5050) is detected
and then client tries for 127.0.1.1:5050
Here is the complete slave logs:
Log file created at: 2015/08/27 07:12:56
Running on machine: vvwslave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0827 07:12:56.406455 1303 logging.cpp:172] INFO level logging started!
I0827 07:12:56.438398 1303 main.cpp:162] Build: 2015-07-24 10:05:39 by root
I0827 07:12:56.438534 1303 main.cpp:164] Version: 0.23.0
I0827 07:12:56.438634 1303 main.cpp:167] Git tag: 0.23.0
I0827 07:12:56.438733 1303 main.cpp:171] Git SHA: 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
I0827 07:12:56.510270 1303 containerizer.cpp:111] Using isolation: posix/cpu,posix/mem
I0827 07:12:56.566021 1329 group.cpp:313] Group process (group(1)#127.0.1.1:5051) connected to ZooKeeper
I0827 07:12:56.566082 1329 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0827 07:12:56.566108 1329 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0827 07:12:56.571959 1303 main.cpp:249] Starting Mesos slave
I0827 07:12:56.587656 1303 slave.cpp:190] Slave started on 1)#127.0.1.1:5051
I0827 07:12:56.587723 1303 slave.cpp:191] Flags at startup: --authenticatee="crammd5" --cgroups_cpu_enable_pids_and
_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" -
-cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_wa
tch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_sandbox_di
rectory="/mnt/mesos/sandbox" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --enforce_container_
disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_
dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
--hadoop_home="" --help="false" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir=
"/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.0.2:2
281/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="505
1" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registrat
ion_backoff_factor="1secs" --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" --strict="true
" --switch_user="true" --version="false" --work_dir="/tmp/mesos"
I0827 07:12:56.592327 1303 slave.cpp:354] Slave resources: cpus(*):2; mem(*):979; disk(*):67653; ports(*):[31000-32
000]
I0827 07:12:56.592576 1303 slave.cpp:384] Slave hostname: vvwslave1
I0827 07:12:56.592608 1303 slave.cpp:389] Slave checkpoint: true
I0827 07:12:56.633998 1330 state.cpp:36] Recovering state from '/tmp/mesos/meta'
I0827 07:12:56.644068 1330 status_update_manager.cpp:202] Recovering status update manager
I0827 07:12:56.644907 1330 containerizer.cpp:316] Recovering containerizer
I0827 07:12:56.650073 1330 slave.cpp:4026] Finished recovery
I0827 07:12:56.650527 1330 slave.cpp:4179] Querying resource estimator for oversubscribable resources
I0827 07:12:56.650653 1330 slave.cpp:4193] Received oversubscribable resources from the resource estimator
I0827 07:12:56.657416 1329 detector.cpp:138] Detected a new leader: (id='14')
I0827 07:12:56.657564 1329 group.cpp:656] Trying to get '/mesos/info_0000000014' in ZooKeeper
W0827 07:12:56.659080 1329 detector.cpp:444] Leading master master#127.0.1.1:5050 is using a Protobuf binary format
when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0827 07:12:56.677889 1329 detector.cpp:481] A new leading master (UPID=master#127.0.1.1:5050) is detected
I0827 07:12:56.677989 1329 slave.cpp:684] New master detected at master#127.0.1.1:5050
I0827 07:12:56.678146 1326 status_update_manager.cpp:176] Pausing sending status updates
I0827 07:12:56.678195 1329 slave.cpp:709] No credentials provided. Attempting to register without authentication
I0827 07:12:56.678239 1329 slave.cpp:720] Detecting new master
I0827 07:12:56.678591 1329 slave.cpp:3087] master#127.0.1.1:5050 exited
W0827 07:12:56.678702 1329 slave.cpp:3090] Master disconnected! Waiting for a new master to be elected
E0827 07:12:56.678460 1332 socket.hpp:107] Shutdown failed on fd=11: Transport endpoint is not connected [107]
E0827 07:12:57.068922 1332 socket.hpp:107] Shutdown failed on fd=11: Transport endpoint is not connected [107]
E0827 07:12:58.829129 1332 socket.hpp:107] Shutdown failed on fd=11: Transport endpoint is not connected [107]
And the complete zookeeper logs running on master on master
2015-08-27 07:12:42,672 - INFO [main:QuorumPeerConfig#101] - Reading configuration from: /etc/zookeeper/conf/zoo.cf
g
2015-08-27 07:12:42,718 - ERROR [main:QuorumPeerConfig#283] - Invalid configuration, only one server specified (igno
ring)
2015-08-27 07:12:42,720 - INFO [main:DatadirCleanupManager#78] - autopurge.snapRetainCount set to 10
2015-08-27 07:12:42,720 - INFO [main:DatadirCleanupManager#79] - autopurge.purgeInterval set to 0
2015-08-27 07:12:42,721 - INFO [main:DatadirCleanupManager#101] - Purge task is not scheduled.
2015-08-27 07:12:42,721 - WARN [main:QuorumPeerMain#113] - Either no config or no quorum defined in config, running
in standalone mode
2015-08-27 07:12:42,741 - INFO [main:QuorumPeerConfig#101] - Reading configuration from: /etc/zookeeper/conf/zoo.cf
g
2015-08-27 07:12:42,765 - ERROR [main:QuorumPeerConfig#283] - Invalid configuration, only one server specified (igno
ring)
2015-08-27 07:12:42,765 - INFO [main:ZooKeeperServerMain#95] - Starting server
2015-08-27 07:12:42,776 - INFO [main:Environment#100] - Server environment:zookeeper.version=3.4.5--1, built on 06/
10/2013 17:26 GMT
2015-08-27 07:12:42,776 - INFO [main:Environment#100] - Server environment:host.name=vvw
2015-08-27 07:12:42,776 - INFO [main:Environment#100] - Server environment:java.version=1.7.0_79
2015-08-27 07:12:42,776 - INFO [main:Environment#100] - Server environment:java.vendor=Oracle Corporation
2015-08-27 07:12:42,777 - INFO [main:Environment#100] - Server environment:java.home=/usr/lib/jvm/java-7-openjdk-amd64/jre
2015-08-27 07:12:42,777 - INFO [main:Environment#100] - Server environment:java.class.path=/etc/zookeeper/conf:/usr/share/java/jline.jar:/usr/share/java/log4j-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/xmlParserAPIs.jar:/usr/share/java/netty.jar:/usr/share/java/slf4j-api.jar:/usr/share/java/slf4j-log4j12.jar:/usr/share/java/zookeeper.jar
2015-08-27 07:12:42,777 - INFO [main:Environment#100] - Server environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2015-08-27 07:12:42,779 - INFO [main:Environment#100] - Server environment:java.io.tmpdir=/tmp
2015-08-27 07:12:42,779 - INFO [main:Environment#100] - Server environment:java.compiler=<NA>
2015-08-27 07:12:42,779 - INFO [main:Environment#100] - Server environment:os.name=Linux
2015-08-27 07:12:42,779 - INFO [main:Environment#100] - Server environment:os.arch=amd64
2015-08-27 07:12:42,780 - INFO [main:Environment#100] - Server environment:os.version=3.19.0-25-generic
2015-08-27 07:12:42,780 - INFO [main:Environment#100] - Server environment:user.name=zookeeper
2015-08-27 07:12:42,780 - INFO [main:Environment#100] - Server environment:user.home=/var/lib/zookeeper
2015-08-27 07:12:42,780 - INFO [main:Environment#100] - Server environment:user.dir=/
2015-08-27 07:12:42,789 - INFO [main:ZooKeeperServer#726] - tickTime set to 2000
2015-08-27 07:12:42,789 - INFO [main:ZooKeeperServer#735] - minSessionTimeout set to -1
2015-08-27 07:12:42,789 - INFO [main:ZooKeeperServer#744] - maxSessionTimeout set to -1
2015-08-27 07:12:42,806 - INFO [main:NIOServerCnxnFactory#94] - binding to port 0.0.0.0/0.0.0.0:2281
2015-08-27 07:12:42,826 - INFO [main:FileSnap#83] - Reading snapshot /var/lib/zookeeper/version-2/snapshot.705
2015-08-27 07:12:42,859 - INFO [main:FileTxnSnapLog#240] - Snapshotting: 0x728 to /var/lib/zookeeper/version-2/snap
shot.728
2015-08-27 07:12:44,848 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted sock
et connection from /192.168.0.2:44500
2015-08-27 07:12:44,857 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request
from old client /192.168.0.2:44500; will be dropped if server is in r-o mode
2015-08-27 07:12:44,859 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting
to establish new session at /192.168.0.2:44500
2015-08-27 07:12:44,862 - INFO [SyncThread:0:FileTxnLog#199] - Creating new log file: log.729
2015-08-27 07:12:45,299 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10000 with nego
tiated timeout 10000 for client /192.168.0.2:44500
2015-08-27 07:12:45,505 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted sock
et connection from /192.168.0.2:44501
2015-08-27 07:12:45,506 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request
from old client /192.168.0.2:44501; will be dropped if server is in r-o mode
2015-08-27 07:12:45,506 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting
to establish new session at /192.168.0.2:44501
2015-08-27 07:12:45,509 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted sock
et connection from /192.168.0.2:44502
2015-08-27 07:12:45,510 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request
from old client /192.168.0.2:44502; will be dropped if server is in r-o mode
2015-08-27 07:12:45,510 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44502
2015-08-27 07:12:45,538 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.2:44503
2015-08-27 07:12:45,538 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.2:44504
2015-08-27 07:12:45,538 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request from old client /192.168.0.2:44503; will be dropped if server is in r-o mode
2015-08-27 07:12:45,539 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44503
2015-08-27 07:12:45,539 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request from old client /192.168.0.2:44504; will be dropped if server is in r-o mode
2015-08-27 07:12:45,539 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44504
2015-08-27 07:12:45,564 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10001 with negotiated timeout 10000 for client /192.168.0.2:44501
2015-08-27 07:12:45,674 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10002 with negotiated timeout 10000 for client /192.168.0.2:44502
2015-08-27 07:12:45,675 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10003 with negotiated timeout 10000 for client /192.168.0.2:44503
2015-08-27 07:12:45,676 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10004 with negotiated timeout 10000 for client /192.168.0.2:44504
2015-08-27 07:12:46,183 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.2:44506
2015-08-27 07:12:46,189 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44506
2015-08-27 07:12:46,232 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10005 with negotiated timeout 10000 for client /192.168.0.2:44506
2015-08-27 07:12:48,195 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.2:44508
2015-08-27 07:12:48,196 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44508
2015-08-27 07:12:48,212 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10006 with negotiated timeout 40000 for client /192.168.0.2:44508
2015-08-27 07:12:49,872 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.2:44509
2015-08-27 07:12:49,873 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request from old client /192.168.0.2:44509; will be dropped if server is in r-o mode
2015-08-27 07:12:49,873 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44509
2015-08-27 07:12:49,878 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10007 with negotiated timeout 10000 for client /192.168.0.2:44509
2015-08-27 07:12:56,161 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.3:60436
2015-08-27 07:12:56,161 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request from old client /192.168.0.3:60436; will be dropped if server is in r-o mode
2015-08-27 07:12:56,161 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.3:60436
2015-08-27 07:12:56,189 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10008 with negotiated timeout 10000 for client /192.168.0.3:60436
And the logs from master node
I0827 07:12:45.412888 1604 leveldb.cpp:176] Opened db in 567.381081ms
I0827 07:12:45.469497 1604 leveldb.cpp:183] Compacted db in 56.508537ms
I0827 07:12:45.469674 1604 leveldb.cpp:198] Created db iterator in 21452ns
I0827 07:12:45.502590 1604 leveldb.cpp:204] Seeked to beginning of db in 32.834339ms
I0827 07:12:45.502900 1604 leveldb.cpp:273] Iterated through 3 keys in the db in 101809ns
I0827 07:12:45.503026 1604 replica.cpp:744] Replica recovered with log positions 73 -> 74 with 0 holes and 0 unlear
ned
I0827 07:12:45.507745 1643 log.cpp:238] Attempting to join replica to ZooKeeper group
I0827 07:12:45.507983 1643 recover.cpp:449] Starting replica recovery
I0827 07:12:45.508095 1643 recover.cpp:475] Replica is in VOTING status
I0827 07:12:45.508167 1643 recover.cpp:464] Recover process terminated
I0827 07:12:45.536058 1604 main.cpp:383] Starting Mesos master
I0827 07:12:45.559154 1604 master.cpp:368] Master 20150827-071245-16842879-5050-1604 (vvwmaster) started on 127.0.1
.1:5050
I0827 07:12:45.559239 1604 master.cpp:370] Flags at startup: --allocation_interval="1secs" --allocator="Hierarchica
lDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --framework_sorter="drf" --hel
p="false" --hostname="vvwmaster" --initialize_driver_logging="true" --log_auto_initialize="true" --log_dir="/var/log
/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum
="1" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_s
tore_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_rereg
ister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/l
ib/mesos" --zk="zk://192.168.0.2:2281/mesos" --zk_session_timeout="10secs"
I0827 07:12:45.559460 1604 master.cpp:417] Master allowing unauthenticated frameworks to register
I0827 07:12:45.559491 1604 master.cpp:422] Master allowing unauthenticated slaves to register
I0827 07:12:45.559587 1604 master.cpp:459] Using default 'crammd5' authenticator
W0827 07:12:45.559619 1604 authenticator.cpp:504] No credentials provided, authentication requests will be refused.
I0827 07:12:45.559909 1604 authenticator.cpp:511] Initializing server SASL
I0827 07:12:45.564357 1642 group.cpp:313] Group process (group(1)#127.0.1.1:5050) connected to ZooKeeper
I0827 07:12:45.564539 1642 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0827 07:12:45.564590 1642 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper
I0827 07:12:45.675650 1644 group.cpp:313] Group process (group(2)#127.0.1.1:5050) connected to ZooKeeper
I0827 07:12:45.675717 1644 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
I0827 07:12:45.675750 1644 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper
I0827 07:12:45.676774 1639 group.cpp:313] Group process (group(3)#127.0.1.1:5050) connected to ZooKeeper
I0827 07:12:45.676828 1639 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0827 07:12:45.676857 1639 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0827 07:12:45.678182 1640 group.cpp:313] Group process (group(4)#127.0.1.1:5050) connected to ZooKeeper
I0827 07:12:45.678235 1640 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0827 07:12:45.678380 1640 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0827 07:12:45.809567 1645 network.hpp:415] ZooKeeper group memberships changed
I0827 07:12:45.816505 1644 group.cpp:656] Trying to get '/mesos/log_replicas/0000000013' in ZooKeeper
I0827 07:12:45.820705 1645 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)#127.0.1.1:5050 }
I0827 07:12:46.020447 1644 contender.cpp:131] Joining the ZK group
I0827 07:12:46.020498 1639 master.cpp:1420] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I0827 07:12:46.078451 1643 contender.cpp:247] New candidate (id='14') has entered the contest for leadership
I0827 07:12:46.078984 1645 detector.cpp:138] Detected a new leader: (id='14')
I0827 07:12:46.079110 1645 group.cpp:656] Trying to get '/mesos/info_0000000014' in ZooKeeper
W0827 07:12:46.084359 1645 detector.cpp:444] Leading master master#127.0.1.1:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0827 07:12:46.084485 1645 detector.cpp:481] A new leading master (UPID=master#127.0.1.1:5050) is detected
I0827 07:12:46.084553 1645 master.cpp:1481] The newly elected leader is master#127.0.1.1:5050 with id 20150827-071245-16842879-5050-1604
I0827 07:12:46.084653 1645 master.cpp:1494] Elected as the leading master!
I0827 07:12:46.084682 1645 master.cpp:1264] Recovering from registrar
I0827 07:12:46.084812 1645 registrar.cpp:313] Recovering registrar
I0827 07:12:46.085160 1645 log.cpp:661] Attempting to start the writer
I0827 07:12:46.085683 1639 replica.cpp:477] Replica received implicit promise request with proposal 18
I0827 07:12:46.231271 1639 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 145.505945ms
I0827 07:12:46.231402 1639 replica.cpp:345] Persisted promised to 18
I0827 07:12:46.231667 1640 coordinator.cpp:230] Coordinator attemping to fill missing position
I0827 07:12:46.231801 1640 log.cpp:677] Writer started with ending position 74
I0827 07:12:46.232197 1646 leveldb.cpp:438] Reading position from leveldb took 60443ns
I0827 07:12:46.232319 1646 leveldb.cpp:438] Reading position from leveldb took 21312ns
I0827 07:12:46.232934 1646 registrar.cpp:346] Successfully fetched the registry (247B) in 148.019968ms
I0827 07:12:46.233131 1646 registrar.cpp:445] Applied 1 operations in 17888ns; attempting to update the 'registry'
I0827 07:12:46.234346 1640 log.cpp:685] Attempting to append 286 bytes to the log
I0827 07:12:46.234463 1640 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 75
I0827 07:12:46.234748 1645 replica.cpp:511] Replica received write request for position 75
I0827 07:12:46.274888 1645 leveldb.cpp:343] Persisting action (305 bytes) to leveldb took 40.044935ms
I0827 07:12:46.275140 1645 replica.cpp:679] Persisted action at 75
I0827 07:12:46.275503 1646 replica.cpp:658] Replica received learned notice for position 75
I0827 07:12:46.307917 1646 leveldb.cpp:343] Persisting action (307 bytes) to leveldb took 32.320539ms
I0827 07:12:46.308076 1646 replica.cpp:679] Persisted action at 75
I0827 07:12:46.308112 1646 replica.cpp:664] Replica learned APPEND action at position 75
I0827 07:12:46.308668 1646 registrar.cpp:490] Successfully updated the 'registry' in 75.472128ms
I0827 07:12:46.308749 1646 registrar.cpp:376] Successfully recovered registrar
I0827 07:12:46.308888 1646 log.cpp:704] Attempting to truncate the log to 75
I0827 07:12:46.309002 1646 master.cpp:1291] Recovered 1 slaves from the Registry (247B) ; allowing 10mins for slaves to re-register
I0827 07:12:46.309056 1646 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 76
I0827 07:12:46.309252 1646 replica.cpp:511] Replica received write request for position 76
I0827 07:12:46.352067 1646 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 42.749912ms
I0827 07:12:46.352377 1646 replica.cpp:679] Persisted action at 76
I0827 07:12:46.352900 1646 replica.cpp:658] Replica received learned notice for position 76
I0827 07:12:46.407814 1646 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 54.686166ms
I0827 07:12:46.408033 1646 leveldb.cpp:401] Deleting ~2 keys from leveldb took 50800ns
I0827 07:12:46.408068 1646 replica.cpp:679] Persisted action at 76
I0827 07:12:46.408102 1646 replica.cpp:664] Replica learned TRUNCATE action at position 76
I0827 07:12:46.884490 1644 master.cpp:3332] Registering slave at slave(1)#127.0.1.1:5051 (vvw) with id 20150827-071245-16842879-5050-1604-S0
I0827 07:12:46.900085 1644 registrar.cpp:445] Applied 1 operations in 43323ns; attempting to update the 'registry'
I0827 07:12:46.901564 1639 log.cpp:685] Attempting to append 440 bytes to the log
I0827 07:12:46.901736 1639 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 77
I0827 07:12:46.902035 1639 replica.cpp:511] Replica received write request for position 77
I0827 07:12:46.947882 1639 leveldb.cpp:343] Persisting action (459 bytes) to leveldb took 45.777578ms
I0827 07:12:46.948067 1639 replica.cpp:679] Persisted action at 77
I0827 07:12:46.948422 1639 replica.cpp:658] Replica received learned notice for position 77
I0827 07:12:46.992007 1639 leveldb.cpp:343] Persisting action (461 bytes) to leveldb took 43.518061ms
I0827 07:12:46.992187 1639 replica.cpp:679] Persisted action at 77
I0827 07:12:46.992249 1639 replica.cpp:664] Replica learned APPEND action at position 77
I0827 07:12:46.992826 1640 registrar.cpp:490] Successfully updated the 'registry' in 92.466176ms
I0827 07:12:46.992949 1639 log.cpp:704] Attempting to truncate the log to 77
I0827 07:12:46.993027 1639 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 78
I0827 07:12:46.993371 1639 replica.cpp:511] Replica received write request for position 78
I0827 07:12:46.993588 1640 master.cpp:3395] Registered slave 20150827-071245-16842879-5050-1604-S0 at slave(1)#127.0.1.1:5051 (vvw) with cpus(*):4; mem(*):1846; disk(*):141854; ports(*):[31000-32000]
I0827 07:12:46.993785 1644 hierarchical.hpp:528] Added slave 20150827-071245-16842879-5050-1604-S0 (vvw) with cpus(*):4; mem(*):1846; disk(*):141854; ports(*):[31000-32000] (allocated: )
I0827 07:12:47.018685 1641 master.cpp:3687] Received update of slave 20150827-071245-16842879-5050-1604-S0 at slave(1)#127.0.1.1:5051 (vvw) with total oversubscribed resources
I0827 07:12:47.018934 1641 hierarchical.hpp:588] Slave 20150827-071245-16842879-5050-1604-S0 (vvw) updated with oversubscribed resources (total: cpus(*):4; mem(*):1846; disk(*):141854; ports(*):[31000-32000], allocated: )
I0827 07:12:47.036170 1639 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 42.72315ms
I0827 07:12:47.036388 1639 replica.cpp:679] Persisted action at 78

"But at the mesos page i can see just one master and one slave (same as the master's host present)."
Most probably this happens because the master is not able to establish connection to agents (aka slaves) living on other machines. Right now (this may change with the new HTTP API), the master must be able to open a connection to an agent, which means an agent must report a non-local IP when to registers with the master. From your logs it looks like agents bind to local IPs (127.0.1.1). You can change that via --ip flag.

I have noticed that you are running mesos as a service, and I think there must be a configuration file where you should specify your master ip(or zookeeper ip) and the default value in the file is 127.0.1.1, so only your slave on the same machine with your master can connect to it. Because when running mesos-slave you must give it the master ip.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Mesos webUI display only the recently connected slave - mesos

Related

Why does Spark only use one executor on my 2 worker node cluster if I increase the executor memory past 5 GB?

Hadoop 2 node cluster UI showing 1 live Node

why mesos agent can not register

SonarQube WebServer process spikes CPU after a while

mesos slave on different host unable to add itself

Categories

Resources