So I was trying to use the Near Lake Indexer but after completing every step, when I ran
./target/release/near-lake --home ~/.near/mainnet run --endpoint https://my-endpoint --bucket my-bucket --region us-east-1 --stream-while-syncing sync-from-latest
there is nothing other than
INFO stats: # 9820210 Waiting for peers 0 peers β¬ 0 B/s β¬ 0 B/s 0.00 bps 0 gas/s CPU: 0%, Mem: 32.8 MB
Why is it trying to fetch block 9820210 but not the latest one, even though I passed in "sync-from-latest"?
You need to follow the standard nearcore network sync procedures: https://github.com/near/near-lake-indexer#syncing
INFO stats: # 9820210 Waiting for peers 0 peers β¬ 0 B/s β¬ 0 B/s 0.00 bps 0 gas/s CPU: 0%, Mem: 32.8 MB
This means that your node does not have the network state backup, and it waits to get synced (why it is not happening is not clear to me, and needs to be asked on https://github.com/near/nearcore by providing them with the means to reproduce the issue running neard instead of the indexer).
Why is it trying to fetch block 9820210 but not the latest one, even though I passed in "sync-from-latest"?
stats log is reported by nearcore, which reports the most recent synced block on the node.
I have seen lots of answers on SO and on Quora along with many websites. Some problems were solved when they configured firewall for slaves IPs, Some said it's a UI glitch. I am confused . I have two datanodes: one is pure datanode and another is Namenode+datanode. Problem is when I do <master-ip>:50075 it shows only one datanode ( that of machine which has namenode too ). but my hdfs dfsadmin -report shows I have two datanodes and after starting hadoop on my master and if I do jps on my pure-datanode-machine or slave machine I can see datanode running.
Firewall on both machines is off. sudo ufw status verbose gives Status: inactive response. Same scenerio is with spark. Spark UI show worker node as the node with master node not the pure worker node.But worker is running on pure-worker-machine. Again, is this a UI glitch or I am missing something?
hdfs dfsadmin -report
Configured Capacity: 991216451584 (923.14 GB)
Present Capacity: 343650484224 (320.05 GB)
DFS Remaining: 343650418688 (320.05 GB)
DFS Used: 65536 (64 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Pending deletion blocks: 0
-------------------------------------------------
Live datanodes (2):
Name: 10.10.10.105:50010 (ekbana)
Hostname: ekbana
Decommission Status : Normal
Configured Capacity: 24690192384 (22.99 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 7112691712 (6.62 GB)
DFS Remaining: 16299675648 (15.18 GB)
DFS Used%: 0.00%
DFS Remaining%: 66.02%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Jul 25 04:27:36 EDT 2017
Name: 110.44.111.147:50010 (saque-slave-ekbana)
Hostname: ekbana
Decommission Status : Normal
Configured Capacity: 966526259200 (900.15 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 590055215104 (549.53 GB)
DFS Remaining: 327350743040 (304.87 GB)
DFS Used%: 0.00%
DFS Remaining%: 33.87%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Jul 25 04:27:36 EDT 2017
/etc/hadoop/masters file on master node
ekbana
/etc/hadoop/slaves file on master node
ekbana
saque-slave-ekbana
/etc/hadoop/masters file on slave node
saque-master
Note: saque-master on slaves machine and ekbana on master machine is mapped to same IP.
Also UI looks similar to this question's UI
It's because of the same hostname(ekbana).
So in UI it will show only one entry for the same hostname.
if you want to confirm this, just start only one datanode which is not in master. you can see entry for that in the UI.
If you started other datanode too, it will mask second entry for the same hostname.
you can change the hostname and try.
I also Faced similar issue, where I couldn't see datanode information on dfshealth.html page. I had two hosts named master and slave.
etc/hadoop/masters (on master machine)
master
etc/hadoop/slaves
master
slave
etc/hadoop/masters (slave machine)
master
etc/hadoop/slaves
slave
and it was able to see datanodes on UI.
I start a mesos-master and mesos-agent on my virtual machine(master and agent all on the same server).
# mesos-master --work_dir=/opt/mesos_master
# GLOG_v=1 mesos-agent --master=127.0.0.1:5050 \
--isolation=docker/runtime,filesystem/linux \
--work_dir=/opt/mesos_slave --image_providers=docker
And I got the screen output like this
I0726 18:13:57.042263 8224 master.cpp:4619] Registered agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 at slave(1)#202.106.199.37:5051 (bt-199-037.bta.net.cn) with cpus(*):4; mem(*):944; disk(*):10680; ports(*):[31000-32000]
I0726 18:13:57.042392 8224 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 226
I0726 18:13:57.042790 8224 hierarchical.cpp:478] Added agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 (bt-199-037.bta.net.cn) with cpus(*):4; mem(*):944; disk(*):10680; ports(*):[31000-32000] (allocated: )
I0726 18:13:57.042994 8224 replica.cpp:537] Replica received write request for position 226 from (21)#202.106.199.37:5050
I0726 18:13:57.050371 8224 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 7.277511ms
I0726 18:13:57.050611 8224 replica.cpp:712] Persisted action at 226
I0726 18:13:57.050882 8224 replica.cpp:691] Replica received learned notice for position 226 from #0.0.0.0:0
I0726 18:13:57.053961 8224 leveldb.cpp:341] Persisting action (20 bytes) to leveldb took 3.035601ms
I0726 18:13:57.054203 8224 leveldb.cpp:399] Deleting ~2 keys from leveldb took 167530ns
I0726 18:13:57.054226 8224 replica.cpp:712] Persisted action at 226
I0726 18:13:57.054234 8224 replica.cpp:697] Replica learned TRUNCATE action at position 226
I0726 18:14:46.817351 8228 master.cpp:4520] Agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 at slave(1)#202.106.199.37:5051 (bt-199-037.bta.net.cn) already registered, resending acknowledgement
E0726 18:14:50.530529 8231 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0726 18:15:00.045917 8231 process.cpp:2105] Failed to shutdown socket with fd 13: Transport endpoint is not connected
I0726 18:15:00.045985 8226 master.cpp:1245] Agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 at slave(1)#202.106.199.37:5051 (bt-199-037.bta.net.cn) disconnected
I0726 18:15:00.046139 8226 master.cpp:2784] Disconnecting agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 at slave(1)#202.106.199.37:5051 (bt-199-037.bta.net.cn)
I0726 18:15:00.046185 8226 master.cpp:2803] Deactivating agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 at slave(1)#202.106.199.37:5051 (bt-199-037.bta.net.cn)
I0726 18:15:00.046233 8226 hierarchical.cpp:571] Agent 28354e0c-fe56-4a82-a420-98489be4519a-S2 deactivated
Can anybody know that why the agent can not got registered to the master?
I have seen this issue before. Add your local ip to /etc/mesos-master/ip or /etc/mesos-slave/ip
When you see in your mesos-master log file the next line:
master.cpp:3216] Deactivating agent AGENT_ID at slave(1)#127.0.1.1:5051 (HOSTNAME)
Means that you didn't mention the mesos-agent IP address. Add as startup parameter --ip=AGENT_HOST_IP to your agent startup script or command.
You didn't tell the master which network interface to listen on. Most probablyβthat's what your agent log hints atβit listens at 202.106.199.37:5050.
Either explicitly tell your master to listen on 127.0.0.1 via --ip flag, or tell your agent where your master is (you can get this information from its log).
Mesos slave is unable to add itself to the cluster. Right now i have 3 machine, with 3 slaves running and 1 master.
But at the mesos page i can see just one master and one slave (same as the master's host present). I can see the marathon running, app etc..
But just the other slaves are unable to connect to the master.
slave logs ::
I0825 21:30:00.971642 4110 slave.cpp:4193] Received oversubscribable resources from the resource estimator
I0825 21:30:01.000732 4106 group.cpp:313] Group process (group(1)#127.0.1.1:5051) connected to ZooKeeper
I0825 21:30:01.000821 4106 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0825 21:30:01.000874 4106 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0825 21:30:01.007753 4106 detector.cpp:138] Detected a new leader: (id='9')
I0825 21:30:01.008038 4106 group.cpp:656] Trying to get '/mesos/info_0000000009' in ZooKeeper
W0825 21:30:01.020577 4106 detector.cpp:444] Leading master master#127.0.1.1:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0825 21:30:01.021152 4106 detector.cpp:481] A new leading master (UPID=master#127.0.1.1:5050) is detected
I0825 21:30:01.021353 4106 status_update_manager.cpp:176] Pausing sending status updates
I0825 21:30:01.021385 4105 slave.cpp:684] New master detected at master#127.0.1.1:5050
I0825 21:30:01.022073 4105 slave.cpp:709] No credentials provided. Attempting to register without authentication
E0825 21:30:01.022299 4113 socket.hpp:107] Shutdown failed on fd=11: Transport endpoint is not connected [107]
zookeeer on master ::
ls /mesos
[info_0000000009, info_0000000010, log_replicas]
ls /mesos/info_0000000009
[]
Please note the lines in slave logs :
Trying to get '/mesos/info_0000000009' in ZooKeeper
and then why slave assumes the master as 127.0.1.1:5050 .. i never specified that
Leading master master#127.0.1.1:5050
but zookeeper returns
ls /mesos/info_0000000009
[]
looked into master's zookeeper and found that it was not set at all.. is t a bug in mesos or ki am missing some configuration..
also, the zookeeper logs on master closed the client connection(may now client started to connect to some other master)
2015-08-25 21:30:01,882 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxn#349] - caught
end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x14f657dafeb000d, likely cl
ient has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2015-08-25 21:30:01,884 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxn#1001] - Closed
socket connection for client /192.168.0.3:53125 which had sessionid 0x14f657dafeb000d
2015-08-25 21:30:01,952 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] -
Accepted socket connection from /192.168.0.3:53166
Note : slave on the same host as the master works perfectly fine.
TRYING TO RESOLVE IT OVER MORE THAN 2 DAYS NOW .. PLEASEE HELP..
Looks like a bug to me .. where can i see the current master in zookeeper .. is it something like /mesos/info_0000000009 ? but i was getting the in zookeeper
ls /mesos/info_0000000009
[]
an empty array thr .. is this correct because from client logs were trying to look for this : ...
I0825 21:30:01.008038 4106 group.cpp:656] Trying to get '/mesos/info_0000000009' in ZooKeeper
W0825 21:30:01.020577 4106 detector.cpp:444] Leading master master#127.0.1.1:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0825 21:30:01.021152 4106 detector.cpp:481] A new leading master (UPID=master#127.0.1.1:5050) is detected
and then client tries for 127.0.1.1:5050
Here is the complete slave logs:
Log file created at: 2015/08/27 07:12:56
Running on machine: vvwslave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0827 07:12:56.406455 1303 logging.cpp:172] INFO level logging started!
I0827 07:12:56.438398 1303 main.cpp:162] Build: 2015-07-24 10:05:39 by root
I0827 07:12:56.438534 1303 main.cpp:164] Version: 0.23.0
I0827 07:12:56.438634 1303 main.cpp:167] Git tag: 0.23.0
I0827 07:12:56.438733 1303 main.cpp:171] Git SHA: 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
I0827 07:12:56.510270 1303 containerizer.cpp:111] Using isolation: posix/cpu,posix/mem
I0827 07:12:56.566021 1329 group.cpp:313] Group process (group(1)#127.0.1.1:5051) connected to ZooKeeper
I0827 07:12:56.566082 1329 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0827 07:12:56.566108 1329 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0827 07:12:56.571959 1303 main.cpp:249] Starting Mesos slave
I0827 07:12:56.587656 1303 slave.cpp:190] Slave started on 1)#127.0.1.1:5051
I0827 07:12:56.587723 1303 slave.cpp:191] Flags at startup: --authenticatee="crammd5" --cgroups_cpu_enable_pids_and
_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" -
-cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_wa
tch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_sandbox_di
rectory="/mnt/mesos/sandbox" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --enforce_container_
disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_
dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
--hadoop_home="" --help="false" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir=
"/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.0.2:2
281/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="505
1" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registrat
ion_backoff_factor="1secs" --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" --strict="true
" --switch_user="true" --version="false" --work_dir="/tmp/mesos"
I0827 07:12:56.592327 1303 slave.cpp:354] Slave resources: cpus(*):2; mem(*):979; disk(*):67653; ports(*):[31000-32
000]
I0827 07:12:56.592576 1303 slave.cpp:384] Slave hostname: vvwslave1
I0827 07:12:56.592608 1303 slave.cpp:389] Slave checkpoint: true
I0827 07:12:56.633998 1330 state.cpp:36] Recovering state from '/tmp/mesos/meta'
I0827 07:12:56.644068 1330 status_update_manager.cpp:202] Recovering status update manager
I0827 07:12:56.644907 1330 containerizer.cpp:316] Recovering containerizer
I0827 07:12:56.650073 1330 slave.cpp:4026] Finished recovery
I0827 07:12:56.650527 1330 slave.cpp:4179] Querying resource estimator for oversubscribable resources
I0827 07:12:56.650653 1330 slave.cpp:4193] Received oversubscribable resources from the resource estimator
I0827 07:12:56.657416 1329 detector.cpp:138] Detected a new leader: (id='14')
I0827 07:12:56.657564 1329 group.cpp:656] Trying to get '/mesos/info_0000000014' in ZooKeeper
W0827 07:12:56.659080 1329 detector.cpp:444] Leading master master#127.0.1.1:5050 is using a Protobuf binary format
when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0827 07:12:56.677889 1329 detector.cpp:481] A new leading master (UPID=master#127.0.1.1:5050) is detected
I0827 07:12:56.677989 1329 slave.cpp:684] New master detected at master#127.0.1.1:5050
I0827 07:12:56.678146 1326 status_update_manager.cpp:176] Pausing sending status updates
I0827 07:12:56.678195 1329 slave.cpp:709] No credentials provided. Attempting to register without authentication
I0827 07:12:56.678239 1329 slave.cpp:720] Detecting new master
I0827 07:12:56.678591 1329 slave.cpp:3087] master#127.0.1.1:5050 exited
W0827 07:12:56.678702 1329 slave.cpp:3090] Master disconnected! Waiting for a new master to be elected
E0827 07:12:56.678460 1332 socket.hpp:107] Shutdown failed on fd=11: Transport endpoint is not connected [107]
E0827 07:12:57.068922 1332 socket.hpp:107] Shutdown failed on fd=11: Transport endpoint is not connected [107]
E0827 07:12:58.829129 1332 socket.hpp:107] Shutdown failed on fd=11: Transport endpoint is not connected [107]
And the complete zookeeper logs running on master on master
2015-08-27 07:12:42,672 - INFO [main:QuorumPeerConfig#101] - Reading configuration from: /etc/zookeeper/conf/zoo.cf
g
2015-08-27 07:12:42,718 - ERROR [main:QuorumPeerConfig#283] - Invalid configuration, only one server specified (igno
ring)
2015-08-27 07:12:42,720 - INFO [main:DatadirCleanupManager#78] - autopurge.snapRetainCount set to 10
2015-08-27 07:12:42,720 - INFO [main:DatadirCleanupManager#79] - autopurge.purgeInterval set to 0
2015-08-27 07:12:42,721 - INFO [main:DatadirCleanupManager#101] - Purge task is not scheduled.
2015-08-27 07:12:42,721 - WARN [main:QuorumPeerMain#113] - Either no config or no quorum defined in config, running
in standalone mode
2015-08-27 07:12:42,741 - INFO [main:QuorumPeerConfig#101] - Reading configuration from: /etc/zookeeper/conf/zoo.cf
g
2015-08-27 07:12:42,765 - ERROR [main:QuorumPeerConfig#283] - Invalid configuration, only one server specified (igno
ring)
2015-08-27 07:12:42,765 - INFO [main:ZooKeeperServerMain#95] - Starting server
2015-08-27 07:12:42,776 - INFO [main:Environment#100] - Server environment:zookeeper.version=3.4.5--1, built on 06/
10/2013 17:26 GMT
2015-08-27 07:12:42,776 - INFO [main:Environment#100] - Server environment:host.name=vvw
2015-08-27 07:12:42,776 - INFO [main:Environment#100] - Server environment:java.version=1.7.0_79
2015-08-27 07:12:42,776 - INFO [main:Environment#100] - Server environment:java.vendor=Oracle Corporation
2015-08-27 07:12:42,777 - INFO [main:Environment#100] - Server environment:java.home=/usr/lib/jvm/java-7-openjdk-amd64/jre
2015-08-27 07:12:42,777 - INFO [main:Environment#100] - Server environment:java.class.path=/etc/zookeeper/conf:/usr/share/java/jline.jar:/usr/share/java/log4j-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/xmlParserAPIs.jar:/usr/share/java/netty.jar:/usr/share/java/slf4j-api.jar:/usr/share/java/slf4j-log4j12.jar:/usr/share/java/zookeeper.jar
2015-08-27 07:12:42,777 - INFO [main:Environment#100] - Server environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2015-08-27 07:12:42,779 - INFO [main:Environment#100] - Server environment:java.io.tmpdir=/tmp
2015-08-27 07:12:42,779 - INFO [main:Environment#100] - Server environment:java.compiler=<NA>
2015-08-27 07:12:42,779 - INFO [main:Environment#100] - Server environment:os.name=Linux
2015-08-27 07:12:42,779 - INFO [main:Environment#100] - Server environment:os.arch=amd64
2015-08-27 07:12:42,780 - INFO [main:Environment#100] - Server environment:os.version=3.19.0-25-generic
2015-08-27 07:12:42,780 - INFO [main:Environment#100] - Server environment:user.name=zookeeper
2015-08-27 07:12:42,780 - INFO [main:Environment#100] - Server environment:user.home=/var/lib/zookeeper
2015-08-27 07:12:42,780 - INFO [main:Environment#100] - Server environment:user.dir=/
2015-08-27 07:12:42,789 - INFO [main:ZooKeeperServer#726] - tickTime set to 2000
2015-08-27 07:12:42,789 - INFO [main:ZooKeeperServer#735] - minSessionTimeout set to -1
2015-08-27 07:12:42,789 - INFO [main:ZooKeeperServer#744] - maxSessionTimeout set to -1
2015-08-27 07:12:42,806 - INFO [main:NIOServerCnxnFactory#94] - binding to port 0.0.0.0/0.0.0.0:2281
2015-08-27 07:12:42,826 - INFO [main:FileSnap#83] - Reading snapshot /var/lib/zookeeper/version-2/snapshot.705
2015-08-27 07:12:42,859 - INFO [main:FileTxnSnapLog#240] - Snapshotting: 0x728 to /var/lib/zookeeper/version-2/snap
shot.728
2015-08-27 07:12:44,848 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted sock
et connection from /192.168.0.2:44500
2015-08-27 07:12:44,857 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request
from old client /192.168.0.2:44500; will be dropped if server is in r-o mode
2015-08-27 07:12:44,859 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting
to establish new session at /192.168.0.2:44500
2015-08-27 07:12:44,862 - INFO [SyncThread:0:FileTxnLog#199] - Creating new log file: log.729
2015-08-27 07:12:45,299 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10000 with nego
tiated timeout 10000 for client /192.168.0.2:44500
2015-08-27 07:12:45,505 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted sock
et connection from /192.168.0.2:44501
2015-08-27 07:12:45,506 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request
from old client /192.168.0.2:44501; will be dropped if server is in r-o mode
2015-08-27 07:12:45,506 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting
to establish new session at /192.168.0.2:44501
2015-08-27 07:12:45,509 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted sock
et connection from /192.168.0.2:44502
2015-08-27 07:12:45,510 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request
from old client /192.168.0.2:44502; will be dropped if server is in r-o mode
2015-08-27 07:12:45,510 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44502
2015-08-27 07:12:45,538 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.2:44503
2015-08-27 07:12:45,538 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.2:44504
2015-08-27 07:12:45,538 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request from old client /192.168.0.2:44503; will be dropped if server is in r-o mode
2015-08-27 07:12:45,539 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44503
2015-08-27 07:12:45,539 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request from old client /192.168.0.2:44504; will be dropped if server is in r-o mode
2015-08-27 07:12:45,539 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44504
2015-08-27 07:12:45,564 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10001 with negotiated timeout 10000 for client /192.168.0.2:44501
2015-08-27 07:12:45,674 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10002 with negotiated timeout 10000 for client /192.168.0.2:44502
2015-08-27 07:12:45,675 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10003 with negotiated timeout 10000 for client /192.168.0.2:44503
2015-08-27 07:12:45,676 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10004 with negotiated timeout 10000 for client /192.168.0.2:44504
2015-08-27 07:12:46,183 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.2:44506
2015-08-27 07:12:46,189 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44506
2015-08-27 07:12:46,232 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10005 with negotiated timeout 10000 for client /192.168.0.2:44506
2015-08-27 07:12:48,195 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.2:44508
2015-08-27 07:12:48,196 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44508
2015-08-27 07:12:48,212 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10006 with negotiated timeout 40000 for client /192.168.0.2:44508
2015-08-27 07:12:49,872 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.2:44509
2015-08-27 07:12:49,873 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request from old client /192.168.0.2:44509; will be dropped if server is in r-o mode
2015-08-27 07:12:49,873 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.2:44509
2015-08-27 07:12:49,878 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10007 with negotiated timeout 10000 for client /192.168.0.2:44509
2015-08-27 07:12:56,161 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.0.3:60436
2015-08-27 07:12:56,161 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#793] - Connection request from old client /192.168.0.3:60436; will be dropped if server is in r-o mode
2015-08-27 07:12:56,161 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2281:ZooKeeperServer#839] - Client attempting to establish new session at /192.168.0.3:60436
2015-08-27 07:12:56,189 - INFO [SyncThread:0:ZooKeeperServer#595] - Established session 0x14f6cd241e10008 with negotiated timeout 10000 for client /192.168.0.3:60436
And the logs from master node
I0827 07:12:45.412888 1604 leveldb.cpp:176] Opened db in 567.381081ms
I0827 07:12:45.469497 1604 leveldb.cpp:183] Compacted db in 56.508537ms
I0827 07:12:45.469674 1604 leveldb.cpp:198] Created db iterator in 21452ns
I0827 07:12:45.502590 1604 leveldb.cpp:204] Seeked to beginning of db in 32.834339ms
I0827 07:12:45.502900 1604 leveldb.cpp:273] Iterated through 3 keys in the db in 101809ns
I0827 07:12:45.503026 1604 replica.cpp:744] Replica recovered with log positions 73 -> 74 with 0 holes and 0 unlear
ned
I0827 07:12:45.507745 1643 log.cpp:238] Attempting to join replica to ZooKeeper group
I0827 07:12:45.507983 1643 recover.cpp:449] Starting replica recovery
I0827 07:12:45.508095 1643 recover.cpp:475] Replica is in VOTING status
I0827 07:12:45.508167 1643 recover.cpp:464] Recover process terminated
I0827 07:12:45.536058 1604 main.cpp:383] Starting Mesos master
I0827 07:12:45.559154 1604 master.cpp:368] Master 20150827-071245-16842879-5050-1604 (vvwmaster) started on 127.0.1
.1:5050
I0827 07:12:45.559239 1604 master.cpp:370] Flags at startup: --allocation_interval="1secs" --allocator="Hierarchica
lDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --framework_sorter="drf" --hel
p="false" --hostname="vvwmaster" --initialize_driver_logging="true" --log_auto_initialize="true" --log_dir="/var/log
/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum
="1" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_s
tore_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_rereg
ister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/l
ib/mesos" --zk="zk://192.168.0.2:2281/mesos" --zk_session_timeout="10secs"
I0827 07:12:45.559460 1604 master.cpp:417] Master allowing unauthenticated frameworks to register
I0827 07:12:45.559491 1604 master.cpp:422] Master allowing unauthenticated slaves to register
I0827 07:12:45.559587 1604 master.cpp:459] Using default 'crammd5' authenticator
W0827 07:12:45.559619 1604 authenticator.cpp:504] No credentials provided, authentication requests will be refused.
I0827 07:12:45.559909 1604 authenticator.cpp:511] Initializing server SASL
I0827 07:12:45.564357 1642 group.cpp:313] Group process (group(1)#127.0.1.1:5050) connected to ZooKeeper
I0827 07:12:45.564539 1642 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0827 07:12:45.564590 1642 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper
I0827 07:12:45.675650 1644 group.cpp:313] Group process (group(2)#127.0.1.1:5050) connected to ZooKeeper
I0827 07:12:45.675717 1644 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
I0827 07:12:45.675750 1644 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper
I0827 07:12:45.676774 1639 group.cpp:313] Group process (group(3)#127.0.1.1:5050) connected to ZooKeeper
I0827 07:12:45.676828 1639 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0827 07:12:45.676857 1639 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0827 07:12:45.678182 1640 group.cpp:313] Group process (group(4)#127.0.1.1:5050) connected to ZooKeeper
I0827 07:12:45.678235 1640 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0827 07:12:45.678380 1640 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0827 07:12:45.809567 1645 network.hpp:415] ZooKeeper group memberships changed
I0827 07:12:45.816505 1644 group.cpp:656] Trying to get '/mesos/log_replicas/0000000013' in ZooKeeper
I0827 07:12:45.820705 1645 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)#127.0.1.1:5050 }
I0827 07:12:46.020447 1644 contender.cpp:131] Joining the ZK group
I0827 07:12:46.020498 1639 master.cpp:1420] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I0827 07:12:46.078451 1643 contender.cpp:247] New candidate (id='14') has entered the contest for leadership
I0827 07:12:46.078984 1645 detector.cpp:138] Detected a new leader: (id='14')
I0827 07:12:46.079110 1645 group.cpp:656] Trying to get '/mesos/info_0000000014' in ZooKeeper
W0827 07:12:46.084359 1645 detector.cpp:444] Leading master master#127.0.1.1:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0827 07:12:46.084485 1645 detector.cpp:481] A new leading master (UPID=master#127.0.1.1:5050) is detected
I0827 07:12:46.084553 1645 master.cpp:1481] The newly elected leader is master#127.0.1.1:5050 with id 20150827-071245-16842879-5050-1604
I0827 07:12:46.084653 1645 master.cpp:1494] Elected as the leading master!
I0827 07:12:46.084682 1645 master.cpp:1264] Recovering from registrar
I0827 07:12:46.084812 1645 registrar.cpp:313] Recovering registrar
I0827 07:12:46.085160 1645 log.cpp:661] Attempting to start the writer
I0827 07:12:46.085683 1639 replica.cpp:477] Replica received implicit promise request with proposal 18
I0827 07:12:46.231271 1639 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 145.505945ms
I0827 07:12:46.231402 1639 replica.cpp:345] Persisted promised to 18
I0827 07:12:46.231667 1640 coordinator.cpp:230] Coordinator attemping to fill missing position
I0827 07:12:46.231801 1640 log.cpp:677] Writer started with ending position 74
I0827 07:12:46.232197 1646 leveldb.cpp:438] Reading position from leveldb took 60443ns
I0827 07:12:46.232319 1646 leveldb.cpp:438] Reading position from leveldb took 21312ns
I0827 07:12:46.232934 1646 registrar.cpp:346] Successfully fetched the registry (247B) in 148.019968ms
I0827 07:12:46.233131 1646 registrar.cpp:445] Applied 1 operations in 17888ns; attempting to update the 'registry'
I0827 07:12:46.234346 1640 log.cpp:685] Attempting to append 286 bytes to the log
I0827 07:12:46.234463 1640 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 75
I0827 07:12:46.234748 1645 replica.cpp:511] Replica received write request for position 75
I0827 07:12:46.274888 1645 leveldb.cpp:343] Persisting action (305 bytes) to leveldb took 40.044935ms
I0827 07:12:46.275140 1645 replica.cpp:679] Persisted action at 75
I0827 07:12:46.275503 1646 replica.cpp:658] Replica received learned notice for position 75
I0827 07:12:46.307917 1646 leveldb.cpp:343] Persisting action (307 bytes) to leveldb took 32.320539ms
I0827 07:12:46.308076 1646 replica.cpp:679] Persisted action at 75
I0827 07:12:46.308112 1646 replica.cpp:664] Replica learned APPEND action at position 75
I0827 07:12:46.308668 1646 registrar.cpp:490] Successfully updated the 'registry' in 75.472128ms
I0827 07:12:46.308749 1646 registrar.cpp:376] Successfully recovered registrar
I0827 07:12:46.308888 1646 log.cpp:704] Attempting to truncate the log to 75
I0827 07:12:46.309002 1646 master.cpp:1291] Recovered 1 slaves from the Registry (247B) ; allowing 10mins for slaves to re-register
I0827 07:12:46.309056 1646 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 76
I0827 07:12:46.309252 1646 replica.cpp:511] Replica received write request for position 76
I0827 07:12:46.352067 1646 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 42.749912ms
I0827 07:12:46.352377 1646 replica.cpp:679] Persisted action at 76
I0827 07:12:46.352900 1646 replica.cpp:658] Replica received learned notice for position 76
I0827 07:12:46.407814 1646 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 54.686166ms
I0827 07:12:46.408033 1646 leveldb.cpp:401] Deleting ~2 keys from leveldb took 50800ns
I0827 07:12:46.408068 1646 replica.cpp:679] Persisted action at 76
I0827 07:12:46.408102 1646 replica.cpp:664] Replica learned TRUNCATE action at position 76
I0827 07:12:46.884490 1644 master.cpp:3332] Registering slave at slave(1)#127.0.1.1:5051 (vvw) with id 20150827-071245-16842879-5050-1604-S0
I0827 07:12:46.900085 1644 registrar.cpp:445] Applied 1 operations in 43323ns; attempting to update the 'registry'
I0827 07:12:46.901564 1639 log.cpp:685] Attempting to append 440 bytes to the log
I0827 07:12:46.901736 1639 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 77
I0827 07:12:46.902035 1639 replica.cpp:511] Replica received write request for position 77
I0827 07:12:46.947882 1639 leveldb.cpp:343] Persisting action (459 bytes) to leveldb took 45.777578ms
I0827 07:12:46.948067 1639 replica.cpp:679] Persisted action at 77
I0827 07:12:46.948422 1639 replica.cpp:658] Replica received learned notice for position 77
I0827 07:12:46.992007 1639 leveldb.cpp:343] Persisting action (461 bytes) to leveldb took 43.518061ms
I0827 07:12:46.992187 1639 replica.cpp:679] Persisted action at 77
I0827 07:12:46.992249 1639 replica.cpp:664] Replica learned APPEND action at position 77
I0827 07:12:46.992826 1640 registrar.cpp:490] Successfully updated the 'registry' in 92.466176ms
I0827 07:12:46.992949 1639 log.cpp:704] Attempting to truncate the log to 77
I0827 07:12:46.993027 1639 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 78
I0827 07:12:46.993371 1639 replica.cpp:511] Replica received write request for position 78
I0827 07:12:46.993588 1640 master.cpp:3395] Registered slave 20150827-071245-16842879-5050-1604-S0 at slave(1)#127.0.1.1:5051 (vvw) with cpus(*):4; mem(*):1846; disk(*):141854; ports(*):[31000-32000]
I0827 07:12:46.993785 1644 hierarchical.hpp:528] Added slave 20150827-071245-16842879-5050-1604-S0 (vvw) with cpus(*):4; mem(*):1846; disk(*):141854; ports(*):[31000-32000] (allocated: )
I0827 07:12:47.018685 1641 master.cpp:3687] Received update of slave 20150827-071245-16842879-5050-1604-S0 at slave(1)#127.0.1.1:5051 (vvw) with total oversubscribed resources
I0827 07:12:47.018934 1641 hierarchical.hpp:588] Slave 20150827-071245-16842879-5050-1604-S0 (vvw) updated with oversubscribed resources (total: cpus(*):4; mem(*):1846; disk(*):141854; ports(*):[31000-32000], allocated: )
I0827 07:12:47.036170 1639 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 42.72315ms
I0827 07:12:47.036388 1639 replica.cpp:679] Persisted action at 78
"But at the mesos page i can see just one master and one slave (same as the master's host present)."
Most probably this happens because the master is not able to establish connection to agents (aka slaves) living on other machines. Right now (this may change with the new HTTP API), the master must be able to open a connection to an agent, which means an agent must report a non-local IP when to registers with the master. From your logs it looks like agents bind to local IPs (127.0.1.1). You can change that via --ip flag.
I have noticed that you are running mesos as a service, and I think there must be a configuration file where you should specify your master ip(or zookeeper ip) and the default value in the file is 127.0.1.1, so only your slave on the same machine with your master can connect to it. Because when running mesos-slave you must give it the master ip.
I found Windbg is very useful during development and debugging.
but mostly i use windbg in use mode debugging.
What kernel debugging can do in windbg?
or When should I use windbg's kernel debugging?
Is there a toturial about kernel debugging in windbg?
Thanks in advance.
you usually use kernel debugging when you need to debug low level device drivers interacting directly with the hardware.
It's more complicated to debug in kernel mode, among other things for a live kernel debug session you have to run the debugger on a different system than the one being debugged . for the majority of developers user mode is enough to do most of the work.
Advanced Windows Debugging is a very good book about debugging with wndbg (includes discussions about kernel debugging).
the dump analysis site has many tutorials including kernel debugging scenarios
the main difference between user mode and kernel mode WINDBG, is you can see EVERY process in kernel mode WINDBG, and all threads. You wont necessary get to see every stack frame since they get paged out frequently by the memory manager.
some common commands I use frequently.
!process 0 0
lists every running process:
**** NT ACTIVE PROCESS DUMP ****
PROCESS 80a02a60 Cid: 0002 Peb: 00000000 ParentCid: 0000
DirBase: 00006e05 ObjectTable: 80a03788 TableSize: 150.
Image: System
PROCESS 80986f40 Cid: 0012 Peb: 7ffde000 ParentCid: 0002
DirBase: 000bd605 ObjectTable: 8098fce8 TableSize: 38.
Image: smss.exe
PROCESS 80958020 Cid: 001a Peb: 7ffde000 ParentCid: 0012
DirBase: 0008b205 ObjectTable: 809782a8 TableSize: 150.
Image: csrss.exe
PROCESS 80955040 Cid: 0020 Peb: 7ffde000 ParentCid: 0012
DirBase: 00112005 ObjectTable: 80955ce8 TableSize: 54.
Image: winlogon.exe
PROCESS 8094fce0 Cid: 0026 Peb: 7ffde000 ParentCid: 0020
DirBase: 00055005 ObjectTable: 80950cc8 TableSize: 222.
Image: services.exe
PROCESS 8094c020 Cid: 0029 Peb: 7ffde000 ParentCid: 0020
DirBase: 000c4605 ObjectTable: 80990fe8 TableSize: 110.
Image: lsass.exe
PROCESS 809258e0 Cid: 0044 Peb: 7ffde000 ParentCid: 0026
DirBase: 001e5405 ObjectTable: 80925c68 TableSize: 70.
Image: SPOOLSS.EXE
.process {x}
Select the process you want to make active, usually followed by the !threads command to list a processes current threads.
!stacks 0x2 {foo.sys}
searches ALL threads for call stacks that contain the specified driver.
!poolused
useful when debugging low kernel memory situations and all you have is a kernel crash dump
.crash
Useful for when you are debugging live via serial cable and you want to make the target machine write a crash dump
!vm 1
Useful display of the memory managers statistics, example:
*** Virtual Memory Usage ***
Physical Memory: 16270 ( 65080 Kb)
Page File: \??\E:\pagefile.sys
Current: 98304Kb Free Space: 61044Kb
Minimum: 98304Kb Maximum: 196608Kb
Available Pages: 5543 ( 22172 Kb)
ResAvail Pages: 6759 ( 27036 Kb)
Locked IO Pages: 112 ( 448 Kb)
Free System PTEs: 45089 ( 180356 Kb)
Free NP PTEs: 5145 ( 20580 Kb)
Free Special NP: 336 ( 1344 Kb)
Modified Pages: 714 ( 2856 Kb)
NonPagedPool Usage: 877 ( 3508 Kb)
NonPagedPool Max: 6252 ( 25008 Kb)
PagedPool 0 Usage: 729 ( 2916 Kb)
PagedPool 1 Usage: 432 ( 1728 Kb)
PagedPool 2 Usage: 436 ( 1744 Kb)
PagedPool Usage: 1597 ( 6388 Kb)
PagedPool Maximum: 13312 ( 53248 Kb)
Shared Commit: 1097 ( 4388 Kb)
Special Pool: 229 ( 916 Kb)
Shared Process: 1956 ( 7824 Kb)
PagedPool Commit: 1597 ( 6388 Kb)
Driver Commit: 828 ( 3312 Kb)
Committed pages: 21949 ( 87796 Kb)
Commit limit: 36256 ( 145024 Kb)
And don't forget the ALL MIGHTY !locks
absolutely essential for troubleshooting a deadlocked machine,
kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks......
Resource # 0x80e97620 Shared 4 owning threads
Threads: ff688da0-01<*> ff687da0-01<*> ff686da0-01<*> ff685da0-01<*>
KD: Scanning for held locks.......................................................
Resource # 0x80e23f38 Shared 1 owning threads
Threads: 80ed0023-01<*> *** Actual Thread 80ed0020
KD: Scanning for held locks.
Resource # 0x80d8b0b0 Shared 1 owning threads
Threads: 80ed0023-01<*> *** Actual Thread 80ed0020
2263 total locks, 3 locks currently held
using this command you can track down threads that are stuck waiting for another thread to release an ERESOURCE
Probably, you'll only want to debug in kernel mode when your code is running in kernel mode, ie when you're writing a drivers or something else that runs in the kernel. Or possibly if you're trying to learn more about Windows itself at a very low level by exploring around in the kernel and poking and prodding at things.
When looking for tutorials and other reference material, you might look for "kd" references as well as they are likely to be very similar. (kd is a command line kernel debugging tool.)