Hadoop: Diagnose long running job - hadoop
I need help with diagnosing why a particular Job in Jobtracker is long-running and workarounds for improving it.
Here is an excerpt of the job in question (please pardon the formatting):
Hadoop job_201901281553_38848
User: mapred
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Running
Started at: Fri Feb 01 12:39:05 CST 2019
Running for: 3hrs, 23mins, 58sec
Job Cleanup: Pending
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed
Task Attempts
map 100.00% 1177 0 0 1177 0 0 / 0
reduce 95.20% 12 0 2 10 0 0 / 0
Counter Map Reduce Total
File System Counters FILE: Number of bytes read 1,144,088,621 1,642,723,691 2,786,812,312
FILE: Number of bytes written 3,156,884,366 1,669,567,665 4,826,452,031
FILE: Number of read operations 0 0 0
FILE: Number of large read operations 0 0 0
FILE: Number of write operations 0 0 0
HDFS: Number of bytes read 11,418,749,621 0 11,418,749,621
HDFS: Number of bytes written 0 8,259,932,078 8,259,932,078
HDFS: Number of read operations 2,365 5 2,370
HDFS: Number of large read operations 0 0 0
HDFS: Number of write operations 0 12 12
Job Counters Launched map tasks 0 0 1,177
Launched reduce tasks 0 0 12
Data-local map tasks 0 0 1,020
Rack-local map tasks 0 0 157
Total time spent by all maps in occupied slots (ms) 0 0 4,379,522
Total time spent by all reduces in occupied slots (ms) 0 0 81,115,664
Map-Reduce Framework Map input records 77,266,616 0 77,266,616
Map output records 77,266,616 0 77,266,616
Map output bytes 11,442,228,060 0 11,442,228,060
Input split bytes 177,727 0 177,727
Combine input records 0 0 0
Combine output records 0 0 0
Reduce input groups 0 37,799,412 37,799,412
Reduce shuffle bytes 0 1,853,727,946 1,853,727,946
Reduce input records 0 76,428,913 76,428,913
Reduce output records 0 48,958,874 48,958,874
Spilled Records 112,586,947 62,608,254 175,195,201
CPU time spent (ms) 2,461,980 14,831,230 17,293,210
Physical memory (bytes) snapshot 366,933,626,880 9,982,947,328 376,916,574,208
Virtual memory (bytes) snapshot 2,219,448,848,384 23,215,755,264 2,242,664,603,648
Total committed heap usage (bytes) 1,211,341,733,888 8,609,333,248 1,219,951,067,136
AcsReducer ColumnDeletesOnTable- 0 3,284,862 3,284,862
ColumnDeletesOnTable- 0 3,285,695 3,285,695
ColumnDeletesOnTable- 0 3,284,862 3,284,862
ColumnDeletesOnTable- 0 129,653 129,653
ColumnDeletesOnTable- 0 129,653 129,653
ColumnDeletesOnTable- 0 129,653 129,653
ColumnDeletesOnTable- 0 129,653 129,653
ColumnDeletesOnTable- 0 517,641 517,641
ColumnDeletesOnTable- 0 23,786 23,786
ColumnDeletesOnTable- 0 594,872 594,872
ColumnDeletesOnTable- 0 597,739 597,739
ColumnDeletesOnTable- 0 595,665 595,665
ColumnDeletesOnTable- 0 36,101,345 36,101,345
ColumnDeletesOnTable- 0 11,791 11,791
ColumnDeletesOnTable- 0 11,898 11,898
ColumnDeletesOnTable-0 176 176
RowDeletesOnTable- 0 224,044 224,044
RowDeletesOnTable- 0 224,045 224,045
RowDeletesOnTable- 0 224,044 224,044
RowDeletesOnTable- 0 17,425 17,425
RowDeletesOnTable- 0 17,425 17,425
RowDeletesOnTable- 0 17,425 17,425
RowDeletesOnTable- 0 17,425 17,425
RowDeletesOnTable- 0 459,890 459,890
RowDeletesOnTable- 0 23,786 23,786
RowDeletesOnTable- 0 105,910 105,910
RowDeletesOnTable- 0 107,829 107,829
RowDeletesOnTable- 0 105,909 105,909
RowDeletesOnTable- 0 36,101,345 36,101,345
RowDeletesOnTable- 0 11,353 11,353
RowDeletesOnTable- 0 11,459 11,459
RowDeletesOnTable- 0 168 168
WholeRowDeletesOnTable- 0 129,930 129,930
deleteRowsCount 0 37,799,410 37,799,410
deleteRowsMicros 0 104,579,855,042 104,579,855,042
emitCount 0 48,958,874 48,958,874
emitMicros 0 201,996,180 201,996,180
rollupValuesCount 0 37,799,412 37,799,412
rollupValuesMicros 0 234,085,342 234,085,342
As you can see its been running almost 3.5 hours now. There were 1177 Map tasks and they complete some time ago. The Reduce phase is incomplete at 95%.
So I drill into the 'reduce' link and it takes me to the tasklist. If I drill into the first incomplete task, here it is:
Job job_201901281553_38848
All Task Attempts
Task Attempts Machine Status Progress Start Time Shuffle Finished Sort Finished Finish Time Errors Task Logs Counters Actions
attempt_201901281553_38848_r_000000_0 RUNNING 70.81% 2/1/2019 12:39 1-Feb-2019 12:39:59 (18sec) 1-Feb-2019 12:40:01 (2sec) Last 4KB 60
Last 8KB
All
From there I can see the machine/datanode running the task so i ssh into it and I look at the log (filtering on just the task in question)
from datanode $/var/log/hadoop-0.20-mapreduce/hadoop-mapred-tasktracker-.log
2019-02-01 12:39:40,836 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201901281553_38848_r_000000_0 task's state:UNASSIGNED
2019-02-01 12:39:40,838 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201901281553_38848_r_000000_0 which needs 1 slots
2019-02-01 12:39:40,838 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 21 and trying to launch attempt_201901281553_38848_r_000000_0 which needs 1 slots
2019-02-01 12:39:40,925 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /disk12/mapreduce/tmp-map-data/ttprivate/taskTracker/mapred/jobcache/job_201901281553_38848/attempt_201901281553_38848_r_000000_0/taskjvm.sh
2019-02-01 12:39:41,904 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201901281553_38848_r_-819481850 given task: attempt_201901281553_38848_r_000000_0
2019-02-01 12:39:49,011 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.09402435% reduce > copy (332 of 1177 at 23.66 MB/s) >
2019-02-01 12:39:56,250 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.25233644% reduce > copy (891 of 1177 at 12.31 MB/s) >
2019-02-01 12:39:59,206 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.25233644% reduce > copy (891 of 1177 at 12.31 MB/s) >
2019-02-01 12:39:59,350 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.33333334% reduce > sort
2019-02-01 12:40:01,599 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.33333334% reduce > sort
2019-02-01 12:40:02,469 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6667039% reduce > reduce
2019-02-01 12:40:05,565 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6667039% reduce > reduce
2019-02-01 12:40:11,666 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6668788% reduce > reduce
2019-02-01 12:40:14,755 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.66691136% reduce > reduce
2019-02-01 12:40:17,838 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6670001% reduce > reduce
2019-02-01 12:40:20,930 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6671631% reduce > reduce
2019-02-01 12:40:24,016 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6672566% reduce > reduce
.. and these lines repeat in this manner for hours..
^ so it appears the shuffle/sort phase went very quick but after that, its just the reduce phase crawling, the percentage is slowly increasing but takes hours before the task completes.
1) So that looks like the bottleneck here- am I correct in identifying the cause of my long-running job is because this task (and many tasks like it) is taking a very long time on the reduce phase for the task?
2) if so, what are my options for speeding it up?
Load appears to be reasonably low on the datanode assigned that task, as well as its iowait:
top - 15:20:03 up 124 days, 1:04, 1 user, load average: 3.85, 5.64, 5.96
Tasks: 1095 total, 2 running, 1092 sleeping, 0 stopped, 1 zombie
Cpu(s): 3.8%us, 1.5%sy, 0.9%ni, 93.6%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 503.498G total, 495.180G used, 8517.543M free, 5397.789M buffers
Swap: 2046.996M total, 0.000k used, 2046.996M free, 432.468G cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
82236 hbase 20 0 16.9g 16g 17m S 136.9 3.3 26049:16 java
30143 root 39 19 743m 621m 13m R 82.3 0.1 1782:06 clamscan
62024 mapred 20 0 2240m 1.0g 24m S 75.1 0.2 1:21.28 java
36367 mapred 20 0 1913m 848m 24m S 11.2 0.2 22:56.98 java
36567 mapred 20 0 1898m 825m 24m S 9.5 0.2 22:23.32 java
36333 mapred 20 0 1879m 880m 24m S 8.2 0.2 22:44.28 java
36374 mapred 20 0 1890m 831m 24m S 6.9 0.2 23:15.65 java
and a snippet of iostat -xm 4:
avg-cpu: %user %nice %system %iowait %steal %idle
2.15 0.92 0.30 0.17 0.00 96.46
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 350.25 0.00 30.00 0.00 1.49 101.67 0.02 0.71 0.00 0.71 0.04 0.12
sdb 0.00 2.75 0.00 6.00 0.00 0.03 11.67 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 9.75 0.00 1.25 0.00 0.04 70.40 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 6.50 0.00 0.75 0.00 0.03 77.33 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 5.75 0.00 0.50 0.00 0.02 100.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 8.00 0.00 0.75 0.00 0.03 93.33 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 6.25 0.00 0.50 0.00 0.03 108.00 0.00 0.00 0.00 0.00 0.00 0.00
sdi 0.00 3.75 93.25 0.50 9.03 0.02 197.57 0.32 3.18 3.20 0.00 1.95 18.30
sdj 0.00 3.50 0.00 0.50 0.00 0.02 64.00 0.00 0.00 0.00 0.00 0.00 0.00
sdk 0.00 7.00 0.00 0.75 0.00 0.03 82.67 0.00 0.33 0.00 0.33 0.33 0.03
sdl 0.00 6.75 0.00 0.75 0.00 0.03 80.00 0.00 0.00 0.00 0.00 0.00 0.00
sdm 0.00 7.75 0.00 5.75 0.00 0.05 18.78 0.00 0.04 0.00 0.04 0.04 0.03
#<machine>:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 40G 5.9G 32G 16% /
tmpfs 252G 0 252G 0% /dev/shm
/dev/sda1 488M 113M 350M 25% /boot
/dev/sda8 57G 460M 54G 1% /tmp
/dev/sda7 9.8G 1.1G 8.2G 12% /var
/dev/sda5 40G 17G 21G 45% /var/log
/dev/sda6 30G 4.4G 24G 16% /var/log/audit.d
/dev/sdb1 7.2T 3.3T 3.6T 48% /disk1
/dev/sdc1 7.2T 3.3T 3.6T 49% /disk2
/dev/sdd1 7.2T 3.3T 3.6T 48% /disk3
/dev/sde1 7.2T 3.3T 3.6T 48% /disk4
/dev/sdf1 7.2T 3.3T 3.6T 48% /disk5
/dev/sdi1 7.2T 3.3T 3.6T 48% /disk6
/dev/sdg1 7.2T 3.3T 3.6T 48% /disk7
/dev/sdh1 7.2T 3.3T 3.6T 48% /disk8
/dev/sdj1 7.2T 3.3T 3.6T 48% /disk9
/dev/sdk1 7.2T 3.3T 3.6T 48% /disk10
/dev/sdm1 7.2T 3.3T 3.6T 48% /disk11
/dev/sdl1 7.2T 3.3T 3.6T 48% /disk12
This is version Hadoop 2.0.0-cdh4.3.0. Its highly-available with 3 zookeeper nodes, 2 namenodes, and 35 datanodes. YARN is not installed. Using hbase, oozie. Jobs mainly come in via Hive and HUE.
Each datanode has 2 physical cpus, each with 22 cores. Hyperthreading is enabled.
If you need more information, please let me know. My guess is maybe I need more reducers, there are mapred-site.xml settings that need tuning, perhaps the input data from map phase is too large, or Hive query needs written better. Im fairly new administrator to Hadoop, any detailed advice is great.
Thanks!
Related
near indexer does not add anything to the database
I've tried to run https://github.com/near/near-indexer-for-explorer No firewall, IP accessable (tested right now). With empty data, it waits for peers forever. With data from the run started some days ago ./target/release/indexer-explorer --home-dir ../.near/mainnet run --store-genesis --stream-while-syncing --allow-missing-relations-in-first-blocks 1000 sync-from-latest It does something Nov 01 18:42:23.293 INFO indexer_for_explorer: AccessKeys from genesis were added/updated successful. Nov 01 18:42:33.188 INFO stats: # 9820210 Waiting for peers 1/1/40 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 0%, Mem: 0 B Nov 01 18:42:43.190 INFO stats: # 9820210 Downloading headers 68.72% (13074549) 3/3/40 peers ⬇ 149.3kiB/s ⬆ 6.0kiB/s 0.00 bps 0 gas/s CPU: 23%, Mem: 510.7 MiB Nov 01 18:42:53.192 INFO stats: # 9820210 Downloading headers 68.72% (13074559) 2/2/40 peers ⬇ 299.4kiB/s ⬆ 297.5kiB/s 0.00 bps 0 gas/s CPU: 40%, Mem: 621.3 MiB Nov 01 18:43:03.194 INFO stats: # 9820210 Downloading headers 68.72% (13074569) 1/1/40 peers ⬇ 150.1kiB/s ⬆ 148.9kiB/s 0.00 bps 0 gas/s CPU: 42%, Mem: 520.7 MiB Nov 01 18:43:13.196 INFO stats: # 9820210 Downloading headers 68.72% (13074578) 2/2/40 peers ⬇ 150.3kiB/s ⬆ 148.8kiB/s 0.00 bps 0 gas/s CPU: 10%, Mem: 631.6 MiB Nov 01 18:43:23.198 INFO stats: # 9820210 Downloading headers 68.72% (13074590) 2/1/40 peers ⬇ 294.1kiB/s ⬆ 297.6kiB/s 0.00 bps 0 gas/s CPU: 14%, Mem: 601.5 MiB Nov 01 18:43:33.200 INFO stats: # 9820210 Downloading headers 68.72% (13074598) 1/1/40 peers ⬇ 149.4kiB/s ⬆ 148.8kiB/s 0.00 bps 0 gas/s CPU: 2%, Mem: 602.9 MiB Nov 01 18:43:43.203 INFO stats: # 9820210 EPnLgE7iEq9s7yTkos96M3cWymH5avBAPm3qx3NXqR8H -/4 2/2/40 peers ⬇ 150.0kiB/s ⬆ 148.8kiB/s 0.00 bps 0 gas/s CPU: 9%, Mem: 657.0 MiB Nov 01 18:43:53.209 INFO stats: # 9820210 Downloading headers 68.72% (13074608) 1/1/40 peers ⬇ 150.5kiB/s ⬆ 148.8kiB/s 0.00 bps 0 gas/s CPU: 3%, Mem: 661.0 MiB Nov 01 18:44:03.212 INFO stats: # 9820210 EPnLgE7iEq9s7yTkos96M3cWymH5avBAPm3qx3NXqR8H -/4 1/1/40 peers ⬇ 148.6kiB/s ⬆ 148.8kiB/s 0.00 bps 0 gas/s CPU: 4%, Mem: 664.8 MiB Nov 01 18:44:13.213 INFO stats: # 9820210 EPnLgE7iEq9s7yTkos96M3cWymH5avBAPm3qx3NXqR8H -/4 0/0/40 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 2%, Mem: 664.8 MiB Nov 01 18:44:23.215 INFO stats: # 9820210 EPnLgE7iEq9s7yTkos96M3cWymH5avBAPm3qx3NXqR8H -/4 0/0/40 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 1%, Mem: 666.8 MiB Nov 01 18:44:33.217 INFO stats: # 9820210 Downloading headers 68.72% (13074655) 1/1/40 peers ⬇ 150.0kiB/s ⬆ 148.8kiB/s 0.00 bps 0 gas/s CPU: 11%, Mem: 614.7 MiB Nov 01 18:44:43.219 INFO stats: # 9820210 EPnLgE7iEq9s7yTkos96M3cWymH5avBAPm3qx3NXqR8H -/4 0/0/40 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 1%, Mem: 614.9 MiB Nov 01 18:44:53.224 INFO stats: # 9820210 EPnLgE7iEq9s7yTkos96M3cWymH5avBAPm3qx3NXqR8H -/4 0/0/40 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 1%, Mem: 614.9 MiB Nov 01 18:45:03.227 INFO stats: # 9820210 EPnLgE7iEq9s7yTkos96M3cWymH5avBAPm3qx3NXqR8H -/4 0/0/40 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 1%, Mem: 616.4 MiB Nov 01 18:45:13.232 INFO stats: # 9820210 EPnLgE7iEq9s7yTkos96M3cWymH5avBAPm3qx3NXqR8H -/4 0/0/40 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 1%, Mem: 616.4 MiB But nothing get added to database. What am I doing wrong?
Your concern about the indexing part (no data in the database) will get resolved once the node reaches “syncing blocks” stage. Currently, your node is still only at “syncing block headers” stage. To speed up this process, start from a backup: https://docs.near.org/docs/develop/node/validator/running-a-node#starting-a-node-from-backup As to the fact that the node dropped off the p2p network, I have no clues why that could have happened. I recommend you start with running a simple neard node and report any issues there before you get to the Indexer (which is just an extension to nearcore and thus you can use the same home & data folder)
First of all, indexer requires full archive mode. Links to 5-epoch are misleading. They are not usable for indexer. Second (may save lots of time for downloading), indexer requires AVX extension to run. If your CPU does not AVX, don't bother building nearcore. That should be mentioned in docs. nearcore depends on some wasm, wasm requires AVX to run. Indexer will run for some time and fail miserably without AVX.
Hadoop cluster stops after mapping
I have a hadoop cluster consisting of three machines. I put on hadoop 20 G file, I start hadoop and it stops after mapping. "13/08/22 08:09:34 INFO mapred.JobClient: map 100% reduce 11%" After mapping all cpu don't work. I can wait one day, but it can't start again. What can I do? This is last 10 lines of my log file, when map is 100% and reduce is 11%: 2013-08-22 14:15:32,503 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2013-08-22 14:15:32,542 INFO org.apache.hadoop.mapred.MapTask: Finished spill 67 2013-08-22 14:15:32,552 INFO org.apache.hadoop.mapred.Merger: Merging 68 sorted segments 2013-08-22 14:15:32,558 INFO org.apache.hadoop.mapred.Merger: Merging 5 intermediate segments out of a total of 68 2013-08-22 14:15:32,622 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 64 segments left of total size: 1600710 bytes 2013-08-22 14:15:32,708 INFO org.apache.hadoop.mapred.Task: Task:attempt_201308221308_0002_m_000302_0 is done. And is in the process of commiting 2013-08-22 14:15:32,717 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201308221308_0002_m_000302_0' done. 2013-08-22 14:15:32,759 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2013-08-22 14:15:32,774 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds. 2013-08-22 14:15:32,774 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName llobocki for UID 1000 from the native implementation My Child of master hadoop thread dump, when map is 100% and reduce is 11%: 2013-08-23 11:37:26 Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode): "Attach Listener" daemon prio=10 tid=0x0000000000f85800 nid=0x3873 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Thread for polling Map Completion Events" daemon prio=10 tid=0x00007fc32860c800 nid=0x1d7a waiting on condition [0x00007fc31c183000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2882) "Thread for merging in memory files" daemon prio=10 tid=0x00007fc32860a800 nid=0x1d78 in Object.wait() [0x00007fc31c284000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd6dd7c8> (a java.lang.Object) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$ShuffleRamManager.waitForDataToMerge(ReduceTask.java:1197) - locked <0x00000005bd6dd7c8> (a java.lang.Object) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2760) "Thread for merging on-disk files" daemon prio=10 tid=0x00007fc328608000 nid=0x1d77 in Object.wait() [0x00007fc31c385000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd713988> (a java.util.TreeSet) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2654) - locked <0x00000005bd713988> (a java.util.TreeSet) "MapOutputCopier attempt_201308230927_0001_r_000000_0.4" prio=10 tid=0x00007fc328606800 nid=0x1d76 in Object.wait() [0x00007fc31c486000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd762eb0> (a java.util.ArrayList) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1324) - locked <0x00000005bd762eb0> (a java.util.ArrayList) "MapOutputCopier attempt_201308230927_0001_r_000000_0.3" prio=10 tid=0x00007fc328602000 nid=0x1d75 in Object.wait() [0x00007fc31c587000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd762eb0> (a java.util.ArrayList) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1324) - locked <0x00000005bd762eb0> (a java.util.ArrayList) "MapOutputCopier attempt_201308230927_0001_r_000000_0.2" prio=10 tid=0x00007fc328600000 nid=0x1d73 in Object.wait() [0x00007fc31c688000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd762eb0> (a java.util.ArrayList) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1324) - locked <0x00000005bd762eb0> (a java.util.ArrayList) "MapOutputCopier attempt_201308230927_0001_r_000000_0.1" prio=10 tid=0x00007fc3285ff000 nid=0x1d72 in Object.wait() [0x00007fc31c789000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd762eb0> (a java.util.ArrayList) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1324) - locked <0x00000005bd762eb0> (a java.util.ArrayList) "MapOutputCopier attempt_201308230927_0001_r_000000_0.0" prio=10 tid=0x00007fc3285f8800 nid=0x1d70 in Object.wait() [0x00007fc31c88a000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd762eb0> (a java.util.ArrayList) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1324) - locked <0x00000005bd762eb0> (a java.util.ArrayList) "communication thread" daemon prio=10 tid=0x00007fc3285d2000 nid=0x1d53 in Object.wait() [0x00007fc31c9b3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd762e90> (a java.lang.Object) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:658) - locked <0x00000005bd762e90> (a java.lang.Object) at java.lang.Thread.run(Thread.java:724) "Timer for 'ReduceTask' metrics system" daemon prio=10 tid=0x00007fc3285b1000 nid=0x1d49 in Object.wait() [0x00007fc31cbb5000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd919a30> (a java.util.TaskQueue) at java.util.TimerThread.mainLoop(Timer.java:552) - locked <0x00000005bd919a30> (a java.util.TaskQueue) at java.util.TimerThread.run(Timer.java:505) "Thread for syncLogs" daemon prio=10 tid=0x00007fc328494000 nid=0x1d3e waiting on condition [0x00007fc31cebd000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.Child$3.run(Child.java:139) "IPC Client (47) connection to /127.0.0.1:35127 from job_201308230927_0001" daemon prio=10 tid=0x00007fc328492800 nid=0x1d3d in Object.wait() [0x00007fc31cfbe000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd721b60> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:747) - locked <0x00000005bd721b60> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:789) "Service Thread" daemon prio=10 tid=0x00007fc3280f4000 nid=0x1cf7 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "C2 CompilerThread1" daemon prio=10 tid=0x00007fc3280f1800 nid=0x1cf5 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "C2 CompilerThread0" daemon prio=10 tid=0x00007fc3280ee800 nid=0x1cf4 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" daemon prio=10 tid=0x00007fc3280ec800 nid=0x1cf3 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Finalizer" daemon prio=10 tid=0x00007fc32809e000 nid=0x1ce5 in Object.wait() [0x00007fc2c1b7f000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd6fb1f8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) - locked <0x00000005bd6fb1f8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189) "Reference Handler" daemon prio=10 tid=0x00007fc32809c000 nid=0x1ce4 in Object.wait() [0x00007fc2c1c80000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000005bd6fade8> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:503) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133) - locked <0x00000005bd6fade8> (a java.lang.ref.Reference$Lock) "main" prio=10 tid=0x00007fc32800b000 nid=0x1cc8 waiting on condition [0x00007fc32dc3a000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:2191) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:386) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249) "VM Thread" prio=10 tid=0x00007fc328094800 nid=0x1cdf runnable "GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fc328018800 nid=0x1ccc runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fc32801a800 nid=0x1cce runnable "GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fc32801c800 nid=0x1cd7 runnable "GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fc32801e000 nid=0x1cd8 runnable "VM Periodic Task Thread" prio=10 tid=0x00007fc3280fe800 nid=0x1cf8 waiting on condition JNI global references: 224 During mapping the net traffic is ~20 MiB on master, but when reduce starts, net traffic goes down to 3 KiB. iostat of my machines. Master during map: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 7.00 0.00 0.02 6.29 0.48 68.43 68.29 47.80 sda 0.00 0.00 43.00 7.00 5.38 0.02 221.04 0.22 4.42 2.78 13.90 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 43.00 3.00 5.38 0.01 239.83 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 14.00 0.00 53.00 0.00 1.34 51.66 1.58 29.77 5.38 28.50 sda 3.00 14.00 34.00 53.00 4.62 1.34 140.34 1.27 14.55 3.84 33.40 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 37.00 62.00 4.62 1.32 122.99 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Slave during map: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 2.00 0.00 12.00 4.00 1.75 0.01 225.25 0.76 47.50 25.19 40.30 sdb 0.00 0.00 0.00 6.00 0.00 0.02 6.00 0.09 20.00 14.67 8.80 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 14.00 2.00 1.75 0.01 225.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 28.00 4.00 3.50 0.01 224.81 0.39 12.28 7.16 22.90 sdb 0.00 0.00 5.00 3.00 0.42 0.01 110.25 0.25 31.50 22.12 17.70 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 33.00 0.00 3.92 0.00 243.39 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Master stopped: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 8.00 1.00 1.00 0.00 228.44 0.03 3.44 3.00 2.70 sda 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.01 13.00 13.00 1.30 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.01 0.62 0.50 0.40 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.02 2.38 2.38 1.90 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.01 0.75 0.50 0.40 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Slave stopped: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.01 1.38 1.12 0.90 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 7.00 0.00 0.88 0.00 256.00 0.01 0.71 0.57 0.40 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 7.00 0.00 0.88 0.00 256.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Filesystem: rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.01 0.75 0.62 0.50 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md3 0.00 0.00 8.00 0.00 1.00 0.00 256.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I've solved my problem. I had an incorrect value in /etc/hosts. Earlier: ip alias Now: ip domain alias
cpu pegged, memory pegged, strace output
All of the sudden an application that I've been working on is hammering the server's cpu and memory. I have not had any code changes since this "pegging" has started. I did some digging into strace to try and find out what is going on but I'm needing help deciphering what the output really means. I took one pid that was running for several minutes with cpu usage at 100% and using about 1.5GB of memory and ran strace -c with it. I got the following output. I'm taking a long time in the clone and wait4 commands. Can anybody give me a direction to move in with this info? % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 54.10 1.011982 252996 4 clone 30.67 0.573741 51 11296 brk 11.98 0.224099 56025 4 wait4 1.96 0.036701 53 687 munmap 0.68 0.012642 1580 8 mremap 0.58 0.010928 2 4886 5 read 0.01 0.000135 0 2854 fstat 0.01 0.000112 0 845 25 open 0.00 0.000073 0 935 115 lstat 0.00 0.000044 0 97 23 access 0.00 0.000043 0 464 40 stat 0.00 0.000037 0 466 write 0.00 0.000037 0 466 gettimeofday 0.00 0.000035 0 840 close 0.00 0.000000 0 173 poll 0.00 0.000000 0 210 lseek 0.00 0.000000 0 688 mmap 0.00 0.000000 0 5 rt_sigaction 0.00 0.000000 0 5 rt_sigprocmask 0.00 0.000000 0 16 writev 0.00 0.000000 0 55 setitimer 0.00 0.000000 0 2 socket 0.00 0.000000 0 2 2 connect 0.00 0.000000 0 4 accept 0.00 0.000000 0 6 shutdown 0.00 0.000000 0 4 getsockname 0.00 0.000000 0 10 setsockopt 0.00 0.000000 0 2 getsockopt 0.00 0.000000 0 8 semop 0.00 0.000000 0 38 fcntl 0.00 0.000000 0 168 flock 0.00 0.000000 0 12 getdents 0.00 0.000000 0 25 getcwd 0.00 0.000000 0 10 chdir 0.00 0.000000 0 2 unlink 0.00 0.000000 0 2 chmod 0.00 0.000000 0 15 umask 0.00 0.000000 0 7 times 0.00 0.000000 0 2 1 futex 0.00 0.000000 0 4 epoll_wait 0.00 0.000000 0 6 openat 0.00 0.000000 0 4 pipe2 ------ ----------- ----------- --------- --------- ---------------- 100.00 1.870609 25337 211 total NEW ---ADDITIONAL INFO I have some more info on this. User's of my application upload images. The directory where the images get uploaded to now has 80k plus image files that have been uploaded. When I do a strace of a process that is hogging cpu and memory I get a lot of reads in that directory followed by brk. Below is an exceprt from the strace. poll([{fd=18, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(18, "e\2\0\0\3SELECT \"File\".\"ClassName\", "..., 617) = 617 read(18, "\1\0\0\1\0244\0\0\2\3def\4live\4File\4File\tCla"..., 16384) = 11488 read(18, "\0010\0010\00266\10MMSImage\343\0\0D\10MMSImage\02320"..., 16384) = 13032 read(18, "\373\0011\0010\0010\003122\10MMSImage\345\0\0|\10MMSImag"..., 16384) = 11584 read(18, "257\373\0011\0010\0010\003179\10MMSImage\345\0\0\255\10MMSI"..., 16384) = 16384 read(18, "nswer_Image/0_9373043939_1321407"..., 16384) = 16384 read(18, "or question #1054.Eassets/UserCo"..., 16384) = 16384 then a bit later a ton of brk's brk(0x7f2ba285e000) = 0x7f2ba285e000 brk(0x7f2ba291e000) = 0x7f2ba291e000 brk(0x7f2ba295e000) = 0x7f2ba295e000 brk(0x7f2ba299e000) = 0x7f2ba299e000 brk(0x7f2ba29de000) = 0x7f2ba29de000 brk(0x7f2ba2a1f000) = 0x7f2ba2a1f000 brk(0x7f2ba2a5f000) = 0x7f2ba2a5f000 brk(0x7f2ba2a9f000) = 0x7f2ba2a9f000 brk(0x7f2ba2adf000) = 0x7f2ba2adf000 brk(0x7f2ba2b1f000) = 0x7f2ba2b1f000 brk(0x7f2ba2b5f000) = 0x7f2ba2b5f000 brk(0x7f2ba2b9f000) = 0x7f2ba2b9f000 brk(0x7f2ba2bdf000) = 0x7f2ba2bdf000 brk(0x7f2ba2c1f000) = 0x7f2ba2c1f000 brk(0x7f2ba2c5f000) = 0x7f2ba2c5f000 brk(0x7f2ba2c9f000) Anybody know what may be going on there. Do I have too many files in the directory or something like that? Does the harddrive have to seek to the directory too much?
Well, let's start at the top. A lot of clone and wait4 means that this application does much more forking or threading then it should be doing. As for the directory size, if you are reading from the image files into dynamically allocated memory, brk must be called to get more memory. That simple. Just take out a few of the forks or threading if you ask me.
Understanding ruby-prof output
I ran ruby-profiler on one of my programs. I'm trying to figure out what each fields mean. I'm guessing everything is CPU time (and not wall clock time), which is fantastic. I want to understand what the "---" stands for. Is there some sort of stack information in there. What does calls a/b mean? Thread ID: 81980260 Total Time: 0.28 %total %self total self wait child calls Name -------------------------------------------------------------------------------- 0.28 0.00 0.00 0.28 5/6 FrameParser#receive_data 100.00% 0.00% 0.28 0.00 0.00 0.28 6 FrameParser#read_frames 0.28 0.00 0.00 0.28 4/4 ChatServerClient#receive_frame 0.00 0.00 0.00 0.00 5/47 Fixnum#+ 0.00 0.00 0.00 0.00 1/2 DebugServer#receive_frame 0.00 0.00 0.00 0.00 10/29 String#[] 0.00 0.00 0.00 0.00 10/21 <Class::Range>#allocate 0.00 0.00 0.00 0.00 10/71 String#index -------------------------------------------------------------------------------- 100.00% 0.00% 0.28 0.00 0.00 0.28 5 FrameParser#receive_data 0.28 0.00 0.00 0.28 5/6 FrameParser#read_frames 0.00 0.00 0.00 0.00 5/16 ActiveSupport::CoreExtensions::String::OutputSafety#add_with_safety -------------------------------------------------------------------------------- 0.28 0.00 0.00 0.28 4/4 FrameParser#read_frames 100.00% 0.00% 0.28 0.00 0.00 0.28 4 ChatServerClient#receive_frame 0.28 0.00 0.00 0.28 4/6 <Class::Lal>#safe_call -------------------------------------------------------------------------------- 0.00 0.00 0.00 0.00 1/6 <Class::Lal>#safe_call 0.00 0.00 0.00 0.00 1/6 DebugServer#receive_frame 0.28 0.00 0.00 0.28 4/6 ChatServerClient#receive_frame 100.00% 0.00% 0.28 0.00 0.00 0.28 6 <Class::Lal>#safe_call 0.21 0.00 0.00 0.21 2/4 ChatUserFunction#register 0.06 0.00 0.00 0.06 2/2 ChatUserFunction#packet 0.01 0.00 0.00 0.01 4/130 Class#new 0.00 0.00 0.00 0.00 1/1 DebugServer#profile_stop 0.00 0.00 0.00 0.00 1/33 String#== 0.00 0.00 0.00 0.00 1/6 <Class::Lal>#safe_call 0.00 0.00 0.00 0.00 5/5 JSON#parse 0.00 0.00 0.00 0.00 5/8 <Class::Log>#log 0.00 0.00 0.00 0.00 5/5 String#strip! --------------------------------------------------------------------------------
Each section of the ruby-prof output is broken up into the examination of a particular function. for instance, look at the first section of your output. The read_frames method on FrameParser is the focus and it is basically saying the following: 100% of the execution time that was profiled was spent inside of FrameParser#read_frames FrameParser#read_frames was called 6 times. 5 out of the 6 calls to read_frames came from FrameParser#receive_data and this accounted 100% of the execution time (this is the line above the read_frames line). The lines below the read_frames (but within that first section) method are all of the methods that FrameParser#read_frames calls (you should be aware of that since this seems like it's your code), how many of that methods total calls read_frames is responsible for (the a/b calls column), and how much time those calls took. They are ordered by which of them took up the most execution time. In your case, that is receive_frame method on the ChatServer class. You can then look down at the section focusing on receive_frames (2 down and centered with the '100%' line on receive_frame) and see how it's performance is broken down. each section is set up the same way and usually the subsequent function call which took the most time is the focus of the next section down. ruby-prof will continue doing this through the full call stack. You can go as deep as you want until you find the bottleneck you'd like to resolve.
percentage of memory used used by a process
percentage of memory used used by a process. normally prstat -J will give the memory of process image and RSS(resident set size) etc. how do i knowlist of processes with percentage of memory is used by a each process. i am working on solaris unix. addintionally ,what are the regular commands that you use for monitoring processes,performences of processes that might be very useful to all!
The top command will give you several memory-consumption numbers. htop is much nicer, and will give you percentages, but it isn't installed by default on most systems.
run top and then Shift+O this will bring you to the options, press n (this maybe different on your machine) for memory and then hit enter Example of memory sort. top - 08:17:29 up 3 days, 8:54, 6 users, load average: 13.98, 14.01, 11.60 Tasks: 654 total, 2 running, 652 sleeping, 0 stopped, 0 zombie Cpu(s): 14.7%us, 1.5%sy, 0.0%ni, 59.5%id, 23.5%wa, 0.1%hi, 0.8%si, 0.0%st Mem: 65851896k total, 49049196k used, 16802700k free, 1074664k buffers Swap: 50331640k total, 0k used, 50331640k free, 32776940k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21635 oracle 15 0 6750m 636m 51m S 1.6 1.0 62:34.53 oracle 21623 oracle 15 0 6686m 572m 53m S 1.1 0.9 61:16.95 oracle 21633 oracle 16 0 6566m 445m 235m S 3.7 0.7 30:22.60 oracle 21615 oracle 16 0 6550m 428m 220m S 3.7 0.7 29:36.74 oracle 16349 oracle RT 0 431m 284m 41m S 0.5 0.4 2:41.08 ocssd.bin 17891 root RT 0 139m 118m 40m S 0.5 0.2 41:08.19 osysmond 18154 root RT 0 182m 98m 43m S 0.0 0.2 10:02.40 ologgerd 12211 root 15 0 1432m 84m 14m S 0.0 0.1 17:57.80 java Another method on Solaris is to do the following prstat -s size 1 1 Example prstat output www004:/# prstat -s size 1 1 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 420 nobody 139M 60M sleep 29 10 1:46:56 0.1% webservd/76 603 nobody 135M 59M sleep 29 10 5:33:18 0.1% webservd/96 339 root 134M 70M sleep 59 0 0:35:38 0.0% java/24 435 iplanet 132M 55M sleep 29 10 1:10:39 0.1% webservd/76 573 nobody 131M 53M sleep 29 10 0:24:32 0.0% webservd/76 588 nobody 130M 53M sleep 29 10 2:40:55 0.1% webservd/86 454 nobody 128M 51M sleep 29 10 0:09:01 0.0% webservd/76 489 iplanet 126M 49M sleep 29 10 0:00:13 0.0% webservd/74 405 root 119M 45M sleep 29 10 0:00:13 0.0% webservd/31 717 root 54M 46M sleep 59 0 2:31:27 0.2% agent/7 Keep in mind this is sorted by Size not RSS, if you need it by RSS use the rss key www004:/# prstat -s rss 1 1 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 339 root 134M 70M sleep 59 0 0:35:39 0.1% java/24 420 nobody 139M 60M sleep 29 10 1:46:57 0.4% webservd/76 603 nobody 135M 59M sleep 29 10 5:33:19 0.5% webservd/96 435 iplanet 132M 55M sleep 29 10 1:10:39 0.0% webservd/76 573 nobody 131M 53M sleep 29 10 0:24:32 0.0% webservd/76 588 nobody 130M 53M sleep 29 10 2:40:55 0.0% webservd/86 454 nobody 128M 51M sleep 29 10 0:09:01 0.0% webservd/76 489 iplanet 126M 49M sleep 29 10 0:00:13 0.0% webservd/74
I'm not sure if ps is standardized but at least on linux, ps -o %mem gives the percentage of memory used (you would obviously want to add some other columns as well)