Impala memory limit exceeded with simple count query - hadoop
Edit:
There are some corrupt AVRO files in the table. After remove some of them, every thing works fine. I've de-compress these files to json with avro-tools and the decompressed file is not very large, either. So it seems to be some bug in Impala to handle corrupt AVRO files.
I have an Impala table with gziped AVRO format, which is partitioned by "day". When I execute the query:
select count(0) from adhoc_data_fast.log where day='2017-04-05';
It said:
Query: select count(0) from adhoc_data_fast.log where day='2017-04-05'
Query submitted at: 2017-04-06 13:35:56 (Coordinator: http://szq7.appadhoc.com:25000)
Query progress can be monitored at: http://szq7.appadhoc.com:25000/query_plan?query_id=ef4698db870efd4d:739c89ef00000000
WARNINGS:
Memory limit exceeded
GzipDecompressor failed to allocate 109051904000 bytes.
Each node is configured with 96 GB memory and the single pool memory limit is set to 300 GB.
All the files after compressed is no larger than 250MB:
62M log.2017-04-05.1491321605834.avro
79M log.2017-04-05.1491323647211.avro
62M log.2017-04-05.1491327241311.avro
60M log.2017-04-05.1491330839609.avro
52M log.2017-04-05.1491334439092.avro
59M log.2017-04-05.1491338038503.avro
93M log.2017-04-05.1491341639694.avro
130M log.2017-04-05.1491345239969.avro
147M log.2017-04-05.1491348843931.avro
183M log.2017-04-05.1491352442955.avro
218M log.2017-04-05.1491359648079.avro
181M log.2017-04-05.1491363247597.avro
212M log.2017-04-05.1491366845827.avro
207M log.2017-04-05.1491370445873.avro
197M log.2017-04-05.1491374045830.avro
164M log.2017-04-05.1491377650935.avro
155M log.2017-04-05.1491381249597.avro
203M log.2017-04-05.1491384846366.avro
185M log.2017-04-05.1491388450262.avro
198M log.2017-04-05.1491392047694.avro
206M log.2017-04-05.1491395648818.avro
214M log.2017-04-05.1491399246407.avro
167M log.2017-04-05.1491402846469.avro
77M log.2017-04-05.1491406180615.avro
3.2M log.2017-04-05.1491409790105.avro
1.3M log.2017-04-05.1491413385884.avro
928K log.2017-04-05.1491416981829.avro
832K log.2017-04-05.1491420581588.avro
1.1M log.2017-04-05.1491424180191.avro
2.6M log.2017-04-05.1491427781339.avro
3.8M log.2017-04-05.1491431382552.avro
3.3M log.2017-04-05.1491434984679.avro
5.2M log.2017-04-05.1491438586674.avro
5.1M log.2017-04-05.1491442192541.avro
2.3M log.2017-04-05.1491445789230.avro
884K log.2017-04-05.1491449386630.avro
And I've get them from HDFS and use avro-tools to convert them to json in order to decompress them. The decompressed files are no larger than 1GB:
16M log.2017-04-05.1491321605834.avro.json
308M log.2017-04-05.1491323647211.avro.json
103M log.2017-04-05.1491327241311.avro.json
150M log.2017-04-05.1491330839609.avro.json
397M log.2017-04-05.1491334439092.avro.json
297M log.2017-04-05.1491338038503.avro.json
160M log.2017-04-05.1491341639694.avro.json
95M log.2017-04-05.1491345239969.avro.json
360M log.2017-04-05.1491348843931.avro.json
338M log.2017-04-05.1491352442955.avro.json
71M log.2017-04-05.1491359648079.avro.json
161M log.2017-04-05.1491363247597.avro.json
628M log.2017-04-05.1491366845827.avro.json
288M log.2017-04-05.1491370445873.avro.json
162M log.2017-04-05.1491374045830.avro.json
90M log.2017-04-05.1491377650935.avro.json
269M log.2017-04-05.1491381249597.avro.json
620M log.2017-04-05.1491384846366.avro.json
70M log.2017-04-05.1491388450262.avro.json
30M log.2017-04-05.1491392047694.avro.json
114M log.2017-04-05.1491395648818.avro.json
370M log.2017-04-05.1491399246407.avro.json
359M log.2017-04-05.1491402846469.avro.json
218M log.2017-04-05.1491406180615.avro.json
29M log.2017-04-05.1491409790105.avro.json
3.9M log.2017-04-05.1491413385884.avro.json
9.3M log.2017-04-05.1491416981829.avro.json
8.3M log.2017-04-05.1491420581588.avro.json
2.3M log.2017-04-05.1491424180191.avro.json
25M log.2017-04-05.1491427781339.avro.json
24M log.2017-04-05.1491431382552.avro.json
5.7M log.2017-04-05.1491434984679.avro.json
35M log.2017-04-05.1491438586674.avro.json
5.8M log.2017-04-05.1491442192541.avro.json
23M log.2017-04-05.1491445789230.avro.json
4.3M log.2017-04-05.1491449386630.avro.json
And here is the Impala profiling:
[szq7.appadhoc.com:21000] > profile;
Query Runtime Profile:
Query (id=ef4698db870efd4d:739c89ef00000000):
Summary:
Session ID: f54bb090170bcdb6:621ac5796ef2668c
Session Type: BEESWAX
Start Time: 2017-04-06 13:35:56.454441000
End Time: 2017-04-06 13:35:57.326967000
Query Type: QUERY
Query State: EXCEPTION
Query Status:
Memory limit exceeded
GzipDecompressor failed to allocate 109051904000 bytes.
Impala Version: impalad version 2.7.0-cdh5.9.1 RELEASE (build 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
User: ubuntu
Connected User: ubuntu
Delegated User:
Network Address: ::ffff:192.168.1.7:29026
Default Db: default
Sql Statement: select count(0) from adhoc_data_fast.log where day='2017-04-05'
Coordinator: szq7.appadhoc.com:22000
Query Options (non default):
Plan:
----------------
Estimated Per-Host Requirements: Memory=410.00MB VCores=1
WARNING: The following tables are missing relevant table and/or column statistics.
adhoc_data_fast.log
03:AGGREGATE [FINALIZE]
| output: count:merge(0)
| hosts=13 per-host-mem=unavailable
| tuple-ids=1 row-size=8B cardinality=1
|
02:EXCHANGE [UNPARTITIONED]
| hosts=13 per-host-mem=unavailable
| tuple-ids=1 row-size=8B cardinality=1
|
01:AGGREGATE
| output: count(0)
| hosts=13 per-host-mem=10.00MB
| tuple-ids=1 row-size=8B cardinality=1
|
00:SCAN HDFS [adhoc_data_fast.log, RANDOM]
partitions=1/7594 files=38 size=3.45GB
table stats: unavailable
column stats: all
hosts=13 per-host-mem=400.00MB
tuple-ids=0 row-size=0B cardinality=unavailable
----------------
Estimated Per-Host Mem: 429916160
Estimated Per-Host VCores: 1
Tables Missing Stats: adhoc_data_fast.log
Request Pool: default-pool
Admission result: Admitted immediately
ExecSummary:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
-------------------------------------------------------------------------------------------------------------
03:AGGREGATE 1 52.298ms 52.298ms 0 1 4.00 KB -1.00 B FINALIZE
02:EXCHANGE 1 676.993ms 676.993ms 0 1 0 -1.00 B UNPARTITIONED
01:AGGREGATE 0 0.000ns 0.000ns 0 1 0 10.00 MB
00:SCAN HDFS 0 0.000ns 0.000ns 0 -1 0 400.00 MB adhoc_data_fast.log
Planner Timeline: 69.589ms
- Analysis finished: 6.642ms (6.642ms)
- Equivalence classes computed: 6.980ms (337.753us)
- Single node plan created: 13.302ms (6.322ms)
- Runtime filters computed: 13.368ms (65.984us)
- Distributed plan created: 15.131ms (1.763ms)
- Lineage info computed: 16.488ms (1.356ms)
- Planning finished: 69.589ms (53.101ms)
Query Timeline: 874.026ms
- Start execution: 63.320us (63.320us)
- Planning finished: 72.764ms (72.701ms)
- Submit for admission: 73.592ms (827.496us)
- Completed admission: 73.775ms (183.088us)
- Ready to start 13 remote fragments: 126.950ms (53.175ms)
- All 13 remote fragments started: 161.919ms (34.968ms)
- Rows available: 856.761ms (694.842ms)
- Unregister query: 872.527ms (15.765ms)
- ComputeScanRangeAssignmentTimer: 356.136us
ImpalaServer:
- ClientFetchWaitTimer: 0.000ns
- RowMaterializationTimer: 0.000ns
Execution Profile ef4698db870efd4d:739c89ef00000000:(Total: 782.712ms, non-child: 0.000ns, % non-child: 0.00%)
Number of filters: 0
Filter routing table:
ID Src. Node Tgt. Node(s) Targets Target type Partition filter Pending (Expected) First arrived Completed Enabled
----------------------------------------------------------------------------------------------------------------------------
Fragment start latencies: Count: 13, 25th %-ile: 1ms, 50th %-ile: 1ms, 75th %-ile: 1ms, 90th %-ile: 2ms, 95th %-ile: 2ms, 99.9th %-ile: 35ms
Per Node Peak Memory Usage: szq15.appadhoc.com:22000(0) szq1.appadhoc.com:22000(0) szq13.appadhoc.com:22000(0) szq12.appadhoc.com:22000(0) szq11.appadhoc.com:22000(0) szq20.appadhoc.com:22000(0) szq14.appadhoc.com:22000(0) szq8
.appadhoc.com:22000(0) szq5.appadhoc.com:22000(0) szq9.appadhoc.com:22000(0) szq4.appadhoc.com:22000(0) szq6.appadhoc.com:22000(0) szq7.appadhoc.com:22000(0)
- FiltersReceived: 0 (0)
- FinalizationTimer: 0.000ns
Coordinator Fragment F01:(Total: 729.811ms, non-child: 0.000ns, % non-child: 0.00%)
MemoryUsage(500.000ms): 12.00 KB
- AverageThreadTokens: 0.00
- BloomFilterBytes: 0
- PeakMemoryUsage: 12.00 KB (12288)
- PerHostPeakMemUsage: 0
- PrepareTime: 52.291ms
- RowsProduced: 0 (0)
- TotalCpuTime: 0.000ns
- TotalNetworkReceiveTime: 676.991ms
- TotalNetworkSendTime: 0.000ns
- TotalStorageWaitTime: 0.000ns
BlockMgr:
- BlockWritesOutstanding: 0 (0)
- BlocksCreated: 0 (0)
- BlocksRecycled: 0 (0)
- BufferedPins: 0 (0)
- BytesWritten: 0
- MaxBlockSize: 8.00 MB (8388608)
- MemoryLimit: 102.40 GB (109951164416)
- PeakMemoryUsage: 0
- TotalBufferWaitTime: 0.000ns
- TotalEncryptionTime: 0.000ns
- TotalIntegrityCheckTime: 0.000ns
- TotalReadBlockTime: 0.000ns
CodeGen:(Total: 63.837ms, non-child: 63.837ms, % non-child: 100.00%)
- CodegenTime: 828.728us
- CompileTime: 2.957ms
- LoadTime: 0.000ns
- ModuleBitcodeSize: 1.89 MB (1984232)
- NumFunctions: 7 (7)
- NumInstructions: 96 (96)
- OptimizationTime: 8.070ms
- PrepareTime: 51.769ms
AGGREGATION_NODE (id=3):(Total: 729.291ms, non-child: 52.298ms, % non-child: 7.17%)
ExecOption: Codegen Enabled
- BuildTime: 0.000ns
- GetResultsTime: 0.000ns
- HTResizeTime: 0.000ns
- HashBuckets: 0 (0)
- LargestPartitionPercent: 0 (0)
- MaxPartitionLevel: 0 (0)
- NumRepartitions: 0 (0)
- PartitionsCreated: 0 (0)
- PeakMemoryUsage: 4.00 KB (4096)
- RowsRepartitioned: 0 (0)
- RowsReturned: 0 (0)
- RowsReturnedRate: 0
- SpilledPartitions: 0 (0)
EXCHANGE_NODE (id=2):(Total: 676.993ms, non-child: 676.993ms, % non-child: 100.00%)
BytesReceived(500.000ms): 0
- BytesReceived: 0
- ConvertRowBatchTime: 0.000ns
- DeserializeRowBatchTimer: 0.000ns
- FirstBatchArrivalWaitTime: 0.000ns
- PeakMemoryUsage: 0
- RowsReturned: 0 (0)
- RowsReturnedRate: 0
- SendersBlockedTimer: 0.000ns
- SendersBlockedTotalTimer(*): 0.000ns
Averaged Fragment F00:
split sizes: min: 114.60 MB, max: 451.79 MB, avg: 271.65 MB, stddev: 104.16 MB
completion times: min:694.632ms max:728.356ms mean: 725.379ms stddev:8.878ms
execution rates: min:157.45 MB/sec max:620.68 MB/sec mean:374.89 MB/sec stddev:144.30 MB/sec
num instances: 13
Fragment F00:
Instance ef4698db870efd4d:739c89ef00000001 (host=szq5.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef00000002 (host=szq8.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef00000003 (host=szq14.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef00000004 (host=szq20.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef00000005 (host=szq11.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef00000006 (host=szq12.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef00000007 (host=szq13.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef00000008 (host=szq1.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef00000009 (host=szq15.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef0000000a (host=szq6.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef0000000b (host=szq4.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef0000000c (host=szq9.appadhoc.com:22000):
Instance ef4698db870efd4d:739c89ef0000000d (host=szq7.appadhoc.com:22000):
So why Impala needs so many memory?
It could be that Impala is missing statistics on your table for that partition. The explain plan highlights the following:
Estimated Per-Host Requirements: Memory=410.00MB VCores=1
WARNING: The following tables are missing relevant table and/or column statistics.
adhoc_data_fast.log
Try running a COMPUTE STATS on the table, or a COMPUTE INCREMENTAL STATS for the partition.
e.g.
COMPUTE INCREMENTAL STATS adhoc_data_fast.log PARTITION (day='2017-04-05');
This will help Impala when it does its resource planning. I would be surprised if this fixes it, but worth a shot initially.
Related
Can't add node to the cockroachde cluster
I'm staking to join a CockroachDB node to a cluster. I've created first cluster, then try to join 2nd node to the first node, but 2nd node created new cluster as follows. Does anyone knows whats are wrong steps on the following my steps, any suggestions are wellcome. I've started first node as follows: cockroach start --insecure --advertise-host=163.172.156.111 * Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v19.1/secure-a-cluster.html * CockroachDB node starting at 2019-05-11 01:11:15.45522036 +0000 UTC (took 2.5s) build: CCL v19.1.0 # 2019/04/29 18:36:40 (go1.11.6) webui: http://163.172.156.111:8080 sql: postgresql://root#163.172.156.111:26257?sslmode=disable client flags: cockroach <client cmd> --host=163.172.156.111:26257 --insecure logs: /home/ueda/cockroach-data/logs temp dir: /home/ueda/cockroach-data/cockroach-temp449555924 external I/O path: /home/ueda/cockroach-data/extern store[0]: path=/home/ueda/cockroach-data status: initialized new cluster clusterID: 3e797faa-59a1-4b0d-83b5-36143ddbdd69 nodeID: 1 Then, start secondary node to join to 163.172.156.111, but can't join: cockroach start --insecure --advertise-addr=128.199.127.164 --join=163.172.156.111:26257 CockroachDB node starting at 2019-05-11 01:21:14.533097432 +0000 UTC (took 0.8s) build: CCL v19.1.0 # 2019/04/29 18:36:40 (go1.11.6) webui: http://128.199.127.164:8080 sql: postgresql://root#128.199.127.164:26257?sslmode=disable client flags: cockroach <client cmd> --host=128.199.127.164:26257 --insecure logs: /home/ueda/cockroach-data/logs temp dir: /home/ueda/cockroach-data/cockroach-temp067740997 external I/O path: /home/ueda/cockroach-data/extern store[0]: path=/home/ueda/cockroach-data status: restarted pre-existing node clusterID: a14e89a7-792d-44d3-89af-7037442eacbc nodeID: 1 The cockroach.log of joining node shows some gosip error: cat cockroach-data/logs/cockroach.log I190511 01:21:13.762309 1 util/log/clog.go:1199 [config] file created at: 2019/05/11 01:21:13 I190511 01:21:13.762309 1 util/log/clog.go:1199 [config] running on machine: amfortas I190511 01:21:13.762309 1 util/log/clog.go:1199 [config] binary: CockroachDB CCL v19.1.0 (x86_64-unknown-linux-gnu, built 2019/04/29 18:36:40, go1.11.6) I190511 01:21:13.762309 1 util/log/clog.go:1199 [config] arguments: [cockroach start --insecure --advertise-addr=128.199.127.164 --join=163.172.156.111:26257] I190511 01:21:13.762309 1 util/log/clog.go:1199 line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg utf8=✓ I190511 01:21:13.762307 1 cli/start.go:1033 logging to directory /home/ueda/cockroach-data/logs W190511 01:21:13.763373 1 cli/start.go:1068 RUNNING IN INSECURE MODE! - Your cluster is open for any client that can access <all your IP addresses>. - Any user, even root, can log in without providing a password. - Any user, connecting as root, can read or write any data in your cluster. - There is no network encryption nor authentication, and thus no confidentiality. Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v19.1/secure-a-cluster.html I190511 01:21:13.763675 1 server/status/recorder.go:610 available memory from cgroups (8.0 EiB) exceeds system memory 992 MiB, using system memory W190511 01:21:13.763752 1 cli/start.go:944 Using the default setting for --cache (128 MiB). A significantly larger value is usually needed for good performance. If you have a dedicated server a reasonable setting is --cache=.25 (248 MiB). I190511 01:21:13.764011 1 server/status/recorder.go:610 available memory from cgroups (8.0 EiB) exceeds system memory 992 MiB, using system memory W190511 01:21:13.764047 1 cli/start.go:957 Using the default setting for --max-sql-memory (128 MiB). A significantly larger value is usually needed in production. If you have a dedicated server a reasonable setting is --max-sql-memory=.25 (248 MiB). I190511 01:21:13.764239 1 server/status/recorder.go:610 available memory from cgroups (8.0 EiB) exceeds system memory 992 MiB, using system memory I190511 01:21:13.764272 1 cli/start.go:1082 CockroachDB CCL v19.1.0 (x86_64-unknown-linux-gnu, built 2019/04/29 18:36:40, go1.11.6) I190511 01:21:13.866977 1 server/status/recorder.go:610 available memory from cgroups (8.0 EiB) exceeds system memory 992 MiB, using system memory I190511 01:21:13.867002 1 server/config.go:386 system total memory: 992 MiB I190511 01:21:13.867063 1 server/config.go:388 server configuration: max offset 500000000 cache size 128 MiB SQL memory pool size 128 MiB scan interval 10m0s scan min idle time 10ms scan max idle time 1s event log enabled true I190511 01:21:13.867098 1 cli/start.go:929 process identity: uid 1000 euid 1000 gid 1000 egid 1000 I190511 01:21:13.867115 1 cli/start.go:554 starting cockroach node I190511 01:21:13.868242 21 storage/engine/rocksdb.go:613 opening rocksdb instance at "/home/ueda/cockroach-data/cockroach-temp067740997" I190511 01:21:13.894320 21 server/server.go:876 [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled I190511 01:21:13.894813 21 storage/engine/rocksdb.go:613 opening rocksdb instance at "/home/ueda/cockroach-data" W190511 01:21:13.896301 21 storage/engine/rocksdb.go:127 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:2566] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed. W190511 01:21:13.905666 21 storage/engine/rocksdb.go:127 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:2566] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed. I190511 01:21:13.911380 21 server/config.go:494 [n?] 1 storage engine initialized I190511 01:21:13.911417 21 server/config.go:497 [n?] RocksDB cache size: 128 MiB I190511 01:21:13.911427 21 server/config.go:497 [n?] store 0: RocksDB, max size 0 B, max open file limit 10000 W190511 01:21:13.912459 21 gossip/gossip.go:1496 [n?] no incoming or outgoing connections I190511 01:21:13.913206 21 server/server.go:926 [n?] Sleeping till wall time 1557537673913178595 to catches up to 1557537674394265598 to ensure monotonicity. Delta: 481.087003ms I190511 01:21:14.251655 65 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n?] circuitbreaker: gossip [::]:26257->163.172.156.111:26257 tripped: initial connection heartbeat failed: rpc error: code = Unknown desc = client cluster ID "a14e89a7-792d-44d3-89af-7037442eacbc" doesn't match server cluster ID "3e797faa-59a1-4b0d-83b5-36143ddbdd69" I190511 01:21:14.251695 65 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n?] circuitbreaker: gossip [::]:26257->163.172.156.111:26257 event: BreakerTripped W190511 01:21:14.251763 65 gossip/client.go:122 [n?] failed to start gossip client to 163.172.156.111:26257: initial connection heartbeat failed: rpc error: code = Unknown desc = client cluster ID "a14e89a7-792d-44d3-89af-7037442eacbc" doesn't match server cluster ID "3e797faa-59a1-4b0d-83b5-36143ddbdd69" I190511 01:21:14.395848 21 gossip/gossip.go:392 [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"128.199.127.164:26257" > attrs:<> locality:<> ServerVersion:<major_val:19 minor_val:1 patch:0 unstable:0 > build_tag:"v19.1.0" started_at:1557537674395557548 W190511 01:21:14.458176 21 storage/replica_range_lease.go:506 can't determine lease status due to node liveness error: node not in the liveness table I190511 01:21:14.458465 21 server/node.go:461 [n1] initialized store [n1,s1]: disk (capacity=24 GiB, available=18 GiB, used=2.2 MiB, logicalBytes=41 MiB), ranges=20, leases=0, queries=0.00, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=6467.00 p90=26940.00 pMax=43017435.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00} I190511 01:21:14.458775 21 storage/stores.go:244 [n1] read 0 node addresses from persistent storage I190511 01:21:14.459095 21 server/node.go:699 [n1] connecting to gossip network to verify cluster ID... W190511 01:21:14.469842 96 storage/store.go:1525 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown I190511 01:21:14.474785 21 server/node.go:719 [n1] node connected via gossip and verified as part of cluster "a14e89a7-792d-44d3-89af-7037442eacbc" I190511 01:21:14.475033 21 server/node.go:542 [n1] node=1: started with [<no-attributes>=/home/ueda/cockroach-data] engine(s) and attributes [] I190511 01:21:14.475393 21 server/status/recorder.go:610 [n1] available memory from cgroups (8.0 EiB) exceeds system memory 992 MiB, using system memory I190511 01:21:14.475514 21 server/server.go:1582 [n1] starting http server at [::]:8080 (use: 128.199.127.164:8080) I190511 01:21:14.475572 21 server/server.go:1584 [n1] starting grpc/postgres server at [::]:26257 I190511 01:21:14.475605 21 server/server.go:1585 [n1] advertising CockroachDB node at 128.199.127.164:26257 W190511 01:21:14.475655 21 jobs/registry.go:341 [n1] unable to get node liveness: node not in the liveness table I190511 01:21:14.532949 21 server/server.go:1650 [n1] done ensuring all necessary migrations have run I190511 01:21:14.533020 21 server/server.go:1653 [n1] serving sql connections I190511 01:21:14.533209 21 cli/start.go:689 [config] clusterID: a14e89a7-792d-44d3-89af-7037442eacbc I190511 01:21:14.533257 21 cli/start.go:697 node startup completed: CockroachDB node starting at 2019-05-11 01:21:14.533097432 +0000 UTC (took 0.8s) build: CCL v19.1.0 # 2019/04/29 18:36:40 (go1.11.6) webui: http://128.199.127.164:8080 sql: postgresql://root#128.199.127.164:26257?sslmode=disable client flags: cockroach <client cmd> --host=128.199.127.164:26257 --insecure logs: /home/ueda/cockroach-data/logs temp dir: /home/ueda/cockroach-data/cockroach-temp067740997 external I/O path: /home/ueda/cockroach-data/extern store[0]: path=/home/ueda/cockroach-data status: restarted pre-existing node clusterID: a14e89a7-792d-44d3-89af-7037442eacbc nodeID: 1 I190511 01:21:14.541205 146 server/server_update.go:67 [n1] no need to upgrade, cluster already at the newest version I190511 01:21:14.555557 149 sql/event_log.go:135 [n1] Event: "node_restart", target: 1, info: {Descriptor:{NodeID:1 Address:128.199.127.164:26257 Attrs: Locality: ServerVersion:19.1 BuildTag:v19.1.0 StartedAt:1557537674395557548 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:a14e89a7-792d-44d3-89af-7037442eacbc StartedAt:1557537674395557548 LastUp:1557537671113461486} I190511 01:21:14.916458 59 gossip/gossip.go:1510 [n1] node has connected to cluster via gossip I190511 01:21:14.916660 59 storage/stores.go:263 [n1] wrote 0 node addresses to persistent storage I190511 01:21:24.480247 116 storage/store.go:4220 [n1,s1] sstables (read amplification = 2): 0 [ 51K 1 ]: 51K 6 [ 1M 1 ]: 1M I190511 01:21:24.480380 116 storage/store.go:4221 [n1,s1] ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop ---------------------------------------------------------------------------------------------------------------------------------------------------------- L0 1/0 50.73 KB 0.5 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 8.0 0 1 0.006 0 0 L6 1/0 1.26 MB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 Sum 2/0 1.31 MB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 8.0 0 1 0.006 0 0 Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 8.0 0 1 0.006 0 0 Uptime(secs): 10.6 total, 10.6 interval Flush(GB): cumulative 0.000, interval 0.000 AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files): cumulative 0, interval 0 AddFile(L0 Files): cumulative 0, interval 0 AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count estimated_pending_compaction_bytes: 0 B I190511 01:21:24.481565 121 server/status/runtime.go:500 [n1] runtime stats: 170 MiB RSS, 114 goroutines, 0 B/0 B/0 B GO alloc/idle/total, 14 MiB/16 MiB CGO alloc/total, 0.0 CGO/sec, 0.0/0.0 %(u/s)time, 0.0 %gc (7x), 50 KiB/1.5 MiB (r/w)net What is the possibly cause to block to join? Thank you for your suggestion!
It seems you had previously started the second node (the one running on 128.199.127.164) by itself, creating its own cluster. This can be seen in the error message: W190511 01:21:14.251763 65 gossip/client.go:122 [n?] failed to start gossip client to 163.172.156.111:26257: initial connection heartbeat failed: rpc error: code = Unknown desc = client cluster ID "a14e89a7-792d-44d3-89af-7037442eacbc" doesn't match server cluster ID "3e797faa-59a1-4b0d-83b5-36143ddbdd69" To be able to join the cluster, the data directory of the joining node must be empty. You can either delete cockroach-data or specify an alternate directory with --store=/path/to/data-dir
Is hive.exec.parallel broken?
Apparently, there is a reason why hive.exec.parallel is false by default. When I set it to true (as recommended by an answer to my previous question), my process dies with this message: MapReduce Jobs Launched: Job 0: Map: 2 Reduce: 1 Cumulative CPU: 6.43 sec HDFS Read: 556 HDFS Write: 96 SUCCESS Job 1: Map: 1 Reduce: 1 Cumulative CPU: 3.15 sec HDFS Read: 475 HDFS Write: 96 SUCCESS Job 2: Map: 1 Reduce: 1 Cumulative CPU: 3.36 sec HDFS Read: 475 HDFS Write: 96 SUCCESS Job 3: Map: 1 Reduce: 1 Cumulative CPU: 2.19 sec HDFS Read: 475 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 130 msec OK normalized_keyword pixel_id count sum_log events Time taken: 72.419 seconds ... 14.98user 0.62system 1:16.79elapsed 20%CPU (0avgtext+0avgdata 851392maxresident)k 8inputs+2096outputs (0major+83271minor)pagefaults 0swaps text: java.io.EOFException at java.io.DataInputStream.readShort(Unknown Source) at org.apache.hadoop.fs.shell.Display$Text.getInputStream(Display.java:113) at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:81) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:306) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190) at org.apache.hadoop.fs.shell.Command.run(Command.java:154) at org.apache.hadoop.fs.FsShell.run(FsShell.java:254) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.fs.FsShell.main(FsShell.java:304) No useful data is produced. set hive.exec.parallel.thread.number=2 has no effect (same failure) Suggestions? EDIT: hive --version does not work, but when it starts, it prints Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hive/lib/hive-common-0.10.0-cdh4.4.0.jar!/hive-log4j.properties so, I guess, the version is 0.10.0.
Cassandra read latency high even with row caching, why?
I am testing cassandra performance with a simple model. CREATE TABLE "NoCache" ( key ascii, column1 ascii, value ascii, PRIMARY KEY (key, column1) ) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.010000 AND caching='ALL' AND comment='' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=864000 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; I am fetching 100 columns of a row key using pycassa, get/xget function (). but getting read latency about 15ms in the server. colums=COL_FAM.get(row_key, column_count=100) nodetool cfstats Column Family: NoCache SSTable count: 1 Space used (live): 103756053 Space used (total): 103756053 Number of Keys (estimate): 128 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 20 Read Latency: 15.717 ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Bloom Filter False Positives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 976 Compacted row minimum size: 4769 Compacted row maximum size: 557074610 Compacted row mean size: 87979499 Latency of this type is amazing! When nodetool info shows that read hits directly in the row cache. Row Cache : size 4834713 (bytes), capacity 67108864 (bytes), 35 hits, 38 requests, 1.000 recent hit rate, 0 save period in seconds Can anyone tell me why is cassandra taking so much time while reading from row cache?
Enable tracing and see what it's doing. http://www.datastax.com/dev/blog/tracing-in-cassandra-1-2
Caching not Working in Cassandra
I dont seem to have any caching enabled when checking in Opscenter or cfstats. Im running Cassandra 1.1.7 with Solandra on Debian. I have set the required global options in cassandra.yaml: key_cache_size_in_mb: 800 key_cache_save_period: 14400 row_cache_size_in_mb: 800 row_cache_save_period: 15400 row_cache_provider: SerializingCacheProvider Column Families were created as follows: create column family example with column_type = 'Standard' and comparator = 'BytesType' and default_validation_class = 'BytesType' and key_validation_class = 'BytesType' and read_repair_chance = 1.0 and dclocal_read_repair_chance = 0.0 and gc_grace = 864000 and min_compaction_threshold = 4 and max_compaction_threshold = 32 and replicate_on_write = true and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' and caching = 'ALL'; Opscenter shows no data available on caching graphs and CFSTATS doesn't show any cache related fields: Column Family: charsets SSTable count: 1 Space used (live): 5558 Space used (total): 5558 Number of Keys (estimate): 128 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 61381 Read Latency: 0.123 ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Bloom Filter False Postives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 16 Compacted row minimum size: 1917 Compacted row maximum size: 2299 Compacted row mean size: 2299 Any help or suggestions are appreciated. Sam
The caching stats have been moved from cfstats to info in Cassandra 1.1. If you run nodetool info you should see something like: Key Cache : size 5552 (bytes), capacity 838860800 (bytes), 38 hits, 47 requests, 0.809 recent hit rate, 14400 save period in seconds Row Cache : size 0 (bytes), capacity 838860800 (bytes), 0 hits, 0 requests, NaN recent hit rate, 15400 save period in seconds This is because there are now global caches, rather than per-CF. It seems that Opscenter needs updating for this change - maybe there is a later version available that will work.
Need help analysing the VarnishStat results
I am a newbie with Varnish. I have successfully installed it and now its working, but I need some guidance from the more knowledgeable people about how the server is performing. I read this article - http://kristianlyng.wordpress.com/2009/12/08/varnishstat-for-dummies/ but I am still not sure howz the server performance. The server has been running since last 9 hours. I understand that more content will be cached with time so cache hit ratio will better, but right now my concern is about intermediate help from your side on server performance. Hitrate ratio: 10 100 613 Hitrate avg: 0.2703 0.3429 0.4513 239479 8.00 7.99 client_conn - Client connections accepted 541129 13.00 18.06 client_req - Client requests received 157594 1.00 5.26 cache_hit - Cache hits 3 0.00 0.00 cache_hitpass - Cache hits for pass 313499 9.00 10.46 cache_miss - Cache misses 67377 4.00 2.25 backend_conn - Backend conn. success 316739 7.00 10.57 backend_reuse - Backend conn. reuses 910 0.00 0.03 backend_toolate - Backend conn. was closed 317652 8.00 10.60 backend_recycle - Backend conn. recycles 584 0.00 0.02 backend_retry - Backend conn. retry 3 0.00 0.00 fetch_head - Fetch head 314040 9.00 10.48 fetch_length - Fetch with Length 4139 0.00 0.14 fetch_chunked - Fetch chunked 5 0.00 0.00 fetch_close - Fetch wanted close 386 . . n_sess_mem - N struct sess_mem 55 . . n_sess - N struct sess 313452 . . n_object - N struct object 313479 . . n_objectcore - N struct objectcore 38474 . . n_objecthead - N struct objecthead 368 . . n_waitinglist - N struct waitinglist 12 . . n_vbc - N struct vbc 61 . . n_wrk - N worker threads 344 0.00 0.01 n_wrk_create - N worker threads created 2935 0.00 0.10 n_wrk_queued - N queued work requests 1 . . n_backend - N backends 47 . . n_expired - N expired objects 149425 . . n_lru_moved - N LRU moved objects 1 0.00 0.00 losthdr - HTTP header overflows 461727 10.00 15.41 n_objwrite - Objects sent with write 239468 8.00 7.99 s_sess - Total Sessions 541129 13.00 18.06 s_req - Total Requests 64678 3.00 2.16 s_pipe - Total pipe 5346 0.00 0.18 s_pass - Total pass 318187 9.00 10.62 s_fetch - Total fetch 193589421 3895.84 6459.66 s_hdrbytes - Total header bytes 4931971067 14137.41 164569.09 s_bodybytes - Total body bytes 117585 3.00 3.92 sess_closed - Session Closed 2283 0.00 0.08 sess_pipeline - Session Pipeline 892 0.00 0.03 sess_readahead - Session Read Ahead 458468 10.00 15.30 sess_linger - Session Linger 414010 9.00 13.81 sess_herd - Session herd 36912073 880.96 1231.68 shm_records - SHM records
What VCL are you using? If the answer is 'none' then you are probably not getting a very good hitrate. On a fresh install, Varnish is quite conservative about what it caches (and rightly so), but you can probably improve matters by reading how to achieve a high hitrate. If it's safe to, you can selectively unset cookies and normalise requests with your VCL, which will result in fewer backend calls. How much of your website is cacheable? Is your object cache big enough? If you can answer those two questions, you ought to be able to achieve a great hitrate with Varnish.