Using ExtractText for getting log data in nifi - apache-nifi

I have retrieved custom log data in tailFail and then split the data(line by line). Now I want to get useful data from nifi-api.log.
I used this expression like this:
^(.*)$
but processor make flowfiele unmatched.
1. how should i replace my expression?

It depends on what information you are looking for in the log messages. The expression you've posted simply matches the entire content.
Let's say you have the following log output and want to collect the times for flowfile repository checkpointing to do analysis:
2017-08-25 10:36:31,942 INFO [pool-10-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 229 milliseconds
2017-08-25 10:36:35,571 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#6527aa0 checkpointed with 0 Records and 0 Swap Files in 14 milliseconds (Stop-the-world time = 4 milliseconds, Clear Edit Logs time = 7 millis), max Transaction ID -1
2017-08-25 10:38:31,942 INFO [pool-10-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2017-08-25 10:38:32,162 INFO [pool-10-thread-1] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#6cca70e3 checkpointed with 0 Records and 0 Swap Files in 218 milliseconds (Stop-the-world time = 92 milliseconds, Clear Edit Logs time = 98 millis), max Transaction ID -1
2017-08-25 10:38:32,162 INFO [pool-10-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 218 milliseconds
2017-08-25 10:38:35,584 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#6527aa0 checkpointed with 0 Records and 0 Swap Files in 13 milliseconds (Stop-the-world time = 6 milliseconds, Clear Edit Logs time = 4 millis), max Transaction ID -1
2017-08-25 10:40:32,161 INFO [pool-10-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2017-08-25 10:40:32,341 INFO [pool-10-thread-1] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#6cca70e3 checkpointed with 0 Records and 0 Swap Files in 177 milliseconds (Stop-the-world time = 71 milliseconds, Clear Edit Logs time = 87 millis), max Transaction ID -1
2017-08-25 10:40:32,341 INFO [pool-10-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 178 milliseconds
2017-08-25 10:40:35,592 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#6527aa0 checkpointed with 0 Records and 0 Swap Files in 11 milliseconds (Stop-the-world time = 5 milliseconds, Clear Edit Logs time = 4 millis), max Transaction ID -1
Using an expression like ^[\d\-\s\:,]+\s(INFO|WARN|ERROR).*(\d+) milliseconds would allow you to filter those messages and with your capture groups, understand the severity of the message and the timing.

You can use following regex in extractText processor for extracts value.
regex:(.*)
Then use RouteOnAttribute to check that log to be ERROR/WARN/INFO by below expressions.
INFO:${regex:toLower():contains('info')}
ERROR:${regex:toLower():contains('error')}
WARN:${regex:toLower():contains('warn')}
Now route your flowfiles as per attributes and then do whatever you wants.
Hope this helpful for you

Related

When I use flink to write hbase,getting hbase error in region server

Soft version as follows:
apache hbase 2.1.6
apache flink 1.13.6
apache hadoop 3.1.1
When I use the hbase-client api to access hbase, I get the following error:
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=16, exceptions:
Wed Sep 28 03:03:11 UTC 2022, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68532: java.io.IOException: Invalid currTagsLen -32239. Block offset: 1319713, block length: 99991, position: 42422 (without header). path=hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/cd083a4a1ef04baff94ebb5aabdb8cb8/i/1f6dd8a1bc054eefbc9faa1bf625e24f
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:472)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
Caused by: java.lang.IllegalStateException: Invalid currTagsLen -32239. Block offset: 1319713, block length: 99991, position: 42422 (without header). path=hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/cd083a4a1ef04baff94ebb5aabdb8cb8/i/1f6dd8a1bc054eefbc9faa1bf625e24f
at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.checkTagsLen(HFileReaderImpl.java:642)
at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readKeyValueLen(HFileReaderImpl.java:630)
at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl._next(HFileReaderImpl.java:1080)
at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.next(HFileReaderImpl.java:1097)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:208)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:120)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:153)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6581)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6745)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6518)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3155)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3404)
at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
... 3 more
The exception for hbase regionserver is as follows:
2022-09-28 11:19:36,019 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started
2022-09-28 11:20:20,946 INFO [MemStoreFlusher.0] regionserver.HRegion: Flushing 1/1 column families, dataSize=1.95 MB heapSize=2.09 MB
2022-09-28 11:20:20,969 INFO [MemStoreFlusher.0] regionserver.DefaultStoreFlusher: Flushed memstore data size=1.95 MB at sequenceid=8934625 (bloomFilter=true), to=hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5f76d5468
55f/.tmp/i/2629dbae7d5e402489ef56b1c097289f
2022-09-28 11:20:20,977 INFO [MemStoreFlusher.0] regionserver.HStore: Added hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5f76d546855f/i/2629dbae7d5e402489ef56b1c097289f, entries=1212, sequenceid=8934625, filesize=359.
1 K
2022-09-28 11:20:20,978 INFO [MemStoreFlusher.0] regionserver.HRegion: Finished flush of dataSize ~1.95 MB/2041026, heapSize ~2.09 MB/2190200, currentSize=0 B/0 for e63ee2269b0b076a415c5f76d546855f in 32ms, sequenceid=8934625, compaction requested=true
2022-09-28 11:20:20,986 INFO [regionserver/bghbaseclusterdn9528:16020-shortCompactions-1664173471436] regionserver.HRegion: Starting compaction of i in expose,9ffffff6,1663741391432.e63ee2269b0b076a415c5f76d546855f.
2022-09-28 11:20:20,986 INFO [regionserver/bghbaseclusterdn9528:16020-shortCompactions-1664173471436] regionserver.HStore: Starting compaction of [hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5f76d546855f/i/98d0ecd1ed
7744a8a5f94923c382861e, hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5f76d546855f/i/30bab1682dba4721b25e58b78dd17255, hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5f76d546855f/i/f8
0c2f08176e417a9184f434d4300935, hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5f76d546855f/i/52baca576c154c26b7df3b5d126d47b8, hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5f76d5468
55f/i/7d8291d422d042de9aa43aa5b79da6ad, hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5f76d546855f/i/8bf3b47909ab4eeb86d8a5c283cfe942, hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5
f76d546855f/i/0663d48a4ed94dbe9fdc78f6649c1eb3, hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5f76d546855f/i/b80b55d744174bc882db93283cd70c71] into tmpdir=hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e
63ee2269b0b076a415c5f76d546855f/.tmp, totalSize=18.9 M
2022-09-28 11:20:21,153 INFO [regionserver/bghbaseclusterdn9528:16020-shortCompactions-1664173471436] throttle.PressureAwareThroughputController: e63ee2269b0b076a415c5f76d546855f#i#compaction#637 average throughput is 122.45 MB/second, slept 0 time(s) and
total slept time is 0 ms. 0 active operations remaining, total limit is 61.86 MB/second
2022-09-28 11:20:21,159 ERROR [regionserver/bghbaseclusterdn9528:16020-shortCompactions-1664173471436] regionserver.CompactSplit: Compaction failed region=expose,9ffffff6,1663741391432.e63ee2269b0b076a415c5f76d546855f., storeName=i, priority=73, startTime=
1664335220978
java.lang.IllegalStateException: Invalid currTagsLen -9. Block offset: 1677972, block length: 161891, position: 48652 (without header). path=hdfs://cthbaseclusterpro01/apps/hbase/data/data/default/expose/e63ee2269b0b076a415c5f76d546855f/i/b80b55d744174bc88
2db93283cd70c71
at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.checkTagsLen(HFileReaderImpl.java:642)
at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readKeyValueLen(HFileReaderImpl.java:630)
at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl._next(HFileReaderImpl.java:1080)
at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.next(HFileReaderImpl.java:1097)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:208)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:120)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:388)
at org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:327)
at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65)
at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:126)
at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1410)
at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2187)
at org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:596)
at org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:638)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2022-09-28 11:20:25,000 INFO [RpcServer.default.FPBQ.Fifo.handler=18,queue=3,port=16020] regionserver.HRegion: writing data to region expose,9ffffff6,1663741391432.e63ee2269b0b076a415c5f76d546855f. with WAL disabled. Data may be lost in the event of a cra
sh.
2022-09-28 11:24:01,565 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=1.08 GB, freeSize=2.52 GB, max=3.60 GB, blockCount=17155, accesses=133155383, hits=132992986, hitRatio=99.88%, , cachingAccesses=132985682, cachingHits=132951576, cac
hingHitsRatio=99.97%, evictions=16199, evicted=0, evictedPerRun=0.0
2022-09-28 11:24:01,569 INFO [MobFileCache #0] mob.MobFileCache: MobFileCache Statistics, access: 0, miss: 0, hit: 0, hit ratio: 0%, evicted files: 0
2022-09-28 11:24:05,246 INFO [regionserver/bghbaseclusterdn9528:16020.logRoller] wal.AbstractFSWAL: Rolled WAL /apps/hbase/data/WALs/bghbaseclusterdn9528,16020,1664173440239/bghbaseclusterdn9528%2C16020%2C1664173440239.1664331845190 with entries=21, files
ize=5.39 KB; new WAL /apps/hbase/data/WALs/bghbaseclusterdn9528,16020,1664173440239/bghbaseclusterdn9528%2C16020%2C1664173440239.1664335445235
I found some solutions in code. such as HBASE-21507、HBASE-24515、HBASE-21775

Kafka consumer does not fetch new records when using topic pattern and large messages

I hope someone of you can help me.
I'm using spring boot 2.3.4 with spring kafka 2.5.6. I recently had to reset an offset and saw some strange behavior. We consumed the messages, but after every X (variating) messages we had a timeout of 10 seconds before the consumption continued.
This is my configuration:
spring:
kafka:
bootstrap-servers: localhost:9092
consumer:
enable-auto-commit: false
auto-offset-reset: earliest
heartbeat-interval: 1000
max-poll-records: 50
group-id: kafka-fetch-demo
fetch-max-wait: 10000
listener:
type: single
concurrency: 1
poll-timeout: 1000
no-poll-threshold: 2
monitor-interval: 10
ack-mode: manual
producer:
acks: all
batch-size: 0
retries: 0
This is an examle listener code:
#KafkaListener(id = LISTENER_ID, idIsGroup = false, topicPattern = "#{demoProperties.getTopicPattern()}")
public void onEvent(Acknowledgment acknowledgment, ConsumerRecord<byte[], String> record) {
log.info("Received record on topic {}, partition {} and offset {}",
record.topic(),
record.partition(),
record.offset());
acknowledgment.acknowledge();
}
Analysis
I figured out that the 10 second timeout came from the fetch.max.wait.ms property. However I'm not able to figure out why this property applies.
As far as I understand the fetch-max-wait property only determines the maximum time the broker waits before providing the consumer with new records even if the fetch.min.bytes is not exceeded. (Which in my case is set to the default 1 and should always be fullfilled)
Furthermore I analyzed that this problem only applies when using topic patterns and "larger" messages.
Reproduction
I uploaded an demo application on Github to reproduce the issue: https://github.com/kraennix/kafka-fetch-demo.
How I did reproduce it:
I put a thousand messages with 17,1 KB per message on a kafka topic.
I start my consuming application that listens per topic pattern to this topic. Then you can see this stopping behaviour.
Note: If I do the same with "small" messages (89 Bytes) it works as expected.
Logs
In the logs you can see the successful commit, but then the it says Skipping fetch
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] essageListenerContainer$ListenerConsumer : Commit list: {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}}
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] essageListenerContainer$ListenerConsumer : Committing: {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}}
2021-01-16 15:04:40.773 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Sending OffsetCommit request with {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}} to coordinator localhost:9092 (id: 2147483647 rack: null)
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Using older server API v7 to send OFFSET_COMMIT {group_id=kafka-fetch-demo,generation_id=4,member_id=consumer-kafka-fetch-demo-1-cf8e747f-531d-457a-aca8-18960c518ef9,group_instance_id=null,topics=[{name=publish.LargeTopic.2.test,partitions=[{partition_index=0,committed_offset=488,committed_leader_epoch=-1,committed_metadata=}]}]} with correlation id 62 to node 2147483647
2021-01-16 15:04:40.778 TRACE 19244 --- [_LISTENER-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Completed receive from node 2147483647 for OFFSET_COMMIT with correlation id 62, received {throttle_time_ms=0,topics=[{name=publish.LargeTopic.2.test,partitions=[{partition_index=0,error_code=0}]}]}
2021-01-16 15:04:40.779 DEBUG 19244 --- [_LISTENER-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Committed offset 488 for partition publish.LargeTopic.2.test-0
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.1.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.2.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.1.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.2.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
When there is a change in the Size of the message, you might need to change below 2 Props
heartbeat-interval: 1000
max-poll-records: 50
Your heart beat interval is 1sec and Max poll wait is 10secs. If the size of the message is high and you are processing the consumed messages in the same thread, then Heartbeat check will fail by the time the next Pull triggered. Make sure to process messages by an Executor using Callable.
Increase the Heart Beat Interval to 5 to 10 secs and Reduce Max Poll records to 15 when the messages size is high. Hope, this can help

MiNiFi - NiFi Connection Failure: Unknown Host Exception : Able to telnet host from the machine where MiNiFi is running

I am running MiNiFi in a Linux Box (gateway server) which is behind my company's firewall. My NiFi is running on an AWS EC2 cluster (running in standalone mode).
I am trying to send data from the Gateway to NiFi running in AWS EC2.
From gateway, I am able to telnet to EC2 node with the public DNS and the remote port which I have configured in the nifi.properties file
nifi.properties
# Site to Site properties
nifi.remote.input.host=ec2-xxx.us-east-2.compute.amazonaws.com
nifi.remote.input.secure=false
nifi.remote.input.socket.port=1026
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs
Telnet connection from Gateway to NiFi
iot1#iothdp02:~/minifi/minifi-0.5.0/conf$ telnet ec2-xxx.us-east-2.compute.amazonaws.com 1026
Trying xx.xx.xx.xxx...
Connected to ec2-xxx.us-east-2.compute.amazonaws.com.
Escape character is '^]'.
The Public DNS is resolving to the correct Public IP of the EC2 node.
From the EC2 node, when I do nslookup on the Public DNS, it gives back the private IP.
From AWS Documentation: "The public IP address is mapped to the primary private IP address through network address translation (NAT). "
Hence, I am not adding the Public DNS and the Public IP in /etc/host file in the EC2 node.
From MiNiFi side, I am getting the below error:
minifi-app.log
iot1#iothdp02:~/minifi/minifi-0.5.0/logs$ cat minifi-app.log
2018-11-14 16:00:47,910 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:00:47,911 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:01:02,334 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with 0 Records and 0 Swap Files in 20 milliseconds (Stop-the-world time = 6 milliseconds, Clear Edit Logs time = 4 millis), max Transaction ID -1
2018-11-14 16:02:47,911 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:02:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:03:02,354 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with 0 Records and 0 Swap Files in 18 milliseconds (Stop-the-world time = 3 milliseconds, Clear Edit Logs time = 5 millis), max Transaction ID -1
2018-11-14 16:03:10,636 WARN [Timer-Driven Process Thread-8] o.a.n.r.util.SiteToSiteRestApiClient Failed to get controller from http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi-api due to java.net.UnknownHostException: ec2-xxx.us-east-2.compute.amazonaws.com: unknown error
2018-11-14 16:03:10,636 WARN [Timer-Driven Process Thread-8] o.apache.nifi.controller.FlowController Unable to communicate with remote instance RemoteProcessGroup[http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi] due to org.apache.nifi.controller.exception.CommunicationsException: org.apache.nifi.controller.exception.CommunicationsException: Unable to communicate with Remote NiFi at URI http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi due to: ec2-xxx.us-east-2.compute.amazonaws.com: unknown error
2018-11-14 16:04:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:04:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:05:02,380 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with 0 Records and 0 Swap Files in 25 milliseconds (Stop-the-world time = 8 milliseconds, Clear Edit Logs time = 6 millis), max Transaction ID -1
2018-11-14 16:06:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository
2018-11-14 16:06:47,912 INFO [pool-31-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 0 records in 0 milliseconds
2018-11-14 16:07:02,399 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#67207d8a checkpointed with
MiNiFi config.yml
MiNiFi Config Version: 3
Flow Controller:
name: Gateway-IDS_v0.1
comment: "1. ConsumeMQTT - MiNiFi will consume mqtt messages in gateway\n2. Remote\
\ Process Group will send messages to NiFi "
Core Properties:
flow controller graceful shutdown period: 10 sec
flow service write delay interval: 500 ms
administrative yield duration: 30 sec
bored yield duration: 10 millis
max concurrent threads: 1
variable registry properties: ''
FlowFile Repository:
partitions: 256
checkpoint interval: 2 mins
always sync: false
Swap:
threshold: 20000
in period: 5 sec
in threads: 1
out period: 5 sec
out threads: 4
Content Repository:
content claim max appendable size: 10 MB
content claim max flow files: 100
always sync: false
Provenance Repository:
provenance rollover time: 1 min
implementation: org.apache.nifi.provenance.MiNiFiPersistentProvenanceRepository
Component Status Repository:
buffer size: 1440
snapshot frequency: 1 min
Security Properties:
keystore: ''
keystore type: ''
keystore password: ''
key password: ''
truststore: ''
truststore type: ''
truststore password: ''
ssl protocol: ''
Sensitive Props:
key:
algorithm: PBEWITHMD5AND256BITAES-CBC-OPENSSL
provider: BC
Processors:
- id: 6396f40f-118f-33f4-0000-000000000000
name: ConsumeMQTT
class: org.apache.nifi.processors.mqtt.ConsumeMQTT
max concurrent tasks: 1
scheduling strategy: TIMER_DRIVEN
scheduling period: 0 sec
penalization period: 30 sec
yield period: 1 sec
run duration nanos: 0
auto-terminated relationships list: []
Properties:
Broker URI: tcp://localhost:1883
Client ID: nifi
Connection Timeout (seconds): '30'
Keep Alive Interval (seconds): '60'
Last Will Message:
Last Will QoS Level:
Last Will Retain:
Last Will Topic:
MQTT Specification Version: '0'
Max Queue Size: '10'
Password:
Quality of Service(QoS): '0'
SSL Context Service:
Session state: 'true'
Topic Filter: MQTT
Username:
Controller Services: []
Process Groups: []
Input Ports: []
Output Ports: []
Funnels: []
Connections:
- id: f0007aa3-cf32-3593-0000-000000000000
name: ConsumeMQTT/Message/85ebf198-0166-1000-5592-476a7ba47d2e
source id: 6396f40f-118f-33f4-0000-000000000000
source relationship names:
- Message
destination id: 85ebf198-0166-1000-5592-476a7ba47d2e
max work queue size: 10000
max work queue data size: 1 GB
flowfile expiration: 0 sec
queue prioritizer class: ''
Remote Process Groups:
- id: c00d3132-375b-323f-0000-000000000000
name: ''
url: http://ec2-xxx.us-east-2.compute.amazonaws.com:9090
comment: ''
timeout: 30 sec
yield period: 10 sec
transport protocol: RAW
proxy host: ''
proxy port: ''
proxy user: ''
proxy password: ''
local network interface: ''
Input Ports:
- id: 85ebf198-0166-1000-5592-476a7ba47d2e
name: From MiNiFi
comment: ''
max concurrent tasks: 1
use compression: false
Properties:
Port: 1026
Host Name: ec2-xxx.us-east-2.compute.amazonaws.com
Output Ports: []
NiFi Properties Overrides: {}
Any pointers on how to troubleshoot this issue?
In MiNiFi config.yml, I changed the URL under Remote Process Groups from http://ec2-xxx.us-east-2.compute.amazonaws.com:9090 to http://ec2-xxx.us-east-2.compute.amazonaws.com:9090/nifi.

Use PIG to count the number of records in an avro file

I can open a avro file in HUE and HUE shows me it has 10 records. i can browse through all the 10 records in HUE.
Now I write the following code in PIG
data = LOAD '/user/admin/2015/10/04/02/file1.avro' USING AvroStorage();
data_group = GROUP data ALL;
row_count = FOREACH data_group GENERATE COUNT(data);
dump row_count;
The output of the job is
Input(s):
Successfully read 4 records (58507 bytes) from: "/user/admin/2015/10/04/02/file1.avro"
Output(s):
Successfully stored 1 records (6 bytes) in: "hdfs://nn1/tmp/temp-268177355/tmp915757783"
Counters:
Total records written : 1
Total bytes written : 6
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1438959478020_940907
2015-10-29 19:08:55,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-10-29 19:08:55,252 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-29 19:08:55,253 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-10-29 19:08:55,261 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-10-29 19:08:55,261 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(4)
How did 10 become 4. Is there a different way to count the number of records in an avro file using PIG?

Hive not enforcing bucketing

I am going through the Hive tutorial in the O'Reilly Hadoop book by Tom White. I am trying to make a bucketed table, but I can't get Hive to create the buckets. I can create the table and load the data into it, but all of the data is then stored in one file.
I am running a pseudo-distributed Hadoop cluster. I'm using Hadoop 1.2.1 and Hive 0.10.0 with a MySql metastore.
The data (shown below) are initially in the table 'users'. They are to be put in a table with 4 buckets, i.e. one user per bucket.
select * from users;
OK
id name
0 Nat
2 Joe
3 Kay
4 Ann
I set the properties below in an attempt to enforce bucketing (I don't think that setting mapred.reduce.tasks explicitly is necessary, but I included it just in case).
set hive.enforce.bucketing=true;
set mapred.reduce.tasks=4;
Then I create the table 'bucketed_users' and load the data into it.
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id)
SORTED BY (id ASC) INTO 4 BUCKETS;
INSERT OVERWRITE TABLE bucketed_users SELECT * FROM users;
The output:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/katrina/katrina_20131003204949_a56048f5-ab2f-421b-af45-9ec3ff85731c.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-10-03 20:49:34,011 null map = 0%, reduce = 0%
2013-10-03 20:49:35,026 null map = 0%, reduce = 100%
Ended Job = job_local1250355097_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Loading data to table records.bucketed_users
Deleted hdfs://localhost/user/hive/warehouse/records/bucketed_users
Table records.bucketed_users stats: [num_partitions: 0, num_files: 1, num_rows: 4, total_size: 24, raw_data_size: 20]
OK
id name
Time taken: 8.527 seconds
The data have been loaded into 'bucketed_users' correctly (SELECT * FROM bucketed_users shows all users) but the number of files created is just 1 (num_files: 1 above) rather than the desired 4. Looking at the bucketed_users directory in HDFS (dfs -ls /user/hive/warehouse/records/bucketed_users;) shows just one file, 000000_0. How can I enforce bucketing?
The full log is below:
2013-10-03 20:49:30,769 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Execution log at: /tmp/katrina/katrina_20131003204949_a56048f5-ab2f-421b-af45-9ec3ff85731c.log
2013-10-03 20:49:31,139 INFO exec.ExecDriver (ExecDriver.java:execute(328)) - Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
2013-10-03 20:49:31,144 INFO exec.ExecDriver (ExecDriver.java:execute(350)) - adding libjars: file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:31,144 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(852)) - Processing alias users
2013-10-03 20:49:31,145 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(870)) - Adding input file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:31,145 INFO exec.Utilities (Utilities.java:isEmptyPath(1900)) - Content Summary not cached for hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:31,365 WARN util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(52)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-10-03 20:49:32,410 INFO exec.ExecDriver (ExecDriver.java:createTmpDirs(219)) - Making Temp Directory: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000
2013-10-03 20:49:32,420 WARN mapred.JobClient (JobClient.java:copyAndConfigureFiles(746)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-03 20:49:32,648 WARN snappy.LoadSnappy (LoadSnappy.java:<clinit>(46)) - Snappy native library not loaded
2013-10-03 20:49:32,655 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(370)) - CombineHiveInputSplit creating pool for hdfs://localhost/user/hive/warehouse/records/users; using filter path hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:32,661 INFO mapred.FileInputFormat (FileInputFormat.java:listStatus(199)) - Total input paths to process : 1
2013-10-03 20:49:32,716 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(411)) - number of splits 1
2013-10-03 20:49:32,847 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:downloadCacheObject(423)) - Creating hive-builtins-0.10.0.jar in /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632 with rwxr-xr-x
2013-10-03 20:49:32,850 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:downloadCacheObject(435)) - Extracting /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632/hive-builtins-0.10.0.jar to /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632
2013-10-03 20:49:32,870 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:downloadCacheObject(463)) - Cached file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar as /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:32,880 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:localizePublicCacheObject(486)) - Cached file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar as /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:32,987 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Job running in-process (local Hadoop)
2013-10-03 20:49:33,034 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(340)) - Waiting for map tasks
2013-10-03 20:49:33,037 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(204)) - Starting task: attempt_local1250355097_0001_m_000000_0
2013-10-03 20:49:33,073 INFO mapred.Task (Task.java:initialize(534)) - Using ResourceCalculatorPlugin : null
2013-10-03 20:49:33,077 INFO mapred.MapTask (MapTask.java:updateJobWithSplit(455)) - Processing split: Paths:/user/hive/warehouse/records/users/users.txt:0+24InputFormatClass: org.apache.hadoop.mapred.TextInputFormat
2013-10-03 20:49:33,093 INFO io.HiveContextAwareRecordReader (HiveContextAwareRecordReader.java:initIOContext(154)) - Processing file hdfs://localhost/user/hive/warehouse/records/users/users.txt
2013-10-03 20:49:33,093 INFO mapred.MapTask (MapTask.java:runOldMapper(419)) - numReduceTasks: 1
2013-10-03 20:49:33,099 INFO mapred.MapTask (MapTask.java:<init>(949)) - io.sort.mb = 100
2013-10-03 20:49:33,541 INFO mapred.MapTask (MapTask.java:<init>(961)) - data buffer = 79691776/99614720
2013-10-03 20:49:33,542 INFO mapred.MapTask (MapTask.java:<init>(962)) - record buffer = 262144/327680
2013-10-03 20:49:33,550 INFO ExecMapper (ExecMapper.java:configure(69)) - maximum memory = 2088435712
2013-10-03 20:49:33,551 INFO ExecMapper (ExecMapper.java:configure(74)) - conf classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,551 INFO ExecMapper (ExecMapper.java:configure(76)) - thread classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,585 INFO exec.MapOperator (MapOperator.java:setChildren(387)) - Adding alias users to work list for file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:33,587 INFO exec.MapOperator (MapOperator.java:setChildren(402)) - dump TS struct<id:int,name:string>
2013-10-03 20:49:33,588 INFO ExecMapper (ExecMapper.java:configure(91)) -
<MAP>Id =10
<Children>
<TS>Id =0
<Children>
<SEL>Id =1
<Children>
<RS>Id =2
<Parent>Id = 1 null<\Parent>
<\RS>
<\Children>
<Parent>Id = 0 null<\Parent>
<\SEL>
<\Children>
<Parent>Id = 10 null<\Parent>
<\TS>
<\Children>
<\MAP>
2013-10-03 20:49:33,588 INFO exec.MapOperator (Operator.java:initialize(321)) - Initializing Self 10 MAP
2013-10-03 20:49:33,588 INFO exec.TableScanOperator (Operator.java:initialize(321)) - Initializing Self 0 TS
2013-10-03 20:49:33,588 INFO exec.TableScanOperator (Operator.java:initializeChildren(386)) - Operator 0 TS initialized
2013-10-03 20:49:33,589 INFO exec.TableScanOperator (Operator.java:initializeChildren(390)) - Initializing children of 0 TS
2013-10-03 20:49:33,589 INFO exec.SelectOperator (Operator.java:initialize(425)) - Initializing child 1 SEL
2013-10-03 20:49:33,589 INFO exec.SelectOperator (Operator.java:initialize(321)) - Initializing Self 1 SEL
2013-10-03 20:49:33,592 INFO exec.SelectOperator (SelectOperator.java:initializeOp(58)) - SELECT struct<id:int,name:string>
2013-10-03 20:49:33,594 INFO exec.SelectOperator (Operator.java:initializeChildren(386)) - Operator 1 SEL initialized
2013-10-03 20:49:33,595 INFO exec.SelectOperator (Operator.java:initializeChildren(390)) - Initializing children of 1 SEL
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator (Operator.java:initialize(425)) - Initializing child 2 RS
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator (Operator.java:initialize(321)) - Initializing Self 2 RS
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator (ReduceSinkOperator.java:initializeOp(112)) - Using tag = -1
2013-10-03 20:49:33,606 INFO exec.ReduceSinkOperator (Operator.java:initializeChildren(386)) - Operator 2 RS initialized
2013-10-03 20:49:33,606 INFO exec.ReduceSinkOperator (Operator.java:initialize(361)) - Initialization Done 2 RS
2013-10-03 20:49:33,606 INFO exec.SelectOperator (Operator.java:initialize(361)) - Initialization Done 1 SEL
2013-10-03 20:49:33,606 INFO exec.TableScanOperator (Operator.java:initialize(361)) - Initialization Done 0 TS
2013-10-03 20:49:33,607 INFO exec.MapOperator (Operator.java:initialize(361)) - Initialization Done 10 MAP
2013-10-03 20:49:33,637 INFO exec.MapOperator (MapOperator.java:cleanUpInputFileChangedOp(494)) - Processing alias users for file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:33,638 INFO exec.MapOperator (Operator.java:forward(774)) - 10 forwarding 1 rows
2013-10-03 20:49:33,638 INFO exec.TableScanOperator (Operator.java:forward(774)) - 0 forwarding 1 rows
2013-10-03 20:49:33,639 INFO exec.SelectOperator (Operator.java:forward(774)) - 1 forwarding 1 rows
2013-10-03 20:49:33,641 INFO ExecMapper (ExecMapper.java:map(148)) - ExecMapper: processing 1 rows: used memory = 114294872
2013-10-03 20:49:33,642 INFO exec.MapOperator (Operator.java:close(549)) - 10 finished. closing...
2013-10-03 20:49:33,643 INFO exec.MapOperator (Operator.java:close(555)) - 10 forwarded 4 rows
2013-10-03 20:49:33,643 INFO exec.MapOperator (Operator.java:logStats(845)) - DESERIALIZE_ERRORS:0
2013-10-03 20:49:33,643 INFO exec.TableScanOperator (Operator.java:close(549)) - 0 finished. closing...
2013-10-03 20:49:33,643 INFO exec.TableScanOperator (Operator.java:close(555)) - 0 forwarded 4 rows
2013-10-03 20:49:33,643 INFO exec.SelectOperator (Operator.java:close(549)) - 1 finished. closing...
2013-10-03 20:49:33,644 INFO exec.SelectOperator (Operator.java:close(555)) - 1 forwarded 4 rows
2013-10-03 20:49:33,644 INFO exec.ReduceSinkOperator (Operator.java:close(549)) - 2 finished. closing...
2013-10-03 20:49:33,644 INFO exec.ReduceSinkOperator (Operator.java:close(555)) - 2 forwarded 0 rows
2013-10-03 20:49:33,644 INFO exec.SelectOperator (Operator.java:close(570)) - 1 Close done
2013-10-03 20:49:33,644 INFO exec.TableScanOperator (Operator.java:close(570)) - 0 Close done
2013-10-03 20:49:33,644 INFO exec.MapOperator (Operator.java:close(570)) - 10 Close done
2013-10-03 20:49:33,645 INFO ExecMapper (ExecMapper.java:close(215)) - ExecMapper: processed 4 rows: used memory = 114767288
2013-10-03 20:49:33,647 INFO mapred.MapTask (MapTask.java:flush(1289)) - Starting flush of map output
2013-10-03 20:49:33,659 INFO mapred.MapTask (MapTask.java:sortAndSpill(1471)) - Finished spill 0
2013-10-03 20:49:33,661 INFO mapred.Task (Task.java:done(858)) - Task:attempt_local1250355097_0001_m_000000_0 is done. And is in the process of commiting
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) - hdfs://localhost/user/hive/warehouse/records/users/users.txt:0+24
2013-10-03 20:49:33,668 INFO mapred.Task (Task.java:sendDone(970)) - Task 'attempt_local1250355097_0001_m_000000_0' done.
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(229)) - Finishing task: attempt_local1250355097_0001_m_000000_0
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(348)) - Map task executor complete.
2013-10-03 20:49:33,680 INFO mapred.Task (Task.java:initialize(534)) - Using ResourceCalculatorPlugin : null
2013-10-03 20:49:33,680 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) -
2013-10-03 20:49:33,690 INFO mapred.Merger (Merger.java:merge(408)) - Merging 1 sorted segments
2013-10-03 20:49:33,695 INFO mapred.Merger (Merger.java:merge(491)) - Down to the last merge-pass, with 1 segments left of total size: 70 bytes
2013-10-03 20:49:33,695 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) -
2013-10-03 20:49:33,697 INFO ExecReducer (ExecReducer.java:configure(100)) - maximum memory = 2088435712
2013-10-03 20:49:33,697 INFO ExecReducer (ExecReducer.java:configure(105)) - conf classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,697 INFO ExecReducer (ExecReducer.java:configure(107)) - thread classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,698 INFO ExecReducer (ExecReducer.java:configure(149)) -
<OP>Id =3
<Children>
<FS>Id =4
<Parent>Id = 3 null<\Parent>
<\FS>
<\Children>
<\OP>
2013-10-03 20:49:33,698 INFO exec.ExtractOperator (Operator.java:initialize(321)) - Initializing Self 3 OP
2013-10-03 20:49:33,698 INFO exec.ExtractOperator (Operator.java:initializeChildren(386)) - Operator 3 OP initialized
2013-10-03 20:49:33,698 INFO exec.ExtractOperator (Operator.java:initializeChildren(390)) - Initializing children of 3 OP
2013-10-03 20:49:33,698 INFO exec.FileSinkOperator (Operator.java:initialize(425)) - Initializing child 4 FS
2013-10-03 20:49:33,699 INFO exec.FileSinkOperator (Operator.java:initialize(321)) - Initializing Self 4 FS
2013-10-03 20:49:33,701 INFO exec.FileSinkOperator (Operator.java:initializeChildren(386)) - Operator 4 FS initialized
2013-10-03 20:49:33,701 INFO exec.FileSinkOperator (Operator.java:initialize(361)) - Initialization Done 4 FS
2013-10-03 20:49:33,701 INFO exec.ExtractOperator (Operator.java:initialize(361)) - Initialization Done 3 OP
2013-10-03 20:49:33,706 INFO ExecReducer (ExecReducer.java:reduce(243)) - ExecReducer: processing 1 rows: used memory = 117749816
2013-10-03 20:49:33,707 INFO exec.ExtractOperator (Operator.java:forward(774)) - 3 forwarding 1 rows
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator (FileSinkOperator.java:createBucketFiles(458)) - Final Path: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000/000000_0
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator (FileSinkOperator.java:createBucketFiles(460)) - Writing to temp file: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_task_tmp.-ext-10000/_tmp.000000_0
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator (FileSinkOperator.java:createBucketFiles(481)) - New Final Path: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000/000000_0
2013-10-03 20:49:33,737 INFO ExecReducer (ExecReducer.java:close(301)) - ExecReducer: processed 4 rows: used memory = 118477400
2013-10-03 20:49:33,737 INFO exec.ExtractOperator (Operator.java:close(549)) - 3 finished. closing...
2013-10-03 20:49:33,737 INFO exec.ExtractOperator (Operator.java:close(555)) - 3 forwarded 4 rows
2013-10-03 20:49:33,737 INFO exec.FileSinkOperator (Operator.java:close(549)) - 4 finished. closing...
2013-10-03 20:49:33,737 INFO exec.FileSinkOperator (Operator.java:close(555)) - 4 forwarded 0 rows
2013-10-03 20:49:33,990 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-10-03 20:49:34,011 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - 2013-10-03 20:49:34,011 null map = 0%, reduce = 0%
2013-10-03 20:49:34,111 INFO jdbc.JDBCStatsPublisher (JDBCStatsPublisher.java:publishStat(137)) - Stats publishing for key hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000/000000
2013-10-03 20:49:34,143 INFO exec.FileSinkOperator (Operator.java:logStats(845)) - TABLE_ID_1_ROWCOUNT:4
2013-10-03 20:49:34,143 INFO exec.ExtractOperator (Operator.java:close(570)) - 3 Close done
2013-10-03 20:49:34,145 INFO mapred.Task (Task.java:done(858)) - Task:attempt_local1250355097_0001_r_000000_0 is done. And is in the process of commiting
2013-10-03 20:49:34,146 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) - reduce > reduce
2013-10-03 20:49:34,147 INFO mapred.Task (Task.java:sendDone(970)) - Task 'attempt_local1250355097_0001_r_000000_0' done.
2013-10-03 20:49:35,026 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - 2013-10-03 20:49:35,026 null map = 0%, reduce = 100%
2013-10-03 20:49:35,030 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Ended Job = job_local1250355097_0001
2013-10-03 20:49:35,033 INFO exec.FileSinkOperator (Utilities.java:mvFileToFinalPath(1361)) - Moving tmp dir: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000 to: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000.intermediate
2013-10-03 20:49:35,036 INFO exec.FileSinkOperator (Utilities.java:mvFileToFinalPath(1372)) - Moving tmp dir: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000.intermediate to: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000
I can't reproduce that:
hive> INSERT OVERWRITE TABLE bucketed_users SELECT * FROM unbucketed_users;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_1384565454792_0070, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1384565454792_0070/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1384565454792_0070
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2013-11-16 05:04:12,290 Stage-1 map = 0%, reduce = 0%
2013-11-16 05:04:33,868 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.16 sec
MapReduce Total cumulative CPU time: 7 seconds 160 msec
Ended Job = job_1384565454792_0070
Loading data to table default.bucketed_users
rmr: DEPRECATED: Please use 'rm -r' instead.
Moved: 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/bucketed_users' to trash at: hdfs://sandbox.hortonworks.com:8020/user/hue/.Trash/Current
Table default.bucketed_users stats: [num_partitions: 0, num_files: 4, num_rows: 0, total_size: 24, raw_data_size: 0]
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 4 Cumulative CPU: 7.16 sec HDFS Read: 259 HDFS Write: 24 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 160 msec
OK
Time taken: 19.291 seconds
hive> dfs -ls /apps/hive/warehouse/bucketed_users;
Found 4 items
-rw-r--r-- 3 hue hdfs 12 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000000_0
-rw-r--r-- 3 hue hdfs 0 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000001_0
-rw-r--r-- 3 hue hdfs 6 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000002_0
-rw-r--r-- 3 hue hdfs 6 2013-11-16 05:04 /apps/hive/warehouse/bucketed_users/000003_0
It is very odd that you see a conversion to MapJoin, you should not see that since your query has no joins in it. Is that really the query you're running? If you're seeing that I would suggest to:
hive.auto.convert.join=false;
If that fixes it you should file a bug.
Odd, this works for me , However since you specify that your table is sorted you also need to set
set hive.enforce.sorting=true;
in addition of
set hive.enforce.bucketing = true;
I'm wondering if the combination of bucket/sort table and only setting one of the enforce setting messes it up somehow.

Resources