Hive Query having virtual column INPUT__FILE__NAME in where clause gives exception - hadoop

Executing hive query with filter on virtual column INPUT__FILE__NAME result in following exception.
hive> select count(*) from netflow where INPUT__FILE__NAME='vzb.1351794600.0';
FAILED: SemanticException java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#1d264bf5, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#3d44d0c6,
.
.
.
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#7e6bc5aa]
This error is different from the one we get when column name is wrong
hive> select count(*) from netflow where INPUT__FILE__NAM='vzb.1351794600.0';
FAILED: SemanticException [Error 10004]: Line 1:35 Invalid table alias or column reference 'INPUT__FILE__NAM': (possible column names are: first, last, ....)
But using this virtual column in select clause works fine.
hive> select INPUT__FILE__NAME from netflow group by INPUT__FILE__NAME;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201306041359_0006, Tracking URL = http://192.168.0.224:50030/jobdetails.jsp?jobid=job_201306041359_0006
Kill Command = /opt/hadoop/bin/../bin/hadoop job -kill job_201306041359_0006
Hadoop job information for Stage-1: number of mappers: 12; number of reducers: 4
2013-06-14 18:20:10,265 Stage-1 map = 0%, reduce = 0%
2013-06-14 18:20:33,363 Stage-1 map = 8%, reduce = 0%
.
.
.
2013-06-14 18:21:15,554 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201306041359_0006
MapReduce Jobs Launched:
Job 0: Map: 12 Reduce: 4 HDFS Read: 3107826046 HDFS Write: 55 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
hdfs://192.168.0.224:9000/data/jk/vzb/vzb.1351794600.0
Time taken: 78.467 seconds
I am trying to create external hive table on already present HDFS data. And I have extra files in the folder that I want to ignore. Similar to what is asked and suggested in following stackflow questions
how to make hive take only specific files as input from hdfs folder
when creating an external table in hive can I point the location to specific files in a direcotry?
Any help would be appreciated.
Full stack trace I am getting is as follows
2013-06-14 15:01:32,608 ERROR ql.Driver (SessionState.java:printError(401)) - FAILED: SemanticException java.lang.RuntimeException: cannot find field input__
org.apache.hadoop.hive.ql.parse.SemanticException: java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.serde2.object
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrOpProcFactory$FilterPCR.process(PcrOpProcFactory.java:122)
at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:87)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:124)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:101)
at org.apache.hadoop.hive.ql.optimizer.pcr.PartitionConditionRemover.transform(PartitionConditionRemover.java:86)
at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:102)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8163)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:50)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:335)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:893)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:755)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.ser
at org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner.prune(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrOpProcFactory$FilterPCR.process(PcrOpProcFactory.java:112)
... 23 more
Caused by: java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyF
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:344)
at org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:100)
at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:128)
at org.apache.hadoop.hive.ql.optimizer.ppr.PartExprEvalUtils.prepareExpr(PartExprEvalUtils.java:100)
at org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner.pruneBySequentialScan(PartitionPruner.java:328)
at org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner.prune(PartitionPruner.java:219)
... 24 more

Related

select count(*) failing on hive / hadoop on windows

hive> select count(*) from test_db.cust;
Query ID = EMMAdmin_20200106222630_32064e30-7ae6-4e0a-bf1b-b0979e297102
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Kill Command = C:\hadoop-2.10.0\bin\mapred job -kill job_1578366054556_0003
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2020-01-06 22:26:40,010 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1578366054556_0003 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
hive>
error on Hadoop admin:
Application application_1578366054556_0001 failed 2 times due to AM Container for appattempt_1578366054556_0001_000002 exited with exitCode: -1
Failing this attempt.Diagnostics:
[2020-01-06 22:08:40.708]The command line has a length of 12995 exceeds maximum allowed length of 8191. Command starts with: "#set HADOOP_CLASSPATH=%PWD%;job.jar/*;job.jar/classes/;job.jar/lib/*;%PWD%/*;C:\apache-hive-3.0.0-bi "
normal select queries are working fine
hive> select country from test_db.cust where first_name like 'S%';
OK
USA
USA
USA
USA
USA
USA
USA
USA
Time taken: 0.229 seconds, Fetched: 8 row(s)
hive>
enter image description here
The problem is from commander.
You could use the next syntax, in order to output the result into a file:
hive -e "your query" > ~/sample_output.txt

Hive | ArrayIndexOutOfBounds | More than 1000 columns in select query

I am trying to run some group operations (like max, min, avg, count etc) on Hive table with 300 columns. So, my select query would have more than 1000 columns and more than 4000 characters.
The select query is failing. I am facing the below issue.
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:217)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ArrayIndexOutOfBoundsException: -128
at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1084)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:199)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ArrayIndexOutOfBoundsException: -128
at org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:1042)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1081)
... 13 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ArrayIndexOutOfBoundsException: -128
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:401)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1007)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:1025)
... 14 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -128
at java.util.ArrayList.elementData(ArrayList.java:400)
at java.util.ArrayList.get(ArrayList.java:413)
at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.serialize(BinarySortableSerDe.java:797)
at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.serialize(BinarySortableSerDe.java:609)
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.toHiveKey(ReduceSinkOperator.java:508)
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:394)
... 17 more
I get this error when I try to run the query on Hive terminal.
There is table in hive which contains 300 columns and when I perform group functions like count, min, max, distinct etc. on all the columns of this table, I face the above error. The hive query for this is huge and has 300*6 (let's consider 6 group functions - applied to each and every column) columns in it.

Too many counter groups while storing Hive partitioned table as parquet

I have created a table sample with id as its partition and stored it in parquet format.
create table sample(uuid String,date String,Name String,EmailID String,Comments String,CompanyName String,country String,url String,keyword String,source String) PARTITIONED BY (id String) Stored as parquet;
Then I inserted values into it using below command
INSERT INTO TABLE sample PARTITION (id) Select uuid,date,Name,EmailID,Comments,CompanyName,country,url,keyword,source,id from inter distribute by id;
This query results in following issue
Error: java.lang.RuntimeException: Hive Runtime Error while closing operators: org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counter groups: 51 max=50 at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:295) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1613) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: org.apache.hadoop.mapreduce.counters.LimitExceededException: org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counter groups: 51 max=50 at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:97) at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:108) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:78) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:95) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounterImpl(AbstractCounterGroup.java:123) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:113) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:130) at org.apache.hadoop.mapred.Counters$Group.findCounter(Counters.java:369) at org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:314) at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:479) at org.apache.hadoop.mapred.Counters.incrCounter(Counters.java:544) at org.apache.hadoop.mapred.Task$TaskReporter.incrCounter(Task.java:679) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper$ReportStats.func(ExecMapper.java:261) at org.apache.hadoop.hive.ql.exec.Operator.preorderMap(Operator.java:850) at org.apache.hadoop.hive.ql.exec.Operator.preorderMap(Operator.java:853) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:289) ... 7 more Caused by: org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counter groups: 51 max=50 at org.apache.hadoop.mapreduce.counters.Limits.checkGroups(Limits.java:118) at org.apache.hadoop.mapreduce.counters.AbstractCounters.getGroup(AbstractCounters.java:230) at org.apache.hadoop.mapred.Counters.getGroup(Counters.java:113) at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:479) at org.apache.hadoop.mapred.Counters.incrCounter(Counters.java:544) at org.apache.hadoop.mapred.Task$TaskReporter.incrCounter(Task.java:679) at org.apache.hadoop.hive.ql.stats.CounterStatsPublisher.publishStat(CounterStatsPublisher.java:54) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.publishStats(FileSinkOperator.java:1167) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1017) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:287) ... 7 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 137 Container exited with a non-zero exit code 137
NOTE id column have 1 million distinct values
Any one help me in this?
You should expand the counters limit, such as:
mapreduce.job.counters.limit=1000
mapreduce.job.counters.max=1000
mapreduce.job.counters.groups.max=500
mapreduce.job.counters.group.name.max=1000
mapreduce.job.counters.counter.name.max=500

Hive > insert overwrite table / local directory doen't work

trying to join two tables and store the output in some table or local directory
map reduce job is success but nothing comes up in the output path / table.
can some one help me ?
hive> insert overwrite table order_result select e.emp_id as emp_id, count(distinct p.product_id) as product_id, sum(p.quantity) as quantity from emp e join orders p on e.emp_id = p.emp_id group by e.emp_id order by quantity desc, product_id asc;
Total jobs = 3
Stage-1 is selected by condition resolver.
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1438631656520_0053, Tracking URL = http://localhost:8088/proxy/application_1438631656520_0053/
Kill Command = /usr/lib/hadoop-2.2.0/bin/hadoop job -kill job_1438631656520_0053
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2015-08-04 07:45:28,470 Stage-1 map = 0%, reduce = 0%
2015-08-04 07:45:58,648 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 11.62 sec
2015-08-04 07:46:01,302 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 12.05 sec
MapReduce Total cumulative CPU time: 3 seconds 0 msec
Ended Job = job_1438631656520_0055
Loading data to table test_join.order_result
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted hdfs://localhost:8020/user/hive/warehouse/test_join.db/order_result
Table test_join.order_result stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 Cumulative CPU: 305.34 sec HDFS Read: 354101279 HDFS Write: 96 SUCCESS
Job 1: Map: 1 Reduce: 1 Cumulative CPU: 2.76 sec HDFS Read: 462 HDFS Write: 96 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 3.0 sec HDFS Read: 462 HDFS Write: 48 SUCCESS
Total MapReduce CPU Time Spent: 5 minutes 11 seconds 100 msec
OK
Time taken: 817.424 seconds
hive> select * from order_result;
OK
Time taken: 0.146 seconds
Can u check the query if you are getting output or not, as per the MR log which was shared, could see the Input Data Size was 354101279 where as output is only 96
HDFS Read: 354101279 HDFS Write: 96 SUCCESS
belive the query is working fine but not producing any output.
it could be reason like.
Both Input Table is having data but the Data Type for emp_id is not
corrent or not matching
The issue was with the data in DB tables; it has some NULL values;
If anyone face the same issue; check the data inside the tables is as per your requirement and don't have any null values.

DSE 4.0.1: hive count different than cassandra count

We're running Datastax Enterprise 4.0.1 and running into a very strange issue when inserting rows into Cassandra and then querying hive for the COUNT(1).
The setup: DSE 4.0.01, Cassandra 2.0, Hive, brand new cluster. Insert 10,000 rows into Cassandra and then:
cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000;
count
-------
10000
(1 rows)
cqlsh:pageviews>
But from Hive:
hive> select count(1) from pageviews_v1 limit 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002
Kill Command = /usr/bin/dse hadoop job -kill job_201403272330_0002
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2014-03-27 23:38:22,129 Stage-1 map = 0%, reduce = 0%
<snip>
2014-03-27 23:38:49,324 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.31 sec
MapReduce Total cumulative CPU time: 11 seconds 310 msec
Ended Job = job_201403272330_0002
MapReduce Jobs Launched:
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 11.31 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 310 msec
OK
1723
Time taken: 38.634 seconds, Fetched: 1 row(s)
Only 1,723 rows. I'm so confused. The CQL3 ColumnFamily definition is:
CREATE TABLE pageviews_v1 (
website text,
date text,
created timestamp,
browser_id text,
ip text,
referer text,
user_agent text,
PRIMARY KEY ((website, date), created, browser_id)
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND
bloom_filter_fp_chance=0.001000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};
And in Hive we have:
CREATE EXTERNAL TABLE pageviews_v1(
website string COMMENT 'from deserializer',
date string COMMENT 'from deserializer',
created timestamp COMMENT 'from deserializer',
browser_id string COMMENT 'from deserializer',
ip string COMMENT 'from deserializer',
referer string COMMENT 'from deserializer',
user_agent string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe'
STORED BY
'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua')
LOCATION
'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1'
TBLPROPERTIES (
'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner',
'cassandra.ks.name'='pageviews',
'cassandra.cf.name'='pageviews_v1',
'auto_created'='true')
Has anyone else experienced similar?
It's probably the consistency setting on the HIVE table as per this document.
Change the hive query to "select count(*) from pageviews_v1 ;"
The issue appears to be with CLUSTERING ORDERY BY. Removing that resolves the COUNT misreporting from Hive.

Resources