DSE 4.0.1: hive count different than cassandra count

DSE 4.0.1: hive count different than cassandra count - hadoop

We're running Datastax Enterprise 4.0.1 and running into a very strange issue when inserting rows into Cassandra and then querying hive for the COUNT(1).
The setup: DSE 4.0.01, Cassandra 2.0, Hive, brand new cluster. Insert 10,000 rows into Cassandra and then:
cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000;
count
-------
10000
(1 rows)
cqlsh:pageviews>
But from Hive:
hive> select count(1) from pageviews_v1 limit 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002
Kill Command = /usr/bin/dse hadoop job -kill job_201403272330_0002
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2014-03-27 23:38:22,129 Stage-1 map = 0%, reduce = 0%
<snip>
2014-03-27 23:38:49,324 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.31 sec
MapReduce Total cumulative CPU time: 11 seconds 310 msec
Ended Job = job_201403272330_0002
MapReduce Jobs Launched:
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 11.31 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 310 msec
OK
1723
Time taken: 38.634 seconds, Fetched: 1 row(s)
Only 1,723 rows. I'm so confused. The CQL3 ColumnFamily definition is:
CREATE TABLE pageviews_v1 (
website text,
date text,
created timestamp,
browser_id text,
ip text,
referer text,
user_agent text,
PRIMARY KEY ((website, date), created, browser_id)
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND
bloom_filter_fp_chance=0.001000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};
And in Hive we have:
CREATE EXTERNAL TABLE pageviews_v1(
website string COMMENT 'from deserializer',
date string COMMENT 'from deserializer',
created timestamp COMMENT 'from deserializer',
browser_id string COMMENT 'from deserializer',
ip string COMMENT 'from deserializer',
referer string COMMENT 'from deserializer',
user_agent string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe'
STORED BY
'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua')
LOCATION
'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1'
TBLPROPERTIES (
'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner',
'cassandra.ks.name'='pageviews',
'cassandra.cf.name'='pageviews_v1',
'auto_created'='true')
Has anyone else experienced similar?

It's probably the consistency setting on the HIVE table as per this document.

Change the hive query to "select count(*) from pageviews_v1 ;"

The issue appears to be with CLUSTERING ORDERY BY. Removing that resolves the COUNT misreporting from Hive.

Related

select count(*) failing on hive / hadoop on windows

hive> select count(*) from test_db.cust;
Query ID = EMMAdmin_20200106222630_32064e30-7ae6-4e0a-bf1b-b0979e297102
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Kill Command = C:\hadoop-2.10.0\bin\mapred job -kill job_1578366054556_0003
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2020-01-06 22:26:40,010 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1578366054556_0003 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
hive>
error on Hadoop admin:
Application application_1578366054556_0001 failed 2 times due to AM Container for appattempt_1578366054556_0001_000002 exited with exitCode: -1
Failing this attempt.Diagnostics:
[2020-01-06 22:08:40.708]The command line has a length of 12995 exceeds maximum allowed length of 8191. Command starts with: "#set HADOOP_CLASSPATH=%PWD%;job.jar/*;job.jar/classes/;job.jar/lib/*;%PWD%/*;C:\apache-hive-3.0.0-bi "
normal select queries are working fine
hive> select country from test_db.cust where first_name like 'S%';
OK
USA
USA
USA
USA
USA
USA
USA
USA
Time taken: 0.229 seconds, Fetched: 8 row(s)
hive>
enter image description here

The problem is from commander.
You could use the next syntax, in order to output the result into a file:
hive -e "your query" > ~/sample_output.txt

Optimizing the hive query :Apache Hive

The following hive query which finds the lead and lag on a single column. The query spawns 1 Mapper and 50 Reducers. How can i optimize the query to spawn less reduces.
Table description
col_name data_type comment
# col_name data_type comment
a int
Data in tale
select * from foo;
OK
foo.a 1 2 3 4 5 6 3 4 6 78 9 7 NULL
select lag(a,1) over (order by a) as next,lead(a,1) over (order by a) as prev from foo;
Query ID =
phodisvc_20170403015502_de129135-eb19-4c4d-8161-c3f217a45928 Total
jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not
specified. Defaulting to jobconf value of: 50 In order to change the
average load for a reducer (in bytes): set
hive.exec.reducers.bytes.per.reducer= In order to limit the
maximum number of reducers: set hive.exec.reducers.max= In
order to set a constant number of reducers: set
mapreduce.job.reduces= Kill Command =
/opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job -kill
job_1489146839620_136214 Hadoop job information for Stage-1: number of
mappers: 1; number of reducers: 50

Hive partition query is scanning all partitions

When I write the hive query like below
select count(*)
from order
where order_month >= '2016-11';
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
I am getting 5 mappers only it means reading required partitions only(2016-11 and 2016-12)
Same query I write using function
select count(*)
from order
where order_month >= concat(year(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)),'-',month(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)));
Note:
concat(year(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)),'-',month(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)))
= '2016-11'
Hadoop job information for Stage-1: number of mappers: 216; number of reducers: 1
this time it is reading all partitions {i.e. 2004-10 to 2016-12}. .
How to modify the query to read required partitions only.

unix_timestamp() function is non-deterministic and prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP and CURRENT_DATE.
Use current_date, also no need to calculate year and month separately:
where order_month >= substr(date_sub(current_date, 10),1,7)

Hive > insert overwrite table / local directory doen't work

trying to join two tables and store the output in some table or local directory
map reduce job is success but nothing comes up in the output path / table.
can some one help me ?
hive> insert overwrite table order_result select e.emp_id as emp_id, count(distinct p.product_id) as product_id, sum(p.quantity) as quantity from emp e join orders p on e.emp_id = p.emp_id group by e.emp_id order by quantity desc, product_id asc;
Total jobs = 3
Stage-1 is selected by condition resolver.
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1438631656520_0053, Tracking URL = http://localhost:8088/proxy/application_1438631656520_0053/
Kill Command = /usr/lib/hadoop-2.2.0/bin/hadoop job -kill job_1438631656520_0053
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2015-08-04 07:45:28,470 Stage-1 map = 0%, reduce = 0%
2015-08-04 07:45:58,648 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 11.62 sec
2015-08-04 07:46:01,302 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 12.05 sec
MapReduce Total cumulative CPU time: 3 seconds 0 msec
Ended Job = job_1438631656520_0055
Loading data to table test_join.order_result
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted hdfs://localhost:8020/user/hive/warehouse/test_join.db/order_result
Table test_join.order_result stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 Cumulative CPU: 305.34 sec HDFS Read: 354101279 HDFS Write: 96 SUCCESS
Job 1: Map: 1 Reduce: 1 Cumulative CPU: 2.76 sec HDFS Read: 462 HDFS Write: 96 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 3.0 sec HDFS Read: 462 HDFS Write: 48 SUCCESS
Total MapReduce CPU Time Spent: 5 minutes 11 seconds 100 msec
OK
Time taken: 817.424 seconds
hive> select * from order_result;
OK
Time taken: 0.146 seconds

Can u check the query if you are getting output or not, as per the MR log which was shared, could see the Input Data Size was 354101279 where as output is only 96
HDFS Read: 354101279 HDFS Write: 96 SUCCESS
belive the query is working fine but not producing any output.
it could be reason like.
Both Input Table is having data but the Data Type for emp_id is not
corrent or not matching

The issue was with the data in DB tables; it has some NULL values;
If anyone face the same issue; check the data inside the tables is as per your requirement and don't have any null values.

Hadoop Still Treats Commas as Delimiters after Explictly Declaring a Different Character

I'm currently importing data into a hive table. When we created the table we used
CREATE EXTERNAL TABLE Customers
(
Code string,
Company string,
FirstName string,
LastName string,
DateOfBirth string,
PhoneNo string,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
as there are commas in our data. However, we've now discovered that the commas are still being treated as field delimiters, as well as the | we're using to separate the fields. Is there any way to work around this? Do we have to escape every single comma in our data or is there an easier way to get it set up?
Example data
1|2|3|4
a|b|c|d
John|Joe|Bob, Jr|Alex
Which when put in the table appears as
1 2 3 4
a b c d
John Joe Bob Jr
With Jr occupying its own column and bumping Alex from the table.

It is working fine for me using your data. Hive version is 0.13
hive> create external table foo(
> first string,
> second string,
> third string,
> forth string)
> row format delimited fields terminated by '|' lines terminated by '\n';
OK
Time taken: 3.222 seconds
hive> load data inpath '/user/xilan/data.txt' overwrite into table foo;
hive> select third from foo;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1422157058628_0001, Tracking URL = http://host:8088/proxy/application_1422157058628_0001/
Kill Command = /scratch/xilan/hadoop/bin/hadoop job -kill job_1422157058628_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-03-27 07:05:41,901 Stage-1 map = 0%, reduce = 0%
2015-03-27 07:05:50,190 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.24 sec
MapReduce Total cumulative CPU time: 1 seconds 240 msec
Ended Job = job_1422157058628_0001
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.24 sec HDFS Read: 245 HDFS Write: 12 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 240 msec
OK
3
c
Bob, Jr
Time taken: 18.853 seconds, Fetched: 3 row(s)
hive>

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

DSE 4.0.1: hive count different than cassandra count - hadoop

It's probably the consistency setting on the HIVE table as per this document.

Change the hive query to "select count(*) from pageviews_v1 ;"

The issue appears to be with CLUSTERING ORDERY BY. Removing that resolves the COUNT misreporting from Hive.

Related

select count(*) failing on hive / hadoop on windows

Optimizing the hive query :Apache Hive

Hive partition query is scanning all partitions

Hive > insert overwrite table / local directory doen't work

Hadoop Still Treats Commas as Delimiters after Explictly Declaring a Different Character

Categories

Resources