I'm currently importing data into a hive table. When we created the table we used
CREATE EXTERNAL TABLE Customers
(
Code string,
Company string,
FirstName string,
LastName string,
DateOfBirth string,
PhoneNo string,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
as there are commas in our data. However, we've now discovered that the commas are still being treated as field delimiters, as well as the | we're using to separate the fields. Is there any way to work around this? Do we have to escape every single comma in our data or is there an easier way to get it set up?
Example data
1|2|3|4
a|b|c|d
John|Joe|Bob, Jr|Alex
Which when put in the table appears as
1 2 3 4
a b c d
John Joe Bob Jr
With Jr occupying its own column and bumping Alex from the table.
It is working fine for me using your data. Hive version is 0.13
hive> create external table foo(
> first string,
> second string,
> third string,
> forth string)
> row format delimited fields terminated by '|' lines terminated by '\n';
OK
Time taken: 3.222 seconds
hive> load data inpath '/user/xilan/data.txt' overwrite into table foo;
hive> select third from foo;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1422157058628_0001, Tracking URL = http://host:8088/proxy/application_1422157058628_0001/
Kill Command = /scratch/xilan/hadoop/bin/hadoop job -kill job_1422157058628_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-03-27 07:05:41,901 Stage-1 map = 0%, reduce = 0%
2015-03-27 07:05:50,190 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.24 sec
MapReduce Total cumulative CPU time: 1 seconds 240 msec
Ended Job = job_1422157058628_0001
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.24 sec HDFS Read: 245 HDFS Write: 12 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 240 msec
OK
3
c
Bob, Jr
Time taken: 18.853 seconds, Fetched: 3 row(s)
hive>
Related
hive> select count(*) from test_db.cust;
Query ID = EMMAdmin_20200106222630_32064e30-7ae6-4e0a-bf1b-b0979e297102
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Kill Command = C:\hadoop-2.10.0\bin\mapred job -kill job_1578366054556_0003
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2020-01-06 22:26:40,010 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1578366054556_0003 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
hive>
error on Hadoop admin:
Application application_1578366054556_0001 failed 2 times due to AM Container for appattempt_1578366054556_0001_000002 exited with exitCode: -1
Failing this attempt.Diagnostics:
[2020-01-06 22:08:40.708]The command line has a length of 12995 exceeds maximum allowed length of 8191. Command starts with: "#set HADOOP_CLASSPATH=%PWD%;job.jar/*;job.jar/classes/;job.jar/lib/*;%PWD%/*;C:\apache-hive-3.0.0-bi "
normal select queries are working fine
hive> select country from test_db.cust where first_name like 'S%';
OK
USA
USA
USA
USA
USA
USA
USA
USA
Time taken: 0.229 seconds, Fetched: 8 row(s)
hive>
enter image description here
The problem is from commander.
You could use the next syntax, in order to output the result into a file:
hive -e "your query" > ~/sample_output.txt
Hive Version 0.13
I am having a column in hive which needs to be enclosed in double quotes
I am trying with the below query but the output is returned as NULL
select "\""+notes_detail+"\"" from service_request_notes.TSS_INCIDENT_NOTES_F_SAMPLE limit 1;
where as the output should be this
select notes_detail from service_request_notes.TSS_INCIDENT_NOTES_F_SAMPLE limit 1;
Job 0: Map: 106 Cumulative CPU: 151.24 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 minutes 31 seconds 240 msec
OK
notes_detail
From: scottw#acespower.com
trying to join two tables and store the output in some table or local directory
map reduce job is success but nothing comes up in the output path / table.
can some one help me ?
hive> insert overwrite table order_result select e.emp_id as emp_id, count(distinct p.product_id) as product_id, sum(p.quantity) as quantity from emp e join orders p on e.emp_id = p.emp_id group by e.emp_id order by quantity desc, product_id asc;
Total jobs = 3
Stage-1 is selected by condition resolver.
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1438631656520_0053, Tracking URL = http://localhost:8088/proxy/application_1438631656520_0053/
Kill Command = /usr/lib/hadoop-2.2.0/bin/hadoop job -kill job_1438631656520_0053
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2015-08-04 07:45:28,470 Stage-1 map = 0%, reduce = 0%
2015-08-04 07:45:58,648 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 11.62 sec
2015-08-04 07:46:01,302 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 12.05 sec
MapReduce Total cumulative CPU time: 3 seconds 0 msec
Ended Job = job_1438631656520_0055
Loading data to table test_join.order_result
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted hdfs://localhost:8020/user/hive/warehouse/test_join.db/order_result
Table test_join.order_result stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 Cumulative CPU: 305.34 sec HDFS Read: 354101279 HDFS Write: 96 SUCCESS
Job 1: Map: 1 Reduce: 1 Cumulative CPU: 2.76 sec HDFS Read: 462 HDFS Write: 96 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 3.0 sec HDFS Read: 462 HDFS Write: 48 SUCCESS
Total MapReduce CPU Time Spent: 5 minutes 11 seconds 100 msec
OK
Time taken: 817.424 seconds
hive> select * from order_result;
OK
Time taken: 0.146 seconds
Can u check the query if you are getting output or not, as per the MR log which was shared, could see the Input Data Size was 354101279 where as output is only 96
HDFS Read: 354101279 HDFS Write: 96 SUCCESS
belive the query is working fine but not producing any output.
it could be reason like.
Both Input Table is having data but the Data Type for emp_id is not
corrent or not matching
The issue was with the data in DB tables; it has some NULL values;
If anyone face the same issue; check the data inside the tables is as per your requirement and don't have any null values.
We're running Datastax Enterprise 4.0.1 and running into a very strange issue when inserting rows into Cassandra and then querying hive for the COUNT(1).
The setup: DSE 4.0.01, Cassandra 2.0, Hive, brand new cluster. Insert 10,000 rows into Cassandra and then:
cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000;
count
-------
10000
(1 rows)
cqlsh:pageviews>
But from Hive:
hive> select count(1) from pageviews_v1 limit 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002
Kill Command = /usr/bin/dse hadoop job -kill job_201403272330_0002
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2014-03-27 23:38:22,129 Stage-1 map = 0%, reduce = 0%
<snip>
2014-03-27 23:38:49,324 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.31 sec
MapReduce Total cumulative CPU time: 11 seconds 310 msec
Ended Job = job_201403272330_0002
MapReduce Jobs Launched:
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 11.31 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 310 msec
OK
1723
Time taken: 38.634 seconds, Fetched: 1 row(s)
Only 1,723 rows. I'm so confused. The CQL3 ColumnFamily definition is:
CREATE TABLE pageviews_v1 (
website text,
date text,
created timestamp,
browser_id text,
ip text,
referer text,
user_agent text,
PRIMARY KEY ((website, date), created, browser_id)
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND
bloom_filter_fp_chance=0.001000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};
And in Hive we have:
CREATE EXTERNAL TABLE pageviews_v1(
website string COMMENT 'from deserializer',
date string COMMENT 'from deserializer',
created timestamp COMMENT 'from deserializer',
browser_id string COMMENT 'from deserializer',
ip string COMMENT 'from deserializer',
referer string COMMENT 'from deserializer',
user_agent string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe'
STORED BY
'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua')
LOCATION
'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1'
TBLPROPERTIES (
'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner',
'cassandra.ks.name'='pageviews',
'cassandra.cf.name'='pageviews_v1',
'auto_created'='true')
Has anyone else experienced similar?
It's probably the consistency setting on the HIVE table as per this document.
Change the hive query to "select count(*) from pageviews_v1 ;"
The issue appears to be with CLUSTERING ORDERY BY. Removing that resolves the COUNT misreporting from Hive.
hive> select * from example;
OK
1 hello yang
2 hello bear
3 aaa
Time taken: 51.273 seconds -> that's ok !
hive> select count(key) from example;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Starting Job = job_201309170341_0001, Tracking URL = ...
Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=10.10.5.153:8012 -kill job_201309170341_0001
Hadoop job information for Stage-1: number of mappers: 1537; number of reducers: 1
Then wait for 1 hour ,I get the count : 3 !
why need so many time ? and why mappers so big: 1537 ?
Do you enable the vnodes? It looks like you enable vnode. We are working on support hadoop on vnodes, but before it's done, it's recommended to disable it for a hadoop data center/cluster