Presto not returning rows from Hive Metabase - hadoop

I'm very new to AWS EMR. I've got Hive up and running and been querying external tables in S3 with no problems. I have now installed Presto onto the EMR cluster and this seems to be up and running and can read the Hive metabase. However, every query I run returns the column headers but not actually any columns (query below).
presto:default> select count(*) from patrequests;
_col0
-------
0
(1 row)
Query 20171113_163811_00033_vdw6c, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
The same query in Hive runs fine:
hive> select * from patrequests limit 10;
OK
2017-10-01T00:00:18.6772628Z 779 ***** ***** ***** ***** 2017 10
Time taken: 2.876 seconds, Fetched: 10 row(s)
The data is stored in an S3 bucket in JSON format - no nesting.
Any help greatly appreciated.
Thanks

The problem seemed to be with the JSON Serde org.openx.data.jsonserde.JsonSerDe not being available to Presto. Bootstrapping the instance with the following from an S3 bucket seemed to resolve the issues:
#!/bin/bash
wget -P /usr/lib/presto/plugin/hive-hadoop2/ 'https://s3-eu-west-1.amazonaws.com/########/json-serde-1.3.9-SNAPSHOT-jar-with-dependencies.jar';
wget -P /usr/lib/hive-hcatalog/share/hcatalog/ 'https://s3-eu-west-1.amazonaws.com/########/json-serde-1.3.9-SNAPSHOT-jar-with-dependencies.jar';

Related

Getting NULL values after loading data into Hive tables from an online dataset

I am trying to load a data from an online dataset into my hive table using hue interface but I am getting NULL values.
Here's my dataset:
https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv
Here's my code:
CREATE TABLE IF NOT EXISTS AISLES (aisles_id INT, aisles STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
Here's how I loaded the data:
LOAD DATA LOCAL INPATH '/home/hadoop/aisles.csv' INTO TABLE aisles;
My Workaround, but no go:
FIELDS TERMINATED BY ','
FIELDS TERMINATED BY '\t'
FIELDS TERMINATED BY ''
FIELDS TERMINATED BY ' '
Also tried removing LINES TERMINATED BY '\n'
This is how I downloaded the data:
[hadoop#ip-172-31-76-58 ~]$ wget -O aisles.csv "https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv"
--2020-10-14 23:50:06-- https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv
Resolving www.kaggle.com (www.kaggle.com)... 35.244.233.98
Connecting to www.kaggle.com (www.kaggle.com)|35.244.233.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘aisles.csv’
I checked the location of the table I created this is what it says;
hdfs://ip-172-31-76-58.ec2.internal:8020/user/hive/warehouse/aisles
I tried browsing the directory and see where the file was saved:
[hadoop#ip-172-31-76-58 ~]$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxrwxrwt - arjiesaenz hadoop 0 2020-10-15 00:57 /user/hive/warehouse/aisles
So, I tried to change my load script like this;
LOAD DATA INPATH '/user/hive/warehouse/aisles.csv' INTO TABLE aisles;
But I got an error:
Error while compiling statement: FAILED: SemanticException line 6:61 Invalid path ''/user/hive/warehouse/aisles.csv'': No files matching path hdfs://ip-172-31-76-58.ec2.internal:8020/user/hive/warehouse/aisles.csv
Hopefully someone can help me pinpoint the problem with my code.
Thanks.
I tried the same on my hadoop cluster. The code worked without any issues.
Here's my execution snippet:
hive> CREATE TABLE IF NOT EXISTS AISLES (aisles_id INT, aisles STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> tblproperties("skip.header.line.count"="1");
OK
Time taken: 0.034 seconds
hive> load data inpath '/user/hirwuser1448/aisles.csv' into table AISLES;
Loading data to table revisit.aisles
Table revisit.aisles stats: [numFiles=1, totalSize=2603]
OK
Time taken: 0.183 seconds
hive> select * from AISLES limit 10;
OK
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
6 other
7 packaged meat
8 bakery desserts
9 pasta sauce
10 kitchen supplies
Time taken: 0.038 seconds, Fetched: 10 row(s)
I think you need to cross check if your dataset aisles.csv is at the hdfs location and not stored at local directory.
The problem is with your load cmd.
LOAD DATA INPATH '/user/hive/warehouse/aisles.csv' INTO TABLE aisles;
I see you tried browsing the dir to see the saved file. Do you see aisles.csv under that dir? If the file's there, then you're giving wrong path in your load cmd else file isn't there at all.
I found a workaround by downloading the dataset and uploaded it into the Amazon S3 bucket and used the S3 path in the LOAD command.

How to find recently updated values in hive without using time stamp

I have a table like
id name sal
1 Saa 45000
2 aaa 33000
after incremental load
id name sal
3 bbb 55000
How to get only recently updated value without time stamp
The easiest and most efficient way is using a partition. You can have a partitioned table and create a new partition every time you do the incremental load. This way the latest partition will only have the latest records.
Please be noted that very frequent incremental loads can lead to a lot of small partitions which may not be an optimum data design.
There can be couple other ways of doing this but that purely depends on what is your use case, the data rate, and volume.
Hope that helps!
Create a table.
hive> create table student(id int, name string);
OK
Time taken: 3.503 seconds
Insert one record into the table.
hive> insert into student values(1, 'first');
hive> select * from student;
OK
1 first
Time taken: 0.109 seconds, Fetched: 1 row(s)
Use below command on Hive terminal to find the location of the table. i.e meta store location of student table.
hive> describe formatted student;
You should get the details as shown below.
# Detailed Table Information
Database: retaildb
Owner: root
CreateTime: Thu Mar 08 15:52:47 PST 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student
Now Check the content
[root#quickstart cloudera]# hdfs dfs -ls hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student
Found 1 items
-rwxr-xr-x 1 root supergroup 8 2018-03-08 15:53 hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0
[root#quickstart cloudera]# hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/*
1first
Insert one more record.
hive> select * from student;
OK
1 first
1 second
Time taken: 0.095 seconds, Fetched: 2 row(s)
Check the metastore location.
[root#quickstart cloudera]# hdfs dfs -ls hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/
Found 2 items
-rwxr-xr-x 1 root supergroup 8 2018-03-08 15:53 hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0
-rwxr-xr-x 1 root supergroup 9 2018-03-08 15:57 hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0_copy_1
[root#quickstart cloudera]# hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0_copy_1
1second
[root#quickstart cloudera]# hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/*
1first
1second

Why does querying an external hive table require write access to the hdfs directory?

I've hit an interesting permissions problem when setting up an external table to view some Avro files in Hive.
The Avro files are in this directory :
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
The server can write to this file, but regular users cannot.
As the database admin, I create an external table in Hive referencing this directory:
hive> create external table test_table (data string) stored as avro location '/server/data/avrofiles';
Now as a regular user I try to query the table:
hive> select * from test_table limit 10;
FAILED: HiveException java.security.AccessControlException: Permission denied: user=regular.joe, access=WRITE, inode="/server/data/avrofiles":myserver:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
Weird, I'm only trying to read the contents of the file using hive, I'm not trying to write to it.
Oddly, I don't get the same problem when I partition the table like this:
As database_admin:
hive> create external table test_table_partitioned (data string) partitioned by (value string) stored as avro;
OK
Time taken: 0.104 seconds
hive> alter table test_table_partitioned add if not exists partition (value='myvalue') location '/server/data/avrofiles';
OK
As a regular user:
hive> select * from test_table_partitioned where value = 'some_value' limit 10;
OK
Can anyone explain this?
One interesting thing I noticed is that the Location value for the two tables are different and have different permissions:
hive> describe formatted test_table;
Location: hdfs://server.companyname.com:8020/server/data/avrofiles
$ hadoop fs -ls /apps/hive/warehouse/my-database/
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
user cannot write
hive> describe formatted test_table_partitioned;
Location: hdfs://server.companyname.com:8020/apps/hive/warehouse/my-database.db/test_table_partitioned
$ hadoop fs -ls /apps/hive/warehouse/my-database.db/
drwxrwxrwx - database_admin hadoop 0 2017-01-04 14:04 /apps/hive/warehouse/my-database.db/test_table_partitioned
anyone can do anything :)

Unable to see created database in hive in specified location

I'm new to Hive.
I've created a database in Hive and by default the database is created in Hive warehouse. When I run the -ls against Hive Warehouse I'm able to see created database practice.db.
Query Used to Create Database:
create database practice
COMMENT 'Holds all practice tables';
I've created another database in Hive. When I run the -ls command against the path where I created the database unable to see practice_first.db.
Query Used to Create Database:
create database practice_first
COMMENT 'Holds all practice tables'
LOCATION '/somepath in hdfs here';
Even I checked in Hive warehouse practice_first.db is not there in hive warehouse also.
When I run show database I'm able to see practice_first database in list of databases.
Any suggestion where Hive created the practice_first.db.
Thanks in Advance.
Just tested to make sure it should appear without any tables being added and it does. Try using describe on the database after you create to verify the path:
hive> create database test1 COMMENT 'testing' LOCATION '/user/testuser/test1.db';
OK
Time taken: 0.63 seconds
hive> describe database test1;
OK
test1 testing hdfs://nameserviceHA/user/testuser/test1.db testuser USER
Time taken: 0.183 seconds, Fetched: 1 row(s)
hive> exit;
$ hadoop fs -ls /user/testuser/ | grep test
drwxr-xr-x - testuser testgrp 0 2015-02-12 15:43 /user/testuser/test1.db
Doing the same with the default location will automatically name the dir with a .db extension. Also it looks like the group in this case will be hive instead of the user's group:
hive> drop database test1;
OK
Time taken: 0.379 seconds
hive> create database test1 COMMENT 'testing'; OK
Time taken: 0.319 seconds
hive> describe database test1;
OK
test1 testing hdfs://nameserviceHA/user/hive/warehouse/test1.db testuser USER
Time taken: 0.263 seconds, Fetched: 1 row(s)
hive> exit;
$ hadoop fs -ls /user/hive/warehouse | grep test1
drwxrwxrwt - testuser hive 0 2015-02-12 15:53 /user/hive/warehouse/test1.db
It will be created at the path specified by LOCATION
Check HIVE-1537 and this
You will be able to see practice_first.db directory if
1) you create the database in default location.
2) you explicitly provide the name of the directory as practice_first.db in LOCATION in CREATE DATABASE statement.

Hive - Queries on Partitions return nothing

I have a table that is being partitioned by a specific start date (ds). I can query the latest partition (the previous day's data) and it will use the partition fine.
hive> select count(1) from vtc4 where ds='2012-11-01' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 6.43 sec HDFS Read: 46281957 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 430 msec
OK
151225
Time taken: 35.007 seconds
However, when I try to query earlier partitions, hive seems to read the partition fine, but does not return any results.
hive> select count(1) from vtc4 where ds='2012-10-31' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 7.64 sec HDFS Read: 37754168 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 640 msec
OK
0
Time taken: 29.07 seconds
However, if I tell hive to run the query against the date field inside the table itself, and don't use the partition, I get the correct result.
hive> select count(1) from vtc4 where date_started >= "2012-10-31 00:00:00" and date_started < "2012-11-01 00:00:00" ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 63 Reduce: 1 Cumulative CPU: 453.52 sec HDFS Read: 16420276606 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 minutes 33 seconds 520 msec
OK
123201
Time taken: 265.874 seconds
What am I missing here? I'm running hadoop 1.03 and hive 0.9. I'm pretty new to hive/hadoop, so any help would be appreciated.
Thanks.
EDIT 1:
hive> describe formatted vtc4 partition (ds='2012-10-31');
Partition Value: [2012-10-31 ]
Database: default
Table: vtc4
CreateTime: Wed Oct 31 12:02:24 PDT 2012
LastAccessTime: UNKNOWN
Protect Mode: None
Location: hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31
Partition Parameters:
transient_lastDdlTime 1351875579
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.191 seconds
The partition folders exist, but when i try to do a hadoop fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31 it says the file/directory does not exist. If I browse to that directory using the web interface, I can get into the folder , as well as see the /part-m-000* files. If I do a fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-11-01 it works fine.
Seems like either a permissions thing, or something funky with the either hive's or the namenode's metadata. Here's what I would try:
copy the data in that partition to some other location in hdfs. You may need to do this as the hive or hdfs user, depending on how your permissions are set up.
alter table vtc4 drop partition (ds='2012-10-31');
alter table vtc4 add partition (ds='2012-10-31');
copy the data back into that partition on hdfs
Another thing with hive partition is that it sometime doesn't register in metadata system when created outside of hive (e.g. from sparksql). You can also try MSCK REPAIR TABLE xc_bonus; after any changes to partition so it reflects correctly.

Resources