delete duplicates using pig where there is no primary key - hadoop

Im a newbie to hadoop and I have a use case where there are 3 columns name,value,time stamp.The data is , comma separated and is in csv format I need to check for the duplicates and delete them using pig.How can I achieve that.

You can use pig DISTINCT function to remove duplicate.
Please refer this link to know about DISTINCT function.
As you are saying that your data reside in HIVE table and you want to access those data through pig, You can use HCatLoader() to access HIVE table through pig. HCatalog can be used for both external and internal HIVE table. But before using this function, please verify that your cluster has configured HCatalog. If you are using Hadoop 2.X then it should be there.
Using HCatalog your pig LOAD command will be like this.
A = LOAD 'table_name' using HCatLoader();
If you don't want to use HCatalog and if your HIVE tables are external table and you know the HDFS location of the data then you can use CSVLoader() to access the data. Using CSVLoader() your pig LOAD command will be like this.
REGISTER piggybank.jar
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
--Load data using CSVLoader.
A = LOAD '/user/hdfs/dirtodata/MyData.csv' using CSVLoader AS (
name:chararray, value:chararray, timestamp:chararray,
);
Hive external tables are designed in such a way that user can access
the data from outside hive such as Pig and MapReduce programming. But if your HIVE table is internal table and you want to analyze the data using Pig, then you can use HCatLoader() to access hive table data through pig.
In both scenario there wont be any effect in original data during the analytic. Here you are accessing the data, you are not modifying the original data.
Please refer below useful link to understand more about HCat.
http://hortonworks.com/hadoop-tutorial/how-to-use-hcatalog-basic-pig-hive-commands/
https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat

Related

Main purpose of the MetaStore in Hive?

I am a little confused on the purpose of the MetaStore. When you create a table in hive:
CREATE TABLE <table_name> (column1 data_type, column2 data_type);
LOAD DATA INPATH <HDFS_file_location> INTO table managed_table;
So I know this command takes the contents of the file in HDFS and creates a MetaData form of it and stores it in the MetaStore (including column types, column names, the place where it is in HDFS, etc. of each row in the HDFS file). It doesn't actually move the data from HDFS into Hive.
But what is the purpose of storing this MetaData?
When I connect to Hive using Spark SQL for example the MetaStore doesn't contain the actual information in HDFS but just MetaData. So is the MetaStore simply used by Hive to do parsing and compiling steps against the HiveQL query and to create the MapReduce jobs?
Metastore is for storing schema(table definitions including location in HDFS, serde, columns, comments, types, partition definitions, views, access permissions, etc) and statistics. There is no such operation as moving data from HDFS to Hive because Hive tables data is stored in HDFS(or other compatible filesystem like S3). You can define new table or even few tables on top of some location in HDFS and put files in it. You can change existing table location or partition location, all this information is stored in the metastore, so Hive knows how to access data. Table is a logical object defined in the metastore and data itself are just files in some location in HDFS.
See also this answer about Hive query execution flow(high level): https://stackoverflow.com/a/45587873/2700344
Hive performs schema-on-read operations, which means that for the data to be processed in some structured manner (i.e. a table-like object), the layout of said data needs to be summarized in a relational structure
takes the contents of the file in HDFS and creates a MetaData form of it
As far as I know, no files are actually read when you create a table.
SparkSQL connects to the metastore directly. Both Spark and HiveServer have their own query parsers. It's not part of the metastore. MapReduce/Tez/Spark jobs are also not handled by the metastore. It's just a relational database. If it's Mysql, Postgres, or Oracle, you can easily go connect to it and inspect the contents. By default, both Hive and Spark use an embedded Derby database

How to encrypt data in hdfs , and then create hive or impala table to query it?

Recently, I came accross a situation:
There is a data file in remote hdfs, we need to encrpt the data file and then create impala table to query the data in local hdfs system, how impala query encryped data file, I don't know how to solve it.
It can be done by creating User Defined Functions(UDF) in hive. You can create UDF functions by using UDF Hive interface. Then, make jar out of your UDF class, put hive lib.

How to insert data from file into HBase table?

I made example.nt which looks like below.
1 "aaaaa1" "bbbbb1" "ccccc1"
2 "aaaaa2" "bbbbb2" "ccccc2"
3 "aaaaa3" "bbbbb3" "ccccc3"
.......
I want insert this data into HBase table which consist of
(key int,subject string,predicate string,object string)
(:key,cf1:val1,cf1:val2,cf1:val3)
I want perform this inserting on the hbase shell.
How can I do this?
HBase shell is not designed for these purposes, it allows insert data to HBase only line by line with put commands.
Instead of this you can use importtsv tool which allows you import text data directly to HBase.
Assuming you have already created HBase table so_table with one column family cf1 and your example.nt file is in the /tmp/example/ HDFS directory. So it's possible to use it by the following way:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,cf1:val1,cf1:val2,cf1:val3 so_table /tmp/example/
May be you will need add option to change column separator:
-Dimporttsv.separator=';'
Furthermore you should understand that this way data inserts to HBase directly via many put command. There is another way to use importtsv tool which is well suitable for bulk loading big amounts of input data. You can generate StoreFiles and then load it entirely to HBase with completebulkload tool:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.bulk.output=/tmp/example_output -Dimporttsv.columns=HBASE_ROW_KEY,cf1:val1,cf1:val2,cf1:val3 so_table /tmp/example/
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /tmp/example_output so_table
You can read official documentation of this tool: https://hbase.apache.org/book.html#_importtsv

storing pig output into Hive table in a single instance

I would like to insert the pig output into Hive tables(tables in Hive is already created with the exact schema).Just need to insert the output values into table. I dont want to the usual method, wherein I first store into a file, then read that file from Hive and then insert into tables. I need to reduce that extra hop which is done.
Is it possible. If so please tell me how this can be done ?
Thanks
Ok. Create a external hive table with a schema layout somewhere in HDFS directory. Lets say
create external table emp_records(id int,
name String,
city String)
row formatted delimited
fields terminated by '|'
location '/user/cloudera/outputfiles/usecase1';
Just create a table like above and no need to load any file into that directory.
Now write a Pig script that we read data for some input directory and then when you store the output of that Pig script use as below
A = LOAD 'inputfile.txt' USING PigStorage(',') AS(id:int,name:chararray,city:chararray);
B = FILTER A by id > = 678933;
C = FOREACH B GENERATE id,name,city;
STORE C INTO '/user/cloudera/outputfiles/usecase1' USING PigStorage('|');
Ensure that destination location and delimiter and schema layout of final FOREACH statement in you Pigscript matches with Hive DDL schema.
There are two approaches explained below with 'Employee' table example to store pig output into hive table. (Prerequisite is that hive table should be already created)
A = LOAD 'EMPLOYEE.txt' USING PigStorage(',') AS(EMP_NUM:int,EMP_NAME:chararray,EMP_PHONE:int);
Approach 1: Using Hcatalog
// dump pig result to Hive using Hcatalog
store A into 'Empdb.employee' using org.apache.hive.hcatalog.pig.HCatStorer();
(or)
Approach 2: Using HDFS physical location
// dump pig result to external hive warehouse location
STORE A INTO 'hdfs://<<nmhost>>:<<port>>/user/hive/warehouse/Empdb/employee/' USING PigStorage(',')
;
you can store it using Hcatalog
STORE D INTO 'tablename' USING org.apache.hive.hcatalog.pig.HCatStorer();
see below link
https://acadgild.com/blog/loading-and-storing-hive-data-into-pig
The best way is to use HCatalog and write the data in hive table.
STORE final_data INTO 'Hive_table_name' using org.apache.hive.hcatalog.pig.HCatStorer();
But before storing the data, make sure the columns in the 'final_data' dataset is perfectly matched and mapped with the schema of the table.
And run your pig script like this :
pig script.pig -useHCatalog

Access Hive Data from Java

I need to acces the data in Hive, from Java.According to the documentation for Hive JDBC Driver,the current JDBC driver can only be used to query data from default database of Hive.
Is there a way to access data from a Hive database other than the default one , through Java?
For example, you have a hive table:
create table visit (
id int,
url string,
ref string
)
partitioned by (date string)
Then you can use the statement
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM visit WHERE date='2013-05-15';
to load the data to the hdfs then write a mapred job to handle it. Or you can use the statement
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hdfs_out' SELECT * FROM visit WHERE date='2013-05-15';
to load the data to the local file system and write a normal java program to handle it.
The JDBC documentation is found in Hive confluence documentation. For using the JDBC driver you need to have an access by a hive server.
But there are other possibilities to access the data... This all depends on your setup. You could for example also use spark assuming the Hive config and hadoop configs are set appropriately.
the JDBC documentation

Resources