Index Hbase data to solr via Hive external table - hadoop

I have crawled some data via Nutch 2.3.1. Data is stored in Hbase 0.98 table. I have created an external table that import data from hbase table. Now I have to index this data to solr 4.10.3. For that I have followed this well known tutorial. I have created hive table like
create external table if not exists solr_items (
id STRING,
content STRING,
url STRING,
title STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
stored by "com.chimpler.hive.solr.SolrStorageHandler"
with serdeproperties ("solr.column.mapping"="id,content,url,title")
tblproperties ("solr.url" = "http://localhost:8983/solr/collection1") ;
There was some problem when I tried to copy data from hbase posted here. Then I just decide to first index some dummy data. For that I have decided to load data from a file like
LOAD DATA LOCAL INPATH 'data.csv3' OVERWRITE INTO TABLE solr_items;
But it gave following error
FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD
Where is the problem
HADOOP version is 1.2.1

You can't use LOAD DATA for external tables. Hive LanguageManual DML:
Hive does not do any transformation while loading data into tables.
Load operations are currently pure copy/move operations that move
datafiles into locations corresponding to Hive tables.
Hive obviously can't just copy data in case of Solr external table because Solr uses it's own internal data presentation.
You can insert though:
insert into table solr_items select * from tempTable;

Related

Load Hbase table from hive

I am trying to load the hbase table from hive table, for that I am using the following approach and it works fine if I have only single column family in hbase table, however if I have multiple families it throws error.
Approach
source table
CREATE EXTERNAL TABLE temp.employee_orc(id String, name String, Age int)
STORED AS ORC
LOCATION '/tmp/employee_orc/table';
Create Hive table with Hbase Serde
CREATE TABLE temp.employee_hbase(id String, name String, age int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,emp:name,emp:Age')
TBLPROPERTIES("hbase.table.name" = "bda:employee_hbase", "hfile.family.path"="/tmp/employee_hbase/emp", "hive.hbase.generatehfiles"="true");
export the hbase files
SET hive.hbase.generatehfiles=true;
INSERT OVERWRITE TABLE temp.employee_hbase SELECT DISTINCT id, name, Age FROM temp.employee_orc CLUSTER BY id;
Load the hbase table
export HADOOP_CLASSPATH=`hbase classpath`
hadoop jar /usr/hdp/current/hbase-client/lib/hbase-server.jar completebulkload /tmp/employee_hbase/ 'bda:employee_hbase'
Error
I am getting following error if I have multiple column family in Hbase table,
java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: Multiple family directories found in hdfs://hadoopdev/apps/hive/warehouse/temp.db/employee_hbase/_temporary/0/_temporary/attempt_1527799542731_1180_r_000000_0
is there another way to load Hbase table if not this approach?
Bulk load from hive to hbase, The target table can only have a single column family.
bulk load of hbase
You can use hbase bulkload hbase_bulkload with support multiple column family
Or you can use multiple hive table for each column family

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

How to create Hive table on top orc format data?

I have source data in orc format on HDFS.
I created external hive table on the top HDFS data with below command. I am using hive 1.2.1 version.
CREATE EXTERNAL TABLE IF NOT EXISTS test( table_columns ... ) ROW
FORMAT FIELDS TERMINATED BY '\u0001' STORED AS orc LOCATION 'path'
TBL PROPERTIES("orc.compress"="SNAPPY");
But while selecting data from table I am getting this exception.
"protobuf.InvalidProtocolBufferException: Protocol message was too large"
Please help me to resolve this issue.
Thanks.

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

Data in HDFS files not seen under hive table

I have to create a hive table from data present in oracle tables.
I'm doing a sqoop, thereby converting the oracle data into HDFS files. Then I'm creating a hive table on the HDFS files.
The sqoop completes successfully and the files also get generated in the HDFS target directory.
Then I run the create table script in hive. The tables gets created. But it is an empty table, no data is seen in the hive table.
Has anyone faced a similar problem?
Hive default delimiter is ctrlA, if you don't specify any delimiter it will take default delimiter. Add below line in your hive script .
row format delimited fields terminated by '\t'
Your Hive script and your expectation is wrong. You are trying to create a partitioned table on the data that you have already imported, partitions won't work that way. If your query has no partition in it then you can able to see data.
Basically If you want partitioned table , you can't create on the under lying data like you have tried above. If you want hive partition load the data from intermediate table or that sqoop directory to your partitioned table to get Hive partitions.

Resources