Index Hive table data in Parquet format to Cloudera Search/ Solr - hadoop

In a Cloudera Cluster with Kerberos enabled, I want to index data from a Hive table having Parquet data format, to Cloudera Search(Solr). What is the best way to achieve this? Data may be approx 10-20 Mil.
I found 2 ways so far-
1. Using Map reduce indexing tool and morphlines for Parquet (it would be great if I get some help here)
2. using a custom hive serde, https://github.com/lucidworks/hive-solr, not sure if this will work on higher hive versions.
Are there any other mechanisms to index this data.

The 1.) approach seems to be good for me and according Cloudera Search Guide - MapReduce Indexing.
Are there any other mechanisms to index this data.
Not sure if it would be possible to use ORC's file native-indexes.

Related

Query Parquet data through Vertica (Vertica Hadoop Integration)

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.
Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:
CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;
Since the data size is big. This method will not achieve good performance.
I have done some research, Vertica Hadoop Integration
I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.
My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?
EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.
Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)
Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.
You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).
Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:
CREATE EXTERNAL TABLE t (id int, name varchar(50),
created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*'
PARQUET(hive_partition_cols='created,region');

Hive or Hbase when we need to pull more number of columns?

I have a data structure in Hadoop with 100 columns and few hundred rows. Most of the times I need to query 65% of columns. In this case which is better to use HBASE or HIVE? Please advice.
Just number of columns you are accessing is NOT the criteria for deciding hbase or hive.
HIVE (SQL) :
Use Hive when you have warehousing needs and you are good at SQL and don't want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce's shelter.
Hbase (NoSQL database):
You can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase.
hbase get 'rowkey' is powerful when you know your access pattern
Hbase follows CP of CAP Theorm
Consistency:
Every node in the system contains the same data (e.g. replicas are never out of data)
Availability:
Every request to a non-failing node in the system returns a response
Partition Tolerance:
System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost)
also have a look at this
Its very difficult to answer the question in one line.
HBASE is NoSQL database: your data need to store denormalized data because HBASE is very bad for joi
ning tables.
Hive: You can store data in similar format (normalized) in Hive, but would only see benefits when doing batch processing.

Big data architecture: Elasticsearch-Kibana with Hive or Hadoop

I am dealing with a problem: I want to make a datavizualisation & prediction infrastructure.
I thought about Kibana+Elasticsearch on Hdfs (with ES-Hadoop), & Spark (Python) on Hdfs for modelisation.
My question is: can I properly index data in Hdfs with ES, or should I use Hive or Spark between Elasticsearch & Hdfs ?
I don't know which architecture is the best way to go.
ES-Hadoop will allow you to index data in HDFS directly with Elasticsearch. If you need to manipulate the data on its way from HDFS to ES, for example, performing lookups or filtering out data based on some criteria, you could use a tool like StreamSets Data Collector - see the blog post for a bit more detail.
Full disclosure - I'm the community champion at StreamSets.
if your question is regarding the performance difference with indexing in hive and hadoop .... There will not be any difference . Even in the case of hive data is stored in HDFS and can be accessed thorough external tables in hive.... the way you want to use the indexes will determine your choice.... Hive will provide a structure on the data and you can apply many inbuilt functions to operate on data...

how to manage modified data in Apache Hive

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.
Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.
Are there any other standard better ways to update modified data in Hadoop?
Thanks
HDFS might be append only, but Hive does support updates from 0.14 on.
see here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
A design pattern is to take all your previous and current data and insert it into a new table every time.
Depending on your usecase have a look at Apache Impala/Hbase/... or even Drill.

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Resources