Moving clustered data from HDFS to Hive - hadoop

I have been experimenting with Mahout in the Cloudera demo VM and have successfully clustered the sample synthetic control data (https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html) using the k-Means algorithm. I have used ClusterDumper and can view the Mahout output, but now I want to put the output into a Hive table. How would I go about doing this?

There is no direct integration. Your best bet is to modify ClusterDumper to produce some kind of textual representation that can be imported into Hive as tabular data.

Create an external table in Hive, That should point to Mahout o/p path.

Related

Index Hive table data in Parquet format to Cloudera Search/ Solr

In a Cloudera Cluster with Kerberos enabled, I want to index data from a Hive table having Parquet data format, to Cloudera Search(Solr). What is the best way to achieve this? Data may be approx 10-20 Mil.
I found 2 ways so far-
1. Using Map reduce indexing tool and morphlines for Parquet (it would be great if I get some help here)
2. using a custom hive serde, https://github.com/lucidworks/hive-solr, not sure if this will work on higher hive versions.
Are there any other mechanisms to index this data.
The 1.) approach seems to be good for me and according Cloudera Search Guide - MapReduce Indexing.
Are there any other mechanisms to index this data.
Not sure if it would be possible to use ORC's file native-indexes.

Query Parquet data through Vertica (Vertica Hadoop Integration)

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.
Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:
CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;
Since the data size is big. This method will not achieve good performance.
I have done some research, Vertica Hadoop Integration
I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.
My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?
EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.
Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)
Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.
You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).
Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:
CREATE EXTERNAL TABLE t (id int, name varchar(50),
created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*'
PARQUET(hive_partition_cols='created,region');

Hdfs and Hbase: how it works?

Hi everybody
I'm quite new with bigdata, I have installed a HDFS + Hbase test database and I use Talend Big Data (an ETL) to make my test.
I would like to know : if I put a file directly in the HDFS, without going via hbase, I could never request these data ? I mean, I have to read the entire file if I want to filter data I want to chose, is that right ?
Thanks a lot for any help !
HDFS is just a distributed file system, you cannot query your files without passing by an intermidiate component.
Hbase is a nosql database that persist your data on the HDFS, use it when you need a random access to your data.
If you want to store your files on the HDFS as they are and query them, you can create an external table upon them using Hive.
The best option is to use hive on the top of the files which are on the HDFS. You can use bucketing and partitioning in the hive for performance improvement.

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Mahout Hive Integration

I want to combine Hadoop based Mahout recommenders with Apache Hive.So that My generated Recommendations are directly stored in to my Hive Tables..Do any one know similar tutorials for this..?
Hadoop based Mahout recommenders can store the results in HDFS directly.
Hive also allows you to create table schema on top of any data using CREATE EXTERNAL TABLE recommend_table which also specifies the location of the data (LOCATION '/home/admin/userdata';).
This way you are ensured that when new data is written to that location - /home/admin/userdata then it is already available to Hive and can be queried by existing Table schema : recommend_table.
I had blogged about it some time back: external-tables-in-hive-are-handy. This solution helps for any kind of map-reduce program output that needs to be available immediately for Hive ad-hoc queries.

Resources