Mahout Hive Integration

Mahout Hive Integration - hadoop

I want to combine Hadoop based Mahout recommenders with Apache Hive.So that My generated Recommendations are directly stored in to my Hive Tables..Do any one know similar tutorials for this..?

Hadoop based Mahout recommenders can store the results in HDFS directly.
Hive also allows you to create table schema on top of any data using CREATE EXTERNAL TABLE recommend_table which also specifies the location of the data (LOCATION '/home/admin/userdata';).
This way you are ensured that when new data is written to that location - /home/admin/userdata then it is already available to Hive and can be queried by existing Table schema : recommend_table.
I had blogged about it some time back: external-tables-in-hive-are-handy. This solution helps for any kind of map-reduce program output that needs to be available immediately for Hive ad-hoc queries.

Related

Why Hive when HDFS already provide data storage?

I have started learning Hadoop.I understood that HDFS provides distributed storage system and Mapreduce is for data processing.Now i ma reading Hadoop ecosystem.
From the definition of Hive, it is a data ware house built on hadoop for providing SQL like interface.
My question is when hadoop provides HDFS which is falut tolerant , distributed then why hive? Does hive replaces HDFS?.
Does hive provide only sql interface or storage also?

Hive does not replace HDFS. Hive provides sql type interface to data that is stored in HDFS. Its basically used for querying and analysis of data that is stored. Hive in a sense actually eliminates a lot of boiler plate code, that you would have to write if you were using mapreduce. for example just think of how you are going to create different types of joins(left, right, bucketed) or group by clause or any other sql clause in mapreduce and you will get your answer (you lines of code will easily scale to 100's ). Hive provides them out-of-the-box. You dont need to write those lengthy programs in mapreduce. Hive already does that for you.
One thing to note is, Hive itself uses Mapreduce behind the scenes. So any group by, count, join is converted to mapreduce jobs only. You can change this though to Tez/Spark.
for your second question, hive does not provide any storage, it just uses a database (derby as default, MySQL would be a good choice if you want to use a different db) as a metastore just to store the metadata related to the tables, partitions, views, buckets etc.. (metadata is like location of tables, type of data stored in tables, partitions info of the tables, created date, modified date etc..) you create with hive.

To answer your question in comment...
Hive can process structured (csv,txt etc) data & semi-structured(xml,json,parquet etc). It cannot process unstructured data like audio, video etc.
Note: Semi structured data can be handled in DDLs and also through spark to be put into Hive.
I encourage you to learn what is external and managed tables in hive too.
Happy learning.

Query Parquet data through Vertica (Vertica Hadoop Integration)

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.
Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:
CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;
Since the data size is big. This method will not achieve good performance.
I have done some research, Vertica Hadoop Integration
I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.
My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?
EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.

Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)
Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.
You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).
Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:
CREATE EXTERNAL TABLE t (id int, name varchar(50),
created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*'
PARQUET(hive_partition_cols='created,region');

how to manage modified data in Apache Hive

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.
Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.
Are there any other standard better ways to update modified data in Hadoop?
Thanks

HDFS might be append only, but Hive does support updates from 0.14 on.
see here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
A design pattern is to take all your previous and current data and insert it into a new table every time.
Depending on your usecase have a look at Apache Impala/Hbase/... or even Drill.

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?

Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)

You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.

Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)

Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the page—for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Moving clustered data from HDFS to Hive

I have been experimenting with Mahout in the Cloudera demo VM and have successfully clustered the sample synthetic control data (https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html) using the k-Means algorithm. I have used ClusterDumper and can view the Mahout output, but now I want to put the output into a Hive table. How would I go about doing this?

There is no direct integration. Your best bet is to modify ClusterDumper to produce some kind of textual representation that can be imported into Hive as tabular data.

Create an external table in Hive, That should point to Mahout o/p path.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio