I am planning for CCDH certification. Can anyone help me with the below requirement? Is it like we have to write MR code similar to any of HiveQL like select,join etc? or is it something else?
http://www.cloudera.com/content/cloudera/en/training/certification/ccdh/prep.html
Querying Objectives
Write a MapReduce job to implement a HiveQL statement.
Write a MapReduce job to query data stored in HDFS.
Per this cloudera forum thread:
please explain what is really expected in part four. Are we really expected to write an entire job ? I can't imagine that, can you clarify this point ? And how difficult are the queries in HiveQL ? Do we have to know subtle select clause ?
In part 4: you are expected to read an entire job (driver, mapper, reducer) and dissect it and be able to understand what the code is doing or not doing. It's basically a dissection exercise: given the following code, what is the outcome. Queries in HiveQL are not difficult if you know HiveQL or SQL which are not difficult.
I can't guarantee it's true, but that's a post by a Cloudera employee (a bit dated though - it's from 2014-02).
Related
There several kind of file format like impala internal table or external table format like csv, parquet, hbase. Now we need to guarantee the average insert rate is 50K row/s and each row is about 1K. And, some of the data also can be updated occasionally. We also need to do some aggregation operation on those data.
I think Hbase is not a good choose for large aggregation compute when using impala with external table. Does anybody have suggestion about it?
Thanks, Chen.
I've never worked with Impala, but I can tell you a few things based on my experience with Hive.
HBase will be faster if you have a good key design and a proper schema, because just like with Hive, Impala will translate your WHERE into scan filters, it'll depend a lot on the type of queries you run. There are multiple techniques to reduce the amount of data read by a job: from simple ones like providing start and stop rowkeys, timeranges, reading only some families/columns, the already mentioned filters... to more complex like solutions like performing realtime aggregations on your data (*) and keeping them as counters.
Regarding your insert rate, it can perfectly handle it with the proper infrastructure (better to use the HBase native JAVA API), also, you can buffer your writes to get even better performance.
*Not sure if Impala supports HBase counters.
I have a question. How to do Mapreduce Job to implement a HiveQL statement. for example we have a table with column names color, width and some other columns. Suppose if i want to select color in hive i can give select color from tablename;. In the same way what is the code for getting color in Mapreduce.
You can use the Thrift server. You can connect to hive through JDBC. All you need is to include the hive-jdbc jar in your classpath.
However is this advisable? Well that i am not really sure. This is a very bad design pattern if you are doing it in the mapper as no. of mappers is determined by data size.
The same can be achieved as multiple inputs into MR job.
But then i do not know that much about your use case. So thrift will be the way to go.
For converting hive queries to mapreduce jobs, ysmart is the best option
http://ysmart.cse.ohio-state.edu/
Either ysmart can be downloaded or online version can be used.
Check the companion code Chapter 5 - Join Patterns from the MapReduce Design Patterns book. In the join pattern the fields are extracted in the mapper and emitted.
I want to figure out whether there is any software or algorithm which can help in identifying the maps and reduces in a given code on its own.
This is what that happens when you run Hive or Pig queries. You just submit your queries and they automatically get converted into corresponding MR jobs and you get the result without having to do anything additional. Have a look at ANTLR(ANother Tool for Language Recognition) which automatically parses a given Hive query into the corresponding MR job. ANTLR is a parser generator for reading, processing, executing, or translating structured text or binary files.
Do you need something else?Apologies if I have get it wrong.
Where does Apache HiveQL store the Map/Reduce code it generates?
I believe Hive doesn't really generate Map/Reduce code in the sense as you could get from Java, because it is interpreted by the Hive query planner.
If you want to get an idea of what kind of operations your Hive queries generate, you could prefix your queries with EXPLAIN and you will see the abstract syntax tree, the dependency graph, and the plan of each stage. More info on EXPLAIN here.
If you really want to see some Map/Reduce jobs, you could try YSmart which will translate your HiveQL statements into working Java Map/Reduce code. I haven't used it personally, but I know people who have and said good things about it.
It seems that Hive change this method every query execution.
http://hive.apache.org/docs/r0.9.0/api/org/apache/hadoop/hive/ql/exec/Task.html#execute(org.apache.hadoop.hive.ql.DriverContext)
guys I am newbie to Hive and have some doubts in it.
Normally we write custom UDF in Hive for the particular number of columns. (Consider UDF is in Java). Means it performs some operation on that particular column.
I am thinking that can we write such UDF through which we can give the particular column as a input to some query and can we return that query from UDF which will execute on Hive CLI by taking the column as a input?
Can we do this? If yes please suggest me.
Thanks and sorry for my bad english.
This is not possible out of the box because as the Hive query is running, there has been a plan already built that is going to execute. What you suggest is to dynamically change that plan while it is running, which is not only hard because the plan is already built, but also because the Hadoop MapReduce jobs are already running.
What you can do is have your initial Hive query output new Hive queries to a file, then have some sort of bash/perl/python script that goes through that and formulates new Hive queries and passes them to the CLI.