guys I am newbie to Hive and have some doubts in it.
Normally we write custom UDF in Hive for the particular number of columns. (Consider UDF is in Java). Means it performs some operation on that particular column.
I am thinking that can we write such UDF through which we can give the particular column as a input to some query and can we return that query from UDF which will execute on Hive CLI by taking the column as a input?
Can we do this? If yes please suggest me.
Thanks and sorry for my bad english.
This is not possible out of the box because as the Hive query is running, there has been a plan already built that is going to execute. What you suggest is to dynamically change that plan while it is running, which is not only hard because the plan is already built, but also because the Hadoop MapReduce jobs are already running.
What you can do is have your initial Hive query output new Hive queries to a file, then have some sort of bash/perl/python script that goes through that and formulates new Hive queries and passes them to the CLI.
Related
I was trying to use Hive to query the tables I saved using saveAsTable() provided by Spark DataFrame. Everything works well when I query using hiveContext.sql(). However, when I switch to hive and describe the table, it becomes col, array, something like this and is no longer queryable.
Any ideas how to work it through? Is there a reliable way to make Hive understands the metadata defined in spark instead of explicitly defining the schema?
Sometimes I make use of spark to infer schema from the raw data or read schema from certain file formats like parquet so don't want to create these table that could be inferred automatically.
Thanks a lot for any advice!
I want to show hadoop files on HDFS under a specific folder which created on a specific day, is there a command/option to do this?
Thanks in advance,
Lin
As far as I know, hadoop command won't support this.
You can write a script to achieve this, which is not a good implementation.
My suggestions:
Organize your file in the way more convenient to be used. Say in your case, make a time partition would be better.
If you want to make data analysis easier, use some database based on hdfs like hive. hive support partition and sql like query and insert.
more about hive and hive partitions:
https://hive.apache.org/
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables
I have a question. How to do Mapreduce Job to implement a HiveQL statement. for example we have a table with column names color, width and some other columns. Suppose if i want to select color in hive i can give select color from tablename;. In the same way what is the code for getting color in Mapreduce.
You can use the Thrift server. You can connect to hive through JDBC. All you need is to include the hive-jdbc jar in your classpath.
However is this advisable? Well that i am not really sure. This is a very bad design pattern if you are doing it in the mapper as no. of mappers is determined by data size.
The same can be achieved as multiple inputs into MR job.
But then i do not know that much about your use case. So thrift will be the way to go.
For converting hive queries to mapreduce jobs, ysmart is the best option
http://ysmart.cse.ohio-state.edu/
Either ysmart can be downloaded or online version can be used.
Check the companion code Chapter 5 - Join Patterns from the MapReduce Design Patterns book. In the join pattern the fields are extracted in the mapper and emitted.
I am using hive to load a data file and run hadoop mapreduce on it. But I am stuck at create table query. I have a data like this 59.7*, 58.9* where * is just a character. I want to make two columns to store 59.7 & 58.9. Can anyone help on that? Thanks
You can use RegexSerDe to do that. You can visit this page if you need an example.
Where does Apache HiveQL store the Map/Reduce code it generates?
I believe Hive doesn't really generate Map/Reduce code in the sense as you could get from Java, because it is interpreted by the Hive query planner.
If you want to get an idea of what kind of operations your Hive queries generate, you could prefix your queries with EXPLAIN and you will see the abstract syntax tree, the dependency graph, and the plan of each stage. More info on EXPLAIN here.
If you really want to see some Map/Reduce jobs, you could try YSmart which will translate your HiveQL statements into working Java Map/Reduce code. I haven't used it personally, but I know people who have and said good things about it.
It seems that Hive change this method every query execution.
http://hive.apache.org/docs/r0.9.0/api/org/apache/hadoop/hive/ql/exec/Task.html#execute(org.apache.hadoop.hive.ql.DriverContext)