how to create a query in hive for particular data - hadoop

I am using hive to load a data file and run hadoop mapreduce on it. But I am stuck at create table query. I have a data like this 59.7*, 58.9* where * is just a character. I want to make two columns to store 59.7 & 58.9. Can anyone help on that? Thanks

You can use RegexSerDe to do that. You can visit this page if you need an example.

Related

HBase Table Data Convert to CSV

How to HBase table data to convert .CSV file, im trying to convert table data to csv format , but i couldn't get any code
hbase001> list
Table
sample_data
Creating an external Hive table mapped on to HBase table using HBaseStorageHandler can solve your problem ,you can now use "select * from table_name" to get data into a csv table (stored as textfile fields terminted by ','). Please refer the below link for reference.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-Usage
There are plenty of ways to solve your task. You can use spark, regular mapreduce or special tools like sqoop. This task is rather trivial and you can implement it by yourself if you learn hadoop. The quickest way for starters to do it is probably sqoop. Please get youself familiar with this power tool and play with it.
Good luck!

Not able to indexing to colunm in hbase table using phoenix

I have store 3 millions records in hbase table using csv bulk load and try to fetch some sql data using phoenix but I am getting more time to fetch records, So I have created index using phoenix but not able to insert index table record using phoenix map reduce process. I have used following command to insert indexing data.
I am not sure where this hfile store in hdfs. Please help me where this will store or something I did wrong then please help me for this.
sudo -u hdfs hbase org.apache.phoenix.mapreduce.index.IndexTool --data-table BIGSS --index-table BIG_STATUS_INDEXS --output-path STATUS_HFILES
I have attached screenshot which mentioned error which I am facing. Please help for this.
I have used following link for reference.
https://phoenix.apache.org/secondary_indexing.html

Hdfs and Hbase: how it works?

Hi everybody
I'm quite new with bigdata, I have installed a HDFS + Hbase test database and I use Talend Big Data (an ETL) to make my test.
I would like to know : if I put a file directly in the HDFS, without going via hbase, I could never request these data ? I mean, I have to read the entire file if I want to filter data I want to chose, is that right ?
Thanks a lot for any help !
HDFS is just a distributed file system, you cannot query your files without passing by an intermidiate component.
Hbase is a nosql database that persist your data on the HDFS, use it when you need a random access to your data.
If you want to store your files on the HDFS as they are and query them, you can create an external table upon them using Hive.
The best option is to use hive on the top of the files which are on the HDFS. You can use bucketing and partitioning in the hive for performance improvement.

show hadoop files on HDFS only created on a specific day

I want to show hadoop files on HDFS under a specific folder which created on a specific day, is there a command/option to do this?
Thanks in advance,
Lin
As far as I know, hadoop command won't support this.
You can write a script to achieve this, which is not a good implementation.
My suggestions:
Organize your file in the way more convenient to be used. Say in your case, make a time partition would be better.
If you want to make data analysis easier, use some database based on hdfs like hive. hive support partition and sql like query and insert.
more about hive and hive partitions:
https://hive.apache.org/
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables

Can we run queries from the Custom UDF in Hive?

guys I am newbie to Hive and have some doubts in it.
Normally we write custom UDF in Hive for the particular number of columns. (Consider UDF is in Java). Means it performs some operation on that particular column.
I am thinking that can we write such UDF through which we can give the particular column as a input to some query and can we return that query from UDF which will execute on Hive CLI by taking the column as a input?
Can we do this? If yes please suggest me.
Thanks and sorry for my bad english.
This is not possible out of the box because as the Hive query is running, there has been a plan already built that is going to execute. What you suggest is to dynamically change that plan while it is running, which is not only hard because the plan is already built, but also because the Hadoop MapReduce jobs are already running.
What you can do is have your initial Hive query output new Hive queries to a file, then have some sort of bash/perl/python script that goes through that and formulates new Hive queries and passes them to the CLI.

Resources