How to do SQL like querying the parquet files in kedro - kedro

I'm new to kedro, I'm just wondering if I could do SQL like querying the parquet files instead of using Dataframe API's. Please help me out if there is a way.
Thanks in advance!

Related

Can ParquetWriter or AvroParquetWriter store the schema separately?

Do you know, can ParquetWriter or AvroParquetWriter store the schema separately without data?
Now schema is written into parquet file:
AvroParquetWriter.Builder builder = AvroParquetWriter.<GenericRecord>builder(new Path(file.getName()))
.withSchema(payload.getSchema())
.build90;
Do you know is possible write only data without schema into parquet file?
Thank you!
#ЭльфияВалиева. No, the parquet metadata (schema) in the footer is necessary to provide parquet readers the necessary schema to read the parquet data.

HBase Table Data Convert to CSV

How to HBase table data to convert .CSV file, im trying to convert table data to csv format , but i couldn't get any code
hbase001> list
Table
sample_data
Creating an external Hive table mapped on to HBase table using HBaseStorageHandler can solve your problem ,you can now use "select * from table_name" to get data into a csv table (stored as textfile fields terminted by ','). Please refer the below link for reference.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-Usage
There are plenty of ways to solve your task. You can use spark, regular mapreduce or special tools like sqoop. This task is rather trivial and you can implement it by yourself if you learn hadoop. The quickest way for starters to do it is probably sqoop. Please get youself familiar with this power tool and play with it.
Good luck!

How to move HBase tables to HDFS in Parquet format?

I have to build a tool which will process our data storage from HBase(HFiles) to HDFS in parquet format.
Please suggest one of the best way to move data from HBase tables to Parquet tables.
We have to move 400 million records from HBase to Parquet. How to achieve this and what is the fastest way to move data?
Thanks in advance.
Regards,
Pardeep Sharma.
Please have a look in to this project tmalaska/HBase-ToHDFS
which reads a HBase table and writes the out as Text, Seq, Avro, or Parquet
Example usage for parquet :
Exports the data to Parquet
hadoop jar HBaseToHDFS.jar ExportHBaseTableToParquet exportTest c export.parquet false avro.schema
I recently opensourced a patch to HBase which tackles the problem you are describing.
Have a look here: https://github.com/ibm-research-ireland/hbaquet

While creating table how to identify the data types in hive

I am learning to use Hadoop for performing Big Data related operations.
I need to perform some queries on a collection of data sets split across 8 csv files. Each csv file has multiple sheets and the query concerns only one of the sheets(Sheet Name: Table4)
The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html
Sample Data snap shot attached for quick reference
I have already converted the above xls file to csv.
Am not sure how to group the data while creating table in Hive.
It will be really helpful if you can guide me here.
Note: I am a novice with Hadoop and Big Data, so if anyone could guide me with how to proceed further I'd be very grateful.
If you need information on the queries or anything else let me know.
Thanks!

how to create a query in hive for particular data

I am using hive to load a data file and run hadoop mapreduce on it. But I am stuck at create table query. I have a data like this 59.7*, 58.9* where * is just a character. I want to make two columns to store 59.7 & 58.9. Can anyone help on that? Thanks
You can use RegexSerDe to do that. You can visit this page if you need an example.

Resources