Using ParquetFileWriter to write data into parquet file? - parquet

I am newBee to parquet!
I have tried below Example code to write data into parquet file using parquetWriter .
http://php.sabscape.com/blog/?p=623
The above example uses parquetWriter, But I want to use ParquetFileWriter to write data efficiently in parquet files.
Please suggest an example or how we can write parquet files using ParquetFileWriter ?

You can probably get some idea from a parquet column reader that i wrote here.

Related

Using Parquet Writer is it possible to write data into parquet with bucketing?

I am writing data into parquet files programatically with AvroParquetWriter but i also want to write parquet file with bucketing, is it possible to do same with bucketing ?
Thanks in advance!!

how to read parquet schema in non mapreduce java program

Is there a way to direct read Parquet file column names by getting metadata without mapreduce. Please give some example. I am using snappy as compression codec.
You can use either ParquetFileReader or use existing tool https://github.com/Parquet/parquet-mr/tree/master/parquet-tools for reading parquet file using command line.

Reading Text File in to Hbase MapReduce and store it to HTable

I am new to HBaseMapReduce and Hadoop Data Base. I need to read a raw text file from mapreduce job and store the retrieved data into Htable using HBase MapReduce API.
I am googling from may days but I am not able to understand the extact flow. Can any one please provide me with some sample Code of reading data from A file.
I need to read Data From a Text/csv files. I can find some examples of reading data from command prompt. Which method can we use to read an xml file FileInputFormat or, please help me in learning Mapreduce API and please provide me with simple read and write examples.
You can import your csv data to HBase using importtsv and completebulkupload tools. importtsv loads csvs to hadoop files and completebulkupload loads them to a specified HTable. You can use these tools both from command line and Java code. If this can help you inform me to provide sample code or command

Amazon EMR JSON

I am using Amazon EMR Hadoop Hive for big data processing. Current data in my log files is in CSV format. In order to make the table from log files, I wrote regex expression to parse the data and store into different columns of external table. I know that SerDe can be used to read data in JSON format and this means that each log file line could be as JSON object. Are there any Hadoop performance advantages if my log files are in JSON format comparing CSV format.
If you can process the output of the table (that you created with the regexp) why do another processing? Try to avoid unnecessary stuff.
I think the main issue here is which format is faster to read. I believe CSV will provide better speed over JSON but don't take my word. Hadoop really doesn't care. It's all byte arrays to him, once in memory.

Filtering Using MapReduce in Hadoop

I want to filter records from given file based on some criteria,i want my criteria to be if value of third field is equal to some value then retrive that record and save it in output file .i am taking CSV file as input.Can anyone suggest something ?
Simplest way would probably be to use pig
something like
orig = load 'filename.csv' using PigStorage(',') as (first,second,third:chararray,...);
filtered_orig= FILTER orig by third=="somevalue";
store filtered_orig into 'newfilename' using PigStorage(',');
If you need scalability you can use hadoop in the following way:
Install Hadoop, install hive, put your csv files into HDFS.
define the CSV file as external table (http://hive.apache.org/docs/r0.8.1/language_manual/data-manipulation-statements.html) and then you can write SQLs against the CSV file. Results of SQL can be then exported back to CSV.

Resources