Create parquet file with qualified name in databricks - parquet

I have to process some raw data files in csv with cleansing transformations and load as .parquet file in clenase layer. Raw layer file(csv) and Cleanse layer file should have same name.
But I cannot save the .parquet file with the given name, it is creating directory and underneath .parquet files are saved with random name. Please help how to accomplish this.

This is how parquet files are designed to be, a collection of multiple row groups.
The name of your parquet is the folder the chunks of data will be saved under.
If you want a single file, you will have to use a different file format, and you will likely lose the parallelization capabilities offered by parquet for both read and write.

Related

Amazon Glue - Create Single Praquet

I have my data source which generates hourly files in csv format which are pushed to S3. Then using Glue I do some ETL and push the transformed data again back to S3.
The other department which consumes this data wants the files to be consolidated into a single file for yesterday.
I have written a python program that consolidates yesterday's 24 files into a single CSV file.
Now it is also needed that the single consolidated file should also be available in Parquet.
I created a crawler to generate my csv table and then I have a Glue job that converts the single transformed file into Parquet, but I am getting multiple parts of the Parquet file, which I believe because of the snappy compression. But I want to create a single one. How can I do this in Glue ?Secondly I would like to understand that when to use multiple Parquet files and when it makes sense to create a single one.
You can break out to DataFrames, call repartition(1) and then call write.

How to read parquet file from s3 bucket in nifi?

I am trying to read parquet file from s3 bucket in nifi.
to read the file I have used processor listS3 and fetchS3Object and then ExtractAttribute processor. till there it looked fine.
the files are in parquet.gz file and by no mean i was able to generate the flowfile from them, My final purpose is to load the file in noSql(SnowFlake).
FetchParquet works with HDFS which we are not used.
My next option is to use executeScript processor (with python) to read these parquet file and save them back to text.
Can somebody please suggest any work around.
It depends what you need to do with the Parquet files.
For example, if you wanted to get them to your local disk, then ListS3 -> FetchS3Object -> PutFile would work fine. This is because this scenario is just moving around bytes and doesn't really matter whether it is Parquet or not.
If you need to actually interpret the Parquet data in some way, which it sounds like you do for getting it into a database, then you need to use FetchParquet and convert from Parquet to some other format like Avro, Json, or Csv, and then send that to one of the database processors.
You can use Fetch/Put Parquet processors, or any other HDFS processors, with s3 by configuring a core-site.xml with an s3 filesystem.
http://apache-nifi-users-list.2361937.n4.nabble.com/PutParquet-with-S3-td3632.html

How to output multiple s3 files in Parquet

Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?
AvroParquetWriter<GenericRecord> writer =
new AvroParquetWriter<GenericRecord>(file, schema);
GenericData.Record record = new GenericRecordBuilder(schema)
.set("name", "myname")
.set("favorite_number", i)
.set("favorite_color", "mystring").build();
writer.write(record);
For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like
df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])
But I can find no equivalent to partitionBy in plain Java with Hadoop.
In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:
job.setNumReduceTasks(N);
or alternatively via the system property:
-Dmapreduce.job.reduces=N
I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.

What's the recommended way of loading data into Hive from compressed files?

I came across this page on CompressedStorage in the documentation and it has me a bit confused.
According to the page, if my input files (on AWS s3) are compressed gzip files, I should first load the data with the option STORED AS TextFile and then create another table with the option STORED AS SEQUENCEFILE and insert the data into that. Is that really the recommended way?
Or can I just load the data straight into a table set with the option STORED AS SEQUENCEFILE?
If the former method is really the recommended way, is there any further explanation as to why it is?
You must load your data in its format. It means, if your files are Text Files then you should load them as TextFile and if your files are Sequence Files then load them as SEQUENCEFILE.
For Hive the compression format doesn't matter because it will decompress them on fly using the extension of the file as reference (If the compression codec was configured properly in Hadoop).
The suggestion in the page that you are sharing is that it's better work with Sequence Files than Compressed Text Files. That is because a Gzip file is not splittable and if you have a very big Gzip file all the file have to be processed with only one Mapper not allowing work in parrallel distributing the effort among the cluster nodes.
Then the Hive's suggestion is convert Compressed Text Files into Sequence Files to avoid that limitation. It is only about performance.
If your files are small, then it doesn't matter (< 1 Hadoop block size - 128MB by default).

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

Resources