Is there a way to load local csv.gz file in huggingface dataloader? - huggingface-datasets

I'm using datasets library by huggingface to load csv dataset stored locally. The problem is, the dataset is compressed and is stored as a csv.gz file. Therefore, I'm not able to load it using load_dataset('csv', '<local_file_path>.csv') method in huggingface.

Just decompress it with gunzip:
gunzip -v yourdataset.csv.gz

Related

Create parquet file with qualified name in databricks

I have to process some raw data files in csv with cleansing transformations and load as .parquet file in clenase layer. Raw layer file(csv) and Cleanse layer file should have same name.
But I cannot save the .parquet file with the given name, it is creating directory and underneath .parquet files are saved with random name. Please help how to accomplish this.
This is how parquet files are designed to be, a collection of multiple row groups.
The name of your parquet is the folder the chunks of data will be saved under.
If you want a single file, you will have to use a different file format, and you will likely lose the parallelization capabilities offered by parquet for both read and write.

Amazon Glue - Create Single Praquet

I have my data source which generates hourly files in csv format which are pushed to S3. Then using Glue I do some ETL and push the transformed data again back to S3.
The other department which consumes this data wants the files to be consolidated into a single file for yesterday.
I have written a python program that consolidates yesterday's 24 files into a single CSV file.
Now it is also needed that the single consolidated file should also be available in Parquet.
I created a crawler to generate my csv table and then I have a Glue job that converts the single transformed file into Parquet, but I am getting multiple parts of the Parquet file, which I believe because of the snappy compression. But I want to create a single one. How can I do this in Glue ?Secondly I would like to understand that when to use multiple Parquet files and when it makes sense to create a single one.
You can break out to DataFrames, call repartition(1) and then call write.

How to read parquet file from s3 bucket in nifi?

I am trying to read parquet file from s3 bucket in nifi.
to read the file I have used processor listS3 and fetchS3Object and then ExtractAttribute processor. till there it looked fine.
the files are in parquet.gz file and by no mean i was able to generate the flowfile from them, My final purpose is to load the file in noSql(SnowFlake).
FetchParquet works with HDFS which we are not used.
My next option is to use executeScript processor (with python) to read these parquet file and save them back to text.
Can somebody please suggest any work around.
It depends what you need to do with the Parquet files.
For example, if you wanted to get them to your local disk, then ListS3 -> FetchS3Object -> PutFile would work fine. This is because this scenario is just moving around bytes and doesn't really matter whether it is Parquet or not.
If you need to actually interpret the Parquet data in some way, which it sounds like you do for getting it into a database, then you need to use FetchParquet and convert from Parquet to some other format like Avro, Json, or Csv, and then send that to one of the database processors.
You can use Fetch/Put Parquet processors, or any other HDFS processors, with s3 by configuring a core-site.xml with an s3 filesystem.
http://apache-nifi-users-list.2361937.n4.nabble.com/PutParquet-with-S3-td3632.html

Can I pull data directly from hive table to H2O?

We have our data stored in hive text files and parquet files is there anyway to load directly from these into H2O or do we have to go through an intermediate step like csv or pandas dataframe?
yes, you can find all the information you need here
H2O currently supports the following file types:
CSV (delimited) files (including GZipped CSV)
ORC
SVMLight
ARFF
XLS
XLSX
Avro version 1.8.0 (without multifile parsing or column type modification)
Parquet
Notes:
ORC is available only if H2O is running as a Hadoop job.
Users can also import Hive files that are saved in ORC format.
When doing a parallel data import into a cluster:
If the data is an unzipped csv file, H2O can do offset reads, so each node in your cluster can be directly reading its part of the csv file in parallel.
If the data is zipped, H2O will have to read the whole file and unzip it before doing the parallel read.
So, if you have very large data files reading from HDFS, it is best to use unzipped csv. But if the data is further away than the LAN, then it is best to use zipped csv.

Analyzing huge amount of JSON files on S3

I have huge amount of json files, >100TB size in total, each json file is 10GB bzipped, and each line contain a json object, and they are stored on s3
If I want to transform the json into csv (also stored on s3) so I can import them into redshift directly, is writing custom code using hadoop the only choice?
Would it be possible to do adhoc query on the json file without transform the data into other format (since I don't want to convert them into other format first every time I need to do query as the source is growing)
The quickest and easiest way would be to launch an EMR cluster loaded with Hive to do the heavy lifting for this. By using the JsonSerde, you can easily transform the data into csv format. This would only require you to do a insert the data into a CSV formatted table from the JSON formatted table.
A good tutorial for handling the JsonSerde can be found here:
http://aws.amazon.com/articles/2855
Also a good library used for CSV format is:
https://github.com/ogrodnek/csv-serde
The EMR cluster can be short-lived and only necessary for that one job, which can also span across low cost spot instances.
Once you have the CSV format, the Redshift COPY documentation should suffice.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Resources