How to specify number of partitions when writing a Parquet file? - parquet

parquet_writer.write_table(table)
This line writes a single file.
The documentation says:
This creates a single Parquet file. In practice, a Parquet dataset may consist of many files in many directories. We can read a single file back with read_table:
Is there a way for PyArrow to create a parquet file in the form of a directory with multiple part files in it such as :
ls -lrt permit-inspections-recent.parquet
... 14:53 part-00001-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet
... 14:53 part-00000-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet
Regards,
Yash

You need to tell Arrow how to partition the data. This done with partition_cols argument. See here: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html

Related

Combine CSV files with condition

I need to combine all the csv files in some directory (.csv), provided that there are other files with the same name in this directory, but with different expansion (.csv.done).
If a csv file doesn't have .done in this extension then I don't need it for combine process.
What is the best way to do it using Bash ?
This approach is a solution to your problem. I see you've commented that it "didn't work", but whatever the reason is for it not working, it's likely simple to fix e.g. if you forgot to include key details, or failed to adapt it appropriately to suit your specific situation. If you need further help troubleshooting, add more info to your question.
The approach:
for f in *.csv.done
do
cat "${f%.*}" >> combined_file.csv
done
How it works:
In your example, you have 3 files named 1.csv 2.csv 3.csv and two 'done' files named 1.csv.done 2.csv.done.
This script begins by making a list of all files that end in .csv.done (two files: 1.csv.done 2.csv.done).
It then uses a parameter expansion, specifically ${parameter%word}, to 'shorten' the name of the two files in the list to .csv (instead of .csv.done).
Then it 'prints' the content of the two 'shortened' filenames (1.csv and 2.csv) into a 'combined' file.
It doesn't 'print' the content of 1.csv.done or 2.csv.done, or 3.csv, because these files weren't in the original 'list'.
If you run this script multiple times, it will keep adding the contents of files 1.csv and 2.csv to the 'combined' file (only run it once, or delete the 'combined' file before running it again)

How to merge CSV files in Hadoop?

I am new to the Hadoop framework and I would like to merge 4 CSV files into a single file.
All the 4 CSV files have same headers and order is also the same.
I don't think Pig STORE offers such a feature.
You could use Spark's coalesce(1) function, however, there is little reason to do this as almost all Hadoop processing tools prefer to read directories, not files.
You should ideally not be storing raw CSV in Hadoop for very long, anyway, and rather you convert it to ORC or Parquet as columnar data. Especially if you are reading CSV to begin with already -- do not output CSV again.
If the idea is to produce one CSV to later download, then I would suggest using Hive + Beeline to do that
This will store the result into a file in the local file system.
beeline -u 'jdbc:hive2://[databaseaddress]' --outputformat=csv2 -f yourSQlFile.sql > theFileWhereToStoreTheData.csv
try using getmerge utility to merge the csv files
for example you have a couple of EMP_FILE1.csv EMP_FILE2.csv EMP_FILE3.csv are placed at some location on hdfs. you can merge all these files and can placed merge file at some new location.
hadoop fs -getmerge /hdfsfilelocation/EMP_FILE* /newhdfsfilelocation/MERGED_EMP_FILE.csv

filename of the sqoop import

When we import from RDBMS to HDFS using sqoop we will give target directory to store data, once the job completed we can see the filename as part-m-0000 as mapper output. Is there any way we can pass the filename in which the data will stored? Is sqoop have any option like that?
According to this answer, you can specify arguments passed to mapreduce with -D option, which can accept file name options:
-Dmapreduce.output.basename=myoutputprefix
Although this will change the basename of your file, it will not change the part numbers.
Same answers on other sites:
cloudera
hadoopinrealworld
No you can't rename it.
You can specify --target-dir <dir> to tell the location of directory where all the data is imported,
In this directory, you see many part files (e.g. part-m-00000). These part files are created by various mappers (remember -m <number> in your sqoop import command)
Since data is imported in multiple files, how would you name each part file?
I did not see any additional benefit for this renaming.

Generating TDB Dataset from archive containing N-TRIPLES files

Apologies, in advance, for a possible duplicate.
I have an archive containing 117,426 files (each in the N-TRIPLES format) that I wish to load into the default graph of a TDB dataset. Due to the large number of files, I need to be able to perform this import without manually selecting individual files for upload.
I am in Bash, with Jena and Fuseki distributions at my disposal.
If possible, I want to avoid the worst-case scenario of just writing a java application to do this. If I have to write a java application for this, what hooks exist in RIOT/TDB to perform programmatic bulk-loading?
As a genenral comment, one way is to concatenate the N-Triples files to generate one single file.
You can load many files at once with either tdbloader or tdbloader2.
tdbloader --loc DB ... your files ...
The 117,426 may strain you OS for a single command line invocation. You can pipe the files into tdbloader (it's just like concatenating the files first)
... | tdbloader --loc DB -- -
where ... is some way to get bash to cat the files (possible from a subshell).
e.g. (you'll need to adjust to file all 117,426 files):
( for x in data*.nt
do
cat $x
done
) | tdbloader --loc DB -- -

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Resources