Is it possible to write parquet statistics with pyarrow? - parquet

This option exists in Spark, and I saw that pyarrow's write_table() accepts **kwargs, but following up the .pyx, I couldn't trace it to stuff like min/max.
Is this supported, and if so, how is it achieved?

pyarrow already writes the min/max statistics for Parquet files by default. There is no option for that in pyarrow as the underlying parquet-cpp library writes them always. At the time of writing, only min and max are written. The other statistics can neither be provided nor are computed on-the-fly with parquet-cpp. When you require them, you should open an issue in (Py)Arrow's issue tracker and considering contributing the the missing code for that.

Related

Sync files on hdfs having same size but varies in contents

i am trying to sync files from one hadoop clutster to another using distcp and airbnb reair utility, but both of them are not working as expected.
if file size is same on source and destination both of them fails to update it even if file content are been changed(checksum also varies) unless overwrite option is not used.
I need to keep sync data of around 30TB so every time loading complete dataset is not feasible.
Could anyone please suggest how can i bring two dataset in sync if file size is same(count in source is changed) and have varied checksum.
The way DistCp handles syncing between files that are the same size but having different contents is by comparing its so-called FileChecksum. The FileChecksum was first introduced in HADOOP-3981, mostly for the purpose of being used in DistCp. Unfortunately, this has the known shortcoming of being incompatible between different storage implementations, and even incompatible between HDFS instances that have different internal block/chunk settings. Specifically, that FileChecksum bakes in the structure of having, for example, 512-bytes-per-chunk and 128MB-per-block.
Since GCS doesn't have the same notions of "chunks" or "blocks", there's no way for it to have any similar definition of a FileChecksum. The same is also true of all other object stores commonly used with Hadoop; the DistCp documentation appendix discusses this fact under "DistCp and Object Stores".
That said, there's a neat trick that can be done to define a nice standardized representation of a composite CRC for HDFS files that is mostly in-place compatible with existing HDFS deployments; I've filed HDFS-13056 with a proof of concept to try to get this added upstream, after which it should be possible to make it work out-of-the-box against GCS, since GCS also supports file-level CRC32C.

Save and Process huge amount of small files with spark

I'm new in big data! I have some questions about how to process and how to save large amount of small files(pdf and ppt/pptx) in spark, on EMR Clusters.
My goal is to save data(pdf and pptx) into HDFS(or in some type of datastore from cluster) then extract content from this file from spark and save it in elasticsearch or some relational database.
I had read the problem of small files when save data in HDFS. What is the best way to save large amount of pdf & pptx files (maxim size 100-120 MB)? I had read about Sequence Files and HAR(hadoop archive) but none of them I don't understand how exactly it's works and i don't figure out what is the best.
What is the best way to process this files? I understood that some solutions could be FileInputFormat or CombineFileInputFormat but again I don't know how exactly it's works. I know that can't run every small file on separated task because the cluster will be put in the bottleneck case.
Thanks!
If you use Object Stores (like S3) instead of HDFS then there is no need to apply any changes or conversions to your files and you can have them each as a single object or blob (this also means they are easily readable using standard tools and needn't be unpacked or reformatted with custom classes or code).
You can then read the files using python tools like boto (for s3) or if you are working with spark using the wholeTextFile or binaryFiles command and then making a BytesIO (python) / ByteArrayInputStream (java) to read them using standard libraries.
2) When processing the files, you have the distinction between items and partitions. If you have a 10000 files you can create 100 partitions containing 100 files each. Each file will need to anyways be processed one at a time since the header information is relevant and likely different for each file.
Meanwhile, I found some solutions for that small files problem in HDFS. I can use the following approaches:
HDFS Federation help us to distribute the load of namenodes: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/
HBase could be also a good alternative if your files size is not too large.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
https://hbase.apache.org/book.html
Apache Ozone which is object storage like S3 but is on-premises. At the time of writing, from what I know, Ozone is not production ready. https://hadoop.apache.org/ozone/

h2o sparkling water save frame to disk

I am trying to import a frame by creating a h2o frame from a spark parquet file.
The File is 2GB has about 12M rows and Sparse Vectors with 12k cols.
It is not that big in parquet format but the import takes forever.
In h2o it is actually reported as 447mb compressed size. Quite small actually.
Am I doing it wrong and when I actually finish importing (took 39min), Is there any form in h2o to save the frame to disk for a fast loading next time??
I understand h2o does some magic behind the scene which takes so long but I only found a download csv option which is slow and huge for a 11k x 1M sparse data and I doubt it is any faster to import.
I feel like there is a part missing. Any Info about h2o data import/export is appreciated.
Model save/load works great but train/val/test data loading seems an unreasonably slow procedure.
I got 10 sparkworkers with 10g each and gave the driver 8g. This should be plenty.
Use h2o.exportFile() (h2o.export_file() in Python), with the parts argument set to -1. The -1 effectively means that each machine in the cluster will export just its own data. In your case you'd end up with 10 files, and it should be 10 times quicker than otherwise.
To read them back in, use h2o.importFile() and specify all 10 parts when loading:
frame <- h2o.importFile(c(
"s3n://mybucket/my.dat.1",
"s3n://mybucket/my.dat.2",
...
) )
By giving an array of files, they will be loaded and parsed in parallel.
For a local LAN cluster it is recommended to be using HDFS for this. I've had reasonable results by keeping the files on S3 when running a cluster on EC2.
I suggest to export the dataframe from Spark into SVMLight file format (see MLUtils.saveAsLibSVMFile(...). This format can be then natively ingested by H2O.
As Darren pointed out you can export data from H2O in multiple parts which speeds up the export. However H2O currently only supports export to CSV files. This is sub-optimal for your use case of very sparse data. This functionality is accessible via the Java API:
water.fvec.Frame.export(yourFrame, "/target/directory", yourFrame.key.toString, true, -1 /* automatically determine number of part files */)

Is it possible for Parquet to compress the summary file (_metadata) in the MR job?

Right now we are using a mapreduce job to convert data and store the result in the Parquet format.
There is a summary file (_metadata) generated as well. But the problem is that it is too big (over 5G). Is there any way to reduce the size?
Credits to Alex Levenson and Ryan Blue:
Alex Levenson:
You can push the reading of the summary file to the mappers instead of reading it on the submitter node:
ParquetInputFormat.setTaskSideMetaData(conf, true);
(Ryan Blue: This is the default from 1.6.0 forward)
or setting "parquet.task.side.metadata" to true in your configuration. We
had a similar issue, by default the client reads the summary file on the
submitter node which takes a lot of time and memory. This flag fixes the
issue for us by instead reading each individual file's metadata from the
file footer in the mappers (each mapper reads only the metadata it needs).
Another option, which is something we've been talking about in the past, is
to disable creating this metadata file at all, as we've seen creating it
can be expensive too, and if you use the task side metadata approach, it's
never used.
(Ryan Blue: There's an option to suppress the files, which I recommend. Now that file metadata is handled on the tasks, there's not much need for the summary files.)

Accessing Hadoop Distributed Cache in UDF

Is it possible to pick up files from the distributed cache in a UDF?
Before diving in further, I spent quite a bit of time trying to find an answer to this particular question (on StackOverflow and otherwise), and was not able to find one.
The main crux of the problem is as follows: I would like to take a file that is already on HDFS, copy it over to the distributed cache in Pig, and then be able to read this file from the cache in a Java UDF. Another hiccup is that due to the design of the program, I am unable to extend from 'EvalFunc', which may solve the issue.
I specified SET mapred.cache.files '$PATH_TO_FILE_ON_HDFS' as well as SET mapped.create.symlink 'yes' in my Pig script, passed the file path as a parameter to the UDF, and attempted to use FileSystem and FileReader classes to access the file, to no avail.
Please let me know if I can further clarify this/provide any more relevant details.

Resources