Spark(version=2.2.0) there is not DirectParquetOutputCommitter. As an alternative, I can use
dataset
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")//magic here
.parquet("s3a://...")
to avoid creating _temporary folder on S3.
Everything works fine until I set a partitionBy to my Dataset
dataset
.partitionBy("a", "b")
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")//magic stop working creating _temporary on S3
.parquet("s3a://...")
Also tried adding but didn't work
spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
With partitionBy at Spark Dataset, It's going to create _temporary and move files which becomes a very slow operation.
There is any alternative or missing configuration?
Hadoop 3.1's s3a will have a zero rename committer built in, (va HADOOP-13786). Until then, you can make use of its precursor, which is from netflix
Note that "algorithm 2" isn't a magic step eliminating the _temp dir, just renaming task output direct to the destination when the individual tasks commit. Still prone to errors if there's a delayed consistency in the directory listing, and still O(data). You cannot safely use either the v1 or v2 committers directly with S3, not with the S3A connector as shipping in Hadoop 2.x
Alternatives (in order of recommendation + ease - top is best):
Use Netflix's S3Committer: https://github.com/rdblue/s3committer/
Write to HDFS, then copy to S3 (e.g. via s3distcp)
Don't use the partitionBy, but instead iterate over all the partition permutations and write the results dynamically to each partitioned directory
Write a custom file committer
Related
rdd.saveAsTextFile("s3n://bucket-name/path) is creating an empty file with folder name as - [folder-name]_$folder$
Seems like this empty file in used by hadoop-aws jar (of org.apache.hadoop) to mimick S3 filesystem as hadoop filesystem.
But, my application writes thousands of files to S3. As saveAsTextFile creates folder (from the given path) to write the data (from rdd) my application ends up creating thousands of these empty files - [directory-name]_$folder$.
Is there a way to make rdd.saveAsTextFile not to write these empty files?
Stop using s3n, switch to s3a. It's faster and actually supported. that will make this issue go away, along with the atrocious performance problems reading large Parquet/ORC files.
Also, if your app is creating thousands of small files in S3, you are creating future performance problems: listing and opening files on S3 is slow. Try to combine source data into larger columnar-formatted files & use whatever SELECT mechanism your framework has to only read the bits you want
I'm new in big data! I have some questions about how to process and how to save large amount of small files(pdf and ppt/pptx) in spark, on EMR Clusters.
My goal is to save data(pdf and pptx) into HDFS(or in some type of datastore from cluster) then extract content from this file from spark and save it in elasticsearch or some relational database.
I had read the problem of small files when save data in HDFS. What is the best way to save large amount of pdf & pptx files (maxim size 100-120 MB)? I had read about Sequence Files and HAR(hadoop archive) but none of them I don't understand how exactly it's works and i don't figure out what is the best.
What is the best way to process this files? I understood that some solutions could be FileInputFormat or CombineFileInputFormat but again I don't know how exactly it's works. I know that can't run every small file on separated task because the cluster will be put in the bottleneck case.
Thanks!
If you use Object Stores (like S3) instead of HDFS then there is no need to apply any changes or conversions to your files and you can have them each as a single object or blob (this also means they are easily readable using standard tools and needn't be unpacked or reformatted with custom classes or code).
You can then read the files using python tools like boto (for s3) or if you are working with spark using the wholeTextFile or binaryFiles command and then making a BytesIO (python) / ByteArrayInputStream (java) to read them using standard libraries.
2) When processing the files, you have the distinction between items and partitions. If you have a 10000 files you can create 100 partitions containing 100 files each. Each file will need to anyways be processed one at a time since the header information is relevant and likely different for each file.
Meanwhile, I found some solutions for that small files problem in HDFS. I can use the following approaches:
HDFS Federation help us to distribute the load of namenodes: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/
HBase could be also a good alternative if your files size is not too large.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
https://hbase.apache.org/book.html
Apache Ozone which is object storage like S3 but is on-premises. At the time of writing, from what I know, Ozone is not production ready. https://hadoop.apache.org/ozone/
We have a system, including some Oracle and Microsoft SQL DBMS, that get data from some different sources and in different formats, stores and process it. "Different formats" means files: dbf, xls and others, including binary formats (images), which are imported to DBMS with different tools, and direct access to the databases. I want to isolate all the incoming data and store it "forever" and want to get them later by source and creation time. After some studies I want to try hadoop ecosystem, but not quite sure, if it's an adequate solution for this goal. And what parts of ecosystem should I use? HDFS alone, Hive, may be something else? Could you give me a piece of advise?
I assume you want to store the files that contain the data -- effectively a searchable file archive.
The files themselves can just be stored in HDFS ... or you may find a system like Amazon's S3 cheaper and more flexible. As you store the files, you could manage the other data about the data, namely: location, source, and creation time by appending to another file -- a simple tab-separated file or several other formats supported by Hadoop make this easy.
You can manage and query the file with Hive or other SQL-on-Hadoop tools. In effect, you're creating a simple file system with special attributes, so the trick would be to make sure that each time you write a file, you also write the metadata. You may have to handle cases like write failures, what happens when you delete, rename, or move files (I know, you say "never").
Your solution might be simpler depending on your needs, you may find that storing the data in subdirectories within HDFS (or AWS S3) is even simpler. Perhaps if you wanted to store DBF files from source "foo", and XLS files from "bar" created on December 1, 2015, you could simply create a directory structure like
/2015/12/01/foo/dbf/myfile.dbf
/2015/12/01/bar/xls/myexcel.xls
This solution has the advantage of being self-maintaining -- the file path stores the metadata which makes it very portable and simple, requiring nothing more than a shell script to implement.
I don't think there's any reason to make the solution more complicated than necessary. Hadoop or S3 are both fine for long-term, high-durability storage and for querying. My company has found that storing the information about the file in Hadoop (which we use for many other purposes) and storing the files themselves on AWS S3 is far simpler, more easily secured and much cheaper.
There are various things that you may want to do, each with their own solution. If more than 1 use case is relevant for you, you probably want to implement multiple solutions in parallel.
1. Store files for use
If you want to store files in a way that they can be picked up efficiently (distributed), the solution is simple: Put the files on hdfs
2. Store the information for use
If you want to use the information, rather than storing the files you should be interested in storing the information in a way that they can be picked up efficiently. The general solution here would be: Parse the files in a lossles way and store their information in a database
You may find that storing information in (partitioned) ORC files can be nice for this. You can do this with Pive, Pig or even UDFs (e.g. python) in Pig.
3. Keep the files for the future
In this case you would mostly care about preserving the files, and not so much about ease of access. Here the recommended solution is: Store compressed files with proper backups
Note that the replication that hdfs does is to deal more efficiently with data (and hardware issues). Just having your data on hdfs does NOT mean that it is backed up.
Thanks for the answers. I'm still not quite getting the answer I want. It's a particular question involving HDFS and the concat api.
Here it is. When concat talks about files, does it mean only "files created and managed by HDFS?" Or will it work on files that are not known to HDFS but just happen to live on the datanodes?
The idea is to
Create a file and save it through HDFS. It's broken up into blocks and saved to the datanodes.
Go directly to the datanodes and make local copies of the blocks using normal shell commands.
Alter those copies. I now have a set of blocks that Hadoop doesn't know about. The checksums are definitely bad.
Use concat to stitch the copies together and "register" them with HDFS.
At the end of all that, I have two files as far as HDFS is concerned. The original and an updated copy. Essentially, I put the data blocks on the datanodes without going through Hadoop. The concat code put all those new blocks into a new HDFS file without having to pass the data through Hadoop.
I don't think this will work, but I need to be sure it won't. It was suggested to me as a possible solution to the update problem. I need to convince them this will not work.
The base philosophy of HDFS is:
write-once, read-many
then, it is not possible to update files with the base implementation of HDFS. You only can append at the end of a current file if you are using a Hadoop branch that allow it. (The original version doesn't allow it)
An alternative could be use a non-standard HDFS like Map-R file system: https://www.mapr.com/blog/get-real-hadoop-read-write-file-system#.VfHYK2wViko
Go for HBase which is built on top of Hadoop to support CRUD operations in big data hadoop world.
If you are not supposed to use No SQL database then there is no chance for updating HDFS files. Only option is to rewrite.
While building an infrastructure for one of my current projects I've faced the problem of replacement of already existing HDFS files. More precisely, I want to do the following:
We have a few machines (log-servers) which are continuously generating logs. We have a dedicated machine (log-preprocessor) which is responsible for receiving log chunks (each chunk is about 30 minutes in length and 500-800 mb in size) from log-servers, preprocessing them and uploading to HDFS of our Hadoop-cluster.
Preprocessing is done in 3 steps:
for each logserver: filter (in parallel) received log chunk (output file is about 60-80mb)
combine (merge-sort) all output files from the step1 and do some minor filtering (additionally, 30-min files are combined together into 1-hour files)
using current mapping from external DB, process the file from step#2 to obtain the final logfile and put this file to HDFS.
Final logfiles are to be used as input for several periodoc HADOOP-applications which are running on a HADOOP-cluster. In HDFS logfiles are stored as follows:
hdfs:/spool/.../logs/YYYY-MM-DD.HH.MM.log
Problem description:
The mapping which is used on step 3 changes over time and we need to reflect these changes by recalculating step3 and replacing old HDFS files with new ones. This update is performed with some periodicity (e.g. every 10-15 minutes) at least for last 12 hours. Please note that, if the mapping has changed, the result of applying step3 on the same input file may be significantly different (it will not be just a superset/subset of previous result). So we need to overwrite existing files in HDFS.
However, we can't just do hadoop fs -rm and then hadoop fs -copyToLocal because if some HADOOP-application is using the file which is temporary removed the app may fail. The solution I use -- put a new file near the old one, the files have the same name but different suffixes denoting files` version. Now the layout is the following:
hdfs:/spool/.../logs/2012-09-26.09.00.log.v1
hdfs:/spool/.../logs/2012-09-26.09.00.log.v2
hdfs:/spool/.../logs/2012-09-26.09.00.log.v3
hdfs:/spool/.../logs/2012-09-26.10.00.log.v1
hdfs:/spool/.../logs/2012-09-26.10.00.log.v2
Any Hadoop-application during it's start (setup) chooses the files with the most up-to-date versions and works with them. So even if some update is going on, the application will not experience any problems because no input file is removed.
Questions:
Do you know some easier approach to this problem which does not use this complicated/ugly file versioning?
Some applications may start using a HDFS-file which is currently uploading, but not yet uploaded (applications see this file in HDFS but don't know if it consistent). In case of gzip files this may lead to failed mappers. Could you please advice how could I handle this issue? I know that for local file systems I can do something like:
cp infile /finaldir/outfile.tmp && mv /finaldir/output.tmp /finaldir/output
This works because mv is an atomic operation, however I'm not sure that this is the case for HDFS. Could you please advice if HDFS has some atomic operation like mv in conventional local file systems?
Thanks in advance!
IMO, the file rename approach is absolutely fine to go with.
HDFS, upto 1.x, lacks atomic renames (they are dirty updates IIRC) - but the operation has usually been considered 'atomic-like' and never given problems to the specific scenario you have in mind here. You could rely on this without worrying about a partial state since the source file is already created and closed.
HDFS 2.x onwards supports proper atomic renames (via a new API call) that has replaced the earlier version's dirty one. It is also the default behavior of rename if you use the FileContext APIs.