rdd.saveAsTextFile("s3n://bucket-name/path) is creating an empty file with folder name as - [folder-name]_$folder$
Seems like this empty file in used by hadoop-aws jar (of org.apache.hadoop) to mimick S3 filesystem as hadoop filesystem.
But, my application writes thousands of files to S3. As saveAsTextFile creates folder (from the given path) to write the data (from rdd) my application ends up creating thousands of these empty files - [directory-name]_$folder$.
Is there a way to make rdd.saveAsTextFile not to write these empty files?
Stop using s3n, switch to s3a. It's faster and actually supported. that will make this issue go away, along with the atrocious performance problems reading large Parquet/ORC files.
Also, if your app is creating thousands of small files in S3, you are creating future performance problems: listing and opening files on S3 is slow. Try to combine source data into larger columnar-formatted files & use whatever SELECT mechanism your framework has to only read the bits you want
Related
I am using goroutines to concurrently download data from S3. For context, I currently have a group of samples. Each sample contains data in the form of a map, with a key representing the name of a file and the value pointing to the path in S3. Each sample has about 10 files that need to be downloaded from S3. I download all of these files in parallel and write to a shared zipfile object (got the mutexes and stuff figured out). I've figured out the concurrency aspect of this problem but the issue I face is organizing the zipfile object. I was wondering if it was possible to create a subdirectory within a zipfile object. otherwise i'm left with a massive zip object of all the data I need, but it is not really organized in any tangible way. Ideally, I'd be able to create a folder in the zipfile object for each sample and save all the file data to that but i don't know if that's possible.
The zip format has no notion of folder / directory, it just contains a list of files.
The file names may be composed to have folders in them, so the folders are just "virtual" but are not recorded as they are in "real" file systems.
So no, you can't create a directory in a zip file.
Spark(version=2.2.0) there is not DirectParquetOutputCommitter. As an alternative, I can use
dataset
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")//magic here
.parquet("s3a://...")
to avoid creating _temporary folder on S3.
Everything works fine until I set a partitionBy to my Dataset
dataset
.partitionBy("a", "b")
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")//magic stop working creating _temporary on S3
.parquet("s3a://...")
Also tried adding but didn't work
spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
With partitionBy at Spark Dataset, It's going to create _temporary and move files which becomes a very slow operation.
There is any alternative or missing configuration?
Hadoop 3.1's s3a will have a zero rename committer built in, (va HADOOP-13786). Until then, you can make use of its precursor, which is from netflix
Note that "algorithm 2" isn't a magic step eliminating the _temp dir, just renaming task output direct to the destination when the individual tasks commit. Still prone to errors if there's a delayed consistency in the directory listing, and still O(data). You cannot safely use either the v1 or v2 committers directly with S3, not with the S3A connector as shipping in Hadoop 2.x
Alternatives (in order of recommendation + ease - top is best):
Use Netflix's S3Committer: https://github.com/rdblue/s3committer/
Write to HDFS, then copy to S3 (e.g. via s3distcp)
Don't use the partitionBy, but instead iterate over all the partition permutations and write the results dynamically to each partitioned directory
Write a custom file committer
Background--we are trying to read different file types (csv or parquet) into pyspark, and I have the task of writing a program that will determine file type.
It appears that parquet files are always directories, parquet file appears in HDFS as a directory.
We have some csv files that are also directories, where the file name is the directory name and the directory contains several part files. What processes do this?
Why are some files --'files' and some files 'directories'?
It will depend on what process produced those files. For example, when MapReduce produces output, it always produces a directory and then creates one output file per reducer within that directory. This is done so that each reducer can create its output independently.
Judging from Spark's CSV package, it expects to output to a single file. So perhaps the single-file CSVs are being generated by Spark and the directories by MapReduce.
To be as generic as possible, it may be a good idea to do the following: check if the file in question is a directory. If not, check the extension. If yes, look at the extension of the files inside of the directory. This should work for each of your situations.
Note that some input formats (e.g. MapReduce input formats) will only accept directories as inputs, and some (e.g. Spark's textFile) will only accept files/globs of files. You need to be aware of what is expected from the libraries you are interacting with.
All the data on your hard drive consists of files and folders. The
basic difference between the two is that files store data, while
folders store files and other folders.
Hadoop execution engines generally creates a directory and write multiple part files as output based on the number of reducers or executors used.
When you many an output file abc.csv it doesn't mean that its a single file with the data. Its just the output location which MapReduce (generally) interprets as the new directory to be created within which it creates the output files(part files).
In case of Spark when you are writing a file(maybe using .saveAsTextFile) it may creates only a single file.
I'm new in big data! I have some questions about how to process and how to save large amount of small files(pdf and ppt/pptx) in spark, on EMR Clusters.
My goal is to save data(pdf and pptx) into HDFS(or in some type of datastore from cluster) then extract content from this file from spark and save it in elasticsearch or some relational database.
I had read the problem of small files when save data in HDFS. What is the best way to save large amount of pdf & pptx files (maxim size 100-120 MB)? I had read about Sequence Files and HAR(hadoop archive) but none of them I don't understand how exactly it's works and i don't figure out what is the best.
What is the best way to process this files? I understood that some solutions could be FileInputFormat or CombineFileInputFormat but again I don't know how exactly it's works. I know that can't run every small file on separated task because the cluster will be put in the bottleneck case.
Thanks!
If you use Object Stores (like S3) instead of HDFS then there is no need to apply any changes or conversions to your files and you can have them each as a single object or blob (this also means they are easily readable using standard tools and needn't be unpacked or reformatted with custom classes or code).
You can then read the files using python tools like boto (for s3) or if you are working with spark using the wholeTextFile or binaryFiles command and then making a BytesIO (python) / ByteArrayInputStream (java) to read them using standard libraries.
2) When processing the files, you have the distinction between items and partitions. If you have a 10000 files you can create 100 partitions containing 100 files each. Each file will need to anyways be processed one at a time since the header information is relevant and likely different for each file.
Meanwhile, I found some solutions for that small files problem in HDFS. I can use the following approaches:
HDFS Federation help us to distribute the load of namenodes: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/
HBase could be also a good alternative if your files size is not too large.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
https://hbase.apache.org/book.html
Apache Ozone which is object storage like S3 but is on-premises. At the time of writing, from what I know, Ozone is not production ready. https://hadoop.apache.org/ozone/
While building an infrastructure for one of my current projects I've faced the problem of replacement of already existing HDFS files. More precisely, I want to do the following:
We have a few machines (log-servers) which are continuously generating logs. We have a dedicated machine (log-preprocessor) which is responsible for receiving log chunks (each chunk is about 30 minutes in length and 500-800 mb in size) from log-servers, preprocessing them and uploading to HDFS of our Hadoop-cluster.
Preprocessing is done in 3 steps:
for each logserver: filter (in parallel) received log chunk (output file is about 60-80mb)
combine (merge-sort) all output files from the step1 and do some minor filtering (additionally, 30-min files are combined together into 1-hour files)
using current mapping from external DB, process the file from step#2 to obtain the final logfile and put this file to HDFS.
Final logfiles are to be used as input for several periodoc HADOOP-applications which are running on a HADOOP-cluster. In HDFS logfiles are stored as follows:
hdfs:/spool/.../logs/YYYY-MM-DD.HH.MM.log
Problem description:
The mapping which is used on step 3 changes over time and we need to reflect these changes by recalculating step3 and replacing old HDFS files with new ones. This update is performed with some periodicity (e.g. every 10-15 minutes) at least for last 12 hours. Please note that, if the mapping has changed, the result of applying step3 on the same input file may be significantly different (it will not be just a superset/subset of previous result). So we need to overwrite existing files in HDFS.
However, we can't just do hadoop fs -rm and then hadoop fs -copyToLocal because if some HADOOP-application is using the file which is temporary removed the app may fail. The solution I use -- put a new file near the old one, the files have the same name but different suffixes denoting files` version. Now the layout is the following:
hdfs:/spool/.../logs/2012-09-26.09.00.log.v1
hdfs:/spool/.../logs/2012-09-26.09.00.log.v2
hdfs:/spool/.../logs/2012-09-26.09.00.log.v3
hdfs:/spool/.../logs/2012-09-26.10.00.log.v1
hdfs:/spool/.../logs/2012-09-26.10.00.log.v2
Any Hadoop-application during it's start (setup) chooses the files with the most up-to-date versions and works with them. So even if some update is going on, the application will not experience any problems because no input file is removed.
Questions:
Do you know some easier approach to this problem which does not use this complicated/ugly file versioning?
Some applications may start using a HDFS-file which is currently uploading, but not yet uploaded (applications see this file in HDFS but don't know if it consistent). In case of gzip files this may lead to failed mappers. Could you please advice how could I handle this issue? I know that for local file systems I can do something like:
cp infile /finaldir/outfile.tmp && mv /finaldir/output.tmp /finaldir/output
This works because mv is an atomic operation, however I'm not sure that this is the case for HDFS. Could you please advice if HDFS has some atomic operation like mv in conventional local file systems?
Thanks in advance!
IMO, the file rename approach is absolutely fine to go with.
HDFS, upto 1.x, lacks atomic renames (they are dirty updates IIRC) - but the operation has usually been considered 'atomic-like' and never given problems to the specific scenario you have in mind here. You could rely on this without worrying about a partial state since the source file is already created and closed.
HDFS 2.x onwards supports proper atomic renames (via a new API call) that has replaced the earlier version's dirty one. It is also the default behavior of rename if you use the FileContext APIs.