I am working with Dataproc and Parquet on Google Cloud Platform, with data on GCS, and writing lots of small to moderately sized files is a major hassle, being a couple times slower than what I would get with less bigger files or HDFS.
The Hadoop community has been working on S3Guard, which uses DynamoDB for S3A. Similarly, s3committer uses S3's multi-part API to provide a simple alternative committer that is much more efficient.
I am looking for similar solutions on GCS. The multi-part API from S3 is one of the few things not offered by GCS's XML API and thus cannot be used as is. Instead, GCS has a "combine" API where you upload files separately and then issue a combine query. This seems like it could be used to adapt the multi-part upload from s3committer but I am not quite sure.
I could not find any information about using S3Guard on GCS with an alternate key value store (and the S3A connector -- not even sure it can be used with the GCS XML API).
0-rename commits seem to be a common issue with Hadoop and Apache Spark. What are usual solutions to that on GCS, besides "writing less, bigger files"?
There are a few different things in play here. For the problem of enforcing list consistency, Dataproc traditionally relied on a per-cluster NFS mount to apply client-enforced list-after-write consistency; more recently, Google Cloud Storage has managed to improve its list-after-write consistency semantics and now list operations are strongly consistency immediately after all writes. Dataproc is phasing out client-enforced consistency, and something like S3Guard on DynamoDB is no longer needed for GCS.
As for multipart upload, in theory it could be possible to use GCS Compose as you mention, but in most cases the parallel multipart uploads for single large files is mostly helpful in a single-stream situation, whereas most Hadoop/Spark workloads will already be parallelizing different tasks per machine such that it's not beneficial to multithread each individual upload stream; aggregate throughput will be about the same with or without parallel multipart uploads.
So that leaves the question of using the multi-part API to perform conditional/atomic commits. The GCS connector for Hadoop does currently use something called "resumable uploads" where it's theoretically possible for a node to be responsible for "committing" an object that has been uploaded by a completely different node; the client libraries just aren't currently structured to make this very straightforward. However, at the same time, the "copy-and-delete" phase of a GCS "rename" is also different from S3 in that it is done as metadata operations instead of a true data "copy". This makes GCS amenable to using vanilla Hadoop FileCommitters instead of needing to commit "directly" into the final location and skipping the "_temporary" machinery. It may not be ideal to have to "copy/delete" metadata of each file instead of a true directory rename, but it also isn't proportional to the underlying data size, only proportional to the number of files.
Of course, all this still doesn't solve the fact that committing lots of small files is inefficient. It does, however, make it likely that the "direct commit" aspect isn't as much of a factor as you might think; more often the bigger issue is something like Hive not parallelizing file commits at completion time, especially when committing to lots of partition directories. Spark is much better at this, and Hive should be improving over time.
There is a recent performance improvement using a native SSL library in Dataproc 1.2 which you can try without having to "write less, bigger files", just by using Dataproc 1.2 out of the box.
Otherwise, real solutions really do involve writing fewer, bigger files, since even if you fix the write side, you'll suffer on the read side if you have too many small files. GCS is heavily optimized for throughput, so anything less than around 64MB or 128MB may be spending more time just on overhead of spinning up a task and opening the stream vs actual computation (should be able to read that much data in maybe 200ms-500ms or so).
In that vein, you'd want to make sure you set things like hive.merge.mapfiles, hive.merge.mapredfiles, or hive.merge.tezfiles if you're using those, or repartition your Spark dataframes before saving to GCS; merging into larger partitions is usually well worth it for keeping your files manageable and profiting from ongoing faster reads.
Edit: One thing I forgot to mention is that I've been loosely using the term repartition, but in this case since we're strictly trying to bunch up the files into larger files, you may do better with coalesce instead; there's more discussion in another StackOverflow question about repartition vs coalese.
S3Guard, HADOOP-13345 retrofits consistency to S3 by having DynamoDB store the listings. This makes it possible, for the first time, to reliably use S3A as a direct destination of work. Without that, execution time time may seem the problem, but the real one is the rename-based committer may get an inconsistent listing and not even see what files it has to rename.
The S3Guard Committer work HADOOP-13786 will, when finished (as of Aug 2017, still a work in progress), provides two committers.
Staging committer
workers write to local filesystem
Task committer uploads to S3 but does not complete the operation. Instead it saves commit metainfo to HDFS.
This commit metainfo is committed as normal task/job data in HDFS.
In Job commit, the committer reads the data of pending commits from HDFS and completes them, then does cleanup of any outstanding commits.
Task commit is an upload of all data, time is O(data/bandwidth).
This is based on Ryan's s3committer at Netflix and is the one which is going to be safest to play with at first.
Magic committer
Called because it does "magic" inside the filesystem.
the Filesystem itself recognises paths like s3a://dest/__magic/job044/task001/__base/part-000.orc.snappy
redirects the write to s3a://dest/__magic/job044/task001/__base/part-000.orc.snappy ; doesn't complete the write in the stream close() call.
saves the commit metainfo to s3a, here s3a://dest/__magic/job044/task001/__base/part-000.orc.snappy.pending
Task commit: loads all .pending files from that dir, aggregates, saves elsewhere. Time is O(files); data size unimportant.
Task abort: load all .pending files, abort the commits
Job commit: load all pending files from committed tasks, completes.
Because it is listing files in S3, it will need S3Guard to deliver consistency on AWS S3 (other S3 implementations are consistent out the box, so don't need it).
Both committers share the same codebase, job commit for both will be O(files/threads), as they are all short POST requests which don't take up bandwidth or much time.
In tests, the staging committer is faster than the magic one for small test-scale files, because the magic committer talks more to S3, which is slow...though S3Guard speeds listing/getFileStatus calls up. The more data you write, the longer task commits on the staging committer take, whereas task commit for the magic one is constant for the same number of files. Both are faster than using rename(), due to how it is mimicked by list, copy
GCS and Hadoop/Spark Commit algorithms
(I haven't looked at the GCS code here, so reserve the right to be wrong. Tread Dennis Huo's statements as authoritative)
If GCS does rename() more efficiently than the S3A copy-then-delete, it should be faster, more O(file) than O(data), depending on parallelisation in the code.
I don't know if they can go for a 0-rename committer. The changes in the mapreduce code under FileOutputFormat are designed to support different/pluggable committers for different filesystems, so they have the opportunity to do something here.
For now, make sure you are using the v2 MR commit algorithm, which, while less resilient to failures, does at least push the renames into task commit, rather than job commit.
See also Spark and Object Stores.
I was casually wondering if there was a difference in read/write performance for files that are copied to the same folder as opposed to those moved (via mv).
I imagine that performing some serial operation to several files located in a contiguous memory block would be faster than those scattered across a hard drive. Such is the case (I guess ?) if you copy files vs move them from disparate origins. So... is there a performance difference of files moved vs copied to the same directory, how significant, and does it depend on storage technology (HDD, SSD)?
Note, I am not wondering whether mv vs cp is faster. Please don't respond with a description of the difference between the commands. Thanks!
The way that move and copy works will have some (limited) baring on this assuming source and destination are located on the same physical volume.
However assuming source and destination are not the same volume both will behave the same in terms of writing the destination data. If the destination volume is completely empty and freshly formatted then you 'probably' stand a good chance of their data being written to a similar location. If there is or has been data written to the volume then there is no guarantee the file system won't simply scatter the data anyway.
The file system will ultimately decide where the data is to be stored on the actual storage medium, and it may decide that neighbouring blocks are not the best solution. Copy or Move is irrelevant, as both will require the file system to store the data.
Grouping those files by mount point is possibly the best way of ensuring they reside within a similar region of storage.
HTH
Question
Would Hadoop be a good candidate for the following use case:
Simple key-value store (primarily needs to GET and SET by key)
Very small "rows" (32-byte key-value pairs)
Heavy deletes
Heavy writes
On the order of a 100 million to 1 billion key-value pairs
Majority of data can be contained on SSDs (solid state drives) instead of in RAM.
More info
The reason I ask is because I keep seeing references to the Hadoop file system and how Hadoop is used as the foundation for a lot of other database implementations that aren't necessarily designed for Map-Reduce.
Currently, we are storing this data in Redis. Redis performs great, but since it contains all of its data within RAM, we have to use expensive machines with upwards of 128gb RAM. It would be nice to instead use a system that relies on SSDs. This way we would have the freedom to build much bigger hash tables.
We have also stored this data using Cassandra, but Cassandra tends to "break" if the deletes become too heavy.
Hadoop (unlike popular media opinions) is not a database. What you describe is a database. Thus Hadoop is not a good candidate for you. Also the below post is opinionated, so feel free to prove me wrong with benchmarks.
If you care about "NoSql DB's" that are on top of Hadoop:
HBase would be suited for heavy writes, but sucks on huge deletes
Cassandra same story, but writes are not as fast as in HBase
Accumulo might be useful for very frequent updates, but will suck on deletes as well
None of them make "real" use of SSDs, I think that all of them do not get a huge speedup by them.
All of them suffer from the costly compactions if you start to fragment your tablets (in BigTable speech), thus deleting is a fairly obvious limiting factor.
What you can do to mitigate the deletion issues is to just overwrite with a constant "deleted" value, which work-arounds the compaction. However, grows your table which can be costly on SSDs as well. Also you will need to filter, which likely affects the read latency.
From what you describe, Amazon's DynamoDB architecture sounds like the best candidate here. Although deletes here are also costly- maybe not as much as the above alternatives.
BTW: the recommended way of deleting lots of rows from the tables in any of the above databases is to just completely delete the table. If you can fit your design into this paradigm, any of those will do.
Although this isnt an answer to you question, but in context with what you say about
It would be nice to instead use a system that relies on SSDs. This way
we would have the freedom to build much bigger hash tables.
you might consider taking a look at Project Voldemort.
Specifically being a Cassandra user I know when you say Its the compaction and the tombstones that are a problem. I have myself ran into TombstoneOverwhelmingException couple of times and hit dead ends.
You might want to have a look at this article by Linked In
It says:
Memcached is all in memory so you need to squeeze all your data into
memory to be able to serve it (which can be an expensive proposition
if the generated data set is large).
And finally
all we do is just mmap the entire data set into the process address
space and access it there. This provides the lowest overhead caching
possible, and makes use of the very efficient lookup structures in the
operating system.
I dont know if this fits your case. But you can consider evaluating Voldemort once! Best of luck.
I was going through Hadoop- The definitive Guide and i came across these lines:
Normalization poses problems for MapReduce, since it makes reading a record a nonlocal operation, and one of the central assumptions that
MapReduce makes is that it is possible to perform (high-speed)
streaming reads and writes.
Can someone explain what do these lines actually mean in layman language?
I know what is Normalization. How does it make makes reading a record a nonlocal
operation? What is the meaning of a non local operation in reference to hadoop?
In hadoop a local operation refers to executing code in the same physical location where the data it needs to work with is being stored.
When you normalize your data you're essentially splitting it up. If this "split up" data gets distributed in 2 physically different areas you suddenly have non-local operations.
From the following paragraphs of Text——
(http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...
There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.
But while HDFS is very scalable, its high performance design also restricts it to a
particular class of applications; it is not as general-purpose as NFS. There are a large
number of additional decisions and trade-offs that were made with HDFS. In particular:
Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense of
random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files
after they have already been closed are not supported. (An extension to Hadoop will provide
support for appending new data to the ends of files; it is scheduled to be included in
Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does
not provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as a
whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.
Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.
It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).
With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.