How does MinIO replicate compressed objects - minio

I am trying to better understand how server-side bucket replication works w.r.t to compressed objects. Seemingly, MinIO does not (yet) support compressed transfer between client and server (https://github.com/minio/minio/issues/6880).
I'm storing large text data (~1 GB per object) in MinIO. The server is configured to store these objects compressed on disk. The compressed objects are just about 1/10 in size.
Now, I'm wondering how these objects are replicated. Does MinIO use a lower level protocol to replicate nodes that transfer the already compressed objects as they are?
Or are they uncompressed, transferred, and then compressed again as if replicated by using the regular client API?

You have the option of uploading compressed object to MinIO server or utilizing the server side compression feature.
W.r.t replication, if the object was compressed before upload to the primary server, this would be replicated as compressed to the replication target.
However, if you are using server side compression, the data will be uncompressed and transferred to the target - there is no assumption made that the replication target has compression enabled. You will have to independently enable compression on primary and target
Please know you can find us at https://slack.min.io/ 24/7/365. If you have commercial questions, please reach out to us on hello#min.io or on our Ask an Expert Chat functionality at https://min.io/pricing?action=talk-to-us.

Related

NiFi | FlowFile Memory dependency

I am trying to understand NiFi data flow mechanism . I read that Nifi has flow file which holds content and metadata (flow file attribute).
So I wanted to understand if I have 1 TB of data placed on edge node and would like to pass it to Nifi processors , is it going to load everything into memory to be used by processor?
FlowFiles (herein referred to as FF) are analogous to HTTP data in that they are comprised of content and attributes (metadata) as you highlight. However, the way these are handled within the NiFi framework is that the metadata resides in memory (up to a configured limit per connection) and the content portion of the FF is actually a pointer to the content on disk. That is once the content is received into NiFi, it is not longer held in memory at any point in time, utilizing a pass by reference approach allowing NiFi to handle arbitrarily large files. The only thing stored in memory is the metadata of FFs which is configurable to limit the number on a per connection basis.
When a processor needs to make a change, this exercises the copy on write approach for modifications.
In general, processors use a streaming approach for reading/writing data to/from the content repository. To that end, the included processors avoid storing the totality of a FF's content in memory as it could prove prohibitive. Simple routing and movement of data for an arbitrarily large file should be no issue; avoiding excess pressure on the heap memory. When looking at doing transformations/modifications on such files, the answer is that it is typically okay, but it depends on the specifics of the data type.

doubts regarding migration to big data

I have a few doubts regarding hadoop
In one of the videos published by cloudera an instructer told that in hadoop there is HDFS. Every file will be stored as a set of chucks or blocks. Each block will be replicated three times in different machines to minimize the point of failure. Each mapper will process a single hdfs block.
From these logics i perceived that if i have a server having some 100 peta bytes of logs which are not stored in traditional file system unlike hdfs.
Main doubt 1. Now if i want to analyse this huge data efficiently using the mapreduce technique then do i have to transfer the data in a new server running hdfs and having three times the storage of the old server.
In one more video which was also published by cloudera..the instructer mentioned clearly that we dont need to migrate the traditional system to a new system, we can use hadoop and map reduce on top of that. This is little contradictry to the statement mentioned in first point.
Main doubt 2: Lets assume that point 2 statement is true. Now how can this be possible. I mean how can we apply hadoop and map reduce on a traditional file system where there is no replication of blocks or name node ..deamon on each machine.
My main task is to Facilitate fast analysis of a huge amount of logs which are currently not stored in hdfs. For doing this will i need a new server or not.
P.S: I need some good tutorial or Books or some articles which could give me in depth knowledge of big data so that i can start working on it.
So recomendations are most welcome.
Hadoop is just an infrastructure for running a MapReduce style workload (for "big data" or "analytics" atop a cluster of servers.
You can use HDFS for data sharing across the nodes, then use Hadoop's built in workload management to distribute work to nodes where the data is stored. This is sometimes called "function shipping."
But it's also possible to not use HDFS. You can use another network file sharing / distribution mechanism. FTP (file copies), S3 (access from the Amazon Web Services cloud), and a variety of other clustered/distributed file systems are supported by various vendors/platforms. Some of these move the data to the system on which workload is being done ("data shipping").
Which storage strategy is appropriate, efficient, and performant is a big question, and depends greatly on your infrastructure and your MapReduce app's data access patterns. In general, however, analytics jobs are resource hungry, so only small analytics apps tend to run on servers doing other work (the "original systems"). So processing "big data" does tend to suggest new servers--if not ones you buy, ones you rent temporarily from a cloud service like AWS, RackSpace, etc.--and data streaming from replicas/clones of data captured in production ("secondary storage") rather than data still resident on "primary storage."
If you're just starting out with small or modest apps, you might be able to access data in-place, directly from existing systems. But if you've got 100 PB of logs, you're going to want that processed on systems devoted to the task.

Serve static files from Hadoop

My job is to design a distributed system for static image/video files. The size of the data is about tens of Terabytes. It's mostly for HTTP access (thus no processing on data; or only simple processing such as resizing- however it's not important because it can be done directly in the application).
To be a little more clear, it's a system that:
Must be distributed (horizontal scale), because the total size of data is very big.
Primarily serves small static files (such as images, thumbnails, short videos) via HTTP.
Generally, no requirement on processing the data (thus MapReduce is not needed)
Setting HTTP access on the data could be done easily.
(Should have) good throughput.
I am considering:
Native network file system: But it seems not feasible because the data can not fit into one machine.
Hadoop filesystem. I worked with Hadoop mapreduce before, but I have no experience using Hadoop as a static file repository for HTTP requests. So I don't know if it's possible or if it's a recommended way.
MogileFS. It seems promising, but I feel that using MySQL to manage local files (on a single machine) will create too much overhead.
Any suggestion please?
I am the author of Weed-FS. For your requirement, WeedFS is ideal. Hadoop can not handle many small files, in addition to your reasons, each file needs to have an entry in the master. If the number of files are big, the hdfs master node can not scale.
Weed-FS is getting faster when compiled with latest Golang releases.
Many new improvements have been done on Weed-FS recently. Now you can test and compare very easily with the built-in upload tool. This one upload all files recursively under a directory.
weed upload -dir=/some/directory
Now you can compare by "du -k /some/directory" to see the disk usage, and "ls -l /your/weed/volume/directory" to see the Weed-FS disk usage.
And I suppose you would need replication with data center, rack aware, etc. They are in now!
Hadoop is optimized for large files e.g. It's default block size is 64M. A lot of small files are both wasteful and hard to manage on Hadoop.
You can take a look at other distributed file systems e.g. GlusterFS
Hadoop has a rest API for acessing files. See this entry in the documentation. I feel that Hadoop is not meant for storing large number of small files.
HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes. The block size is 64 mb. So even if the file is of 10kb, it would be allocated an entire block of 64 mb. Thats a waste disk space.
If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 files of 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
In "Hadoop Summit 2011", there was this talk by Karthik Ranganathan about Facebook Messaging in which he gave away this bit: Facebook stores data (profiles, messages etc) over HDFS but they dont use the same infra for images and videos. They have their own system named Haystack for images. Its not open source but they shared the abstract design level details about it.
This brings me to weed-fs: an open source project for inspired by Haystacks' design. Its tailor made for storing files. I have not used it till now but seems worth a shot.
If you are able to batch the files and have no requirement to update a batch after adding to HDFS, then you could compile multiple small files into a single larger binary sequence file. This is a more efficient way to store small files in HDFS (as Arnon points out above, HDFS is designed for large files and becomes very inefficient when working with small files).
This is the approach I took when using Hadoop to process CT images (details at Image Processing in Hadoop). Here the 225 slices of the CT scan (each an individual image) were compiled into a single, much larger, binary sequence file for long streaming reads into Hadoop for processing.
Hope this helps!
G

Distributed key-value storage for total data size of 80TB

TL;DR:
I'd like to have recommendations for a distributed key-value storage, for avg. entry size of up to 50KB, to be installed on a Linux environment (dedicated servers).
A file-system solution would do.
I found a few solutions: Ceph, Cassandra, Riak, and a few more.
Details
I'm looking for a storage solution for one of our components, it should be a key-value storage, flat namespace.
Scenario
The read/write patterns are very simple:
Once a key-value is written, there are a few reads within the next hours.
After that, nothing touches the given key-value. We'd like to keep the data for future purposes, "Storage mode".
Other usage aspects
OS: Linux
Python client/connector
Total size: up to 80TB (this value also represents future needs).
Avg Entry Size (for a single value in a k-v pair): 10 to 50 KB, uncompressed, mostly textual data
Compression: either built-in or external.
Encryption: not needed
Network bandwidth: 1Gb, single LAN
Servers: dedicated (not in the cloud)
Most important requirements
The "base" requirements are:
OS: Linux
Python client/connector OR RESTful API via HTTP
Can easily store up to 80TB (this value also represents future needs).
Max read latency: a few seconds for first reads, 30 seconds for "storage mode" (see above for explanation)
Built in replication (so that data is stored on more than a single node)
Nice to have
RESTful gateway
Background data backup to another store (for data recovery in case of a disaster).
Easy to configure
What I've found so far
Ceph
HDFS
HBase on top of HDFS
Lustre
GlusterFS
Mongo's GridFS - but can I trust Mongo's infrastructure?
Cassandra - not an option, since the merge process consumes double disk size
Riak - looks like it has the same issue as Cassandra, needs more research
Swift + OpenStack (actual storage can be on Amazon S3)
Voldemort
There are dozens of additional tools, but I won't write them here since some of them have proprietary license, and others seem to be immature.
I'd appreciate any recommendation on any of the tools I mentioned above (with total capacity of more than 50TB), or on a tool you think is sufficient.
Just use Ceph (I mean direct librados usage). Don't use GlusterFS -- it's hangy.

Storage format in HDFS

How Does HDFS store data?
I want to store huge files in a compressed fashion.
E.g : I have a 1.5 GB of file, with default replication factor of 3.
It requires (1.5)*3 = 4.5 GB of space.
I believe currently no implicit compression of data takes place.
Is there a technique to compress the file and store it in HDFS to save disk space ?
HDFS stores any file in a number of 'blocks'. The block size is configurable on a per file basis, but has a default value (like 64/128/256 MB)
So given a file of 1.5 GB, and block size of 128 MB, hadoop would break up the file into ~12 blocks (12 x 128 MB ~= 1.5GB). Each block is also replicated a configurable number of times.
If your data compresses well (like text files) then you can compress the files and store the compressed files in HDFS - the same applies as above, so if the 1.5GB file compresses to 500MB, then this would be stored as 4 blocks.
However, one thing to consider when using compression is whether the compression method supports splitting the file - that is can you randomly seek to a position in the file and recover the compressed stream (GZIp for example does not support splitting, BZip2 does).
Even if the method doesn't support splitting, hadoop will still store the file in a number of blocks, but you'll lose some benefit of 'data locality' as the blocks will most probably be spread around your cluster.
In your map reduce code, Hadoop has a number of compression codecs installed by default, and will automatically recognize certain file extensions (.gz for GZip files for example), abstracting you away from worrying about whether the input / output needs to be compressed.
Hope this makes sense
EDIT Some additional info in response to comments:
When writing to HDFS as output from a Map Reduce job, see the API for FileOutputFormat, in particular the following methods:
setCompressOutput(Job, boolean)
setOutputCompressorClass(Job, Class)
When uploading files to HDFS, yes they should be pre-compressed, and with the associated file extension for that compression type (out of the box, hadoop supports gzip with the .gz extension, so file.txt.gz would denote a gzipped file)
Some time ago I tried to summarize that in a blog post here.
Essentially that is a question of data splittability, as a file is devided into blocks which are elementary blocks for replication. Name node is responsible for keeping track of all those blocks belonging to one file. It is essential that block is autonomous when choosing compression - not all codecs are splittable. If the format + codec is not splittable that means that in order to decompress it it needs to be in one place which has big impact on parallelism in mapreduce. Essentially running in single slot.
Hope that helps.
Have a look at presentation # Hadoop_Summit, especially Slide 6 and Slide 7.
If DFS block size is 128 MB, for 4.5 GB storage (including replication factor of 3), you need 35.15 ( ~36 blocks)
Only bzip2 file format is splittable. In other formats, all blocks of entire files are stored in same Datanode
Have a look at algorithm types and class names and codecs
#Chris White answer provides information on how to enable zipping while writing Map output
The answer to this question is to first understand the file format available in Hadoop today. There is now choice available within HDFS that can manage file format and compression techniques. Alternative to explicit encoding and splitting using LZO or BZIP. There is many format that today support block compression and columnar row compression with features.
A storage format is a way you define how information is to be stored. This is sometimes usually indicated by the extension of the file. For example we know images can be several storage formats, PNG, JPG, and GIF etc. All these formats can store the same image, but each has specific storage characteristics.
In Hadoop filesystem you have all of traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data.
Why is it important to know these formats
In any performance tradeoffs, a huge bottleneck for HDFS-enabled applications like MapReduce, Hive, HBase, and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are accentuated when you manage large datasets. The Hadoop file formats have evolved to ease these issues across a number of use cases.
Choosing an appropriate file format can have some significant benefits:
Optimum read time
Optimum write time
Spliting or partitioning of files (so you don’t need to read the whole file, just a part of it)
Schema adaption (allowing a field changes to a dataset) Compression support (without sacrificing these features)
Some file formats are designed for general use, others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice when storing data in Hadoop and one should know to optimally store data in HDFS. Currently my go to storage is ORC format.
Check if your Big data components (Spark, Hive, HBase etc) support these format and make the decision accordingly. For example, I am currently injecting data into Hive and converting it into ORC format which works for me in terms of compression and performance.
Some common storage formats for Hadoop include:
Plain text storage (eg, CSV, TSV files, Delimited file etc)
Data is laid out in lines, with each line being a record. Lines are terminated by a newline character \n in the typical UNIX world. Text-files are inherently splittable. but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2. This is not efficient and will require a bit of work when performing MapReduce tasks.
Sequence Files
Originally designed for MapReduce therefore very easy to integrate with Hadoop MapReduce processes. They encode a key and a value for each record and nothing more. Stored in a binary format that is smaller than a text-based format. Even here it doesn't encode the key and value in anyway. One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. Though still not efficient as per statistics like Parquet and ORC.
Avro
The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Its file format with additional framework for, serialization and deserialization framework. With regular old sequence files you can store complex objects but you have to manage the process. It also supports block-level compression.
Parquet
My favorite and hot format these days. Its a columnar file storage structure while it encodes and writes to the disk. So datasets are partitioned both horizontally and vertically. One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). Try using this if your processing can optimally use column storage. You can refer to advantages of columnar storages.
If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application, but frankly if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required.
ORC
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
It is similar to the Parquet but with different encoding technique. Its not for this thread but you can lookup on Google for differences.

Resources