Understanding snapshots in Ceph - snapshot

Our team is currently deciding whether to implement snapshotting on cephfs directories or not, and thus trying to understand the effects and performance issues caused by snapshots on the cluster.
Our main concern is "How will the cluster be affected when data is written to a file under a snapshot?". We were able to find out that Ceph uses a Copy-on-write mechanism to clone the snapshots, so my question is, for example, if I have a 1GB file under a snapshot and I append another 10MB of data to the file, then how much data will be copied because of the new write?
My understanding is that since Ceph stripes a file into multiple objects, only the object containing the last stripe_unit (assuming it's not completely filled) will be copied and the new data will be added to it, and then Ceph somehow manages to include the new object when I request the current version file and will include the old object when I request the file from the snapshot. Data copied = O(10MB), I mean it's in the order of data written, and a few metadata changes.
Or since Ceph now uses Bluestore as the storage layer, does it have even better optimisations (compared to the above case), like when editing the object corresponding to the last stripe_unit, will ceph just write the new data to a location in the disk and edit the metadata of the object to include the location of the new data, and also maintain snapshot-based versions of the metadata to provide us the file contents at previous points in time. Data copied/written = 10MB and some more metadata changes (compared to the above case).
Or is the case that Ceph will copy the whole file and edit the new copy of the file i.e. data copied is 1GB + 10MB. I am assuming this is not the case because it's clearly suboptimal for large files.
PS: Any resources on measuring the effect of snapshots on the cluster and any resources that explain the internals of Ceph snapshots will be very much appreciated. I've done extensive searching on the internet but couldn't find any relevant data. Tried reading the code but you guys can probably guess how it went.

Some of the resources to understand the fundamentals of Ceph snapshots are as following:
Chapter 9 namely "Storage Provisioning with Ceph" of book Learning Ceph
Chapter "Planning for Ceph" in book Mastering Ceph
Furthermore, if you want to get Bluestore specific information of snapshots you may need to read following two resources as they explicitly explain Bluestore based snapshots:
File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution
The Case for Custom Storage Backends in Distributed Storage Systems

Related

Archiving Log files from elasticSearch and bringing them back to minimize the storage cost

please I need some answers from experienced people since it's my first time using elastic stack (Internship). Assuming that I injected logs (coming from multiple servers apache nginx ...) in elasticsearch and for sure after 1 month or maybe less elesaticsearch will be filled up of logs and this will be very expensive in terms of storage and performance, so I need to set a limit (let's assume that when the amount of logs reaches 100gb) I need to remove them from elasticsearch to free some space for new incoming logs, but I should preserve the old logs and not just delete them (I googled some solutions but all those were about deleting old logs to free space which in my case not helpful) and bring those old logs back to elasticsearch if needed. My question is there a way (an optimal way in terms of cost and performance like compressing old logs or something) to get around this with minimal cost.
You can use snapshot and restore feature with custom repository to offload old data and retrieve it when needed. Try the following guide:
https://www.elastic.co/guide/en/kibana/7.5/snapshot-restore-tutorial.html

Uploading data to HDFS cluster from custom format

I have several machines with TBs of log data in a custom format which can be read with a c++ library. I want to upload all data to hadoop cluster (HDFS) while converting it to parquet files.
This is an on going process (meaning every day I will get more data) and not a one time effort.
What is best alternative to do it performance wise (doing it efficiently)?
Is the parquet C++ library as good as the Java one? (updates, bugs, etc.)
The solution should handle tens of TBs per day or even more in the future.
Log data arrives on going and should be available immediately on HDFS cluster.
Performance-wise, your best approach will be to gather the data in batches and then write out a new Parquet file per batch. If your data is received in single lines and you want to persist them immediately on HDFS, then you could also write them out to a row-based format (that supports single line appends), e.g. AVRO and run regulary a job that compacts them into a single Parquet file.
Library-wise, parquet-cpp is much more in active development at the moment then parquet-mr (the Java library). This is mainly due to the fact that active parquet-cpp development (re-)started about 1.5 years ago (winter/spring 2016). So updates to the C++ library will happen very quickly at the moment while the Java library is very mature as it has a huge userbase since quite some years. There are some features like predicate pushdown that are not yet implemented in parquet-cpp but these all on the read path, so for write they don't matter.
We now at a point with parquet-cpp, that it already runs very stable in different productive environments, so in the end, your choice of using the C++ or Java library should mainly depend on our system environment. If all your code is currently running in the JVM, than use parquet-mr, otherwise, if you're a C++/Python/Ruby user, use parquet-cpp.
Disclaimer: I'm one of the parquet-cpp developers.

Redis memory management - clear based on key, database or instance

I am very new to Redis. I've implemented caching in our application and it works nicely. I want to store two main data types: a directory listing and file content. It's not really relevant, but this will cache files served up via WebDAV.
I want the file structure to remain almost forever. The file content needs to be cached for a short time only. I have set up my expiry/TTL to reflect this.
When the server reaches memory capacity is it possible to priorities certain cached items over others? i.e. flush a key, flush a whole database or flush a whole instance of Redis.
I want to keep my directory listing and flush the file content when memory begins to be an issue.
EDIT: Reading this article seems to be what I need. I think I will need to use volatile-ttl. My file content will have a much shorter TTL set, so this should in theory clear that first. If anyone has any other helpful advice I would love to hear it, but for now I am going to implement this.
Reading this article describes what I needed. I have implemented volatile-ttl as my memory management type.

Store Images to display in SOLR search results

I have built a SOLR Index which has the image thumbnail urls that I want to render an image along with the search results. The problem is that those images can run into millions and I think storing the images in index as binary data would make the index humongous.
I am seeking guidance on how to efficiently store those images after rendering them from the URLs , should I use the plain file system and have them rendered by tomcat , or should I use a JCR repository like Apache Jackrabbit ?
Any guidance would be greatly appreciated.
Thank You.
I would evaluate the effective requiriments before finally deciding how to persist the images.
Do you require versioning?
Are you planning to stir eonly the images or additional metadata?
Do you have any requirements in horizontal scaling?
Do you require any image processing or scaling?
Do you need access to the image metatdata?
Do you require additional tooling for managing the images?
Are you willing to invest time in learning an additional technology?
Storing on the file system and making them available by an image sppoler implementation is the most simple way to persist your images.
But if you identify some of the above mentioned requirements (which are typical for a content repo or a dam system), then would end up reinventing the wheel with the filesystem approach.
The other option is using a kind of content repository. A JCR repo like for example Jackrabbit or it's commercial implementation CRX is one option. Alfresco (supports CMIS) would be the another valid.
Features like versioning, post processing (scaling ...), metadata extraction and management belong are supported by both mentioned repository solutions. But this requires you to learn a new technology which can be time consuming. Both mentioned repository technologies can get complex.
If horizontal scaling is a requirement I would consider a commercially supported repository implementations (CRX or Alfresco Enterprise) because the communty releases are lacking this functionality.
Me personally I would really depend any decision on the above mentioned requirements.
I extensively worked with Jackrabbit, CRX and Alfresco CE and EE and personally I would go for the Alfresco as I experienced it to scale better with larger amounts of data.
I'm not aware of a image pooling solution that fits your needs exactly but it shouldn't be to difficult to implement that, apart from the fact that recurring scaling operations may be very resource intensive.
I would go for the following approach if FS is enough for you:
Separate images and thumbnail into two locations.
The images root folder will remain, the thumbnails folder is
temporary.
Create a temporary thumbnail folder for each indexing run.
All thumbnails for that run are stored under that location, scaling
can be achived with i.e ImageMagick.
The temporary thumbnail folder can then easily be dropped as soon as
the next run has been completed.
If you are planning to store millions of images then avoid putting all files in the same directory. Browsing flat hierarchies with two many entries will be a nightmare.
Better create a tree structure by i.e. inverting the current datetime (year/month/day/hour/minute ... 2013/06/01/08/45).
This makes sure that the number of files inside the last folder get's not too big (Alfresco is using the same pattern for storing binary objects on the FS and it has proofen to work nicely).

What are the file update requirements of HDFS?

Under the Simple Coherency Model section of the HDFS Archiectiure guide, it states (emphasis mine):
HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.
I am confused by the use of "need not" here. Do they really mean "must not" or "should not"? If so, how can programs like HBase provide update support? If they really do mean "need not" (i.e. "doesn't have to"), what is trying to be conveyed? What file systems requires you to change a file once written?
Up to what I know, the need not is part of the assumption that "simplifies data coherency issues that enables high...". Actually means can't. But you can delete and create again the hole file.
After hadoop 0.20.2-append (like shown here) you can append data.
For all I read, I understand that HBase uses mainly memory (WAL? section 11.8.3) and modifications gets appended as marks. For example, to delete a column it makes a tombstone (see section 5.8.1.5) just marking the delete, and periodical compactation.
Maybe I am wrong. Good moment for me to learn the exact explanation :)

Resources