Can I create a snapshot view with an old baseline without creating a new stream from the source stream? I have been creating a child stream and pointing it to the old baseline; then making a snapshot of the newly created stream with the old baseline.
Not really a big deal just curious if there is another way
If the baseline is a full baseline, then yes, you can create a base snapshot view with a config spec like:
element * CHECKEDOUT
element /Vob/Component/... YourBaselineId
element * /main/0
See "What is the difference between Full baseline and Incremental baseline in Clearcase UCM?" to know more about the importance of using a full baseline for base ClearCase views.
Otherwise, your method (sub-stream, rebase, mkview) is the standard one, creating an UCM view based on the configuration of the sub-stream.
Related
Our team is currently deciding whether to implement snapshotting on cephfs directories or not, and thus trying to understand the effects and performance issues caused by snapshots on the cluster.
Our main concern is "How will the cluster be affected when data is written to a file under a snapshot?". We were able to find out that Ceph uses a Copy-on-write mechanism to clone the snapshots, so my question is, for example, if I have a 1GB file under a snapshot and I append another 10MB of data to the file, then how much data will be copied because of the new write?
My understanding is that since Ceph stripes a file into multiple objects, only the object containing the last stripe_unit (assuming it's not completely filled) will be copied and the new data will be added to it, and then Ceph somehow manages to include the new object when I request the current version file and will include the old object when I request the file from the snapshot. Data copied = O(10MB), I mean it's in the order of data written, and a few metadata changes.
Or since Ceph now uses Bluestore as the storage layer, does it have even better optimisations (compared to the above case), like when editing the object corresponding to the last stripe_unit, will ceph just write the new data to a location in the disk and edit the metadata of the object to include the location of the new data, and also maintain snapshot-based versions of the metadata to provide us the file contents at previous points in time. Data copied/written = 10MB and some more metadata changes (compared to the above case).
Or is the case that Ceph will copy the whole file and edit the new copy of the file i.e. data copied is 1GB + 10MB. I am assuming this is not the case because it's clearly suboptimal for large files.
PS: Any resources on measuring the effect of snapshots on the cluster and any resources that explain the internals of Ceph snapshots will be very much appreciated. I've done extensive searching on the internet but couldn't find any relevant data. Tried reading the code but you guys can probably guess how it went.
Some of the resources to understand the fundamentals of Ceph snapshots are as following:
Chapter 9 namely "Storage Provisioning with Ceph" of book Learning Ceph
Chapter "Planning for Ceph" in book Mastering Ceph
Furthermore, if you want to get Bluestore specific information of snapshots you may need to read following two resources as they explicitly explain Bluestore based snapshots:
File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution
The Case for Custom Storage Backends in Distributed Storage Systems
A read on write snapshot will cause any changes/updates to be redirected to new blocks. It's easy to see how this can work if data is to be appended, but what if data in the block is modified or deleted? Since the snapshotted block can't be modified, how is the information of what is modified or deleted applied? It can't be just metadata from here on out, right? That would really slow things down if the data is to be used for analysis.
Usually, you use layered filesystem. Each snapshot create a new layer, and when you ask for a file metadata / a file data, you query the top layer which will delegate to lower layer if there is no data about the query in the current layer.
When you delete a file, you simply put in top layer file xxx is deleted.
when you modify a block you create the new block on top layer with metadata referencing one block in new layer, delegating other to lower layer.
It is like docker works, and to go a little bit deeper you can check those link for instance :
https://docs.docker.com/engine/userguide/storagedriver/overlayfs-driver/#how-container-reads-and-writes-work-with-overlay-or-overlay2
https://docs.docker.com/engine/userguide/storagedriver/btrfs-driver/#how-the-btrfs-storage-driver-works
I work regularly on an OBIEE repository with 100+ facts and 200+ logical table sources.
Every time I add a new dimension, I have to go one-by-one to set the logical level for that dimension in each and every table source.
Is there any way, when adding a dimension, to default all LTS to a specific logical level?
Same answer as for cross-post here:
Is there any way, when adding a dimension, to default all LTS to a specific logical level?
No.
A model this size is probably not the best approach in an ideal world.
The only option you have is to look at a script-based approach. There is a supported API for modifying the RPD metadata, (and here too), but it'd be pretty easy to screw things up, doubly so given the size and complexity of your existing RPD.
You can see an example of it in action in a blog I wrote here. Note that all I change in that blog post is the value of an existing repository variable. To add in additional content, with the issue of GUIDs etc, gets very hairy indeed.
tl;dr : sit tight, and keep clicking, unless you're feeling brave
I have built a SOLR Index which has the image thumbnail urls that I want to render an image along with the search results. The problem is that those images can run into millions and I think storing the images in index as binary data would make the index humongous.
I am seeking guidance on how to efficiently store those images after rendering them from the URLs , should I use the plain file system and have them rendered by tomcat , or should I use a JCR repository like Apache Jackrabbit ?
Any guidance would be greatly appreciated.
Thank You.
I would evaluate the effective requiriments before finally deciding how to persist the images.
Do you require versioning?
Are you planning to stir eonly the images or additional metadata?
Do you have any requirements in horizontal scaling?
Do you require any image processing or scaling?
Do you need access to the image metatdata?
Do you require additional tooling for managing the images?
Are you willing to invest time in learning an additional technology?
Storing on the file system and making them available by an image sppoler implementation is the most simple way to persist your images.
But if you identify some of the above mentioned requirements (which are typical for a content repo or a dam system), then would end up reinventing the wheel with the filesystem approach.
The other option is using a kind of content repository. A JCR repo like for example Jackrabbit or it's commercial implementation CRX is one option. Alfresco (supports CMIS) would be the another valid.
Features like versioning, post processing (scaling ...), metadata extraction and management belong are supported by both mentioned repository solutions. But this requires you to learn a new technology which can be time consuming. Both mentioned repository technologies can get complex.
If horizontal scaling is a requirement I would consider a commercially supported repository implementations (CRX or Alfresco Enterprise) because the communty releases are lacking this functionality.
Me personally I would really depend any decision on the above mentioned requirements.
I extensively worked with Jackrabbit, CRX and Alfresco CE and EE and personally I would go for the Alfresco as I experienced it to scale better with larger amounts of data.
I'm not aware of a image pooling solution that fits your needs exactly but it shouldn't be to difficult to implement that, apart from the fact that recurring scaling operations may be very resource intensive.
I would go for the following approach if FS is enough for you:
Separate images and thumbnail into two locations.
The images root folder will remain, the thumbnails folder is
temporary.
Create a temporary thumbnail folder for each indexing run.
All thumbnails for that run are stored under that location, scaling
can be achived with i.e ImageMagick.
The temporary thumbnail folder can then easily be dropped as soon as
the next run has been completed.
If you are planning to store millions of images then avoid putting all files in the same directory. Browsing flat hierarchies with two many entries will be a nightmare.
Better create a tree structure by i.e. inverting the current datetime (year/month/day/hour/minute ... 2013/06/01/08/45).
This makes sure that the number of files inside the last folder get's not too big (Alfresco is using the same pattern for storing binary objects on the FS and it has proofen to work nicely).
I'm trying to optimize my workflow as I still spend quite some time waiting for the computer when it should be the other way 'round IMO.
I'm supposed to hand in topical branches implementing a single feature or fixing a single bug, along with a full build log and regression test report. The project is huge, it takes about 30 minutes to compile on a fairly modern machine when compiling in a snapshot view.
My current workflow thus is to do all development work in a single snapshot view, and when a feature is ready for submission, I create a new dynamic view, merge the relevant changes from the snapshot and start the build/testing procedure overnight.
In a dynamic view, a full build takes about six hours, which is a major PITA, so I'm looking for a way to improve these figures. I've toyed with the cache settings, but that doesn't seem to make much difference. I'm currently pondering writing a script that will create a snapshot view with the same spec as the dynamic view, fetch the files into it and build there, but before I do that I wonder if there is a better way of improving my build times.
Can I somehow make MVFS cache all retrieved objects locally (I have both lots of harddisk space and RAM), ideally sharing the cache between multiple dynamic views (as I build feature branches, most files are bound to be identical between two different branches)
Is there any other setting I could tune to speed up local builds?
Am I doing it wrong (i.e. is there a better workflow for me, considering that snapshot views take about one hour to create)?
Considering that you can have a dynamic view and a snapshot view with the same config spec, I would really recommend:
having a dynamic view ready for merge operation
then, once the merge is done, updating your snapshot view (no need to recreate it from scratch, which takes too much time. Just launch an update)
That way, you get the best of both world:
easy and quick merges within the dynamic view
"fast"(er) compilation within the snapshot view dedicated for that step.
Even if the config spec might have to change in your case (if you really have to use one view per branch), you still can change the config spec of an existing snapshot view (and still benefit from an incremental update), rather than recreating a snapshot view for each branch you need to compile on.