Nexus OOS - Clean up the proxy/attributes/ - proxy

I got a Nexus OOS instance with the following settings:
proxy of the http://repo1.maven.org/maven2/
I override the "local storage location" with a path to a network device
Everything is ok and my Nexus instance works fine... but I notice the number of inodes grows a lot.
After a little check, I can tell every inodes come from the proxy/attributes/ directory.
According to the documentation:
Stores data about the files contained in a remote repository. Each
proxy repository has a subdirectory in the proxy/attributes/ directory
and every file that Nexus has interacted with in the remote repository
has an XML file which captures such data as the: last requested
timestamp, the remote URL for a particular file, the length of the
file, and the digests for a particular file among other things. If you
need to backup the local cached contents of a proxy repository, you
should also back up the contents of the proxy repository's directory
under proxy/attributes/.
Ok I understand why there is a lot of little files in this location but I have a dummy question: to avoid to reach my inode limit, could I periodically clean up the content of proxy/attributes/, without breaking anything and does these files will be recreated 'on demand' if needed?
I find nothing about it...
Any clue will be greatly appreciated!

You can find details on the contents of the working folder here: https://docs.sonatype.com/display/SPRTNXOSS/Nexus+Workspace+Directories+Analysis
The part you're specifically interested in is this:
/Proxy: This folder contains an "attributes" subfolder, that holds as
many subfolders as many repos you have (repoId is the name of
these). This is the place where "item attributes" are persisted as lots
of very small files. These files contain information about expiration
status and are consulted during proxying. Therefore, they have impact
on proxy and group lookup speed if stored on a slow disk. These files
are recreated on demand if they are missing or corrupted and thus
don't need to be backed up.
Hope that helps, if you need more realtime assistance, feel free top hop onto the user list or irc: http://nexus.sonatype.org/project-information.html

Related

Windows Projected File System read only?

I tried to play around with Projected File System to implement a user mode ram drive (previously I had used Dokan). I have two questions:
Is this a read-only projection? I could not find anything any notification sent to me when opening the file from say Notepad and writing to it.
Is the file actually created on the disk once I use PrjWriteFileData()? From what I have understood, yes.
In that case what would be any useful thing that one could do with this library if there is no writing to the projected files? It seems to me that the only useful thing is to initially create a directory tree from somewhere else (say, a remote repo), but nothing beyond that. Dokan still seems the way to go.
The short answer:
It's not read-only but you can't write your files directly to a "source" filesystem via a projected one.
WriteFileData method is used for populating placeholder files on the "scratch" (projected) file system, so, it doesn't affect a "source" file system.
The long answer:
As stated in the comment by #zett42 ProjFS was mainly designed as a remote git file system. So, the main goal of any file versioning system is to handle multiple versions of files. From this a question arise - do we need to override the file inside a remote repository on ProjFS file write? It would be disastrous. When working with git you always write files locally and they are not synced until you push the changes to a remote repository.
When you enumerate files nothing being written to a local file system. From the ProjFS documentation:
When a provider first creates a virtualization root it is empty on the
local system. That is, none of the items in the backing data store
have yet been cached to disk.
Only after the file is opened ProjFS creates a "placeholder" for it in a local file system - I assume that it's a file with a special structure (not a real one).
As files and directories under the virtualization root are opened, the
provider creates placeholders on disk, and as files are read the
placeholders are hydrated with contents.
What "hydrated" is mean? Most likely, it represents a special data structure partially filled with real data. I would imaginge a placeholder as a sponge partially filled with data.
As items are opened, ProjFS requests information from the provider to allow placeholders for those items to be created in the local file system. As item contents are accessed, ProjFS requests those contents from the provider. The result is that from the user's perspective, virtualized files and directories appear similar to normal files and directories that already reside on the local file system.
Only after a file is updated (modified). It's not a placeholder anymore - it becomes "Full file/directory":
For files: The file's content (primary data stream) has been modified.
The file is no longer a cache of its state in the provider's store.
Files that have been created on the local file system (i.e. that do
not exist in the provider's store at all) are also considered to be
full files.
For directories: Directories that have been created on the local file
system (i.e. that do not exist in the provider's store at all) are
considered to be full directories. A directory that was created on
disk as a placeholder never becomes a full directory.
It means that on the first write the placeholder is replaced by the real file in the local FS. But how to keep a "remote" file in sync with a modified one? (1)
When the provider calls PrjWritePlaceholderInfo to write the
placeholder information, it supplies the ContentID in the VersionInfo
member of the placeholderInfo argument. The provider should then
record that a placeholder for that file or directory was created in
this view.
Notice "The provider should then record that a placeholder for that file". It means that in order to sync the file later with a correct view representation we have to remember with which version a modified file is associated. Imagine we are in a git repository and we change the branch. In this case, we may update one file multiple times in different branches. Now, why and when the provider calls PrjWritePlaceholderInfo?
... These placeholders represent the state of the backing store at the
time they were created. These cached items, combined with the items
projected by the provider in enumerations, constitute the client's
"view" of the backing store. From time to time the provider may wish
to update the client's view, whether because of changes in the backing
store, or because of explicit action taken by the user to change their
view.
Once again, imagine switching branches in a git repository; you have to update a file if it's different in another branch. Continuing answering the question (1). Imaging you want to make a "push" from a particular branch. First of all, you have to know which files are modified. If you are not recorded the placeholder info while modifying your file you won't be able to do it correctly (at least for the git repository example).
Remember, that a placeholder is replaced by a real file on modification? A ProjFS has OnNotifyFileHandleClosedFileModifiedOrDeleted event. Here is the signature of the callback:
public void NotifyFileHandleClosedFileModifiedOrDeletedCallback(
string relativePath,
bool isDirectory,
bool isFileModified,
bool isFileDeleted,
uint triggeringProcessId,
string triggeringProcessImageFileName)
For our understanding, the most important parameter for us here is relativePath. It will contain a name of a modified file inside the "scratch" file system (projected). Here you also know that the file is a real file (not a placeholder) and it's written to the disk (that's it you won't be able to intercept the call before the file is written). Now you may copy it to the desired location (or do it later) - it depends on your goals.
Answering the question #2, it seems like PrjWriteFileData is used only for populating "scratch" file system and you cannot use it for updating the "source" file system.
Applications:
As for applications, you still can implement a remote file system (instead of using Dokan) but all writes will be cached locally instead of directly written to a remote location. A couple use case ideas:
Distributed File Systems
Online Drive Client
A File System "Dispatcher" (for example, you may write your files in different folders depending on particular conditions)
A File Versioning System (for example, you may preserve different versions of the same file after a modification)
Mirroring data from your app to a file system (for example, you can "project" a text file with indentations to folders, sub-folders and files)
P.S.: I'm not aware of any undocumented APIs, but from my point of view (accordingly with the documentation) we cannot use ProjFS for purposes like a ramdisk or write files directly to the "source" file system without writing them to the "local" file system first.

Golang file and folder replication / mirroring across multiple servers

Consider this scenario. In a load-balanced environment, I have 3 separate instances of a CMS running on 3 different physical servers. These 3 separate running instances of the application is sharing the same database.
On each server, the CMS has a /media folder where all media subfolders and files reside. My question is how I'd implement/code a file replication service/functionality in Golang, so when a subfolder or file is added/changed/deleted on one of the servers, it'll get copied/replicated/deleted on all other servers?
What packages would I need to look in to, or perhaps you have a small code snippet to help me get started? That would be awesome.
Edit:
This question has been marked as "duplicate", but it is not. It is however an alternative to setting up a shared network file system. I'm thinking that keeping a copy of the same file on all servers, synchronizing and keeping them updated might be better than sharing them.
You probably shouldn't do this. Use a distributed file system, object storage (ala S3 or GCS) or a syncing program like btsync or syncthing.
If you still want to do this yourself, it will be challenging. You are basically building a distributed database and they are difficult to get right.
At first blush you could checkout something like etcd or raft, but unfortunately etcd doesn't work well with large files.
You could, on upload, also copy the file to every other server using ssh. But then what happens when a server goes down? Or what happens when two people update the same file at the same time?
Maybe you could design it such that every file gets a unique id (perhaps based on the hash of its contents so you can safely dedupe) and those files can never be updated or deleted, only added. That would solve the simultaneous update problem, but you'd still have the downtime problem.
One approach would be for each server to maintain an append-only version log when a file is added:
VERSION | FILE HASH
1 | abcd123
2 | efgh456
3 | ijkl789
With that you can pull every file from a server and a single number would be sufficient to know when a file is added. (For example if you think Server A is on version 5, and you get informed it is now on version 7, you know you need to sync 2 files)
You could do this with a database table:
ID | LOCAL_SERVER_ID | REMOTE_SERVER_ID | VERSION | FILE HASH
Which you could periodically poll and do your syncing via ssh or http between machines. If a server was down you could just retry until it works.
Or if you didn't want to have a centralized database for this you could use a library like memberlist. The local meta data for each node could be its version.
Either way there will be some amount of delay between a file was uploaded to a single server, and when it's available on all of them. Handling that well is hard, which is why you probably shouldn't do this.

What is the algorithm that dropbox uses to identify list of files/folders changed locally when the app was not running?

I understand that we can identify the changes in File System when the app is running using some OS events. I am just wondering when the app is not running, If I make lots of changes on the file system like add / modify / delete / rename few files and folders, What algorithm does Dropbox uses to identify these changes. One thing I could think of is, by comparing last modified time of a file on the file system against LMT stored value when the app was running. In this case, we had to loop through all the files anyways. However, LMT doesn't change if we do rename. Just wanted to see is there any better approach as relying on LMT has its own problems?
Any comments?
I don't know if it's how Dropbox handles it but here is a strategy that may be useful:
You have a root directory handled by Dropbox. If I were Dropbox, I'ld keep hashes for each file I have on the server. Starting from the root, the app would scan the file tree (directories + files) and compute the hashes for each file.
The scan would lead to a double index hashtable. Each file and directory would be indexed using its relative path (from the root Dropbox directory). A second index would be made using the hash(es) of each file.
Now, the app has scanned and established the double-indexed hashtable. The server would then send the tuples (relative path, hashes of the file). Let (f, h) be such a file tuple:
The app would try to get the file through the path index using f:
If there is a result, compare the hashes. If they don't match, update the file on the remote server
If there is no result the file may have been deleted OR moved/renamed. The app then tries to get the file through the hash index using h: if there is a match, that means the file is still there only under a different path (hence moved or renamed). The app send the info and the file is appropriately moved/renamed on the server.
The file has not been found neither using the hash or the path. It has been deleted from the Dropbox file tree: we delete it accordingly on the server.
Note that this strategy needs a synchronization mechanism to know, when meeting a match, if the file has to be updated on the client or on the server. This could be achieved by storing the time of the last update run by Dropbox (on the client and the server) and who performed this last update (on the server).

Magento 1.7 Local to Remote server transfer

After transferred my magento local site into remote ... i get errors as
"a:5:{i:0;s:45:"Unable to read response, or response is empty";i:1;s:1151:"#0 C:\xampp\htdocs\magento\magento-1.7.0.2\magento\lib\Varien\Http\Client.php(61): Zend_Http_Client->request('GET')"
The remote site still holding the path of my local server path how to change it . Is there any default config settings .
To a large degree this would depend on exactly which files you uploaded to the remote site, the most likely suspects are probably var/resource_config.json or anything in var/cache/mage--*. I believe you should be able to delete both, but if your a more cautious type you can always rename them.

What's the best way to (programatically) determine a file's network origin?

For an application I'm writing, i want to programatically find out what computer on the network a file came from. How can I best accomplish this?
Do I need to monitor network transactions or is this data stored somewhere in Windows?
When a file is copied to the local system Windows does not keep any record of where it was copied. So unless the application that created it saved such information in the file then it will be lost.
With file auditing file and directory operations can be tracked, but I don't think that will include the source path with file copies (just who created it and when).
Yes, it seems like you would either need to detect the file transfer based on interception of network traffic, or if you have the ability to alter the file in some way, use public key cryptography to sign files using a machine-specific key before they are transferred.
Create a service on either the destination computer, or on the file hosting computers which will add records to an Alternate Data Stream attached to each file, much the way that Windows handles ZoneInfo for files downloaded from the internet.
You can have a background process on machine A which "tags" each file as having been tagged by machine A on such-and-such a date and time. Then when machine B downloads the file, assuming we are using NTFS filesystems, it can see the tag from A. Or, if you can't have a process at the server, you can use NTFS streams on the "client" side via packet sniffing methods as others have described. The bonus here is that future file-copies will retain the data as long as it is between NTFS systems.
Alternative: create a requirement that all file transfers must be done through a Web portal (as opposed to network drag-and-drop). Built in logging. Or other type of file retrieval proxy. Do you have control over procedures such as this?

Resources