HDFS "client" for Windows? Re-export as CIFS?

HDFS "client" for Windows? Re-export as CIFS? - windows

ALthough the general question of hadoop/HDFS on Windows has been posed before, I haven't seen anyone presenting the use case I think is the most important for Windows support: How can Windows end stations participate in a HDFS environment and consume files stored in HDFS.
In particular, let's say we have a nice Linux based HDFS environment with lots of nodes and analysis jobs being run, etc, and all is happy. How can Windows desktops also consume the files? Suppose our analytics find interesting files out of the millions of mostly uninteresting. Now we want to bring them into a desktop application to visualize, etc. The most natural way for the desktop to consume these is via a Windows share, hopefully via a Windows server.
Windows' implementation of CIFS is orders of magnitude better than Samba -- I'm stating that as a fact, not a point of debate. That isn't to say that Samba cannot be made to work, only that there are good reasons to have a very strong preference for essentially exporting this HDFS file system as CIFS.
It's possible to do this via some workflow where we have a back-end process take the interesting files and copy them. But this is cumbersome in many cases and does not give the Windows-shackled analyst the freedom to explore the files on his own as easily.
Hence, what I'm looking for really is:
Windows server
HDFS as a "mounted" file system; Windows is thought of as a HDFS "client"
Export this file system from Windows as a CIFS server
Consume files on Windows desktop
Have all the usual Windows group permissions work correctly (e.g. by mapping through to NFSv4 ACLs).
Btw, if we replace "HDFS" with "GPFS" in this question, it all does work. At the moment, this is a key differentiator between HDFS and GPFS in my environment. Yes, there are many many more points of comparison, but I'd like to not focus on GPFS vs HDFS in general at the moment.
Could someone please add the #GPFS tag?

In particular, let's say we have a nice Linux based HDFS environment with lots of nodes and analysis jobs being run, etc, and all is happy. How can Windows desktops also consume the files?
HDFS provides a REST API through WebHDFS and HttpFS for various operations. The REST API can be pragmatically accessed from many languages. Also note that these languages also have libraries to program against REST API easily.
Haven't tried it out, but according to the Hadoop documentation it should be possible to also mount HDFS to a Windows machine.

Related

rsync-style solution on Windows that can be deployed silently?

I'm building a (for now pretty minimal) network sync system for some of our users, involving a samba server on one end and an rsync cron job which is "installed" for OSX or Linux clients by running a simple bash script linked from our intranet.
I need to do the same thing for Windows clients. I know there are several rsync implementations on Windows (I used cwRsync ages ago), but are there any (off the top of your head) that I can silently pass a config to during install? As it is, I guess I'm going to have to write a crappy old batchfile to interface with Windows Task Scheduler, but I'd at least like for clients installing this to not have to input any more than their username and password.
Thanks!

I've had success with
RichCopy
RoboCopy
Cygwin rsync.exe
All using scheduled tasks.
RichCopy (and maybe robocopy) have options to save config files from the GUI. All worked well for me from a batch file.
All three have restartable/incremental modes. Most are highly aware of specific features think
NTFS encryption
NTFS compression
permissions (ACLs)
alternate NTFS streams
junctions/reparse points
hardlinks/symlinks
etc.

Sharing Eclipse projects on a dual boot system

I recently converted my laptop to a Ubuntu/Win7 dual boot system, each with their own partitions, plus a third shared partition. I'd like to use Eclipse and access my SVN repository regardless which system I'm booted into at the time.
If I have my local SVN repository on the shared partition, how can I enable the workspaces on both Ubuntu and Windows to the files?
The only other alternative I've come up with is to have each OS have their own working copy, and apply commits and updates as necessary.
Edit for clarification
I'm not asking if its possible to have a single workspace for both Linux and Windows. I had in mind a single source folder in the shared partition that was linked to each workspace. Therefore, the file paths would be OS-specific, and only the source code would be accessed.

I don't think this is really possible.
There are a number of files in the workspace .metadata folder (for instance the definition of the JRE/JDKs or the eclipse path) that will be dependent on the underlying file system (e.g. c:\eclipse for Windows on one side and /home/me/eclipse.
What you might be able to do at best is two different workspaces, one for windows and one for Linux.
These two distinct workspaces, in turn would be sharing a number of project. These projects would not be in the so-called default location (that is under the workspace folder location) of any of these two workspaces but under a separate hierarchy under your shared partition. Yet because of these decoupling you would end up doing several things twice (such as defining launch configurations etc...). Which is fair enough I guess.
Finally, since Linux can read ntfs file systems pretty well (to the exception of ACLs which would actually be a plus) using ntfs3g, you can have your shared partition in NTFS. Windows is less apt at reading/writing ext3fs (let alone ext4fs). Just make sure you mount your NTFS partition with a common character set.
In addition, instead of having a dual boot, you could have Windows run inside a VM (e.g. VirtualBox and share the common data as a linux shared folder either through standard Samba or VirtualBox built-in shared folder mechanism. The difference with a dual boot would be that you could in theory cheat eclipse and access your 2 different workspaces simultaneously yet they would share the same projects. Of course this would require some tweaking from the Samba part for locking management.

I would not recommend sharing the same Eclipse workspace between two operating systems as they use different path syntax. You can also run into character encoding and/or line separators issues in files placed into source control. Typically a source control system will do adjustments to these during "check out" and reverse adjustment on "check in". You can have weird problems if you retrieve code in Windows and check it back in on Linux (or the other way around).
I would recommend keeping your Windows and Linux workspaces separate.

Cross-platform File sync tool

I am developing a webapp that will be used on LAN mostly. I have different locations where I deployed this app. Some of the locations run windows and some run linux (no x-window system). I need to know if there is a software out there that could easily synchronize my files stored somehere in the cloud (the clouding service can be provided by the app developers or to use different clouds) on both linux and windows machines. My english is a bit rusty so i'm going to explain this in simple words.
I will work on my local machine. I want to upload the files somewhere on the cloud and the clients installed on the LAN servers should synchronize the files. The client must be available for linux under console (as a daemon if possible) while on windows it can be something like dropbox or ubuntu one.
Does somebody know of such an app?

Dropbox is available for Linux.
You could also investigate unison.

I think "Git" is the best solution to develop your project in different machine.
You can sync your code with easy command through this app, and it will record all the version of your code.
Just google "Git tutorial", and you will find many useful introductions.

I think there is a great tool called Syncthing should be considered after 8 years.
https://syncthing.net/
Syncthing is a continuous file synchronization program. It synchronizes files between two or more computers and replaces proprietary sync and cloud services with something open, trustworthy and decentralized. Your data is your data alone and you deserve to choose where it is stored, if it is shared with some third party and how it's transmitted over the internet.
Check the list of Syncthing's goals for more details.

sync between local and virtual machine

I'm working on a windows platform and want to be able to auto sync my files one way 'on change' to my virtual windows or linux web server - also need to be able to filter file types. i can connect to the remote machine via network drives.
i'm ideally looking for a free, easy to set up solution - a commercial product that does what I need is called ViceVersa but its a little overkill and costs :)
Thanks
Josh

I'd use rsync - simple, easy to setup, and provides the filters you need. Also very low on bandwidth after the first pass.
Here is a link explaining how to get it working in Windows
Whilst rsync doesn't allow 'on-change' auto-syncing, it is very fast when it scans a sync'ed directory (even very large ones), so you could schedule a frequent sync to overcome this.
Edit: You could combine it with a program like this, to trigger an rsync on folder contents change. Cheaper than viceversa

For other users, its worth mentioning lsyncd, it will auto sync on changes between two machines (by default deferring to rsync). Will only work on Linux though, but if thats not a problem it works great.
It also seems that Sparkleshare has finally released some working code (Dropbox clone). Havent tried it myself but does cross-platform synching and you can setup your own server.

The ideal background filesystem backup

I am thinking about a script/program that can run in background, and attempt to backup or synchronize a given filesystem path to a mirror location (probably located on an external/separate storage device).
This should apply to Windows but it could as well be used under Linux.
Differential/incremental backups are a bonus.
Windows System State backups are a bonus too.
Keeping the origin free of meta-data is essential. (unlike version control)
Searching by file or activity date could be interesting (like version control)
Backup repositories should be easy to browse and take little space.
Deleted files should be available for recovery for a period of time.
Windows Backup is tedious and bloated and limited.
Tar-gzipping is not accessible.
User interaction during backup should be nonexistent.

Amanda is the ultimate full-featured open-source backup solution, and there's a (relatively) new Zmanda Windows Client.

Duplicity is free and creates encrypted, incremental, compressed offsite backups. It's a linux app, but you could run it in cygwin or a small virtual machine.
I've written a perl script that runs it via a cronjob to backup several very big directories over DSL and it works great.

Check out AJCBackup. Does an excellent job at a good price.

Acronis True Image is great. It's not free but the Home edition is pretty cheap for what it does and it works reliably. Does image- and file- based backups, scheduling, instant backup of chosen folders accessible from explorer context menu, incremental/differential backups, can mount the backup files as Windows volumes and browse them, copy files out etc. It has saved my ass a few times already.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

HDFS "client" for Windows? Re-export as CIFS? - windows

Related

rsync-style solution on Windows that can be deployed silently?

Sharing Eclipse projects on a dual boot system

Cross-platform File sync tool

sync between local and virtual machine

The ideal background filesystem backup

Categories

Resources