How to pass attributes to ListHDFS

How to pass attributes to ListHDFS - apache-nifi

It seems, the way "list" processor works, we can't put them in middle of a flow. Then how to set attributes to ListHDFS? e.g. if I want to parameterize "directory", and pass it at runtime.

You can use expression language in the Directory property, but only to reference variables, system properties, or dynamic EL expressions, but not flow file attributes.
ListHDFS (and the other List processors) are made to track state and determine which files are new from the previous listing, so if you were allowed to specify the directory from an incoming flow file this would mean that the directory could change at any moment which then makes the previous state no longer meaningful, or would mean that it would need to track state for N number of directories which could grow large and would be unclear when a directory was no longer being listed.
It may be helpful to implement another processor that allows dynamic listings, but does not store state,

In case, someone runs into this question, GetHDFSFileInfo would be what you are looking for.

Related

GetFileInformationByHandleEx/FileIdInfo vs DeviceIoControl/FSCTL_CREATE_OR_GET_OBJECT_ID for OpenFileById

Recently I've stumbled upon "If you want to use GUIDs to identify your files, then nobody's stopping you" article by Raymond Chen and wanted to implement this method. But then I found that there is another way to get file ID and this is GetFileInformationByHandleEx with FILE_INFO_BY_HANDLE_CLASS::FileIdInfo and using the FileId field (128 bit).
I tried both, both methods works as expected but I have a few questions I cannot find any answers to:
These methods return different IDs (and the id from GetFileInformationByHandleEx seems to use only the low 64 bit leaving the high part as zero). What each of them represent? Are they essentially the same thing or just two independent mechanisms to achieve the same goal?
Edit: Actually I've just found some information. So the ObjectID from DeviceIoControl is NTFS object ID but what is the other ID then? How do they relate (if at all)? Are both methods available only on NTFS or at least one of them will work on FAT16/32, exFAT, etc?
Documentation for FILE_INFO_BY_HANDLE_CLASS::FileIdInfo doesn't tell us that the ID may not exist unlike FSCTL_CREATE_OR_GET_OBJECT_ID where I need to explicitly state that I want the ID to be created if there isn't one already. Will it have any bad consequences if I'd just blindly request creation of object IDs for any file I'll be working with?
I found a comment for this question that these IDs remain unchanged if a file is moved to another volume (logical or physical). I did test only the DeviceIoControl method but they indeed don't chnage across drives but if I do move the file I'm required to supply OpenFileById with the destination volume handle, otherwise it won't open the file. So, is there a way to make OpenFileById find a file without keeping the volume reference?
I'm thinking of enumerating all connected volumes to try to open the file by ID for each until it succeed but I'm not sure how reliable is this. Could it be that there could exist two equal IDs that reference different files on different volumes?
How fast it is to ask system to get (or create) an ID? Will it hurt performance if I add the ID query to regular file enumeration procedures or I'd better to do that only on demand when I really need this?

Is there a process for munging data from many different formats in RapidMiner?

I'm trying to help my team streamline a data ingestion process that is taking up a substantial amount of time. We receive data in multiple formats and with attributes arranged differently. Is there a way using RapidMiner to create a process that:
Processes files on a schedule that are dropped into a folder (this
one I think I know but I'd love tips on this as scheduled processes
are new to me)
Automatically identifies input filetype and routes to the correct operator ("Read CSV" for example)
Recognizes a relatively small number of attributes and arranges them accordingly. In some cases, attributes are named the same way as our ingestion format and in others they are not (phone vs phone # vs Phone for example)
The attributes we process mostly consist of name, id, phone, email, address. Also, in some cases names are split first/last and in some they are full name.
I recognize that munging files for such simple attributes shouldn't be that hard but the number of files we receive and lack of order makes it very difficult to streamline a process without a bit of automation. I'm also going to move to a standardized receiving format but for a number of reasons that's on the horizon and not an immediate solution.
I appreciate any tips or guidance you can share.

Your question is relative broad, so unfortunately I can't give you complete answer. But here are some ideas on how I would tackle the points you mentioned:
For a full process scheduling RapidMiner Server is what you are
looking for. In that case you can either define a schedule (e.g.,
check regularly for new files) or even define a web service to
trigger the process.
For selecting the correct operator depending on file type, you could
use a combination of "Loop Files" and macro extraction to get the
correct type and the use either "Branch" or "Select Subprocess" for
switching to different input routes.
The "Select Attributes" operator has some very powerful options to
select specific subsets only. In your example I would go for a
regular expression akin to [pP]hone.* to get the different spelling
variants. Also very helpful in that case would be the "Reorder
Attributes" operator and "Rename by Replacing" to create a common
naming schema.
A general tip when building more complex process pipelines is to organize your different tasks in sub-processes and use the "Execute Process" operator. This makes everything much more readable and maintainable. Also a good error handling strategy is important to handle unforeseen data formats.
For more elaborate answers and tips from many adavanced RapidMiner users, I also highly recommend the RapidMiner community.
I hope this gives a good starting point for your project.

(why) is FSCTL_SET_OBJECT_ID dangerous?

NTFS files can have object ids. These ids can be set using FSCTL_SET_OBJECT_ID. However, the msdn article says:
Modifying an object identifier can result in the loss of data from portions of a file, up to and including entire volumes of data.
But it doesn't go into any more detail. How can this result in loss of data? Is it talking about potential object id collisions in the file system, and does NTFS rely on them in some way?
Side node: I did some experimenting with this before I found that paragraph, and set the object id's of some newly created files, here's hoping that my file system's still intact.

I really don't think this can directly result in loss of data.
The only way I can imagine it being possible is if e.g. a backup program assumes that (1) every file has an Object Id, and (2) that the program is keeping track of all IDs at all times. In that case it might assume that an ID that is not in its database must refer to a file that should not exist, and it might delete the file.
Yeah, I know it sounds ridiculous, but that's the only way I can think of in which this might happen. I don't think you can lose data just by changing IDs.

They are used by distributed link tracking service which enables client applications to track link sources that have moved. The link tracking service maintains its link to an object only by using these object identifier (ID).
So coming back to your question,
Is it talking about potential object id collisions in the file system
?
I dont think so. Windows does provides us the option to set the object IDs using FSCTL_SET_OBJECT_ID but that doesnt bring the risk of ID collision.
Attempting to set an object identifier on an object that already has an object identifier will fail.
.. and does NTFS rely on them in some way?
Yes. Object identifiers are used to track files and directories. An index of all object IDs is stored on the volume. Rename, backup, and restore operations preserve object IDs. However, copy operations do not preserve object IDs, because that would violate their uniqueness.
How can this result in loss of data?
You wont get into a serious problem if you change(or rather set) object ID of user-created files(as you did). However, if a user(knowingly/unknowingly) sets object ID used by a shared object file/library, change will not be reflected as is.
Since Windows doesnt want everyone(but developers) to play with crutial library files, it issues a generic warning:
Modifying an object identifier can result in the loss of data from
portions of a file, up to and including entire volumes of data.
Bottom line: Change it if you know what you are doing.
There's another msn article on distributed link tracking and object identifiers.
Hope it helps!
EDIT:
Thanks to #Mehrdad for pointing out.I didnt mean object identifiers of DLLs themselves but ones which they use internally.
OLEACC(a dll), provides the Active Accessibility runtime and manages requests from Active Accessibility clients[source]. It use OBJID_QUERYCLASSNAMEIDX object identifier [ source ]

How would you implement a Workflow system?

I need to implement a Workflow system.
For example, to export some data, I need to:
Use an XSLT processor to transform an XML file
Use the resulting transformation to convert into an arbitrary data structure
Use the resulting (file or data) and generate an archive
Move the archive into a given folder.
I started to create two types of class, Workflow, which is responsible of adding new Step object and run it.
Each Steps implement a StepInterface.
My main concerns is all my steps are dependent to the previous one (except the first), and I'm wondering what would be the best way to handle such problems.
I though of looping over each steps and providing each steps the result of the previous (if any), but I'm not really happy with it.
Another idea would have been to allow a "previous" Step to be set into a Step, like :
$s = new Step();
$s->setPreviousStep(Step $step);
But I lose the utility of a Workflow class.
Any ideas, advices?
By the way, I'm also concerned about success or failure of the whole workflow, it means that if any steps fail I need to rollback or clean the previous data.

I've implemented a similar workflow engine a last year (closed source though - so no code that I can share). Here's a few ideas based on that experience:
StepInterface - can do what you're doing right now - abstract a single step.
Additionally, provide a rollback capability but I think a step should know when it fails and clean up before proceeding further. An abstract step can handle this for you (template method)
You might want to consider branching based on the StepResult - so you could do a StepMatcher that takes a stepResult object and a conditional - its sub-steps are executed only if the conditional returns true.
You could also do a StepException to handle exceptional flows if a step errors out. Ideally, this is something that you can define either at a workflow level (do this if any step fails) and/or at a step level.
I'd taken the approach that a step returns a well defined structure (StepResult) that's available to the next step. If there's bulky data (say a large file etc), then the URI/locator to the resource is passed in the StepResult.
Your workflow is going to need a context to work with - in the example you quote, this would be the name of the file, the location of the archive and so on - so think of a WorkflowContext
Additional thoughts
You might want to consider the following too - if this is something that you're planning to implement as a large scale service/server:
Steps could be in libraries that were dynamically loaded
Workflow definition in an XML/JSON file - again, dynamically reloaded when edited.
Remote invocation and call back - submit job to remote service with a callback API. when the remote service calls back, the workflow execution is picked up at the subsequent step in the flow.
Parallel execution where possible etc.
stateless design

Rolling back can be fit into this structure easily, as each Step will implement its own rollback() method, which the workflow can call (in reverse order preferably) if any of the steps fail.
As for the main question, it really depends on how sophisticated do you want to get. On a basic level, you can define a StepResult interface, which is returned by each step and passed on to the next one. The obvious problem with this approach is that each step should "know" which implementation of StepResult to expect. For small systems this may be acceptable, for larger systems you'd probably need some kind of configurable mapping framework that can be told how to convert the result of the previous step into the input of the next one. So Workflow will call Step, Step returns StepResult, Workflow then calls StepResultConverter (which is your configurable mapping thingy), StepResultConverter returns a StepInput, Workflow then calls the next Step with StepInput and so on.

I've had great success implementing workflow using a finite state machine. It can be as simple or complicated as you like, with multiple workflows linking to each other. Generally an FSM can be implemented as a simple table where the current state of a given object is tracked in a history table by keeping a journal of the transitions on the object and simply retrieving the last entry. So a transition would be of the form:
nextState = TransLookup(currState, Event, [Condition])
If you are implementing a front end you can use this transition information to construct a list of the events available to a given object in its current state.

Is RegNotifyChangeKeyValue as coarse as it seems?

I've been using ReadDirectoryChangesW to monitor a particular portion of the file system. It rather nicely provides a partial pathname to the file or directory which changed along with a clue about the nature of the change. This may have spoiled me.
I also need to monitor a particular portion of the registry, but it looks as if RegNotifyChangeKeyValue is very coarse. It will tell me that something under the given key changed, but it doesn't seem to want to tell me what that something might have been. Bummer!
The portion of the registry in question is arbitrarily deep, so enumerating all the sub-keys and calling RegNotifyChangeKeyValue for each probably isn't a hot idea because I'll eventually end up having to overcome MAXIMUM_WAIT_OBJECTS. Plus I'd have to adjust the set of keys I'd passed to RegNotifyChangeKeyValue, which would be a fair amount of effort to do without enumerating the sub-keys every time, which would defeat a fair amount of the purpose.
Any ideas?

Unfortunately, yes. You probably have to cache all the values of interest to your code, and update this cache yourself whenever you get a change trigger, or else set up multiple watchers, one on each of the individual data items of interest. As you noted the second solution gets unwieldy very quickly.
If you can implement the required code in .Net you can get the same effect more elegantly via RegistryEvent and its subclasses.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio