Generalizing the multiple processes in flow chart diagram - etl

Below is a general ETL flow chart diagram.I am really confused if it is a good practice to draw such a flow chart.Especially at lines connecting the final output , and the big box used to generalize the whole process that goes from input format to 'Validate files as per type of file' to 'Adjustments for desired output' and finally to 'outputs'.

In concept, to provide an overview of the process, a Context diagram or Data Flow Diagram can be used as well as a Flow Chart. While all these diagrams are old, they are usually useful.
I suggest that you check the following points in your diagram - What I have here are personal suggestions based on long history with ETL.
Your processing shape (say rectangle) must have at least one entry (input) and at least one exist(output). Example: The two large rectangles don't have clear inputs and outputs! What is going on there?
You can't have outputs without a process - Example: Type 1 is somehow split into Type 1.1 and Type 1.2. How is that split happened? Is it by a program? Which one? Another example is Type 1.1 has an arrow connected to Output 1. Again same questions hold. Another example is "Rejected folder" to email.
The name of the process should indicate what the process is or the actual program component name or physical file names. This is valuable in ETL.
You may want to show the process triggering event such as time of day.
Use remarks.
Make the level of detail consistent.

Related

How to go down (recursively) in UIPath

I want to go down recursively and automatically at the host and clusters tab (the blue one in the picture), and then take the text like guest os, compability, etc.
I already know how to get the text from summary, but I got the problem to loop (go down) at the host and cluster tabs.
Any idea how to do it?
Thank you very much.
It's probably not the best idea to have one robot do everything - i.e. setting filters, parsing results, clicking on each result, parsing data, then going back, and so on. Instead of that approach, divide and conquer. Create multiple sequences/workflows, each one with one specific task in mind.
Here's how I would tackle it:
Create a sequence that opens the target page, setting the filters (e.g. Germany, as mentioned in your link: https://www.autoscout24.de/ergebnisse?cy=D&powertype=kw&atype=C&ustate=N%2CU&sort=standard&desc=0&page=1&size=20)
Have the same robot extract the link for each result. Don't go into the details, this could potentially be the task of another robot.
Have the links stored in a queue. Read more about Queues and Transactions here.
Create another sequence that takes the next item in the queue and open the stored link. This will take you directly to the results page (e.g. https://www.autoscout24.de/angebote/opel-corsa-12v-city-servo-ahk-benzin-silber-84a66193-b6d6-4fc9-b222-ec5a7319f221?cldtidx=1)
Parse any kind of data you need.
This comes with the benefit that you can use multiple robots at the same time extracting data, potentially increasing scraping significantly. The Queues and Transactions feature will make sure each result is visited only once, and that multiple robots don't process the same item multiple times.
Edit: you might want to start with ReFramework, which I'd recommend.

Is there a process for munging data from many different formats in RapidMiner?

I'm trying to help my team streamline a data ingestion process that is taking up a substantial amount of time. We receive data in multiple formats and with attributes arranged differently. Is there a way using RapidMiner to create a process that:
Processes files on a schedule that are dropped into a folder (this
one I think I know but I'd love tips on this as scheduled processes
are new to me)
Automatically identifies input filetype and routes to the correct operator ("Read CSV" for example)
Recognizes a relatively small number of attributes and arranges them accordingly. In some cases, attributes are named the same way as our ingestion format and in others they are not (phone vs phone # vs Phone for example)
The attributes we process mostly consist of name, id, phone, email, address. Also, in some cases names are split first/last and in some they are full name.
I recognize that munging files for such simple attributes shouldn't be that hard but the number of files we receive and lack of order makes it very difficult to streamline a process without a bit of automation. I'm also going to move to a standardized receiving format but for a number of reasons that's on the horizon and not an immediate solution.
I appreciate any tips or guidance you can share.
Your question is relative broad, so unfortunately I can't give you complete answer. But here are some ideas on how I would tackle the points you mentioned:
For a full process scheduling RapidMiner Server is what you are
looking for. In that case you can either define a schedule (e.g.,
check regularly for new files) or even define a web service to
trigger the process.
For selecting the correct operator depending on file type, you could
use a combination of "Loop Files" and macro extraction to get the
correct type and the use either "Branch" or "Select Subprocess" for
switching to different input routes.
The "Select Attributes" operator has some very powerful options to
select specific subsets only. In your example I would go for a
regular expression akin to [pP]hone.* to get the different spelling
variants. Also very helpful in that case would be the "Reorder
Attributes" operator and "Rename by Replacing" to create a common
naming schema.
A general tip when building more complex process pipelines is to organize your different tasks in sub-processes and use the "Execute Process" operator. This makes everything much more readable and maintainable. Also a good error handling strategy is important to handle unforeseen data formats.
For more elaborate answers and tips from many adavanced RapidMiner users, I also highly recommend the RapidMiner community.
I hope this gives a good starting point for your project.

Delays in updating content controls when more context.sync() are used on WORD for Mac

We update content control for every character typed in the task pane’s input field. So that user can see the live updates on the word document.
Recently we added functionality for locking content controls. And it happens as below:
User input (types a character) in a input field
We search a content control for that input field (involves context.sync)
Unlock the content control (involves context.sync)
Update value in content control (involves context.sync)
Lock back the content control (involves context.sync)
All this works nice in Word for windows without problems.
But is extremely (visibly) slow with Word for Mac (apple machines)
How should I overcome the delays happening on Mac?
As Juan mentioned in the comment, there are some important details that the team would need to investigate. Sample code would be good too.
That being said, just looking at what you describe, I think you can dramatically cut down on the context.sync() statements. Unlocking the content control, updating its value, and locking it should all be possible to do in one sync.
I have a bunch of details about minimizing sync-s in my book, "Building Office Add-ins using Office.js. Quoting one of the sections from it:
As an add-in author, your job is to minimize the number of context.sync()
calls. Each sync is an extra round-trip to the host application; and when
that application is Office Online, the cost of each of those round-trip adds up
quickly.
If you set out to write your add-in with this in principle in mind, you will
find that you need a surprisingly small number of sync calls. In fact, when
writing this chapter, I found that I really needed to rack my brain to come up with a scenario that did need more than two sync calls. The trick for
minimizing sync calls is to arrange the application logic in such a way that
you're initially scraping the document for whatever information you need
(and queuing it all up for loading), and then following up with a bunch
of operations that modify the document (based on the previously-loaded
data). You've seen several examples of this already: one in the intro chapter,
when describing why Office.js is async; and more recently in the "canonical
sample" section at the beginning of this chapter. For the latter, note that the
scenario itself was reasonably complex: reading document data, processing
it to determine which city has experienced the highest growth, and then
creating a formatted table and chart out of that data. However, given the
"time-travel" superpowers of proxy objects, you can still accomplish this task
as one group of read operations, followed by a group of write operations.
Still, there are some scenarios where multiple loads may be required. And in
fact, there may be legitimate scenarios where even doing an extra sync is the
right thing to do – if it saves on loading a bunch of unneeded data. You will
see an example of this later in the chapter.

Hadoop for the Wikipedia pagecount dataset

I want to build a Hadoop-Job that basically takes the wikipedia pagecount-statistic as input and creates a list like
en-Articlename: en:count de:count fr:count
For that I need the different articlenames related to each language - i.e. Bruges(en, fr), Brügge(de), which the MediaWikiApi query articlewise(http://en.wikipedia.org/w/api.php?action=query&titles=Bruges&prop=langlinks&lllimit=500).
My question is to find the right approach to solve this problem.
My sketched approach would be:
Process the pagecount file line by line (line-example 'de Brugge 2 48824')
Query the MediaApi and write sth. like'en-Articlename: process-language-key:count'
Aggreate all en-Articlename-values to one line (maybe in a second job?)
Now it seems rather unhandy to query the MediaAPI for every line but currently can not get my head around a better solution.
Do you think the current approach for is feasible or can you think of a different one?
On a sidenote: The created job-chain shall be used to do some time-measuring on my (small) Hadoop-Cluster, so altering the task is still okay
Edit:
Here is a quite similar discussion which I just found now..
I think it isn't a good idea to query MediaApi during your batch processing due to:
network latency (your processing will be considerably slowed down)
single point of failure (if the api or your internet connection goes down your calculation will be aborted)
external dependency (its hard to repeat the calculation and got the same result)
legal issues and a ban possibility
The possible solution to your problem is to download the whole wikipedia dump. Each article contains links to that article in other languages in a predefined format, so you can easily write a map/reduce job that collects that information and builds a correspondence between English article name and the rest.
Then you can use the correspondence in a map/reduce job processing pagecount-statistic. If you do that you'll become independent to mediawiki's api, speed up your data processing and improve debugging.

Designing a complex workflow diagram

We've got a surprisingly complex workflow that needs to be monitored by a quasi-technical employees with an in-house webapp. There's about 30 steps, some of which are manual (editing), some are semi-automated stop points (like "the files have been received" or customer approval of certain templates), and some are completely automated (file conversion, search indexing, etc). The flowchart for all of these steps is large and complicated, and three people might be working on three completely different steps at any one time.
How would you present this vast amount of information as usefully as possible to your users? Just showing the whole diagram seems like the brute force solution. But it's big, and it'll likely get bigger as we do more things. Not to mention the complexity necessary to encode this entire diagram in HTML.
I assume you don't want to show these just for entertainment or mockery, but help the users along the way, automating as much as possible, document the process etc. It would probably help if you clearly define the goals or purpose of your app.
I don't see a point in showing the entire workflow, except for "debugging the business rules" or maybe the clients want to see it.
If your goal is to help users do their job, I would present the state of the "project" (or whatever term fits better) is at, and possible transitions to other states.
The State might be multiple mostly independent variables, e.g. one might describe the progress of content - e.g. "incomplete" / "complete" / "reviewed by 2nd staffer" / "signed off by 2nd staffer", others might contain a schedule that is developed in parallel, e.g. "test print date = not scheduled", "print date = not scheduled", "final delivery = tomorrow, preferredly yesterday".
A transition might be "Seint to customer for review", "mark as content-complete", "content modified", etc.
Is this what you have in mind?
I propose to divide your workflow in modules and represent the active state for each module.
A module is a subset of your main workflow. For example it could be divided by tasks, person, roles, department, etc. This will greatly simplify the representation of the workflow. Let's says someone is responsible for data entry at many critical moments. We can group all his tasks in one module (or sub-workflow) containing the same activities, inputs, outputs and conditions. Modules could be inter-dependants and related.
A state is where we are located in a module. In simple workflows there is only one active task. In real life we are multi-threaded! So maybe in one module many states could be active at the same time. The state also includes active inputs, outputs and memory bits.
An input is something required to perform an activity for evaluation a boolean condition. It could be a document, a piece of data, a signal...
An output is something resulting from a task: an information, a document, a signal...
Enough definitions?
Then simply convert your workflow into a LADDER LOGIC and you have your states!
See Ladder Logic definition on Wikipedia
You display only active states:
Active task(s) for the module
Inputs required / inputs confirmed
Output required / output realized
Conditions to continue
Seems abstract?
Here is a small example...
Janet enters data in the system. She manages the green tasks of the diagram. We focus only on her work, not other tasks. She knows how to do 16 tasks in the workflow. We are waiting the following actions from her to continue, and her Intranet dashboard says:
Priority 1: You must send a PO to order enough pencils for the next month based on the sales report.
Task: Send a purchase order
Inputs: Forecast report from the marketing department
Outputs: PO, vendor, item, quantity
Condition for completion: PO sent and order confirmation received from supplier
Priority 2: You must enter into the financial system the number of erasers rejected by production
Task: Data entry
Inputs: Reject count from production
Outputs: Number of rejects
Condition for completion: data entered and confirmed
We do a lot of troubleshooting on automated production systems having hundreds of thousands ladder steps (the workflow is too complex to be represented in a whole). When the system is blocked we look at each module and determine what inputs are missing to activation task completion.
Good luck!
This sounds like the sort of application for which BPEL is suited.
Of course you don't want to re-architect your system right now. But there are a number of BPEL implmentations out there, some of which include graphical editing tools. One of these might help you in your current situation, because they are good at handling scope and hiding detail. So I think you might derive benefit from drawing your workflow as a BPEL diagram even if you don't do anything else with the language.
The Wikipedia page lists several of the available implementations. In addition, Oracle's JDeveloper IDE includes a BPEL Diagrammer as part of its SOA suite; unfortunately it is no longer part of the standard install but it is still available. Find out more.
Try doing it in layers. You have the most detailed layer done, now add additional docs with the details hidden, grouped into higher-level business processes. Users should be able to safely ignore some of those details, but it's good for them to have visibility of how their part fits in to the whole.
You may need more than one higher-level document.
You can use Prezi to present this information to users in a lucid manner.
Split and present the work flow into phases such that the end user is easily able to identify the phase he is currently in.
Display as many number of phases as the number of inputs. The workflow starts with 6 different inputs so display the six different buttons on screen enabling the user to select the input that he wants.
On selecting the button zoom into the workflow depicting the next steps. This would also help the user to verify the actions that he has done so far to reach the current states.
This would also help the user to verify the actions that he has done so far to reach the current states. But this way of presenting could become cumbersome for the users as the number of steps that he has completed goes up. Say the user has almost reached the end of the workflow. To check for the next step he should go through all the steps which might frustrate the user.
To avoid this you can split the complete work flow chronologically into 3-5 phases. The phases should be split logically. The ultimate aim would be not to overwhelm the users with the full work flow. Personally i would try to avoid the task involving this workflow if presented the way you have shown. No offense. I bet you also feel the same.
Could give you a better picture if you could re-post the image after replacing the state names with numbers.
I'd recommend having the whole flow documented somewhere, but in terms of what is distributed to users, how about focusing on task-oriented flows? No one user will be responsible for the entire process I would imagine.
For example, let's say I have 2 roles, A and B, and 6 tasks, 1 through 6, executed in order. Each task may have multiple steps but is self-contained (e.g. download the file, review, run process, review again, upload). A does the even tasks and B does the odd tasks.
A would need to know about those detailed steps that comprise tasks 2, 4, and 6 but not about what goes on in 1, 3, and 5. So hand A a detailed set of flows for the tasks he is responsible for, along with a diagram that treats each task as a black box.
If the flow can't be made modular in this way, you may want to review the process itself to see why it's so complex.
How about showing an example of a workflow scenario, that is, showing the transitions in one possible passing through the workflow? You could cater this to a specific user profile and highlight the pertinent states, dimming the others. This allows them to get a clear idea of the transitions by seeing a real-life example.

Resources