I want to go down recursively and automatically at the host and clusters tab (the blue one in the picture), and then take the text like guest os, compability, etc.
I already know how to get the text from summary, but I got the problem to loop (go down) at the host and cluster tabs.
Any idea how to do it?
Thank you very much.
It's probably not the best idea to have one robot do everything - i.e. setting filters, parsing results, clicking on each result, parsing data, then going back, and so on. Instead of that approach, divide and conquer. Create multiple sequences/workflows, each one with one specific task in mind.
Here's how I would tackle it:
Create a sequence that opens the target page, setting the filters (e.g. Germany, as mentioned in your link: https://www.autoscout24.de/ergebnisse?cy=D&powertype=kw&atype=C&ustate=N%2CU&sort=standard&desc=0&page=1&size=20)
Have the same robot extract the link for each result. Don't go into the details, this could potentially be the task of another robot.
Have the links stored in a queue. Read more about Queues and Transactions here.
Create another sequence that takes the next item in the queue and open the stored link. This will take you directly to the results page (e.g. https://www.autoscout24.de/angebote/opel-corsa-12v-city-servo-ahk-benzin-silber-84a66193-b6d6-4fc9-b222-ec5a7319f221?cldtidx=1)
Parse any kind of data you need.
This comes with the benefit that you can use multiple robots at the same time extracting data, potentially increasing scraping significantly. The Queues and Transactions feature will make sure each result is visited only once, and that multiple robots don't process the same item multiple times.
Edit: you might want to start with ReFramework, which I'd recommend.
Related
We update content control for every character typed in the task pane’s input field. So that user can see the live updates on the word document.
Recently we added functionality for locking content controls. And it happens as below:
User input (types a character) in a input field
We search a content control for that input field (involves context.sync)
Unlock the content control (involves context.sync)
Update value in content control (involves context.sync)
Lock back the content control (involves context.sync)
All this works nice in Word for windows without problems.
But is extremely (visibly) slow with Word for Mac (apple machines)
How should I overcome the delays happening on Mac?
As Juan mentioned in the comment, there are some important details that the team would need to investigate. Sample code would be good too.
That being said, just looking at what you describe, I think you can dramatically cut down on the context.sync() statements. Unlocking the content control, updating its value, and locking it should all be possible to do in one sync.
I have a bunch of details about minimizing sync-s in my book, "Building Office Add-ins using Office.js. Quoting one of the sections from it:
As an add-in author, your job is to minimize the number of context.sync()
calls. Each sync is an extra round-trip to the host application; and when
that application is Office Online, the cost of each of those round-trip adds up
quickly.
If you set out to write your add-in with this in principle in mind, you will
find that you need a surprisingly small number of sync calls. In fact, when
writing this chapter, I found that I really needed to rack my brain to come up with a scenario that did need more than two sync calls. The trick for
minimizing sync calls is to arrange the application logic in such a way that
you're initially scraping the document for whatever information you need
(and queuing it all up for loading), and then following up with a bunch
of operations that modify the document (based on the previously-loaded
data). You've seen several examples of this already: one in the intro chapter,
when describing why Office.js is async; and more recently in the "canonical
sample" section at the beginning of this chapter. For the latter, note that the
scenario itself was reasonably complex: reading document data, processing
it to determine which city has experienced the highest growth, and then
creating a formatted table and chart out of that data. However, given the
"time-travel" superpowers of proxy objects, you can still accomplish this task
as one group of read operations, followed by a group of write operations.
Still, there are some scenarios where multiple loads may be required. And in
fact, there may be legitimate scenarios where even doing an extra sync is the
right thing to do – if it saves on loading a bunch of unneeded data. You will
see an example of this later in the chapter.
hi folks,
We use StormCrawler with elasticsearch to make an index of our homepage which consist of "old pages" and "new pages".
My Question in short:
If two pages A(old),B(new) link to page X, how to pass metadata from B to X?
My Question in long:
We relauched our homepage step by step. So at time we have pdf-Files which are reachable via only the old html-pages, via only the new html-page or on both ways.
For "order by" purpose we must mark all pdf-Files which are reachable by the new html-pages.
So we insert "newHomepage=true" to seeds.txt and "metadata.transfer/-newHomepage" to "crawler-conf.yaml": Fine :-)
But for the pdf-Files which are reachable from old !and! new html-pages, we now have a race condition: If our pdf-File is "DISCOVERED" from an old page this information (newHomepage=false) is in Status-Index and can not be overridden.
( StatusUpdaterBolt does not override documents, IndexerBolt does override by default).
To make the thinks more complicate: in our case a URL (at html-page) to a PDF is redirected two times, before the file is delivered.
So from my point of view we have two possibilities:
Start the crawler two times. First we only index our new pages (and all reachable pdf files), second we index our old pages.
--> Problems with new pages which are changed after crawler was started
Store "outbound_links" and use them to set "newHomepage" independently from the crawler
--> short times with wrong metadata in index
Any advice or other ideas?
Best regards
Karsten
thanks for sharing your problem and great to hear that you are using SC. This is an interesting and unusual use case.
Your analysis of the problem is correct. An intuitive approach would be to extend the default StatusUpdaterBolt so that it updates the metadata if a document already exists. You'd need to remove the part that does the check on whether the doc has a status of DISCOVERED.
This would slow things down, but since you are dealing with a single website, this should not have a massive impact.
You could push the logic even further by setting a new nextFetchDate if the document had been fetched so that it gets refetched and updated quicker in the doc index (as opposed to the status one).
Below is a general ETL flow chart diagram.I am really confused if it is a good practice to draw such a flow chart.Especially at lines connecting the final output , and the big box used to generalize the whole process that goes from input format to 'Validate files as per type of file' to 'Adjustments for desired output' and finally to 'outputs'.
In concept, to provide an overview of the process, a Context diagram or Data Flow Diagram can be used as well as a Flow Chart. While all these diagrams are old, they are usually useful.
I suggest that you check the following points in your diagram - What I have here are personal suggestions based on long history with ETL.
Your processing shape (say rectangle) must have at least one entry (input) and at least one exist(output). Example: The two large rectangles don't have clear inputs and outputs! What is going on there?
You can't have outputs without a process - Example: Type 1 is somehow split into Type 1.1 and Type 1.2. How is that split happened? Is it by a program? Which one? Another example is Type 1.1 has an arrow connected to Output 1. Again same questions hold. Another example is "Rejected folder" to email.
The name of the process should indicate what the process is or the actual program component name or physical file names. This is valuable in ETL.
You may want to show the process triggering event such as time of day.
Use remarks.
Make the level of detail consistent.
We've got a surprisingly complex workflow that needs to be monitored by a quasi-technical employees with an in-house webapp. There's about 30 steps, some of which are manual (editing), some are semi-automated stop points (like "the files have been received" or customer approval of certain templates), and some are completely automated (file conversion, search indexing, etc). The flowchart for all of these steps is large and complicated, and three people might be working on three completely different steps at any one time.
How would you present this vast amount of information as usefully as possible to your users? Just showing the whole diagram seems like the brute force solution. But it's big, and it'll likely get bigger as we do more things. Not to mention the complexity necessary to encode this entire diagram in HTML.
I assume you don't want to show these just for entertainment or mockery, but help the users along the way, automating as much as possible, document the process etc. It would probably help if you clearly define the goals or purpose of your app.
I don't see a point in showing the entire workflow, except for "debugging the business rules" or maybe the clients want to see it.
If your goal is to help users do their job, I would present the state of the "project" (or whatever term fits better) is at, and possible transitions to other states.
The State might be multiple mostly independent variables, e.g. one might describe the progress of content - e.g. "incomplete" / "complete" / "reviewed by 2nd staffer" / "signed off by 2nd staffer", others might contain a schedule that is developed in parallel, e.g. "test print date = not scheduled", "print date = not scheduled", "final delivery = tomorrow, preferredly yesterday".
A transition might be "Seint to customer for review", "mark as content-complete", "content modified", etc.
Is this what you have in mind?
I propose to divide your workflow in modules and represent the active state for each module.
A module is a subset of your main workflow. For example it could be divided by tasks, person, roles, department, etc. This will greatly simplify the representation of the workflow. Let's says someone is responsible for data entry at many critical moments. We can group all his tasks in one module (or sub-workflow) containing the same activities, inputs, outputs and conditions. Modules could be inter-dependants and related.
A state is where we are located in a module. In simple workflows there is only one active task. In real life we are multi-threaded! So maybe in one module many states could be active at the same time. The state also includes active inputs, outputs and memory bits.
An input is something required to perform an activity for evaluation a boolean condition. It could be a document, a piece of data, a signal...
An output is something resulting from a task: an information, a document, a signal...
Enough definitions?
Then simply convert your workflow into a LADDER LOGIC and you have your states!
See Ladder Logic definition on Wikipedia
You display only active states:
Active task(s) for the module
Inputs required / inputs confirmed
Output required / output realized
Conditions to continue
Seems abstract?
Here is a small example...
Janet enters data in the system. She manages the green tasks of the diagram. We focus only on her work, not other tasks. She knows how to do 16 tasks in the workflow. We are waiting the following actions from her to continue, and her Intranet dashboard says:
Priority 1: You must send a PO to order enough pencils for the next month based on the sales report.
Task: Send a purchase order
Inputs: Forecast report from the marketing department
Outputs: PO, vendor, item, quantity
Condition for completion: PO sent and order confirmation received from supplier
Priority 2: You must enter into the financial system the number of erasers rejected by production
Task: Data entry
Inputs: Reject count from production
Outputs: Number of rejects
Condition for completion: data entered and confirmed
We do a lot of troubleshooting on automated production systems having hundreds of thousands ladder steps (the workflow is too complex to be represented in a whole). When the system is blocked we look at each module and determine what inputs are missing to activation task completion.
Good luck!
This sounds like the sort of application for which BPEL is suited.
Of course you don't want to re-architect your system right now. But there are a number of BPEL implmentations out there, some of which include graphical editing tools. One of these might help you in your current situation, because they are good at handling scope and hiding detail. So I think you might derive benefit from drawing your workflow as a BPEL diagram even if you don't do anything else with the language.
The Wikipedia page lists several of the available implementations. In addition, Oracle's JDeveloper IDE includes a BPEL Diagrammer as part of its SOA suite; unfortunately it is no longer part of the standard install but it is still available. Find out more.
Try doing it in layers. You have the most detailed layer done, now add additional docs with the details hidden, grouped into higher-level business processes. Users should be able to safely ignore some of those details, but it's good for them to have visibility of how their part fits in to the whole.
You may need more than one higher-level document.
You can use Prezi to present this information to users in a lucid manner.
Split and present the work flow into phases such that the end user is easily able to identify the phase he is currently in.
Display as many number of phases as the number of inputs. The workflow starts with 6 different inputs so display the six different buttons on screen enabling the user to select the input that he wants.
On selecting the button zoom into the workflow depicting the next steps. This would also help the user to verify the actions that he has done so far to reach the current states.
This would also help the user to verify the actions that he has done so far to reach the current states. But this way of presenting could become cumbersome for the users as the number of steps that he has completed goes up. Say the user has almost reached the end of the workflow. To check for the next step he should go through all the steps which might frustrate the user.
To avoid this you can split the complete work flow chronologically into 3-5 phases. The phases should be split logically. The ultimate aim would be not to overwhelm the users with the full work flow. Personally i would try to avoid the task involving this workflow if presented the way you have shown. No offense. I bet you also feel the same.
Could give you a better picture if you could re-post the image after replacing the state names with numbers.
I'd recommend having the whole flow documented somewhere, but in terms of what is distributed to users, how about focusing on task-oriented flows? No one user will be responsible for the entire process I would imagine.
For example, let's say I have 2 roles, A and B, and 6 tasks, 1 through 6, executed in order. Each task may have multiple steps but is self-contained (e.g. download the file, review, run process, review again, upload). A does the even tasks and B does the odd tasks.
A would need to know about those detailed steps that comprise tasks 2, 4, and 6 but not about what goes on in 1, 3, and 5. So hand A a detailed set of flows for the tasks he is responsible for, along with a diagram that treats each task as a black box.
If the flow can't be made modular in this way, you may want to review the process itself to see why it's so complex.
How about showing an example of a workflow scenario, that is, showing the transitions in one possible passing through the workflow? You could cater this to a specific user profile and highlight the pertinent states, dimming the others. This allows them to get a clear idea of the transitions by seeing a real-life example.
Am trying to write a piece of code which will allows the user to type text into a textbox which then gets saved on the server. When the user types some more text in the textbox, I want only the difference to be sent to the server.
Is there a difference algorithm for JS which I can use to send only information about the difference. So it should be able to tell the difference between two text boxes essentially.
It could also be language agnostic and I can port it.
Thank you for your time.
UPDATE
In simple words. I have a text area which keeps saving the text in the box every X seconds. Now to save bandwidth I only want it to send the difference from the last saved revision (which I can say put in a variable. Initially this will be empty). Now the JS has to check the difference between the last revision and the current state of the textbox and generate a change list to send to the server.
UPDATE 2
Something like www.etherpad.com
Google DiffMatchPatch has a Javascript implementation, I've used it with much success.
http://code.google.com/p/google-diff-match-patch/
The Python difflib module does this and more. It's very flexible but might be challenging to port to Javascript.
Regarding your update, I'm first wondering why you need to worry about bandwidth. Unless your users are typing a lot of text into an edit box (which has its own usability issues) then there just aren't that many bytes to send. Send the whole text box each time you autosave. Users can't type fast enough to really notice the use of bandwidth.
Or, you could meet halfway. Every time you autosave, check to see whether the user has only added new text to the end compared to the last time. If so, send an "append" type update with just the new text. If the user has gone back and edited anything else, then send a "replace" type update where you send the whole text. This takes care of the common append-only case without severely complicating your implementation.
Instead of calculating a diff between 2 texts, which is difficult,
you could always, while people are editting, record the keystrokes and the caret position in the textbox. If you send this over every now and then (and clean the buffer), the server can playback the exact same sequence.
This code-smells of premature optimization. Perhaps you should implement your solution first and then see about optimizing your transfer rates using diffs. How much text are you looking at? Because the request and response packets are going to be more or less the same size with only a few bytes difference for your message, so the savings could be very minimal.
At the very least, complete your solution without optimization and profile your network traffic using tools like Firebug and then test to see how much worse the performance is with what you would consider to be the maximum text block that could be sent.
Finally, you could always use the TypeWatch JQuery plugin to listen for change events in the textbox. You can set a delay so that once the user finishes typing and the delay elapses, the callback function is triggered. This means that the text will only be sent when the user types something, and only when they are finished typing. This will be significantly more efficient than repeatedly polling the server.
Depends on how far you are ready to go.
You would like to check deltav algorithm, it is used by svn in particular: http://svn.apache.org/repos/asf/subversion/trunk/notes/svndiff