Is there any systematic way to turn on or turn off some Bolt in StormCrawler? - apache-storm

I have developed a StormCrawler project that has multiple additional Bolts in that topology. My crawler should work 7 X 24 without any downtime. So I can not restart the crawler and change the topology configuration. I want to bypass (turn on or turn off) some bolts during runtime. What is the best way to disable and enable some Bolts in StormCrawler at runtime?
thanks

There is no way of doing it out of the box, so you'd have to implement the logic for tuning the bolts on / off in the bolts themselves.
If what you need is to refresh their configuration, you could implement a dynamic mechanism. For instance, store the config of the bolts in for instance an Elastic index and reload that config periodically.
We already have something a bit like this with the JSONURLFilterWrapper and the equivalent for the ParseFilter. We could have an abstract ES-backed dynamically configurable bolt. Feel free to open an issue on GitHub if you think this is of interest or even better, contribute a PR ;-)

Related

The best way to schedule||automate MarkLogic data hub flows/custom steps

I use DMSDK to ingest data; I have multiple custom flows to run following data ingestion. Instead of manually running the flows one by one, What is the best way to orchestrate MarkLogic data hub flows?
gradle, trigger or other scheduling tools?
I concur with Dave Cassel that NiFi, or perhaps something like MuleSoft, or maybe even Camel is a great way to manage running your flows. Particularly if you are talking about operational management.
To answer on other mechanisms:
Crontab doesn't connect to MarkLogic itself. You'd have to write scripts or code to make something actually happen. You won't have much control either, nor logging, unless you add that as well.
We have great plugins for Gradle that make running flows real easy. Great during development and such, but perhaps less suited for scheduling or operational tasking.
Triggers inside MarkLogic only respond to insertion of data, so you'd still have to initiate an update from outside anyhow.
Scheduled Tasks inside MarkLogic has similar limitations to Crontab and Gradle. It doesn't do much by itself, so you have to write code anyhow. It provides no logging by itself, nor ways to operationally manage the tasks, other than through Admin ui.
JAR package might depend on what JAR package you actually mean. You can create a JAR of your ml-gradle project, but that doesn't give you a lot of gain over calling Gradle itself.
Personally, I'd have a close look at the operational requirements. Think of for instance: need to get status overview, interrupt schedules, loops to retry at failure, built-in logging, and facilities to send notifications when attention is needed.
HTH!
There are a variety of answers that will work, of course; my preference is NiFi. This keeps any scheduling overhead outside of MarkLogic, with the trade-off that you'll need to have NiFi running.

Apache Nifi - About .Nar File Changes and Nifi Restarting

I want suggestions for my application:
I have Multitenancy in Nifi. For each Process group, I have different Tenants/Users.
For any changes in one Tenant/user like in his custom processor(.nar file will create), we need to copy-paste that .nar file into lib folder and again restart the nifi. But due to this full Nifi server has restarted because of that Each Tenant/User and Processes group get restarted.
So, Please give Some Suggestions So that we can restart only one Tenant/user or process group Or Without Restart Nifi .nar file will reflect?
NiFi does not currently have the kind of warm restart option that you describe, however a lot of the base functionality needed to support it is in the code base and the concept is on the community roadmap.
Some options that might help you today:
Consider segregating the tenants with a high rate of code change into separate development environments. You could possibly leverage the Docker builds to provide flexibility and easy automation. You could then promote the end-of-day versions of your Nars into the 'Production' cluster each night, hopefully without disturbing users.
Consider utilising the NiFi Site-to-Site capability to have linked NiFi environments instead of a single shared one. Processors that change regularly could be called out to and updated in their own schedule
Consider why you are changing processor code so regularly, there may be a better approach than hard coding logic and parameters into the processors - the variable registry, various controller services, flow registry, etc. all provide a very rich featureset.

Using opendaylight to document networks

I am facing a task to analyze, document and visualize a rather large global network (more than 200 sites, various technologies) and wonder if opendaylight might help me with this. The documentation should mainly focus on Layer 3 and Layer 4 but may partially also include L2 topology. There is no need to actually use a controller to configure devices through southbound APIs, but I would like to use the benefits of a structured, consistent description (e.g. YANG) and a visualization (DLUX, NeXt) which ODL provides.
So here's my question: Is there any way to manually add a topology (nodes, links) through a graphical editor to ODL?
From the general project description and from the last 20 seconds of this video, NeXt in general seems to be able to add/modify topology information. How far does the NeXt integration with ODL go? Would I need to write my own app (which could use NeXt) which would need to use the RESTCONF APIs to add information to the topology? Or should I maybe create a virtual topology of the real network using mininet? Any other ideas?
I understand that an sdn controller is probably not the right kind of tool for this task, but other alternatives (e.g. Net2Plan, Visio, yEd) are basically local solutions, a bit too complicated or provide no standardized DSL for topologies. Besides that, the documentation covering ODL and NeXt integration is very limited - couldn't get a grasp of how that integration should work (I'll try harder).

Monitoring solution for EC2 based deployment

We have some 20 or so servers in EC2, most are dynamically spawned (scaling groups).
We're looking for a solution to monitor the uptime of our application.
As an added bonus this solution could also extend to actually monitoring the servers involved so its easy to go back in time and see what happened just before a downtime or whatnot.
We're looking for a hosted solution ideally, and it should be easy to scale with it (it needs to somehow dynamically deal with servers being added/removed with no interaction from us).
Anyways, hoping for some recommendations from you guys.
A bit of background ...
We're currently using a custom Nagios setup, its been reduced to basically doing a simple http check now that the servers have become fully dynamic. We've already been using PagerDuty to deliver the pages. It does ok, but for the maintenance cost we could well be using a http check # Server Density of Pingdom.
I've looked briefly at ServerDensity, and it does look promising, I especially like their install mechanism of just dumping their files into your AMI and it takes care of the rest.
I'd like to know what options there are tho before diving deeper into any particular solution.
We use a combination of Server Density for monitoring and PagerDuty for alerting. The two work quite well together.

Has anyone tried using ZooKeeper?

I was currently looking into memcached as way to coordinate a group of server, but came across Apache's ZooKeeper along the way. It looks interesting, and Yahoo uses it, so it shouldn't be bad, but I'd never heard of it before, so I'm kind of skeptical. Has anyone else given it a try? Any comments or ideas?
ZooKeeper and Memcached have different purposes. You can use memcached to do server coordination, but you'll have to do most of this work yourself. Memcached only allows coordination in that it caches common data lookups to be used by multiple clients. From reading ZooKeeper's documentation, it has a much broader focus than this. ZooKeeper seems to provide support for server clustering, which isn't the same as the cache clustering memcached provides.
Have a look at Brad Fitzpatrick's Linux Journal article on memcached to get a better idea what I mean.
To get an overview of what Zookeper is capable of, watch the following presentation by it's creators. It's capable of so much more (creating queue's, electing master processes amongst a group of peers, distributed high performance run time configurations, rendezvous points for dis-joined processes, determining if processes are still running, etc).
http://zookeeper.sourceforge.net/index.sf.shtml
To answer your question, if "coordination" is what you are looking for Zookeeper is much better targeted at that than memcached.
Zookeeper is great for coordinating data across servers. It does a good job of ordering every transaction and making guarantees that transactions happen in order. However when first breaking into it the documentation sucks; it's very 'high-level' without enough concrete examples or explanations as how to properly handle certain events. One of the included examples (as of version 3.3.3) had its own bugs in it.
Your code will also need to be cognizant of event driven interactions, and polling interactions. With massively distributed architecture, when acting upon 'events' you can inadvertently create a stampede that could not be desirable for your environment (herding effect).

Resources