Creating Custom Processor In Apache Nifi - apache-nifi

I am building an custom processor to process flow file , to process the flow file i need to read an CSV file from my local file system. I created an proerty descriptor CSV_PATH as follows
public static final PropertyDescriptor CSV_PATH = new
PropertyDescriptor
.Builder().name("CSV Path")
.displayName("CSV Path")
.description("CSV Path Reader")
.required(true)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.build();
#Override
protected void init(final ProcessorInitializationContext context) {
final List<PropertyDescriptor> descriptors = new
ArrayList<PropertyDescriptor>();
descriptors.add(JSON_PATH);
descriptors.add(CSV_PATH);
this.descriptors = Collections.unmodifiableList(descriptors);
final Set<Relationship> relationships = new HashSet<Relationship>();
relationships.add(SUCCESS);
this.relationships = Collections.unmodifiableSet(relationships);
}
Now I wants to get the value of CSV_PATH property set in UI while configuring processor. I am not able to get the CSV_PATH value. Also If I hardcode filepath in code then still I am not able to read CSV from local file system.

You want to use the following code to retrieve the value of the PropertyDescriptor from the ProcessContext:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) {
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
final String csvPath = context.getProperty(CSV_PATH).getValue();
// Do something with csvPath
}
If you decide to support NiFi Expression Language in that property descriptor, you will also want to evaluate for that:
final String csvPath = context.getProperty(CSV_PATH).evaluateAttributeExpressions().getValue();
There are additional method overrides for that, including flowfile attributes, variable registry, custom decorators, etc.
This is documented in the Apache NiFi Developer's Guide. I recently did a presentation at Dataworks Summit Barcelona 2019 covering custom processor development with some best practices included and examples that may be helpful. You can also look at any existing processor in the NiFi codebase to see examples.

Related

Connecting to Accumulo inside a Mapper using Kerberos

I am moving some software from an older Hadoop Cluster (uses username/password authentication) to a newer one, 2.6.0-cdh5.12.0 which has Kerberos authentication enabled.
I have been able to get many of existing Map/Reduce jobs that use Accumulo for its input and/or output to work fine using a DelegationToken set in the AccumuloInput/OutputFormat classes.
However, I have 1 job, that uses AccumuloInput/OutputFormat for input and output, but also inside its Mapper.setup() method, it connects to Accumulo via Zookeeper so that in the Mapper.map() method, it can compare each key/value being processed my the Mapper.map() to and entry in another Accumulo table.
I included the relevant code below which shows the setup() method connecting to Zookeeper user a PasswordToken and then creating an Accumulo table Scanner which is then used in the mapper method.
So the question is how do I replace the use of the PasswordToken with a KerberosToken for setting up the Accumulo scanner in the Mapper.setup() method? I can find no way to "get" the DelegationToken used by the AccumuloInput/OutputFormat classes that I set.
I have tried context.getCredentials().getAllTokens() and looking for a token of type org.apache.accumulo.code.client.security.tokens.AuthenticationToken -- all of the tokens returned here are of type org.apache.hadoop.security.token.Token.
Please note that I typed the code fragments in versus cut/paste as the code runs on a network unconnected to the internet - aka there may be a typo. :)
//****************************
// code in the M/R driver
//****************************
ClientConfiguration accumuloCfg = ClientConfiguration.loadDefault().withInstance("Accumulo1").withZkHosts("zookeeper1");
ZooKeeperInstance inst = new ZooKeeperInstance(accumuloCfg);
AuthenticationToken dt = conn.securityOperations().getDelegationToken(new DelagationTokenConfig());
AccumuloInputFormat.setConnectorInfo(job, username, dt);
AccumuloOutputFormat.setConnectorInfo(job, username, dt);
// other job setup and then
job.waitForCompletion(true)
//****************************
// this is inside the Mapper class of the M/R job
//****************************
private Scanner index_scanner;
public void setup(Context context) {
Configuration cfg = context.getConfiguration();
// properties set and passed from M/R Driver program
String username = cfg.get("UserName");
String password = cfg.get("Password");
String accumuloInstName = cfg.get("InstanceName");
String zookeepers = cfg.get("Zookeepers");
String tableName = cfg.get("TableName");
Instance inst = new ZooKeeperInstance(accumuloInstName, zookeepers);
try {
AuthenticationToken passwordToken = new PasswordToken(password);
Connector conn = inst.getConnector(username, passwordToken);
index_scanner = conn.createScanner(tableName, conn.securityOperations().getUserAuthorizations(username));
} catch(Exception e) {
e.printStackTrace();
}
}
public void map(Key key, Value value, Context context) throws IOException, InterruptedException {
String uuid = key.getRow().toString();
index_scanner.clearColumns();
index_scanner.setRange(Range.exact(uuid));
for(Entry<Key, Value> entry : index_scanner) {
// do some processing in here
}
}
The provided AccumuloInputFormat and AccumuloOutputFormat have a method to set the token in the job configuration with the Accumulo*putFormat.setConnectorInfo(job, principle, token). You can also serialize the token in a file in HDFS, using the AuthenticationTokenSerializer and use the version of the setConnectorInfo method which accepts a file name.
If a KerberosToken is passed in, the job will create a DelegationToken to use, and if a DelegationToken is passed in, it will just use that.
The provided AccumuloInputFormat should handle its own scanner, so normally, you shouldn't have to do that in your Mapper if you've set the configuration properly. However, if you're doing secondary scanning (for something like a join) inside your Mapper, you can inspect the provided AccumuloInputFormat's RecordReader source code for an example of how to retrieve the configuration and construct a Scanner.

Dump Spring boot Configuration

Our Ops guys want the Spring boot configuration (i.e. all properties) to be dumped to the log file when the app starts. I assume this can be done by injecting the properties with annotation #ConfigurationProperties and printing them.
The questions is whether there is a better or built-in mechanism to achieve this.
Given there does not seem to be a built in solution besides, I was try to cook my own. Here is what I came up with:
#Component
public class ConfigurationDumper {
#Autowired
public void init(Environment env){
log.info("{}",env);
}
}
The challenge with this is that it does not print variables that are in my application.yml. Instead, here is what I get:
StandardServletEnvironment
{
activeProfiles=[],
defaultProfiles=[default],
propertySources=[
servletConfigInitParams,
servletContextInitParams,
systemProperties,
systemEnvironment,
random,
applicationConfig: [classpath: /application.yml]
]
}
How can I fix this so as to have all properties loaded and printed?
If you use actuator , env endpoint will give you all the configuration properties set in ConfigurableEnvironment and configprops will give you the list of #ConfigurationProperties, but not in the log.
Take a look at the source code for this env endpoint, may be it will give you an idea of how you could get all the properties you are interested in.
There is no built-in mechanism and it really depends what you mean by "all properties". Do you want only the actual keys that you wrote or you want all properties (including defaults).
For the former, you could easily listen for ApplicationEnvironmentPreparedEvent and log the property sources you're interested in. For the latter, /configprops is indeed a much better/complete output.
This logs only the properties configured *.properties file.
/**
* maps given property names to its origin
* #return a map where key is property name and value the origin
*/
public Map<String, String> fromWhere() {
final Map<String, String> mapToLog = new HashMap<>();
final MutablePropertySources propertySources = env.getPropertySources();
final Iterator<?> it = propertySources.iterator();
while (it.hasNext()) {
final Object object = it.next();
if (object instanceof MapPropertySource) {
MapPropertySource propertySource = (MapPropertySource) object;
String propertySourceName = propertySource.getName();
if (propertySourceName.contains("properties")) {
Map<String, Object> sourceMap = propertySource.getSource();
for (String key : sourceMap.keySet()) {
final String envValue = env.getProperty(key);
String env2Val = System.getProperty(key);
String source = propertySource.getName().contains("file:") ? "FILE" : "JAR";
if (envValue.equals(env2Val)) {
source = "ENV";
}
mapToLog.putIfAbsent(key, source);
}
}
}
}
return mapToLog;
}
my example output which depicts the property name, value and from where it comes. My property values are describing from where they come.:
myprop: fooFromJar from JAR
aPropFromFile: fromExternalConfFile from FILE
mypropEnv: here from vm arg from ENV
ENV means that I have given it by -D to JVM.
JAR means it is from application.properties inside JAR
FILE means it is from application.properties outside JAR

How to provide values to storm for calculation

I have a hard time understanding how to provide values to storm since i am a newbie to storm.
I started with the starter kit. I went through the TestWordSpout and in that the following code provides new values
public void nextTuple() {
Utils.sleep(100);
final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"};
final Random rand = new Random();
final String word = words[rand.nextInt(words.length)];
_collector.emit(new Values(word));
}
So i see it's taking one word at a time _collector.emit(new Values(word));
How i can provide a collection of words directly.Is this possible?
TestWordSpout.java
What I mean when nextTuple is called a new words is selected at random from the list and emitted. The random list may look like this after certain time interval
#100ms: nathan
#200ms: golda
#300ms: golda
#400ms: jackson
#500ms: mike
#600ms: nathan
#700ms: bertels
What if i already have a collection of this list and just feed it to storm.
Storm is designed and built to process the continuous stream of data. Please see Rationale for the Storm. It's very unlikely that input data is feed into the storm cluster. Generally, the input data to storm is either from the JMS queues, Apache Kafka or twitter feeds etc. I would think, you would like to pass few configurations. In that case, the following would apply.
Considering the Storm design purpose, very limited configuration details can be passed to Storm such as the RDMBS connection details (Oracle/DB2/MySQL etc), JMS provider details(IBM MQ/RabbitMQ etc) or Apache Kafka details/Hbase etc.
For your particular question or providing the configuration details for the above products, there are three ways that I could think
1.Set the configuration details on the instance of the Spout or Bolt
For eg: Declare the instance variables and assign the values as part of the Spout/Bolt constructor as below
public class TestWordSpout extends BaseRichSpout {
List<String> listOfValues;
public TestWordSpout(List<String> listOfValues) {
this.listOfValues=listOfValues;
}
}
On the topology submission class, create an instance of Spout with the list of values
List<String> listOfValues=new ArrayList<String>();
listOfValues.add("nathan");
listOfValues.add("golda");
listOfValues.add("mike");
builder.setSpout("word", new TestWordSpout(listOfValues), 3);
These values are available as instance variables in the nextTuple() method
Please look at the Storm integrations at Storm contrib on the configurations set for RDBMS/Kafka etc as above
2.Set the configurations in the getComponentConfiguration(). This method is used to override the topology configurations, however, you could pass in few details as below
#Override
public Map<String, Object> getComponentConfiguration() {
Map<String, Object> ret = new HashMap<String, Object>();
if(!_isDistributed) {
ret.put(Config.TOPOLOGY_MAX_TASK_PARALLELISM, 1);
return ret;
} else {
List<String> listOfValues=new ArrayList<String>();
listOfValues.add("nathan");
listOfValues.add("golda");
listOfValues.add("mike");
ret.put("listOfValues", listOfValues);
}
return ret;
}
and the configuration details are available in the open() or prepare() method of Spout/Bolt respectively.
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
_collector = collector;
this.listOfValues=(List<String>)conf.get("listOfValues");
}
3.Declare the configurations in the property file and jar it as part of the jar file that would be submitted to the Storm cluster. The Nimbus node copies the jar file to the worker nodes and makes it available to executor thread. The open()/prepare() method can read the property file and assign to instance variable.
"Values" type accept any kind of objects and any number.
So you can simply send a List for instance from the execute method of a Bolt or from the nextTuple method of a Spout:
List<String> words = new ArrayList<>();
words.add("one word");
words.add("another word");
_collector.emit(new Values(words));
You can add a new Field too, just be sure to declare it in declareOutputFields method
_collector.emit(new Values(words, "a new field value!");
And in your declareOutputFields method
#Override
public void declareOutputFields(final OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("collection", "newField"));
}
You can get the fields in the next Bolt in the topology from the tuple object given by the execute method:
List<String> collection = (List<String>) tuple.getValueByField("collection");
String newFieldValue = tuple.getStringByField("newField");

ApacheConnector does not process request headers that were set in a WriterInterceptor

I am experiencing problems when configurating my Jersey Client with the ApacheConnector. It seems to ignore all request headers that I define in a WriterInterceptor. I can tell that the WriterInterceptor is called when I set a break point within WriterInterceptor#aroundWriteTo(WriterInterceptorContext). Contrary to that, I can observe that the modification of an InputStream is preserved.
Here is a runnable example demonstrating my problem:
public class ApacheConnectorProblemDemonstration extends JerseyTest {
private static final Logger LOGGER = Logger.getLogger(JerseyTest.class.getName());
private static final String QUESTION = "baz", ANSWER = "qux";
private static final String REQUEST_HEADER_NAME_CLIENT = "foo-cl", REQUEST_HEADER_VALUE_CLIENT = "bar-cl";
private static final String REQUEST_HEADER_NAME_INTERCEPTOR = "foo-ic", REQUEST_HEADER_VALUE_INTERCEPTOR = "bar-ic";
private static final int MAX_CONNECTIONS = 100;
private static final String PATH = "/";
#Path(PATH)
public static class TestResource {
#POST
public String handle(InputStream questionStream,
#HeaderParam(REQUEST_HEADER_NAME_CLIENT) String client,
#HeaderParam(REQUEST_HEADER_NAME_INTERCEPTOR) String interceptor)
throws IOException {
assertEquals(REQUEST_HEADER_VALUE_CLIENT, client);
// Here, the header that was set in the client's writer interceptor is lost.
assertEquals(REQUEST_HEADER_VALUE_INTERCEPTOR, interceptor);
// However, the input stream got gzipped so the WriterInterceptor has been partly applied.
assertEquals(QUESTION, new Scanner(new GZIPInputStream(questionStream)).nextLine());
return ANSWER;
}
}
#Provider
#Priority(Priorities.ENTITY_CODER)
public static class ClientInterceptor implements WriterInterceptor {
#Override
public void aroundWriteTo(WriterInterceptorContext context)
throws IOException, WebApplicationException {
context.getHeaders().add(REQUEST_HEADER_NAME_INTERCEPTOR, REQUEST_HEADER_VALUE_INTERCEPTOR);
context.setOutputStream(new GZIPOutputStream(context.getOutputStream()));
context.proceed();
}
}
#Override
protected Application configure() {
enable(TestProperties.LOG_TRAFFIC);
enable(TestProperties.DUMP_ENTITY);
return new ResourceConfig(TestResource.class);
}
#Override
protected Client getClient(TestContainer tc, ApplicationHandler applicationHandler) {
ClientConfig clientConfig = tc.getClientConfig() == null ? new ClientConfig() : tc.getClientConfig();
clientConfig.property(ApacheClientProperties.CONNECTION_MANAGER, makeConnectionManager(MAX_CONNECTIONS));
clientConfig.register(ClientInterceptor.class);
// If I do not use the Apache connector, I avoid this problem.
clientConfig.connector(new ApacheConnector(clientConfig));
if (isEnabled(TestProperties.LOG_TRAFFIC)) {
clientConfig.register(new LoggingFilter(LOGGER, isEnabled(TestProperties.DUMP_ENTITY)));
}
configureClient(clientConfig);
return ClientBuilder.newClient(clientConfig);
}
private static ClientConnectionManager makeConnectionManager(int maxConnections) {
PoolingClientConnectionManager connectionManager = new PoolingClientConnectionManager();
connectionManager.setMaxTotal(maxConnections);
connectionManager.setDefaultMaxPerRoute(maxConnections);
return connectionManager;
}
#Test
public void testInterceptors() throws Exception {
Response response = target(PATH)
.request()
.header(REQUEST_HEADER_NAME_CLIENT, REQUEST_HEADER_VALUE_CLIENT)
.post(Entity.text(QUESTION));
assertEquals(200, response.getStatus());
assertEquals(ANSWER, response.readEntity(String.class));
}
}
I want to use the ApacheConnector in order to optimize for concurrent requests via the PoolingClientConnectionManager. Did I mess up the configuration?
PS: The exact same problem occurs when using the GrizzlyConnector.
After further research, I assume that this is rather a misbehavior in the default Connector that uses a HttpURLConnection. As I explained in this other self-answered question of mine, the documentation states:
Whereas filters are primarily intended to manipulate request and
response parameters like HTTP headers, URIs and/or HTTP methods,
interceptors are intended to manipulate entities, via manipulating
entity input/output streams
A WriterInterceptor is not supposed to manipulate the header values while a {Client,Server}RequestFilter is not supposed to manipulate the entity stream. If you need to use both, both components should be bundled within a javax.ws.rs.core.Feature or within the same class that implements two interfaces. (This can be problematic if you need to set two different Prioritys though.)
All this is very unfortunate though, since JerseyTest uses the Connector that uses a HttpURLConnection such that all my unit tests succeeded while the real life application misbehaved since it was configured with an ApacheConnector. Also, rather than suppressing changes, I wished, Jersey would throw me some exceptions. (This is a general issue I have with Jersey. When I for example used a too new version of the ClientConnectionManager where the interface was renamed to HttpClientConnectionManager I simply was informed in a one line log statement that all my configuration efforts were ignored. I did not discover this log statement til very late in development.)

jersey-freemarker

I'm developing a small tool based on jersey and freemarker, which will enable designers to test there freemarker templates, locally, using some mok-objects.
I'm sorry to write here, but I cant find any documentation about it except some code and javadocs.
To do that I did the following:
1 Dependencies:
<dependency>
<groupId>com.sun.jersey.contribs</groupId>
<artifactId>jersey-freemarker</artifactId>
<version>1.9</version>
</dependency>
2 Starting grizzly, telling where to find freemarker templates:
protected static HttpServer startServer() throws IOException {
System.out.println("Starting grizzly...");
Map<String, Object> params = new HashMap<String, Object>();
params.put("com.sun.jersey.freemarker.templateBasePath", "/");
ResourceConfig rc = new PackagesResourceConfig("resource.package");
rc.setPropertiesAndFeatures(params);
HttpServer server = GrizzlyServerFactory.createHttpServer(BASE_URI, rc);
server.getServerConfiguration().addHttpHandler(
new StaticHttpHandler("/libs"), "/libs");
return server;
}
3 Creates the root resource and binds freemarker files:
#Context ResourceConfig resourceConfig;
#Path("{path: ([^\\s]+(\\.(?i)(ftl))$)}")
public Viewable renderFtl (#PathParam("path") String path) throws IOException {
Viewable view = new Viewable("/"+path);
return view;
}
Everything works fine, except that freemarker files are not rendered. I have an empty white page, but file exists and debugger enter inside renderFtl method right.
Do you know how can I do that?
I read a lot of articles here and around the web, but old posts only or articles talking about spring integration and I don't want to integrate it because I don't need it.
I really like Jersey, I think is one of the most complete and power framework on java world, but anytime I try to find documentation on specific features or contribs libraries, I'm lost... There no escape from groups forums :)
Where can I find a complete documentation about it?
Tanks a lot David
Updates:
Trying to solve I understood I cannot use built-in jersey support, because it needs to use files placed in resources tree. So What I did is to build freemarker configuration, in test for now, directly #runtime and returns a StreamingOutput object:
#Path("{path: ([^\\s]+(\\.(?i)(ftl))$)}")
public StreamingOutput renderFtl (#PathParam("path") String path) throws Exception {
Configuration cfg = new Configuration();
// Specify the data source where the template files come from.
// Here I set a file directory for it:
cfg.setDirectoryForTemplateLoading(new File("."));
// Create the root hash
Map<String, Object> root = new HashMap<String, Object>();
Template temp = cfg.getTemplate(path);
return new FTLOutput(root, temp);
}
FTLOutput is here:
This is not a good code, but is for test only...
class FTLOutput implements StreamingOutput {
private Object root;
private Template t;
public FTLOutput(Object root, Template t) {
this.root = root;
this.t = t;
}
#Override
public void write(OutputStream output) throws IOException {
Writer writer = new OutputStreamWriter(output);
try {
t.process(root, writer);
writer.flush();
} catch (TemplateException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
I have no errors evidence on debug and freemarker tells me that template is found and rendered, but jersey still no give me a result...
I really don't know why!
Why are you using Jersey 1.9? 1.11 is already out, you should update if you can
Have you seen "freemarker" sample from Jersey? It demonstrates simple usecase of using freemarker with jersey.
Where are your resources?
Templates are being found by calling [LastMatchedResourceClass].getResources(...), so if your templates are not accessible as resources, they can't be rendered correctly. you can checkout Jersey source and place some breakpoints into FreemarkerViewProcessor, it should tell you where exactly the problem is..

Resources