Override storm.yaml / setting in Java bolts/spouts - apache-storm

Need help with Tuning apache storm. I have run a command on the nimbus server to increase the spout executors & and a for a bolt.
My question is simple. Does the command:
storm rebalance TopologyName -e spout/or/bolt=
Does this override number of parallel hints in the Java code ?
I ran this and did not see a change in the web GUI interface.
Also is there a way to override the parameter in the storm.yaml file ?
topology.max.spout.pending: 1000
Thanks for any help on this. I do have an excellent book on Storm but I cannot find out why my changes are not being reflected after rebalance...

Did you set the number of tasks high enough? See here for further details:
Rebalancing executors in Apache Storm
https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
So yes, it does override the parallel hint, but only if applicable.
And yes, you can use storm.yaml to set the default "max pending" parameter. This value can be changed for each topology individually by overwriting the default value in the configuration you provide for a topology when submitting it:
Config conf = new Config();
conf.setMaxSpoutPending( /* put your value here */ );
StormSubmitter.submitTopology("topologyName", conf, builder.createTopology());

Related

storm rebalance command not updating the number of workers for a topology

I tried executing the following command for storm 1.1.1:
storm [topologyName] -n [number_of_worker]
The command successfully runs but the number of workers remain unchanged. I tried reducing the number of workers too. That also didn't work.
I have no clue whats happening. Any pointer will be helpful.
FYI:
i have implemented custom scheduling?. Is it because of that?
You can always check Storm's source code behind that CLI. Or code the re-balance (tested against 1.0.2):
RebalanceOptions rebalanceOptions = new RebalanceOptions();
rebalanceOptions.set_num_workers(newNumWorkers);
Nimbus.Client.rebalance("foo", rebalanceOptions);

Using Kafka to import data to Hadoop

Firstly I was thinking what to use to get events into Hadoop, where they will be stored and periodically analysis would be performed on them (possibly using Ooozie to schedule periodic analysis) Kafka or Flume, and decided that Kafka is probably a better solution, since we also have a component that does event processing, so in this way, both batch and event processing components get data in the same way.
But know I'm looking for suggestions concretely how to get data out of broker to Hadoop.
I found here that Flume can be used in combination with Kafka
Flume - Contains Kafka Source (consumer) and Sink (producer)
And also found on the same page and in Kafka documentation that there is something called Camus
Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.
I'm interested in what would be a better (and easier, better documented solution) to do that? Also, are there any examples or tutorials how to do it?
When should I use this variants over simpler, High level consumer?
I'm opened for suggestions if there is another/better solution than this two.
Thanks
You can use flume to dump data from Kafka to HDFS. Flume has kafka source and sink. Its a matter of property file change. An example is given below.
Steps:
Create a kafka topic
kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 -- partitions 1 --topic testkafka
Write to the above created topic using kafka console producer
kafka-console-producer --broker-list localhost:9092 --topic testkafka
Configure a flume agent with the following properties
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = localhost:2181
flume1.sources.kafka-source-1.topic =testkafka
flume1.sources.kafka-source-1.batchSize = 100
flume1.sources.kafka-source-1.channels = hdfs-channel-1
flume1.channels.hdfs-channel-1.type = memory
flume1.sinks.hdfs-sink-1.channel = hdfs-channel-1
flume1.sinks.hdfs-sink-1.type = hdfs
flume1.sinks.hdfs-sink-1.hdfs.writeFormat = Text
flume1.sinks.hdfs-sink-1.hdfs.fileType = DataStream
flume1.sinks.hdfs-sink-1.hdfs.filePrefix = test-events
flume1.sinks.hdfs-sink-1.hdfs.useLocalTimeStamp = true
flume1.sinks.hdfs-sink-1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
flume1.sinks.hdfs-sink-1.hdfs.rollCount=100
flume1.sinks.hdfs-sink-1.hdfs.rollSize=0
flume1.channels.hdfs-channel-1.capacity = 10000
flume1.channels.hdfs-channel-1.transactionCapacity = 1000
Save the above config file as example.conf
Run the flume agent
flume-ng agent -n flume1 -c conf -f example.conf - Dflume.root.logger=INFO,console
Data will be now dumped to HDFS location under the following path
/tmp/kafka/%{topic}/%y-%m-%d
Most of the time, I see people using Camus with azkaban
You can you at the github repo of Mate1 for their implementation of Camus. It's not a tutorial but I think it could help you
https://github.com/mate1/camus

Hadoop MapReduce log4j - log messages to a custom file in userlogs/job_ dir?

Its not clear to me as how one should configure Hadoop MapReduce log4j at a job level. Can someone help me answer these questions.
1) How to add support log4j logging from a client machine. i.e I want to use log4j property file at the client machine, and hence don't want to disturb the Hadoop log4j setup in the cluster. I would think having the property file in the project/jar should suffice, and hadoop's distributed cache should do the rest transferring the map-reduce jar.
2) How to log messages to a custom file in $HADOOP_HOME/logs/userlogs/job_/ dir.
3) Will map reduce task use both the log4j property file? the one supplied by the client job and the one present in the hadoop cluster? If yes, then the log4j.rootLogger would add both the property values?
Thanks
Srivatsan Nallazhagappan
You can configure log4j directly in your code. For example you can call PropertyConfigurator.configure(properties); e.g. in mapper/reducer setup method.
This is example with properties stored on hdfs:
InputStream is = fs.open(log4jPropertiesPath);
Properties properties = new Properties();
properties.load(is);
PropertyConfigurator.configure(properties);
where fs is FileSystem object and log4jPropertiesPath is path on hdfs.
With this you can also output logs to a dir with job_id. For example you can modify our properities before calling PropertyConfigurator.configure(properties);
Enumeration propertiesNames = properties.propertyNames();
while (propertiesNames.hasMoreElements()) {
String propertyKey = (String) propertiesNames.nextElement();
String propertyValue = properties.getProperty(propertyKey);
if (propertyValue.indexOf(JOB_ID_PATTERN) != -1) {
properties.setProperty(propertyKey, propertyValue.replace(JOB_ID_PATTERN, context.getJobID().toString()));
}
}
There is no straight forward way to override the log4j properties at each job level.
Map Reduce job itself doesn't store the logs in Hadoop,it writes logs in local file system(${hadoop.log.dir}/userlogs) of the datanodes. There is a separate process from Yarn called log-aggregation which collect those logs and combines.
Use yarn logs --applicationId <appId> to fetch the full log, then use unix command to parse and extract the part of the log you need.

Kafka Storm spout changing topology and consuming from the old offset

I am using the kafka spout for consuming messages. But in case if I have to change topology and upload then will it resume from the old message or start from the new message? Kafka spout gives us to specity the timestamp from where to consume but how will I know the timestamp?
spoutConfig.forceStartOffsetTime(-1);
It will choose the latest offset written around that timestamp to start consuming. You can
force the spout to always start from the latest offset by passing in -1, and you can force
it to start from the earliest offset by passing in -2.
references
If you are using KafkaSpout ensure the following:
In your SpoutConfig “id” and “ zkroot" do NOT change after
redeploying the new version of the topology. Storm uses the“
zkroot”, “id” to store the topic offset into zookeeper
KafkaConfig.forceFromStart is set to false.
KafkaSpout stores the offsets into zookeeper. Be very careful during the re-deployment if you set forceFromStart to true ( which can be the case when you first deploy the topology) in KafkaConfig of the KafkaSpout it will ignore stored zookeeper offsets. Make sure you set it to false.
Consider writing your topology so that the KafkaConfig.forceFromStart value is read from a properties file when your Topology’s main() method executes. This will allow your administrators to control whether the Kafka messages are replayed or not.
Basically the sequence of events will be:
First time start the topology by reading from beginning with below properties:
forceFromStart = true
startOffsetTime = -2
The above props will force it to start from the beginning of the topic. Remember to have both properties because forceFromStart tells storm to read the startOffsetTime property and use the value that is set to determine from where to start reading, and ignore zookeeper offset.
From now on your topology will run and zookeeper will maintain the offset. If your worker dies, it will start be started by supervisor and start reading from the offset in zookeeper.
Now if you want to restart your topology and you want to read from where it was left off before shutdown, use below property and restart the topology:
forceFromStart = false
By the above property, you are telling storm not the read the startOffsetTime value instead use the zookeeper offset which has been maintained before you shutdown your topology.
From now on every time you restart the topology, it will read from where it was left.
If you want to restart your topology and you want to read from the head/top of the topic, use below property and restart topology:
forceFromStart = true
startOffsetTime = -1
By above property you are telling storm to ignore the zookeeper offset and start from the latest offset that is the tip of the topic.

More than 120 counters in hadoop

There's a limit for Hadoop counter size. It's 120 by default. I try to use the configuration "mapreduce.job.counters.limit" to change that, but it doesn't work. I've seen the source code. It's like the instance of JobConf in class "org.apache.hadoop.mapred.Counters" is private.
Have anybody seen that before? What's your solution?
THX :)
You can override that property in mapred-site.xml on your JT, TT, client nodes but make sure that this will be a system-wide modification:
<configuration>
...
<property>
<name>mapreduce.job.counters.limit</name>
<value>500</value>
</property>
...
</configuration>
Then restart the mapreduce service on your cluster.
In Hadoop 2, this configuration parameter is called
mapreduce.job.counters.max
Setting it on the command line or in your Configuration object isn't enough, though. You need to call the static method
org.apache.hadoop.mapreduce.counters.Limits.init()
in the setup() method of your mapper or reducer to get the setting to take effect.
Tested with 2.6.0 and 2.7.1.
The para is set by config file, while paras below will take effect
mapreduce.job.counters.max=1000
mapreduce.job.counters.groups.max=500
mapreduce.job.counters.group.name.max=1000
mapreduce.job.counters.counter.name.max=500
Just adding this in case anyone else faces the same problem we did: increasing the counters from with MRJob.
To raise the number of counters, add emr_configurations to your mrjob.conf (or pass it to MRJob as a config parameter):
runners:
emr:
emr_configurations:
- Classification: mapred-site
Properties:
mapreduce.job.counters.max: 1024
mapreduce.job.counters.counter.name.max: 256
mapreduce.job.counters.groups.max: 256
mapreduce.job.counters.group.name.max: 256
We can customize the limits as command line options only for specific jobs, instead of making change in mapred-site.xml.
-Dmapreduce.job.counters.limit=x
-Dmapreduce.job.counters.groups.max=y
NOTE: x and y are custom values based on your environment/requirement.

Resources