How to check if the parameter is set or not on hadoop? - hadoop

How to check the value of a particular parameter(say io.sort.mb) on hadoop while I'm running a benchmark(say teragen)?
I know you can always go to configuration files and see that but I have many configuration files plus some parameters get overwritten(like number of map tasks).
I don't have GUI. Is there any command to see this?
Thanks!

Whatever you set in the configuration files should be available in your job.xml. This should be available in your job tracker in the specific job entry.
Through a code these values will be available in the Configuration object.

Related

Apache Nifi GetTwitter

I have a simple question, as I am new to NiFi.
I have a GetTwitter processor set up and configured (assuming correctly). I have the Twitter Endpoint set to Sample Endpoint. I run the processor and it runs, but nothing happens. I get no input/output
How do I troubleshoot what it is doing (or in this case not doing)?
A couple things you might look at:
What activity does the processor show? You can look at the metrics to see if anything has been attempted (Tasks/Time) as well as if it succeeded (Out)
Stop the downstream processor temporarily to make any output FlowFiles visible in the connection queue.
Are there errors? Typically these appear in the top-left corner as a yellow icon
Are there related messages in the logs/nifi-app.log file?
It might also help us help you if you describe the GetTwitter Property settings a bit more. Can you share a screenshot (minus keys)?
In my case its because there are two sensitive values set. According to the documentation when a sensitive value is set, the nifi.properties file's nifi.sensitive.props.key value must be set - it is an empty string by default using HortonWorks DataPlatform distribution. I set this to some random string (literally random_STRING but you can use anything) and re-created my process from the template and it began working.
In general I suppose this topic can be debugged by setting the loglevel to DEBUG.
However, in my case the issue was resolved more easily:
I just set up a new cluster, and decided to copy all twitter keys and secrets to notepad first.
It turns out that despite carefully copying the keys from twitter, one of them had a leading tab. When pasting directly into the GetTwitter processer, this would not show, but fortunately it showed up in notepad and I was able to remove it and make this work.

How to run Mapreduce from within a pig script

I want to understand how to integrate calling a mapreduce job from within a pig script.
I referred to the link
https://wiki.apache.org/pig/NativeMapReduce
But I am not sure how to do it as how it will understand which is my mapper or reducer code. The explanation is not very clear.
If someone can illustrate it with an example, it will be of great Help.
Thanks in Advance,
Cheers :)
Example from the pig documentation
A = LOAD 'WordcountInput.txt';
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir'
AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
In the above example, pig will store input data from A into inputDir and load the job's output data from outputDir.
Also, there is a jar in HDFS called wordcount.jar in which there is a class called org.myorg.WordCount with a main class which takes care of setting mappers and reducers, input and output etc.
You could also call the mapreduce job via hadoop jar mymr.jar org.myorg.WordCount inputDir outputDir.
By default pig will anticipate the map/reduce program. However hadoop comes with default mapper/reducer implementations; which is used by Pig - when map reduce class is not identified.
Further Pig uses the properties from Hadoop along with its specific properties for this. Try setting, below properties in pig script, it should be picked by Pig as well.
SET mapred.mapper.class="<fully qualified classname for mapper>"
SET mapred.reducer.class="<fully qualified classname for reducer>"
The same can be set using -Dmapred.mapper.class option as well. Comprehensive list is here
Based on your hadoop installation, the properties could be as well:
mapreduce.map.class
mapreduce.reduce.class
Just FYI...
hadoop.mapred has been deprecated. Versions before 0.20.1 used mapred.
Versions after that use mapreduce.
Moreover pig has its own set of properties, which can be viewed using command pig -help properties
e.g. in my pig installation, below are the properties:
The following properties are supported:
Logging:
verbose=true|false; default is false. This property is the same as -v switch
brief=true|false; default is false. This property is the same as -b switch
debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch
aggregate.warning=true|false; default is true. If true, prints count of warnings
of each type rather than logging each warning.
Performance tuning:
pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
Note that this memory is shared across all large bags used by the application.
pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner=true|false; default is false.
Only disable combiner as a temporary workaround for problems.
opt.multiquery=true|false; multiquery is on by default.
Only disable multiquery as a temporary workaround for problems.
opt.fetch=true|false; fetch is on by default.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.
pig.tmpfilecompression=true|false; compression is off by default.
Determines whether output of intermediate jobs is compressed.
pig.tmpfilecompression.codec=lzo|gzip; default is gzip.
Used in conjunction with pig.tmpfilecompression. Defines compression type.
pig.noSplitCombination=true|false. Split combination is on by default.
Determines if multiple small files are combined into a single map.
pig.exec.mapPartAgg=true|false. Default is false.
Determines if partial aggregation is done within map phase,
before records are sent to combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records
by this factor, it gets disabled.
Miscellaneous:
exectype=mapreduce|local; default is mapreduce. This property is the same as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure=true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified.

updating file in distributed cache in hadoop

How can we update file/files in distributed cache?.
For instance I have a properties file in distributed cache Now I have added few more values in properties file.
Options:
Append new values in old file and restart the job.
Replace the old file with new one and restart the job.
Place the new file in new location and point to that location.
Which all above options are correct and Why ?
This requires an understanding of how distributed cache works:
When you add a file to distributed cache, at the time of running the job the file is copied to each task node and that file is available locally. Since it creates multiple copies : It cannot be modified.
Option 2 & 3 sound feasible but not sure if that is the right way.
If the file just has a bunch of properties you can set these in the configuration object instead of file in distributed cache. You could use the collector to write the output to the desired location. (I do not know your use case clearly so this may not be suitable).

How to define a shared (global) variable in Hadoop?

I need a shared (global) variable which is accessible among all mappers and reducers. Mappers just read value from it, but reducers change some values to be used in the next iteration in it. I know DistributedCache is a technique to do that, however it only support reading a shared value.
This is exactly what ZooKeeper was built for. ZooKeeper can keep up with lots of reads from mappers/reducers, and still be able to write something now and then.
The other option would be to set values in the configuration object. However, this only persists globally for a single job. You'd have to manage the passing of this value across jobs yourself. Also, you can't end this while the job is running.

Modify the default WorkManager in WebSphere 7 using a wsadmin script

I want to raise the maximum number of threads in the default work manager's thread pool using a wsadmin (Jython) script. What is the best approach?
I can't seem to find documentation of a fine-grained control that would let me modify just this property. The closest I can find to what I want is AdminTask.applyConfigProperties, which requires passing a file. The documentation explains that if you want to modify an existing property, you must extract the existing properties file, edit it in an editor, and then pass the edited file to applyConfigProperties.
I want to avoid the manual step of extracting the existing properties file and editing it. The scripts needs to run completely unattended. In fact, I'd prefer to not use a file at all, but just set the property to a value directly in the script.
Something like the following pseudo-code:
defaultwmId = AdminConfig.getid("wm/default")
AdminTask.setProperty(defaultwmId, ['-propertyName', maxThreads, '-propertyValue', 20])
The following represents a fairly simplistic wsadmin approach to updating the max threads on the default work managers:
workManagers = AdminConfig.getid("/WorkManagerInfo:DefaultWorkManager/").splitlines()
for workManager in workManagers :
AdminConfig.modify(workManager, '[[maxThreads "20"]]')
AdminConfig.save()
Note that the first line will retrieve all of the default work managers across all scopes, so if you want to only choose one (for example, if you only one to modify a particular application server or cluster's work manager properties), you will need to refine the containment path further. Also, you may need to synchronize the nodes and restart the modified servers in order for the property to be applied at runtime.
More information on the use of the AdminConfig scripting object can be found in the WAS InfoCenter:
http://publib.boulder.ibm.com/infocenter/wasinfo/v7r0/index.jsp?topic=/com.ibm.websphere.nd.doc/info/ae/ae/rxml_adminconfig1.html

Resources