Hadoop streaming - remove trailing tab from reducer output - hadoop

I have a hadoop streaming job whose output does not contain key/value pairs. You can think of it as value-only pairs or key-only pairs.
My streaming reducer (a php script) is outputting records separated by newlines. Hadoop streaming treats this as a key with no value, and inserts a tab before the newline. This extra tab is unwanted.
How do I remove it?
I am using hadoop 1.0.3 with AWS EMR. I downloaded the source of hadoop 1.0.3 and found this code in hadoop-1.0.3/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeReducer.java :
reduceOutFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t").getBytes("UTF-8");
So I tried passing -D stream.reduce.output.field.separator= as an argument to the job with no luck. I also tried -D mapred.textoutputformat.separator= and -D mapreduce.output.textoutputformat.separator= with no luck.
I've searched google of course and nothing I found worked. One search result even stated there was no argument that could be passed to achieve the desired result (though, the hadoop version in that case was really really old).
Here is my code (with added line breaks for readability):
hadoop jar streaming.jar -files s3n://path/to/a/file.json#file.json
-D mapred.output.compress=true -D stream.reduce.output.field.separator=
-input s3n://path/to/some/input/*/* -output hdfs:///path/to/output/dir
-mapper 'php my_mapper.php' -reducer 'php my_reducer.php'

As helpful for others, using the tips above, I was able to do an implementation:
CustomOutputFormat<K, V> extends org.apache.hadoop.mapred.TextOutputFormat<K, V> {....}
with exactly one line of the built-in implementation of 'getRecordWriter' changed to:
String keyValueSeparator = job.get("mapred.textoutputformat.separator", "");
instead of:
String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t");
after compiling that into a Jar, and including it into my hadoop streaming call (via the instructions on hadoop streaming), the call looked like:
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-archives 'hdfs:///user/the/path/to/your/jar/onHDFS/theNameOfTheJar.jar' \
-libjars theNameOfTheJar.jar \
-outputformat com.yourcompanyHere.package.path.tojavafile.CustomOutputFormat \
-file yourMapper.py -mapper yourMapper.py \
-file yourReducer.py -reducer yourReducer.py \
-input $yourInputFile \
-output $yourOutputDirectoryOnHDFS
I also included the jar in the folder I issued that call from.
It was working great for my needs (and it created no tabs at the end of the line after the reducer).
update: based on a comment implying this is indeed helpful for others, here's the full source of my CustomOutputFormat.java file:
import java.io.DataOutputStream;
import java.io.IOException;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordWriter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Progressable;
import org.apache.hadoop.util.ReflectionUtils;
public class CustomOutputFormat<K, V> extends TextOutputFormat<K, V> {
public RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job, String name,
Progressable progress) throws IOException {
boolean isCompressed = getCompressOutput(job);
//Channging the default from '\t' to blank
String keyValueSeparator = job.get("mapred.textoutputformat.separator", ""); // '\t'
if (!isCompressed) {
Path file = FileOutputFormat.getTaskOutputPath(job, name);
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
} else {
Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job,
GzipCodec.class);
// create the named codec
CompressionCodec codec = ReflectionUtils.newInstance(codecClass, job);
// build the filename including the extension
Path file = FileOutputFormat.getTaskOutputPath(job, name + codec.getDefaultExtension());
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new LineRecordWriter<K, V>(new DataOutputStream(
codec.createOutputStream(fileOut)), keyValueSeparator);
}
}
}
FYI: For your usage context, be sure to check this does not adversely affect hadoop-streaming managed interactions (in terms of separating key vs. value) between your mapper and reducer. To clarify:
From my testing -- if you have a 'tab' in every line of your data (with something on each side of it), you can leave the built in defaults as they are: streaming will interpret the first thing before the first tab as your 'key', and all on that row after it as your 'value.' As such, it does not see a 'null value,' and won't append a tab that shows up after your reducer. (You'll see your final outputs sorted on the value of the 'key' that streaming interprets in each row as what it sees as occuring before each tab.)
Conversely, if you have no tabs in your data, and you don't override the defaults using the above trick(s), then you'll see the tabs after the run completes, for which the above override becomes a fix.

Looking at the org.apache.hadoop.mapreduce.lib.output.TextOutputFormat source, I see 2 things:
The write(key,value) method writes a separator if key or value is non-null
The separator is always set, using the default (\t), when the mapred.textoutputformat.separator returns null (which I'm assuming happens with -D stream.reduce.output.field.separator=
Your only solution maybe to write your own OutputFormat that works around these 2 issues.
My testing
In a task I had, I wanted to reformat a line from
id1|val1|val2|val3
id1|val1
into:
id1|val1,val2,val3
id2|val1
I had a custom mapper (Perl script) to convert the lines. And for this task, I initially tried to do as a key-only (or value-only) input, but got the results with the trailing tab.
At first I just specified:
-D stream.map.input.field.separator='|' -D stream.map.output.field.separator='|'
This gave the mapper a key, value pair, since my mapping wanted a key anyway. But this output now had the tab after the first field
I got the desired output when I added:
-D mapred.textoutputformat.separator='|'
If I didn't set it or set to blank
-D mapred.textoutputformat.separator=
then I would again get a tab after the first field.
It made sense once I looked at the source for TextOutputFormat

I too had this problem. I was using a python, map-only job that was basically just emitting lines of CSV data. After examining the output, I noted the \t on the end of every line.
foo,bar,baz\t
What I discovered is the mapper, and the Python stream, are both dealing with key value pairs. If you don't emit the default separator the whole line of CSV data is considered the "key" and the framework, which requires a key and a value, slaps on a \t and a empty value.
Since my data was essentially a CSV string, I set the separator for both stream and mapred output to comma. The framework read everything up to the first comma as the key and everything after the first comma as the value. Then when it wrote out the results to file it wrote key comma value which effectively created the output I was after.
foo,bar,baz
In my case, I added the below to prevent the framework from adding \t to the end of my csv output...
-D mapred.reduce.tasks=0 \
-D stream.map.output.field.separator=, \
-D mapred.textoutputformat.separator=, \

Related

TextFSM Template for Netmiko for "inc" phrase

I am trying to create a textfsm template with the Netmiko library. While it works for most of the commands, it does not work when I try performing "inc" operation in the network device. The textfsm index file seems like it is not recognizing the same command for 2 different templates; for instance:
If I am giving the command - show running | inc syscontact
And give another command - show running | inc syslocation
in textfsm index; the textfsm template seems like it is recognizing only the first command; and not the second command.
I understand that I can get the necessary data by the regex expression for syscontact and syslocation for the commands( via the template ), however I want to achieve this by the "inc" command from the device itself. Is there a way this can be done?
you need to escape the pipe in the index file. e.g. sh[[ow]] ru[[nning]] \| inc syslocation
There is a different way to parse that you want all datas which is called TTP module. You can take the code I wrote below as an example. You can create your own templates.
from pprint import pprint
from ttp import ttp
import json
import time
with open("showSystemInformation.txt") as f:
data_to_parse = f.read()
ttp_template = """
<group name="Show_System_Information">
System Name : {{System_Name}}
System Type : {{System_Type}} {{System_Type_2}}
System Version : {{Version}}
System Up Time : {{System_Uptime_Days}} days, {{System_Uptime_HR_MIN_SEC}} (hr:min:sec)
Last Saved Config : {{Last_Saved_Config}}
Time Last Saved : {{Last_Time_Saved_Date}} {{Last_Time_Saved_HR_MIN_SEC}}
Time Last Modified : {{Last_Time_Modified_Date}} {{Last_Time_Modifed_HR_MIN_SEC}}
</group>
"""
parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()
# print result in JSON format
results = parser.result(format='json')[0]
print(results)
Example run:
[appadmin#ryugbz01 Nokia]$ python3 showSystemInformation.py
[
{
"Show_System_Information": {
"Last_Saved_Config": "cf3:\\config.cfg",
"Last_Time_Modifed_HR_MIN_SEC": "11:46:57",
"Last_Time_Modified_Date": "2022/02/09",
"Last_Time_Saved_Date": "2022/02/07",
"Last_Time_Saved_HR_MIN_SEC": "15:55:39",
"System_Name": "SR7-2",
"System_Type": "7750",
"System_Type_2": "SR-7",
"System_Uptime_Days": "17",
"System_Uptime_HR_MIN_SEC": "05:24:44.72",
"Version": "C-16.0.R9"
}
}
]

pyspark writeStream: Each Data Frame row in a separate json file

I am using pyspark to read data from a Kafka topic as a streaming dataframe as follows:
spark = SparkSession.builder \
.appName("Spark Structured Streaming from Kafka") \
.getOrCreate()
sdf = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.option("startingOffsets", "latest") \
.option("failOnDataLoss", "false") \
.load() \
.select(from_json(col("value").cast("string"), json_schema).alias("parsed_value"))
sdf_ = sdf.select("parsed_value.*")
My goal is to write each of the sdf_ rows as seperate json files.
The following code:
writing_sink = sdf_.writeStream \
.format("json") \
.option("path", "/Desktop/...") \
.option("checkpointLocation", "/Desktop/...") \
.start()
writing_sink.awaitTermination()
will write several rows of the dataframe within the same json, depending on the size of the micro-batch (or this is my hypothesis at least).
What I need is to tweak the above so that each row of the dataframe is written in a separate json file.
I have also tried using partitionBy('column'), but still this will not do exactly what I need, but instead create folders within which the json files might still have multiple rows written within them (if they have the same id).
Any ideas that could help out here? Thanks in advance.
Found out that the following option does the trick:
.option("maxRecordsPerFile", 1)

Prettify YAML with comments

1. Summary
I can't find, how I can automatically prettify my YAML files.
2. Data
Example:
    I have SashaPrettifyYAML.yaml file:
sasha_commands:
# Sasha comment
sasha_command_help: {call: sublime.command_help, caption: 'Sasha Command: Command Help'}
3. Expected behavior
I want to delete {braces}:
sasha_commands:
# Sasha comment
sasha_command_help:
call: sublime.command_help
caption: 'Sasha Command: Command Help'
4. Not helped
Pretty YAML (based on PyYAML) and online formatters as YAML Formatter and OnlineYAMLTools delete comments;
I can't find the required option in ruamel.yaml.cmd;
align-yaml align, not prettify YAML file.
There is no option to do this in ruamel.yaml.cmd, but it is fairly straightforward to do this with a small python program and using ruamel.yaml, by loading and dumping in round-trip mode (the default).
The only thing you need to do is make sure the flow-style on the data-structure that is the value for the key sasha_command_help is set to block-style (which is how I interpret your definition of "prettifying YAML"):
import sys
import ruamel.yaml
yaml_str = """\
sasha_commands:
# Sasha comment
sasha_command_help: {call: sublime.command_help, caption: 'Sasha Command: Command Help'}
"""
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load(yaml_str)
data['sasha_commands']['sasha_command_help'].fa.set_block_style()
yaml.dump(data, sys.stdout)
this will exactly give the output you expect.
A recursive data structure walker can be found in scalarstring.py in the ruamel.yaml source, and adapted to make a generic "make-everything-block-style" routine:
import sys
import ruamel.yaml
def block_style(base):
"""
This routine walks over a simple, i.e. consisting of dicts, lists and
primitives, tree loaded from YAML. It recurses into dict values and list
items, and sets block-style on these.
"""
if isinstance(base, dict):
for k in base:
try:
base.fa.set_block_style()
except AttributeError:
pass
block_style(base[k])
elif isinstance(base, list):
for elem in base:
try:
base.fa.set_block_style()
except AttributeError:
pass
block_style(elem)
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
file_in = sys.argv[1]
file_out = sys.argv[2]
with open(file_in) as fp:
data = yaml.load(fp)
block_style(data)
with open(file_out, 'w') as fp:
yaml.dump(data, fp)
If you store the above in prettifyyaml.py you can call it with:
python prettifyyaml.py SashaPrettifyYAML.yaml Prettified.yaml
Since you are already using single quotes around the scalar that has embedded spaces, you won't see a change if you leave out yaml.preserve_quotes = True. But if you had used a double quoted scalar then that line makes sure the double quotes are preserved.
I had the same problem. I wrote my own YAML beautifier https://github.com/wangkuiyi/yamlfmt. I hope it helps.
I tried top results from Google, but none of them address the requirements of https://sqlflow.org/sqlflow, which I am leading:
https://pypi.org/project/yamlfmt cannot handle a file of multiple YAML documents separated by ---
https://github.com/devopyio/yamlfmt cannot handle multiple files.
https://github.com/miekg/yamlfmt/blob/master/fmt.go cannot replace (inline edit) the input files.
You can use yq tool - it's easy to install and use, and it's well maintained.
Supposing you have example.yml file to format, it can be processed by following ways:
from file: yq r --unwrapScalar -p pv -P example.yml '*'
from stdin: cat example.yml | yq r --unwrapScalar -p pv -P - '*'

How do I log to file in Scalding?

In my Scalding map reduce code, I want to log out certain steps that are happening so that I can debug the map-reduce jobs if something goes wrong.
How can I add logging to my scalding job?
E.g.
import com.twitter.scalding._
class WordCountJob(args: Args) extends Job(args) {
//LOG: Starting job at time blah..
TextLine( args("input") )
.read
.flatMap('line -> 'word) {
line: String =>
line.trim.toLowerCase.split("\\W+")
}
.groupBy('word) { group => group.size('count) }
}
.write(Tsv(args("output")))
//LOG - ending job at time...
}
Any logging framework will do. You can obviously also use println() - it will appear in your job's stdout log file in the job history of your hadoop cluster (in hdfs mode) or in your console (in local mode).
Also consider defining a trap with the addTrap() method for catching erroneous records.

wsadmin: How to inspect existing resource references?

With $AdminApp view <applicationName> -MapResRefToEJB it is possible to list the resource references defined for a deployed EJB module. However, the result of that command is plain text (that in addition may be localized). To extract that information one would have to parse this text, which is not very convenient. Is there a way to get the same information (i.e. the resources references of an application) in a structured form using $AdminConfig?
The AppManagement MBean provides this data in a structured format (Vector of AppDeploymentTasks). To obtain this data using wsadmin scripting (jython):
import javax.management as mgmt
appName = sys.argv[0]
appMgmt = mgmt.ObjectName(AdminControl.completeObjectName("WebSphere:*,type=AppManagement"))
appInfo = AdminControl.invoke_jmx(appMgmt, "getApplicationInfo", [appName, java.util.Hashtable(), None], ["java.lang.String", "java.util.Hashtable", "java.lang.String"])
for task in appInfo :
if (task.getName() == "MapResRefToEJB") :
resRefs = task.getTaskData()
# skip the first row since it contains the headers
for i in range(1, len(resRefs)) :
resRef = resRefs[i]
print
print "URI:", resRef[4]
print "EJB:", resRef[3]
print "Name:", resRef[5]
print "Type:", resRef[6]
print "JNDI:", resRef[8]

Resources