How to count the number of output tuples in cascading - hadoop

Using cascading framework, I am filtering some tuples and outputting those in an S3 file.
I also want to count the number of output total tuples. One simple way is to download the output S3 file and count the number of lines.
Is there any other way to dump the count of output tuples to another file?

This can be done using a flowprocess.
We can write a custom function.
public class Counter extends BaseOperation implements Function {
...
#Override
public void operate(FlowProcess flowProcess, FunctionCall functionCall) {
functionCall.getOutputCollector().add(functionCall.getArguments());
flowProcess.increment(counterGroup, counterName, 1);
}
...
}
Use :
groupByPipe = new Each(groupByPipe, new Counter(COUNTER_GROUP_NAME, counterName));

Related

sending input from single spout to multiple bolts with Fields grouping in Apache Storm

builder.setSpout("spout", new TweetSpout());
builder.setBolt("bolt", new TweetCounter(), 2).fieldsGrouping("spout",
new Fields("field1"));
I have an input field "field1" added in fields grouping. By definition of fields grouping, all tweets with same "field1" should go to a single task of TweetCounter. The executors # set for TweetCounter bolt is 2.
However, if "field1" is the same in all the tuples of incoming stream, does this mean that even though I specified 2 executors for TweetCounter, the stream would only be sent to one of them and the other instance remains empty?
To go further with my particular use case, how can I use a single spout and send data to different bolts based on a particular value of an input field (field1)?
It seems one way to solved this problem is to use Direct grouping where the source decides which component will receive the tuple. :
This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
You can see it's example uses here:
collector.emitDirect(getWordCountIndex(word),new Values(word));
where getWordCountIndex returns the index of the component where this tuple will be processes.
An alternative to using emitDirect as described in this answer is to implement your own stream grouping. The complexity is about the same, but it allows you to reuse grouping logic across multiple bolts.
For example, the shuffle grouping in Storm is implemented as a CustomStreamGrouping as follows:
public class ShuffleGrouping implements CustomStreamGrouping, Serializable {
private ArrayList<List<Integer>> choices;
private AtomicInteger current;
#Override
public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) {
choices = new ArrayList<List<Integer>>(targetTasks.size());
for (Integer i : targetTasks) {
choices.add(Arrays.asList(i));
}
current = new AtomicInteger(0);
Collections.shuffle(choices, new Random());
}
#Override
public List<Integer> chooseTasks(int taskId, List<Object> values) {
int rightNow;
int size = choices.size();
while (true) {
rightNow = current.incrementAndGet();
if (rightNow < size) {
return choices.get(rightNow);
} else if (rightNow == size) {
current.set(0);
return choices.get(0);
}
} // race condition with another thread, and we lost. try again
}
}
Storm will call prepare to tell you the task ids your grouping is responsible for, as well as some context on the topology. When Storm emits a tuple from a bolt/spout where you're using this grouping, Storm will call chooseTasks which lets you define which tasks the tuple should go to. You would then use the grouping when building your topology as shown:
TopologyBuilder tp = new TopologyBuilder();
tp.setSpout("spout", new MySpout(), 1);
tp.setBolt("bolt", new MyBolt())
.customGrouping("spout", new ShuffleGrouping());
Be aware that groupings need to be Serializable and thread safe.

Spring batch FixedLengthTokenizer Range without upper limit is not working

As per java doc of Range class
A Range can be unbounded at maximum * side. This can be specified by
passing {#link Range#UPPER_BORDER_NOT_DEFINED}} as max * value or
using constructor {#link #Range(int)}
I have one like like
SomeText sometext etc
Update: InputFile Basically i have Multi line data set like this.Itemid is identifier of start of recorder. I am using SingleItemPeekableItemReader ,PatternMatchingCompositeLineTokenizer after lot of efforts , i got it working and was able read the the data in required pojo. the solution is build on
https://docs.spring.io/spring-batch/4.0.x/reference/html/common-patterns.html#multiLineRecords.
But as
Itemid1-ID1
SomeRandomText1SomeRandomText1SomeRandomText1
SomeRandomText1
SomeRandomText1SomeRandomText1SomeRandomText1
Itemid2-ID2
SomeRandomText1SomeRandomText1
SomeRandomText1
SomeRandomText1
SomeRandomText1SomeRandomText1
The data item is like
class Pojo
{
String id
String data // this data is concatenated string of of all remaining lines. until
//new data iteam
}
IF I want to configure FixedLengthTokenizer to read this in n single field
public FixedLengthTokenizer head()
{
FixedLengthTokenizer token = new FixedLengthTokenizer();
token.setNames("id");
token.setColumns(new Range(1));
return token;
}
My Expectation is if i do not provide the max limit in Range constructor, then it will read the full line. But i am getting Line is longer than max range 1 Exception
Can someone please help?
In your case, you need to specify two ranges: one for the ID and another one the actual data. Here is an example:
#Test
public void testFixedLengthTokenizerUnboundedRange() {
FixedLengthTokenizer tokenizer = new FixedLengthTokenizer();
tokenizer.setNames("id", "data");
tokenizer.setColumns(new Range(1, 5), new Range(6));
FieldSet tokens = tokenizer.tokenize("12345\nSomeRandomText1\nSomeRandomText2");
assertEquals("12345", tokens.readString("id"));
assertEquals("SomeRandomText1\nSomeRandomText2", tokens.readString("data"));
}
This test is passing. So the unbounded range is working as expected.
Hope the example helps.

Hbase - Hadoop : TableInputFormat extension

Using an hbase table as my input, of which the keys I have pre-processed in order to consist of a number concatenated with the respective row ID, I want to rest assured that all rows with the same number heading their key, will be processed from the same mapper at a M/R job. I am aware that this could be achieved through extension of TableInputFormat, and I have seen one or two posts concerning extension of this class, but I am searching for the most efficient way to do this in particular.
If anyone has any ideas, please let me know.
You can use a PrefixFilter in your scan.
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PrefixFilter.html
And parallelize the launch of your different mappers using Future
final Future<Boolean> newJobFuture = executor.submit(new Callable<Boolean>() {
#Override
public Boolean call() throws Exception {
Job mapReduceJob = MyJobBuilder.createJob(args, thePrefix,
...);
return mapReduceJob.waitForCompletion(true);
}
});
But I believe this is more an approach of a reducer you are looking for.

SingleColumnValueFilter not returning proper number of rows

In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.
I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.
Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.
Our MapReduce job features a single mapper:
public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{
public String crawlIdentifier;
// counters
private static enum CountRows {
ROWS_WITH_MATCHED_CRAWL_IDENTIFIER
}
#Override
public void setup(Context context){
Configuration configuration=context.getConfiguration();
crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
}
#Override
public void map(ImmutableBytesWritable legacykey, Result row, Context context){
String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN);
if (StringUtils.equals(crawlIdentifier, rowIdentifier)){
context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l);
}
}
}
The filter setup is like this:
String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
if (StringUtils.isBlank(crawlIdentifier)){
throw new IllegalArgumentException("Crawl Identifier not set.");
}
// build an HBase scanner
Scan scan=new Scan();
SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(),
HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(),
CompareOp.EQUAL,
Bytes.toBytes(crawlIdentifier));
filter.setFilterIfMissing(true);
scan.setFilter(filter);
Are we using the wrong filter, or have we configured it wrong?
EDIT: we're looking at manually adding all the column families as per https://issues.apache.org/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.
The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.
Bytes.toBytes(crawlIdentifier)
[1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String)
[2] http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#getBytes()

Too many filter matching in pig

I have a list of filter keywords (about 1000 in numbers) and I need to filter a field of a relation in pig using this list.
Initially, I have declared these keywords like:
%declare p1 '.keyword1.';
....
...
%declare p1000 '.keyword1000.';
I am then doing filtering like:
Filtered= FITLER SRC BY (not $0 matches '$p1') and (not $0 matches '$p2') and ...... (not $0 matches '$p1000');
DUMP Filtered;
Assume that my source relation is in SRC and I need to apply filtering on first field i.e. $0.
If I am reducing the number of filters to 100-200, it's working fine. But as number of filters increases to 1000. It doesn't work.
Can somebody suggest a work around to get the results right?
Thanks in advance
You can write a simple filter UDF where you'd perform all the checks something like:
package myudfs;
import java.io.IOException;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
public class MYFILTER extends FilterFunc
{
static List<String> filterList;
static MYFILTER(){
//load all filters
}
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return !filterList.contains(str);
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
One shallow approach is to divide the filtration into stages. Filter keywords 1 to 100 in stage one and then filter another 100 and so on for a total of (count(keywords)/100) stages. However, given more details of your data, there is probably a better solution to this.
As for the above shallow solution, you can wrap the pig script in a shell script that does the parcelling out of input and starts the run on the current keyword subset being filtered.

Resources