I have a list of filter keywords (about 1000 in numbers) and I need to filter a field of a relation in pig using this list.
Initially, I have declared these keywords like:
%declare p1 '.keyword1.';
....
...
%declare p1000 '.keyword1000.';
I am then doing filtering like:
Filtered= FITLER SRC BY (not $0 matches '$p1') and (not $0 matches '$p2') and ...... (not $0 matches '$p1000');
DUMP Filtered;
Assume that my source relation is in SRC and I need to apply filtering on first field i.e. $0.
If I am reducing the number of filters to 100-200, it's working fine. But as number of filters increases to 1000. It doesn't work.
Can somebody suggest a work around to get the results right?
Thanks in advance
You can write a simple filter UDF where you'd perform all the checks something like:
package myudfs;
import java.io.IOException;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
public class MYFILTER extends FilterFunc
{
static List<String> filterList;
static MYFILTER(){
//load all filters
}
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return !filterList.contains(str);
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
One shallow approach is to divide the filtration into stages. Filter keywords 1 to 100 in stage one and then filter another 100 and so on for a total of (count(keywords)/100) stages. However, given more details of your data, there is probably a better solution to this.
As for the above shallow solution, you can wrap the pig script in a shell script that does the parcelling out of input and starts the run on the current keyword subset being filtered.
Related
I am using Springboot + OpenCSV to parse a CSV with 120 columns (sample 1). I upload the file process each rows and in case of error, return a similar CSV (say errorCSV). This errorCSV will have only errored out rows with 120 original columns and 3 additional columns for details on what went wrong. Sample Error file 2
I have used annotation based processing and beans are populating fine. But I need to get header names in the order they appear in the csv. This particular part is quite challenging. Then capture exception and original data during parsing. The two together can later be used in writing CSV.
CSVReaderHeaderAware headerReader;
headerReader = new CSVReaderHeaderAware(reader);
try {
header = headerReader.readMap().keySet();
} catch (CsvValidationException e) {
e.printStackTrace();
}
However the header order is jumbled and there is no way to get header index. The reason being CSVReaderHeaderAware internally uses a HashMap. In order to solve this I built my custom class. It is a replica of CSVReaderHeaderAware 3 except that I used LinkedHashMap
public class CSVReaderHeaderOrderAware extends CSVReader {
private final Map<String, Integer> headerIndex = new LinkedHashMap<>();
}
....
// This code cannot be done with a stream and Collectors.toMap()
// because Map.merge() does not play well with null values. Some
// implementations throw a NullPointerException, others simply remove
// the key from the map.
Map<String, String> resultMap = new LinkedHashMap<>(headerIndex.size()*2);
It does the job however wanted to check if this is the best way out or can you think of a better way to get header names and failed values back and write in a csv.
I referred to following links but couldn't get much help
How to read from particular header in opencsv?
As per java doc of Range class
A Range can be unbounded at maximum * side. This can be specified by
passing {#link Range#UPPER_BORDER_NOT_DEFINED}} as max * value or
using constructor {#link #Range(int)}
I have one like like
SomeText sometext etc
Update: InputFile Basically i have Multi line data set like this.Itemid is identifier of start of recorder. I am using SingleItemPeekableItemReader ,PatternMatchingCompositeLineTokenizer after lot of efforts , i got it working and was able read the the data in required pojo. the solution is build on
https://docs.spring.io/spring-batch/4.0.x/reference/html/common-patterns.html#multiLineRecords.
But as
Itemid1-ID1
SomeRandomText1SomeRandomText1SomeRandomText1
SomeRandomText1
SomeRandomText1SomeRandomText1SomeRandomText1
Itemid2-ID2
SomeRandomText1SomeRandomText1
SomeRandomText1
SomeRandomText1
SomeRandomText1SomeRandomText1
The data item is like
class Pojo
{
String id
String data // this data is concatenated string of of all remaining lines. until
//new data iteam
}
IF I want to configure FixedLengthTokenizer to read this in n single field
public FixedLengthTokenizer head()
{
FixedLengthTokenizer token = new FixedLengthTokenizer();
token.setNames("id");
token.setColumns(new Range(1));
return token;
}
My Expectation is if i do not provide the max limit in Range constructor, then it will read the full line. But i am getting Line is longer than max range 1 Exception
Can someone please help?
In your case, you need to specify two ranges: one for the ID and another one the actual data. Here is an example:
#Test
public void testFixedLengthTokenizerUnboundedRange() {
FixedLengthTokenizer tokenizer = new FixedLengthTokenizer();
tokenizer.setNames("id", "data");
tokenizer.setColumns(new Range(1, 5), new Range(6));
FieldSet tokens = tokenizer.tokenize("12345\nSomeRandomText1\nSomeRandomText2");
assertEquals("12345", tokens.readString("id"));
assertEquals("SomeRandomText1\nSomeRandomText2", tokens.readString("data"));
}
This test is passing. So the unbounded range is working as expected.
Hope the example helps.
Using cascading framework, I am filtering some tuples and outputting those in an S3 file.
I also want to count the number of output total tuples. One simple way is to download the output S3 file and count the number of lines.
Is there any other way to dump the count of output tuples to another file?
This can be done using a flowprocess.
We can write a custom function.
public class Counter extends BaseOperation implements Function {
...
#Override
public void operate(FlowProcess flowProcess, FunctionCall functionCall) {
functionCall.getOutputCollector().add(functionCall.getArguments());
flowProcess.increment(counterGroup, counterName, 1);
}
...
}
Use :
groupByPipe = new Each(groupByPipe, new Counter(COUNTER_GROUP_NAME, counterName));
I am currently trying to play around with mahout. I purchased the book Mahout in Action.
The whole process is understood and with simple test data sets I was already successful.
Now I have a classification problem that I would like to solve.
the target variable is found, which I call - for now - x.
The existing data in our database has already been classified with -1, 0 and +1.
We defined several predictor variables which we select with an SQL query.
These are the product's attributes: language, country, category (of the shop), title, description.
Now I want them to directly be written in a SequenceFile, for which I wrote a little helper class that will append to the sequence file each time a new row of the SQL resultset has been processed:
public void appendToFile(String classification, String databaseID, String language, String country, String vertical, String title, String description) {
int count = 0;
Text key = new Text();
Text value = new Text();
key.set("/" + classification + "/" + databaseID);
//??value.set(message);
try {
this.writer.append(key, value);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
If I only had the title or so, I could simply store it in the value - but how do I store mutiple values like country, lang, and so on, in that particular key?
Thanks for any help!
you shouldnt be storing structures in a seq file, just dump all the text you have seperated by a space,
it's simply a place to put all your content for term counting and such when using something like Naive Bayes, it cares not about structure.
Then when you have classification, lookup the structure in your database.
In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.
I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.
Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.
Our MapReduce job features a single mapper:
public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{
public String crawlIdentifier;
// counters
private static enum CountRows {
ROWS_WITH_MATCHED_CRAWL_IDENTIFIER
}
#Override
public void setup(Context context){
Configuration configuration=context.getConfiguration();
crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
}
#Override
public void map(ImmutableBytesWritable legacykey, Result row, Context context){
String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN);
if (StringUtils.equals(crawlIdentifier, rowIdentifier)){
context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l);
}
}
}
The filter setup is like this:
String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
if (StringUtils.isBlank(crawlIdentifier)){
throw new IllegalArgumentException("Crawl Identifier not set.");
}
// build an HBase scanner
Scan scan=new Scan();
SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(),
HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(),
CompareOp.EQUAL,
Bytes.toBytes(crawlIdentifier));
filter.setFilterIfMissing(true);
scan.setFilter(filter);
Are we using the wrong filter, or have we configured it wrong?
EDIT: we're looking at manually adding all the column families as per https://issues.apache.org/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.
The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.
Bytes.toBytes(crawlIdentifier)
[1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String)
[2] http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#getBytes()