Performing bulk load in cassandra with map reduce - hadoop

I haven't got much experience working with cassandra, so please excuse me if I have put in a wrong approach.
I am trying to do bulk load in cassandra with map reduce
Basically the word count example
Reference : http://henning.kropponline.de/2012/11/15/using-cassandra-hadoopbulkoutputformat/
I have put the simple Hadoop Wordcount Mapper Example and slightly modified the driver code and the reducer as per the above example.
I have successfully generated the output file as well. Now my doubt is how to perform the loading to cassandra part? Is there any difference in my approach ?
Please advice.
This is a part of the driver code
Job job = new Job();
job.setJobName(getClass().getName());
job.setJarByClass(CassaWordCountJob.class);
Configuration conf = job.getConfiguration();
conf.set("cassandra.output.keyspace", "test");
conf.set("cassandra.output.columnfamily", "words");
conf.set("cassandra.output.partitioner.class", "org.apache.cassandra.dht.RandomPartitioner");
conf.set("cassandra.output.thrift.port","9160"); // default
conf.set("cassandra.output.thrift.address", "localhost");
conf.set("mapreduce.output.bulkoutputformat.streamthrottlembits", "400");
job.setMapperClass(CassaWordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setReducerClass(CassaWordCountReducer.class);
FileOutputFormat.setOutputPath(job, new Path("/home/user/Desktop/test/cassandra"));
MultipleOutputs.addNamedOutput(job, "reducer", BulkOutputFormat.class, ByteBuffer.class, List.class);
return job.waitForCompletion(true) ? 0 : 1;
Mapper is the same as the normal wordcount mapper that just tokenizes and emits Word, 1
The reducer class is of the form
public class CassaWordCountReducer extends
Reducer<Text, IntWritable, ByteBuffer, List<Mutation>> {
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
List<Mutation> columnsToAdd = new ArrayList<Mutation>();
Integer wordCount = 0;
for(IntWritable value : values) {
wordCount += value.get();
}
Column countCol = new Column(ByteBuffer.wrap("count".getBytes()));
countCol.setValue(ByteBuffer.wrap(wordCount.toString().getBytes()));
countCol.setTimestamp(new Date().getTime());
ColumnOrSuperColumn wordCosc = new ColumnOrSuperColumn();
wordCosc.setColumn(countCol);
Mutation countMut = new Mutation();
countMut.column_or_supercolumn = wordCosc;
columnsToAdd.add(countMut);
context.write(ByteBuffer.wrap(key.toString().getBytes()), columnsToAdd);
}
}

To do bulk loads into Cassandra, I would advise looking at this article from DataStax. Basically you need to do 2 things for bulk loading:
Your output data won't natively fit into Cassandra, you need to transform it to SSTables.
Once you have your SSTables, you need to be able to stream them into Cassandra. Of course you don't simply want to copy each SSTable to every node, you want to only copy the relevant part of the data to each node
In your case when using the BulkOutputFormat, it should do all that as it's using the sstableloader behind the scenes. I've never used it with MultipleOutputs, but it should work fine.
I think the error in your case is that you're not using MultipleOutputs correctly: you're still doing a context.write, when you should really be writing to your MultipleOutputs object. The way you're doing it right now, since you're writing to the regular Context, it will get picked up by the default output format of TextOutputFormat and not the one you defined in your MultipleOutputs. More information on how to use the MultipleOutputs in your reducer here.
Once you write to the correct output format of BulkOutputFormat like you defined, your SSTables should get created and streamed to Cassandra from each node in your cluster - you shouldn't need any extra step, the output format will take care of it for you.
Also I would advise looking at this post, where they also explain how to use BulkOutputFormat, but they're using a ConfigHelper which you might want to take a look at to more easily configure your Cassandra endpoint.

Related

OpenCSV : getting the list of header names in the order it appears in csv

I am using Springboot + OpenCSV to parse a CSV with 120 columns (sample 1). I upload the file process each rows and in case of error, return a similar CSV (say errorCSV). This errorCSV will have only errored out rows with 120 original columns and 3 additional columns for details on what went wrong. Sample Error file 2
I have used annotation based processing and beans are populating fine. But I need to get header names in the order they appear in the csv. This particular part is quite challenging. Then capture exception and original data during parsing. The two together can later be used in writing CSV.
CSVReaderHeaderAware headerReader;
headerReader = new CSVReaderHeaderAware(reader);
try {
header = headerReader.readMap().keySet();
} catch (CsvValidationException e) {
e.printStackTrace();
}
However the header order is jumbled and there is no way to get header index. The reason being CSVReaderHeaderAware internally uses a HashMap. In order to solve this I built my custom class. It is a replica of CSVReaderHeaderAware 3 except that I used LinkedHashMap
public class CSVReaderHeaderOrderAware extends CSVReader {
private final Map<String, Integer> headerIndex = new LinkedHashMap<>();
}
....
// This code cannot be done with a stream and Collectors.toMap()
// because Map.merge() does not play well with null values. Some
// implementations throw a NullPointerException, others simply remove
// the key from the map.
Map<String, String> resultMap = new LinkedHashMap<>(headerIndex.size()*2);
It does the job however wanted to check if this is the best way out or can you think of a better way to get header names and failed values back and write in a csv.
I referred to following links but couldn't get much help
How to read from particular header in opencsv?

Spring Batch - Loop reader, processor and writer for N times

In Spring Batch, how to loop the reader,processor and writer for N times?
My requirement is:
I have "N" no of. customers/clients.
For each customer/client, I need to fetch the records from database (Reader), then I have to process (Processor) all records for the customer/client and then I have to write the records into a file (Writer).
How to loop the spring batch job for N times?
AFAIK I'm afraid there's no framework support for this scenario. Not at least the way you want to solve it.
I'd suggest to solve the problem differently:
Option 1
Read/Process/Write all records from all customers at once.You can only do this if they are all in the same DB. I would not recommend it otherwise, because you'll have to configure JTA/XA transactions and it's not worth the trouble.
Option 2
Run your job once for each client (best option in my opinion). Save necessary info of each client in different properties files (db data connections, values to filter records by client, whatever other data you may need specific to a client) and pass through a param to the job with the client it has to use. This way you can control which client is processed and when using bash files and/or cron. If you use Spring Boot + Spring Batch you can store the client configuration in profiles (application-clientX.properties) and run the process like:
$> java -Dspring.profiles.active="clientX" \
-jar "yourBatch-1.0.0-SNAPSHOT.jar" \
-next
Bonus - Option 3
If none of the abobe fits your needs or you insist in solving the problem they way you presented, then you can dynamically configure the job depending on parameters and creating one step for each client using JavaConf:
#Bean
public Job job(){
JobBuilder jb = jobBuilders.get("job");
for(Client c : clientsToProcess) {
jb.flow(buildStepByClient(c));
};
return jb.build();
}
Again, I strongly advise you not to go this way: ugly, against framework philosophy, hard to maintain, debug, you'll probably have to also use JTA/XA here, ...
I hope I've been of any help!
Local Partitioning will solve your problem.
In your partitioner, you will put all of your clients Ids in map as shown below ( just pseudo code ) ,
public class PartitionByClient implements Partitioner {
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, ExecutionContext> result = new HashMap<>();
int partitionNumber = 1;
for (String client: allClients) {
ExecutionContext value = new ExecutionContext();
value.putString("client", client);
result.put("Client [" + client+ "] : THREAD " + partitionNumber, value);
partitionNumber++;
}
}
return result;
}
}
This is just a pseudo code. You have to look to detailed documentation of partitioning.
You will have to mark your reader , processor and writer in #StepScope ( i.e. which ever part needs the value of your client ). Reader will use this client in WHERE clause of SQL. You will use #Value("#{stepExecutionContext[client]}") String client in reader etc definition to inject this value.
Now final piece , you will need a task executor and clients equal to concurrencyLimit will start in parallel provided you set this task executor in your master partitioner step configuration.
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor simpleTaskExecutor = new SimpleAsyncTaskExecutor();
simpleTaskExecutor.setConcurrencyLimit(concurrencyLimit);
return simpleTaskExecutor;
}
concurrencyLimit will be 1 if you wish only one client running at a time.

Adding partitions to Hive from a MapReduce Job

I am new to Hive and MapReduce and would really appreciate your answer and also provide a right approach.
I have defined an external table logs in hive partitioned on date and origin server with an external location on hdfs /data/logs/. I have a MapReduce job which fetches these logs file and splits them and stores under the folder mentioned above. Like
"/data/logs/dt=2012-10-01/server01/"
"/data/logs/dt=2012-10-01/server02/"
...
...
From MapReduce job I would like add partitions to the table logs in Hive. I know the two approaches
alter table command -- Too many alter table commands
adding dynamic partitions
For approach two I see only examples of INSERT OVERWRITE which is not an options for me. Is there a way to add these new partitions to the table after the end of the job?
To do this from within a Map/Reduce job I would recommend using Apache HCatalog, which is a new project stamped under Hadoop.
HCatalog really is an abstraction layer on top of HDFS so you can write your outputs in a standardized way, be it from Hive, Pig or M/R. Where this comes into the picture for you, is that you can directly load data in Hive from your Map/Reduce job using the output format HCatOutputFormat. Below is an example taken from the official website.
A current code example for writing out a specific partition for (a=1,b=1) would go something like this:
Map<String, String> partitionValues = new HashMap<String, String>();
partitionValues.put("a", "1");
partitionValues.put("b", "1");
HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues);
HCatOutputFormat.setOutput(job, info);
And to write to multiple partitions, separate jobs will have to be kicked off with each of the above.
You can also use dynamic partitions with HCatalog, in which case you could load as many partitions as you want in the same job !
I recommend reading further on HCatalog on the website provided above, which should give you more details if needed.
In reality, things are a little more complicated than that, which is unfortunate because it is undocumented in official sources (as of now), and it takes a few days of frustration to figure out.
I've found that I need to do the following to get HCatalog Mapreduce jobs to work with writing to dynamic partitions:
In my record writing phase of my job (usually the reducer), I have to manually add my dynamic partitions (HCatFieldSchema) to my HCatSchema objects.
The trouble is that HCatOutputFormat.getTableSchema(config) does not actually return partitioned fields. They need to be manually added
HCatFieldSchema hfs1 = new HCatFieldSchema("date", Type.STRING, null);
HCatFieldSchema hfs2 = new HCatFieldSchema("some_partition", Type.STRING, null);
schema.append(hfs1);
schema.append(hfs2);
Here's the code for writing into multiple tables with dynamic partitioning in one job using HCatalog, the code has been tested on Hadoop 2.5.0, Hive 0.13.1:
// ... Job setup, InputFormatClass, etc ...
String dbName = null;
String[] tables = {"table0", "table1"};
job.setOutputFormatClass(MultiOutputFormat.class);
MultiOutputFormat.JobConfigurer configurer = MultiOutputFormat.createConfigurer(job);
List<String> partitions = new ArrayList<String>();
partitions.add(0, "partition0");
partitions.add(1, "partition1");
HCatFieldSchema partition0 = new HCatFieldSchema("partition0", TypeInfoFactory.stringTypeInfo, null);
HCatFieldSchema partition1 = new HCatFieldSchema("partition1", TypeInfoFactory.stringTypeInfo, null);
for (String table : tables) {
configurer.addOutputFormat(table, HCatOutputFormat.class, BytesWritable.class, CatRecord.class);
OutputJobInfo outputJobInfo = OutputJobInfo.create(dbName, table, null);
outputJobInfo.setDynamicPartitioningKeys(partitions);
HCatOutputFormat.setOutput(
configurer.getJob(table), outputJobInfo
);
HCatSchema schema = HCatOutputFormat.getTableSchema(configurer.getJob(table).getConfiguration());
schema.append(partition0);
schema.append(partition1);
HCatOutputFormat.setSchema(
configurer.getJob(table),
schema
);
}
configurer.configure();
return job.waitForCompletion(true) ? 0 : 1;
Mapper:
public static class MyMapper extends Mapper<LongWritable, Text, BytesWritable, HCatRecord> {
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
HCatRecord record = new DefaultHCatRecord(3); // Including partitions
record.set(0, value.toString());
// partitions must be set after non-partition fields
record.set(1, "0"); // partition0=0
record.set(2, "1"); // partition1=1
MultiOutputFormat.write("table0", null, record, context);
MultiOutputFormat.write("table1", null, record, context);
}
}

Hbase - Hadoop : TableInputFormat extension

Using an hbase table as my input, of which the keys I have pre-processed in order to consist of a number concatenated with the respective row ID, I want to rest assured that all rows with the same number heading their key, will be processed from the same mapper at a M/R job. I am aware that this could be achieved through extension of TableInputFormat, and I have seen one or two posts concerning extension of this class, but I am searching for the most efficient way to do this in particular.
If anyone has any ideas, please let me know.
You can use a PrefixFilter in your scan.
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PrefixFilter.html
And parallelize the launch of your different mappers using Future
final Future<Boolean> newJobFuture = executor.submit(new Callable<Boolean>() {
#Override
public Boolean call() throws Exception {
Job mapReduceJob = MyJobBuilder.createJob(args, thePrefix,
...);
return mapReduceJob.waitForCompletion(true);
}
});
But I believe this is more an approach of a reducer you are looking for.

Hadoop: Map Reduce: read from HBase, but filter rows by content of one column

I am really new to Hadoop and I am not able to find an answer to my question. I want to write a map reduce job, where I read from HBase and write then in a simple text file.
In HBase, Ive got a column representing an id. Now I dont want to work on all containing rows in my HBase Table, but only on those between a maxId and a minId.
I found out that I could possibly user filters (scan.setFilter), so that I can filter rows which dont match my request.
This is my first Map Reduce Job, so please be patient :-)
Ive got a Starter Class, where I configure the job and the Scan Object and then start the job.
Now, my first try looks like this:
private Scan getScan()
{
final Scan scan = new Scan();
// ** FILTER **
List<Filter> filters = new ArrayList<Filter>();
Filter filter1 = new ValueFilter(CompareFilter.CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes(Integer.parseInt(minId))));
filters.add(filter1);
Filter filter2 = new ValueFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, new BinaryComparator(Bytes.toBytes(Integer.parseInt(maxId))));
filters.add(filter2);
FilterList filterList = new FilterList(filters);
scan.setFilter(filterList);
scan.setCaching(500);
scan.setCacheBlocks(false);
// id
scan.addColumn("columnfamily".getBytes(), "id".getBytes());
return scan;
}
Well, Im not sure if this is the right way to do it. I also read that I could pass my minId and maxId maybe with the Configuration Object to the Map Job, but Im not sure how.
Besides, what have I to do afterwards? I would normally just initiate the job with initTableMapperJob and pass the Scan Object to it. Ive read something of ResultScanner and so, do I need them? I thought the MapReduce Framework would now automatically pass the correct rows to my map job, is that correct?

Resources