I have a small sqlite database (post code -> US city name) and I have a big S3 file of users. I would like to map every user to the city name associated to their postcode.
I follow the famous WordCount.java example but Im not sure how mapReduce works internally:
Is my mapper created once per s3 input file?
Should I connect to the sqlite database on mapper creation ? Should I do so in the constructor of the mapper?
MapReduce is a framework for writing application to process the big data in parallel on large clusters of commodity hardware in reliable and fault tolerant manner. MapReduce executes on top of HDFS(Hadoop Distributed File System) in two different phases called map phase and reduce phase.
Answer to your question Is my mapper created once per s3 input file?
Mapper created equals to the number of splits
and by default split is created equals to the number of block.
High level overview is something like
input
file->InputFormat->Splits->RecordReader->Mapper->Partitioner->Shuffle&Sort->Reducer->final
output
Example,
Your input files- server1.log,server2.log,server3.log
InputFormat will create number of Split based on block size(by default)
Corresponding to each Split a Mapper will allocated to work on each split.
To get the line of record from the Split a RecordReader will be there in between Mapper and Split.
Than Partitioner will started.
After Partitioner Shuffle&Sort phase will start.
Reducer
Final output.
Answer to your 2nd Question:
Below are the three standard life cycle method of Mapper.
#Override
protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// Filter your data
}
}
#Override
protected void setup(Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
System.out.println("calls only once at startup");
}
#Override
protected void cleanup(Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
System.out.println("calls only once at end");
}
1) mapper is created once per 1 split that is usually 128 or 256mb.
You can configure split size with this params: mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize. If input file is less than split size, it all goes in one map task.
2) You can use methods setup and cleanup for configuring resources for the task. setup called once at task start and cleanup called once at the end. So you can make your connection to database in setup method (probably not just connect but load all cities in memory for performance) and close connection (if you decided not to load data, but just connect) in cleanup
Related
I am using processor api to delete messages from state store. Delete is working successfully, i confirmed by using interactive queries call on state store by kafka key, but it does not reduce the kafka streams file size on local disk under directory tmp/kafka-streams.
#Override
public void init(ProcessorContext processorContext) {
this.processorContext = processorContext;
processorContext.schedule(Duration.ofSeconds(10), PunctuationType.STREAM_TIME, new Punctuator() {
#Override
public void punctuate(long l) {
processorContext.commit();
}
}); //invoke punctuate every 12 seconds
this.statestore = (KeyValueStore<String, GenericRecord>) processorContext.getStateStore(StateStoreEnum.HEADER.getStateStore());
log.info("Processor initialized");
}
#Override
public void process(String key, GenericRecord value) {
statestore.all().forEachRemaining(keyValue -> {
statestore.delete(keyValue.key);
});
}
kafka streams directory size
2.3M /private/tmp/kafka-streams
3.3M /private/tmp/kafka-streams
Do I need any specific configuration so that it keeps the file size in control? If it doesn't work this way, is it okay to delete kafka-streams directory? I assume it should be safe, since such delete will delete the record from both state store and changelog topic.
RocksDB does file compaction in the background. Hence, if you need a more aggressive compaction you should pass in a custom RocksDBConfigSetter via Streams config parameter rocksdb.config.setter. For more details about RockDB, check out the RocksDB documentation.
https://docs.confluent.io/current/streams/developer-guide/config-streams.html#rocksdb-config-setter
However, I would not recommend to change RocksDB configs as long as there is no real issue -- you can do more harm than good. Seems you store size is quite small, thus, I don't see a real problem atm.
Btw: If you go to production, you should change the state.dir config to an appropriate directory where even after restarting of a machine the state will not be lost. If you put state into the default /tmp location, state is most likely gone after restarting of the machine and an expensive recovery from the changelog topics would be triggered.
I have a business case of Merge Multiple csv files(around 1000+ Each containing 1000 records )into Single csv using Spring batch .
Please help me provide your guidance and solutions in terms of approach and performance-wise as well.
So far, I have tried two approaches,
Approach 1.
Tasklet chunk with multiResourceItemReader to read the files from directory and
FlatFileItemWriter as item writer.
Issue here is, it is very slow in processing since this is single threaded, but approach works as expected.
Approach 2:
Using MultiResourcePartitioner partitioner and AsynTaskExceutor as task-executor.
Issue here is, since it is async multi-thread, data is getting overwritten/ corrupted while merging into final single file.
You can wrap your FlatFileItemWriter in AsyncItemWriter and use along with AsyncItemProcessor. This will not corrupt your data and increase the performance as processing and writing will be through several threads.
#Bean
public AsyncItemWriter asyncItemWriter() throws Exception {
AsyncItemWriter<Customer> asyncItemWriter = new AsyncItemWriter<>();
asyncItemWriter.setDelegate(flatFileItemWriter);
asyncItemWriter.afterPropertiesSet();
return asyncItemWriter;
}
#Bean
public AsyncItemProcessor asyncItemProcessor() throws Exception {
AsyncItemProcessor<Customer, Customer> asyncItemProcessor = new AsyncItemProcessor();
asyncItemProcessor.setDelegate(itemProcessor());
asyncItemProcessor.setTaskExecutor(threadPoolTaskExecutor());
asyncItemProcessor.afterPropertiesSet();
return asyncItemProcessor;
}
#Bean
public TaskExecutor threadPoolTaskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(10);
executor.setMaxPoolSize(10);
executor.setThreadNamePrefix("default_task_executor_thread");
executor.initialize();
return executor;
}
Since your headers are common between your source and destination files, I wouldn't recommend using Spring Batch provided readers to convert lines into specific beans since column level information is not needed & csv being a text format , you can go ahead only with line level info without breaking it at field level.
Also, partitioning per file is going to be a very slow ( if you have those many files ) & you should try by first fixing your number of partitions ( like 10 or 20 ) and try grouping your files into those many partitions. Secondly file writing being a disk based operation & not CPU based, multi threading won't be useful.
What I suggest instead is to write your custom reader & writer in plain Java on the lines as suggested in this answer where your reader will return a List<String> and writer will get List<List<String>> & that you can write to file.
If you have enough memory to hold lines from all files in one go then you can read all files in one go & keep returning chunk_size or you can keep reading small set of files to reach chunk size limit should be good enough. Your reader will return null when no more files to read.
I ran into an interesting situation, and now am looking for how to do it intentionally. On my local single node setup, I ran 2 jobs simultaneously from the terminal screen. My both jobs use same reducer, they only have difference in map function (aggregation key - the group by), the output of both jobs was written to the output of first job (though second job did created its own folder, but it was empty). What I am working on is providing rollup aggregations across various levels, and this behavior is fascinating for me, that the aggregation output from two different levels are available to me in one single file (also perfectly sorted).
My question is how to achieve the same in real Hadoop cluster, where we have multiple data nodes i.e. I programmatically initiate multiple jobs, all accessing same input file, mapping the data differently, but using the same reducer, and the output is available in one single file, and not in 5 different output files.
Please advise.
I was taking a look at merge output files after reduce phase before I decided to ask my question.
Thanks and Kind regards,
Moiz Ahmed.
When different Mappers consume the same input file, with other words the same data structure, then source code for all these different mappers can be placed into separate methods of a single Mapper implementation and use a parameter from the context to decide which map functions to invoke. On the pluss side you need to start only one Map Reduce Job. Example is pseudo code:
class ComplexMapper extends Mapper {
protected BitSet mappingBitmap = new BitSet();
protected void setup(Context context) ... {
{
String params = context.getConfiguration().get("params");
---analyze params and set bits into the mappingBitmap
}
protected void mapA(Object key, Object value, Context context){
.....
context.write(keyA, value);
}
protected void mapB(Object key, Object value, Context context){
.....
context.write(keyA, value);
}
protected void mapB(Object key, Object value, Context context){
.....
context.write(keyB, value);
}
public void map(Object key, Object value, Context context) ..... {
if (mappingBitmap.get(1)) {
mapA(key, value, context);
}
if (mappingBitmap.get(2)) {
mapB(key, value, context);
}
if (mappingBitmap.get(3)) {
mapC(key, value, context);
}
}
Of cause it can be implemented more elegantly with interfaces etc.
In the job setup just add:
Configuration conf = new Configuration();
conf.set("params", "AB");
Job job = new Job(conf);
As Praveen Sripati mentioned, having a single output file will force you into having just one Reducer which might be bad for performance. You can always concatenate the part** files when you download them from the hdfs. Example:
hadoop fs -text /output_dir/part* > wholefile.txt
Usually each reducer task produces a separate file in HDFS, so that the reduce tasks can operate in parallel. If the requirement is to have one o/p file from the reduce task then configure the job to have one reducer task. The number of reducers can be configure using the mapred.reduce.tasks property which is defaulted to 1. The con of this approach is there is only one reducer which might be a bottle neck for the job to complete.
Another option is to use some other output format which allows multiple reducers to write to the same sink simultaneously like DBOuputFormat. Once the Job processing is complete, the results from the DB can be exported into a flat file. This approach will enable multiple reduce tasks to run in parallel.
Another options is to merge the o/p files as mentioned in the OP. So, based on the pros and cons of each of the approach and the volume of the data to be processed the one of the approach can be chosen.
I forgot what API/method to call, but my problem is that :
My mapper will run more than 10 minutes - and I don't want to increase default timeout.
Rather I want to have my mapper send out update ping to task tracker, when it is in the particular code path that consumes time > 10 mins.
Please let me know what API/method to call.
You can simply increase a counter and call progress. This will ensure that the task sends a heartbeat back to the tasktracker to know if its alive.
In the new API this is managed through the context, see here: http://hadoop.apache.org/common/docs/r1.0.0/api/index.html
e.G.
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// increment counter
context.getCounter(SOME_ENUM).increment(1);
context.progress();
}
In the old API there is the reporter class:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html
You typically use the Reporter to let the framework know you're still alive.
Quote from the javadoc:
Mapper and Reducer can use the Reporter provided to report progress or
just indicate that they are alive. In scenarios where the application
takes an insignificant amount of time to process individual key/value
pairs, this is crucial since the framework might assume that the task
has timed-out and kill that task.
I am trying to send two files to a hadoop reducer.
I tried DistributedCache, but anything I put using addCacheFile in main, doesn't seem to be given back to with getLocalCacheFiles in the mapper.
right now I am using FileSystem to read the file, but I am running locally so I am able to just send the name of the file. Wondering how to do this if I was running on a real hadoop system.
is there anyway to send values to the mapper except the file that it's reading?
I also had a lot of problems with distribution cache, and sending parameters. Options worked for me are below:
For distributed cache usage:
For me it was a nightmare to get the url/path to file on HDFS in Map or Reduce, but with symlink it worked
in run() method of the job
DistributedCache.addCacheFile(new URI(file+"#rules.dat"), conf);
DistributedCache.createSymlink(conf);
and then read in Map or Reduce
in header, before methods
public static FileSystem hdfs;
and then in setup() method of Map or Reduce
hdfs = FileSystem.get(new Configuration()).open(new Path ("rules.dat"));
For parameters:
Send some values to Map or Reduce (could be a filename to open from HDFS):
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
...
conf.set("level", otherArgs[2]); //sets variable level from command line, it could be a filename
...
}
then in Map or Reduce class just:
int level = Integer.parseInt(conf.get("level")); //this is int, but you can read also strings, etc.
If distributed cache suites your need - it is a way to go.
getLocalCacheFiles works differently in the local mode and in the distributed mode. (it actually do not work in local mode).
Look into this link: http://developer.yahoo.com/hadoop/tutorial/module5.html
look for the phrase: As a cautionary note: