Query related to Hadoop's map-reduce - hadoop

Scenario:
I have one subset of database and one dataware house. I have bring this both things on HDFS.
I want to analyse the result based on subset and datawarehouse.
(In short, for one record in subset I have to scan each and every record in dataware house)
Question:
I want to do this task using Map-Reduce algo. I am not getting that how to take both files as a input in mapper and also how to handle both files in map phase of map-reduce.
Pls suggest me some idea so that I can able to perform it?

Check the Section 3.5 (Relations Joins) in Data-Intensive Text Processing with MapReduce for Map-Side Joins, Reduce-Side Joins and Memory-Backed Joins. In any case MultipleInput class is used to have multiple mappers process different files in a single job.
FYI, you could use Apache Sqoop to import DB into HDFS.

Some time ago I wrote a Hadoop map reduce for one of my classes. I was scanning several IMD databases and producing a merged information about actors (basically the name, biography and films he acted in was in different databases). I think you can use the same approach I used for my homework:
I wrote a separate map reduce turning every database file in the same format, just placing a two-letter prefix infront of every row the map-reduce produced to be able to tell 'BI' (biography), 'MV' (movies) and so on. Then I used all these produced files as input for my last map reduced that processed them grouping them in the desired way.
I am not even sure that you need so much work if you are really going to scan every line of the datawarehouse. Maybe in this case you can just do this scan either in the map or the reduce phase (based on what additional processing you want to do), but my suggestion assumes that you actually need to filter the datawarehouse based on the subsets. If the latter my suggestion might work for you.

Related

Processing HUGE number of small files independently

The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2).
Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR.
The reason is that processing files in my case, are independent. As far as I understand MR, it works best when the output is dependent on many individual files (for example counting the frequency of each word, given many documents, since a word might be included in any document in the input file). But in my case, I just need a lot of independent CPUs/Cores.
I was wondering if you have any advice on this.
Side Notes: There is another issue which is that MR works best for "huge files rather than huge number of small size". Although there seems to be solutions for that. So I am ignoring it for now.
It is possible to use map reduce for your needs. In MapReduce, there are two phases Map and Reduce, however, the reduce phase is not a must, just for your situation, you could write a map-only MapReduce job, and all the calculations on a single file should be put into a customised Map function.
However, I haven't process such huge num of files in a single job, no idea on its efficiency. Try it yourself, and share with us :)
This is quite easy to do. In such cases - the data for MR job is typically the list of files (and not the files themselves). So the size of the data submitted to Hadoop is the size of 10M file names - which is order of a couple of gigs max.
One uses MR to split up the list of files into smaller fragments (how many can be controlled by various options). Then each mapper gets a list of files. It can process one file at a time and generate the output.
(fwiw - I would suggest Qubole (where I am a founder) instead of EMR cause it would save you a ton of money with auto-scaling and spot integration).

MapReduce for same task/different data

We have a system that is made up of multiple PostgreSQL databases. Each database has the same tables, i.e., schema, but only carries a share of the data (and not the full data!).The reason for distributing the data is that our customers run queries that are rather complex and perform up to 100 calculations per row.
By distributing the data to multiple databases, we want to lower the amount of work processed by each database, and ultimately speed up search. At the end, we combine the results of each database to create the final results.
A friend of mine has recommended looking at MapReduce (Hadoop). In my opinion, map-reduce only makes sense if the single workers share the same data but perform different type of work on it (corresponds to multiple instruction, single data).
In our case, however, the workers should perform the same task, but perform that task on various data (corresponds to single instruction, multiple data).
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Yes.
I think you have a misconception about Hadoop and MapReduce. A MapReduce job does indeed work on the same type of data (i.e., "same tables"), but different segments of that data. The parallel Map and Reduce tasks are the same tasks over different portions of the data. MapReduce is most definitely "single instruction, multiple data" from your definition.
Hadoop is by no means a drop-in replacement for a SQL database. They do different things in different ways. Here are some other things to note:
Note that MapReduce is only really going to do batch analytics for you. Things like rollups and counts and aggregates. You won't be able to retrieve or search with MapReduce effectively. Also, updating data in Hadoop is not a typical way you want to do things-- you treat things as more "append only". For any of that, you'll probably want to look at HBase.
Hadoop's file system segments the data for you. From a file system perspective, it'll look like files in folders that contain CSV (or some other file format). Files get split up into blocks, which can then be operated on separately with map tasks. You won't have to manually shard the data like you are now.
Take a look at Hive. It's a abstraction layer on top of MapReduce that interprets a light version of SQL into MapReduce under the covers. It should allow you to convert some of your logic a bit easier.

How to design Hadoop job to match fields from one file to another

I have two different files which each contain different data. I would like to do some processing with these files then merge the data together based on matching keys. What is the best way to implement this in Hadoop? I was thinking of somehow creating two mappers that would each process one file then a reducer to combine the data? I'm not sure if this is even possible. Does anyone have any suggestion as to how I can combine data from two files in Hadoop?
There are many ways to write map/reduce job (Hive, Pig, Cascading, Java etc.) but essentially a join is a multi-input job where the mappers emit record in the key_to_join_by and rest_of_data format and the reducer does the actual join (unless one of the files is small enough to hold in memory where you can do the join in the mapper)
You can see an example of how to do this in Pig here
Can you give examples of your file? It is not clear what you are asking. Are you talking about doing joins in Hadoop? If so you will need to have two mapper classes. Or you can use Hive which makes performing joins easier. Please look at this for examples of both the possible solutions: Joins in Hadoop

Is Hadoop the right tech for this?

If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.
Map-Reduce is designed for algorithms that can be parallelized and local results can be computed and aggregated. A typical example would be counting words in a document. You can split this up into multiple parts where you count some of the words on one node, some on another node, etc and then add up the totals (obviously this is a trivial example, but illustrates the type of problem).
Hadoop is designed for processing large data files (such as log files). The default block size is 64MB, so having millions of small records wouldn't really be a good fit for Hadoop.
To deal with the issue of having non-uniformly structured data, you might consider a NoSQL database, which is designed to handle data where a lot of a columns are null (such as MongoDB).
Hadoop/MR are designed for batch processing and not for real time processing. So, some other alternative like Twitter Storm, HStreaming has to be considered.
Also, look at Hama for real time processing of data. Note that real time processing in Hama is still crude and a lot of improvement/work has to be done.
I would recommend Storm or Flume. In either of these you may analyze each record as it comes in and decide what to do with it.
If your data volumes are not great , and millions of records are not sounds as such I would suggest to try to get most from RDMBS, even if your schema will not be properly normalized.
I think even tavle of structure K1, K2, K3, Blob will be more useful t
In NoSQL KeyValue stores are built to support schemaless data in various flavors but their query capability are limited.
Only case I can think as usefull is MongoDB/ CoachDB capability to index schemaless data. You will be able to get records by some attribute value.
Regarding Hadoop MapReduce - i think it is not useful unless you want to harness a lot of CPUs for your processing or have a lot of data or need distributed sort capability.

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions.
All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.
Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.
Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.
Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.
Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.
Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.
You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.
It's possible to do that, but not in trivial way in one MR.
You may use composite keys. Let's say you need two kinds of the reducers, 'R1' and 'R2'. Add ids for these as a prefix to your o/p keys in the mapper. So, in the mapper, a key 'K' now becomes 'R1:K' or 'R2:K'.
Then, in the reducer, pass values to implementations of R1 or R2 based on the prefix.
I guess you want to run different reducers in a chain. In hadoop 'multiple reducers' means running multiple instances of the same reducer. I would propose you run one reducer at a time, providing trivial map function for all of them except the first one. To minimize time for data transfer, you can use compression.
Of course you can define multiple reducers. For the Job (Hadoop 0.20) just add:
job.setNumReduceTasks(<number>);
But. Your infrastructure has to support the multiple reducers, meaning that you have to
have more than one cpu available
adjust mapred.tasktracker.reduce.tasks.maximum in mapred-site.xml accordingly
And of course your job has to match some specifications. Without knowing what you exactly want to do, I only can give broad tips:
the keymap-output have either to be partitionable by %numreducers OR you have to define your own partitioner:
job.setPartitionerClass(...)
for example with a random-partitioner ...
the data must be reduce-able in the partitioned format ... (references needed?)
You'll get multiple output files, one for each reducer. If you want a sorted output, you have to add another job reading all files (multiple map-tasks this time ...) and writing them sorted with only one reducer ...
Have a look too at the Combiner-Class, which is the local Reducer. It means that you can aggregate (reduce) already in memory over partial data emitted by map.
Very nice example is the WordCount-Example. Map emits each word as key and its count as 1: (word, 1). The Combiner gets partial data from map, emits (, ) locally. The Reducer does exactly the same, but now some (Combined) wordcounts are already >1. Saves bandwith.
I still dont get your problem you can use following sequence:
database-->map-->reduce(use cat or None depending on requirement)
then store the data representation you have extracted.
if you are saying that it is small enough to fit in memory then storing it on disk shouldnt be an issue.
Also your use of MapReduce paradigm for the given problem is incorrect, using a single map function and multiple "different" reduce function makes no sense, it shows that you are just using map to pass out data to different machines to do different things. you dont require hadoop or any other special architecture for that.

Resources