Query preprocessing: Hadoop or distributed system - hadoop

I am trying to optimize the performance of a search engine by preprocessing all the results. We have around 50k search terms. I am planning to search these 50k terms before hand and save it in memory (memcached/redis). Searching for all 50k terms takes more than a day in my case since we do deep semantic search. So I am planning to distribute the searching (preprocessing) over several nodes. I was considering to use hadoop. My input size is very less. Probably less than 1MB even though total search term is over 50k. But searching each term takes over a min i.e more computation oriented than data oriented. So I am wondering if I should use Hadoop or build my own distributed system. I remember reading that hadoop is used mainly if input is very huge. Please suggest me on how to go about this.
And I read hadoop reads data in block size. i.e 64mb for each jvm/mapper. Is it possible to make it number of lines instead of block size. Example: Every mapper gets 1000 lines instead of 64mb. Is it possible to achieve this.

Hadoop can definitely handle this task. Yes, much of Hadoop was designed to handle jobs with very large input or output data, but that's not it's sole purpose. It can work well for close to any type of distributed batch processsing. You'll want to take a look at NLineInputFormat; it allows you to split your input up based on exactly what you wanted, number of lines.

Related

Does Apache Spark read and process in the same time, or in first reads entire file in memory and then starts transformations?

I am curious if Spark first reads entire file into memory and only then starts processing it, meaning applying transformations and actions, or it reads first chunk of a file - applies transformation on it, reads second chunk and so on.
Is there any difference between Spark in Hadoop for the same matter? I read that Spark keeps entire file in memory most of the times, while Hadoop not. But what about the initial step when we read it for the first time and map the keys.
Thanks
I think a fair characterisation would be this:
Both Hadoop (or more accurately MapReduce) and Spark use the same underlying filesystem HDFS to begin with.
During the Mapping phase both will read all data and actually write the map result to disk so that it can be sorted and distributed between nodes via the Shuffle logic.
Both of them do in fact try and cache the data just mapped in memory in addition to spilling it to disk for the Shuffle to do its work.
The difference here though is that Spark is a lot more efficient in this process, trying to optimally align the node chosen for a specific computation with the data already cached on a certain node.
Since Spark also does something called lazy-evaluation the memory use of Spark is very different from Hadoop as a result of planning computation and caching simultaneously.
In in the steps of a word-count job Hadoop does this:
Map all the words to 1.
Write all those mapped pairs of (word, 1) to a single file in HDFS (single file could still span multiple nodes on the distributed HDFS) (this is the shuffle phase)
Sort the rows of (word, 1) in that shared file (this is the sorting phase)
Have the reducers read sections (partitions) from that shared file that now contains all the words sorted and sum up all those 1s for every word.
Spark on the other hand will go the other way around:
It figures that like in Hadoop it is probably most efficient to have all those words summed up via separate Reducer runs, so it decides according to some factors that it wants to split the job into x parts and then merge them into the final result.
So it knows that words will have to be sorted which will require at least part of them in memory at a given time.
After that it evaluates that such a sorted list will require all words mapped to (word, 1) pairs to start the calculation.
It works through steps 3 than 2 than 1.
Now the trick relative to Hadoop is that it knows in Step 3, which in-memory cached items it will need in 2. and in 2. it already knows how these parts (mostly K-V pairs) will be needed in the final step 1.
This allows Spark to very efficiently plan the execution of Jobs, but caching data it knows will be needed in later stages of the job. Hadoop working from the beginning (mapping) to the end without explicitly looking ahead into the following stages, simply cannot use memory this efficiently and hence doesn't waste resources keeping the large chunks in memory, that Spark would keep. Unlike Spark it just doesn't know if all the pairs in a Map phase will be needed in the next step.
The fact that it appears that Spark is keeping the whole dataset in memory hence isn't something Spark actively does, but rather a result of the way Spark is able to plan the execution of a job.
On the other hand, Spark may be able to actually keep fewer things memory in a different kind of job. Counting the number of distinct words is a good example here in my opinion.
Here Spark would have planned ahead and immediately drop a repeat-word from the cache/memory when encountering it during the mapping, while in Hadoop it would go ahead and waste memory on shuffling the repeat words too (I acknowledge there is a million ways to also make Hadoop do this but it's not out of the box, also there is ways of writing your Spark job in unfortunate ways to break these optimisations, but it's not so easy to fool Spark here :)).
Hope this helps understand that the memory use is just a natural consequence of the way Spark works, but not something actively aimed at and also not something strictly required by Spark. It is also perfectly capable of repeatedly spilling data back to disk between steps of the execution when memory becomes an issue.
For more insight into this I recommend learning about the DAG scheduler in Spark from here to see how this is actually done in code.
You'll see that it always follows the pattern of working out where what data is and will be cached before figuring out what to calculate where.
Spark uses lazy iterators to process data and can spill data to disk if necessary. It doesn't read all data in memory.
The difference compared to Hadoop is that Spark can chain multiple operations together.

Processing HUGE number of small files independently

The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2).
Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR.
The reason is that processing files in my case, are independent. As far as I understand MR, it works best when the output is dependent on many individual files (for example counting the frequency of each word, given many documents, since a word might be included in any document in the input file). But in my case, I just need a lot of independent CPUs/Cores.
I was wondering if you have any advice on this.
Side Notes: There is another issue which is that MR works best for "huge files rather than huge number of small size". Although there seems to be solutions for that. So I am ignoring it for now.
It is possible to use map reduce for your needs. In MapReduce, there are two phases Map and Reduce, however, the reduce phase is not a must, just for your situation, you could write a map-only MapReduce job, and all the calculations on a single file should be put into a customised Map function.
However, I haven't process such huge num of files in a single job, no idea on its efficiency. Try it yourself, and share with us :)
This is quite easy to do. In such cases - the data for MR job is typically the list of files (and not the files themselves). So the size of the data submitted to Hadoop is the size of 10M file names - which is order of a couple of gigs max.
One uses MR to split up the list of files into smaller fragments (how many can be controlled by various options). Then each mapper gets a list of files. It can process one file at a time and generate the output.
(fwiw - I would suggest Qubole (where I am a founder) instead of EMR cause it would save you a ton of money with auto-scaling and spot integration).

Hadoop smaller input file

I am using hadoop in a little different way. In my case, input size is really small. However, computation time is more. I have some complicated algorithm which I will be running on every line of input. So even though the input size is less than 5mb, the overall computation time is over 10hrs. So I am using hadoop here. I am using NLineInputFormat to split the file by number of lines rather than block size. In my initial testing, I had around 1500 lines (Splitting by 200 lines) and I saw only a improvement of 1.5 times in a four node cluster compared to that of running it serially on one machine. I am using VM's. Could that be the issue or for smaller size input there wont be much benefits with hadoop? Any insights will be really helpful.
To me, your workload resembles SETI#Home work load -- small payloads but hours of crunching time.
Hadoop (or more specifically HDFS) is not designed for lots of small files. But I doubt that is an issue for MapReduce - the processing framework you are using.
If you want to keep your workload together:
1) split them into individual files (one workload, one file) if the file is less than block size then it will go to one mapper. Typical block sizes are 64MB or 128MB
2) create a wrapper for FileInputFormat, and override the 'isSplitable()' method to false. This will make sure entire file contents are fed to one mapper, rather than hadoop trying to split it line by line
reference : http://hadoopilluminated.com/hadoop_book/HDFS_Intro.html
Hadoop is not really good at dealing with tons of small files, hence, it is often desired to combine a large number of smaller input files into less number of bigger files so as to reduce number of mappers.
As Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is split into one or more InputSplits typically upper bounded by block size. This means the number of input splits is lower bounded by number of input files. This is not an ideal environment for MapReduce process when it’s dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there is relatively large number of small files.
The basic parameter which drives the spit size is mapred.max.split.size.
Using CombineFileInputFormat and this parameter we can control the number of mappers.
Checkout the implementation I had for another answer here.

Is Hadoop the right tech for this?

If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.
Map-Reduce is designed for algorithms that can be parallelized and local results can be computed and aggregated. A typical example would be counting words in a document. You can split this up into multiple parts where you count some of the words on one node, some on another node, etc and then add up the totals (obviously this is a trivial example, but illustrates the type of problem).
Hadoop is designed for processing large data files (such as log files). The default block size is 64MB, so having millions of small records wouldn't really be a good fit for Hadoop.
To deal with the issue of having non-uniformly structured data, you might consider a NoSQL database, which is designed to handle data where a lot of a columns are null (such as MongoDB).
Hadoop/MR are designed for batch processing and not for real time processing. So, some other alternative like Twitter Storm, HStreaming has to be considered.
Also, look at Hama for real time processing of data. Note that real time processing in Hama is still crude and a lot of improvement/work has to be done.
I would recommend Storm or Flume. In either of these you may analyze each record as it comes in and decide what to do with it.
If your data volumes are not great , and millions of records are not sounds as such I would suggest to try to get most from RDMBS, even if your schema will not be properly normalized.
I think even tavle of structure K1, K2, K3, Blob will be more useful t
In NoSQL KeyValue stores are built to support schemaless data in various flavors but their query capability are limited.
Only case I can think as usefull is MongoDB/ CoachDB capability to index schemaless data. You will be able to get records by some attribute value.
Regarding Hadoop MapReduce - i think it is not useful unless you want to harness a lot of CPUs for your processing or have a lot of data or need distributed sort capability.

Converting word docs to pdf using Hadoop

Say if I want to convert 1000s of word files to pdf then would using Hadoop to approach this problem make sense? Would using Hadoop have any advantage over simply using multiple EC2 instances with job queues?
Also if there was 1 file and 10 free nodes then would hadoop split the file and send it to the 10 nodes or will the file be sent to just 1 node while 9 sit idle?
There isn't much advantage in using hadoop for this use case. Having competing consumers read from a queue and producing output is going to be a lot easier to setup and will probably be more efficient.
Hadoop would not automatically split a document and process sections on differnt nodes. Although if you had a really big (many thousands of pages long) then the Hadoop use case would make sense - but only when the time to produce a pdf on a single machine is significant.
The map tasks could print a few thousand pages each and the reduce task merge the PDF's into a single document - although reading the resulting file may be difficult to read if it is very large.
Say if I want to convert 1000s of word
files to pdf then would using Hadoop
to approach this problem make sense?
Would using Hadoop have any advantage
over simply using multiple EC2
instances with job queues?
I think either tool could accomplish this task, so it depends on what you plan to do with the documents after conversion. Derek Gottfrid at the New York Times famously found Hadoop to be a useful tool for large-scale document conversion, so it's certainly within the realm of tasks at which Hadoop performs well.
Also if there was 1 file and 10 free
nodes then would hadoop split the file
and send it to the 10 nodes or will
the file be sent to just 1 node while
9 sit idle?
It depends on the InputFormat you use. As you can see in the documentation, you can specify how to compute the "InputSplits", which might include splitting a large document into chunks.
Good luck with whatever tool you choose for this problem!
Regards,
Jeff
How many 1000's are you talking about? If this is a once off batch I would set it up on a single machine and simply let it run, you'll be surprised I think at how fast you can convert 1000s of Docs to PDF, even if you need to run the task for a couple of days, if its a once off convert then there is no need for complications such as Hadoop. If you are continually converting 1000s of docs then its probably worth the effort of setting up something else.

Resources