Indexing process in Hadoop - hadoop

could any body please explain me what is meant by Indexing process in Hadoop.
Is it something like a traditional indexing of data that we do in RDBMS, so drawing the same analogy here in Hadoop we index the data blocks and store the physical address of the blocks in some data structure.
So it will be an additional space in the Cluster.
Googled around this topic but could not get any satisfactory and detailed things.
Any pointers will help.
Thanks in advance

Hadoop stores data in files, and does not index them. To find something, we have to run a MapReduce job going through all the data. Hadoop is efficient where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data.
However, we can use indexing in HDFS using two types viz. file based indexing & InputSplit based indexing.
Lets assume that we have 2 Files to store in HDFS for processing. First one is of 500 MB and 2nd one is around 250 MB. Hence we'll have 4 InputSplits of 128MB each on 1st File and 3 InputSplits on 2nd file.
We can apply 2 types of indexing for the mentioned case -
1. With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query
2. With InputSplit based indexing, you will end up with 4 InputSplits. The performance should be definitely better than doing a full scan query.
Now, to for implementing InputSplits index we need to perform following steps:
Build index from your full data set - This can be achived by writing a MapReduce job to extract the value we want to index, and output it together with its InputSplit MD5 hash.
Get the InputSplit(s) for the indexed value you are looking for - Output of MapReduce program will be Reduced Files (Containing Indices based on InputSplits) which will be stored in HDFS
Execute your actual MapReduce job on indexed InputSplits only. - This can be done by Hadoop as it is able to retrieve the number of InputSplit to be used using the FileInputFormat.class. We will create our own IndexFileInputFormat class extending the default FileInputFormat.class, and overriding its getSplits() method. You have to read the file you have created at previous step, add all your indexed InputSplits into a list, and then compare this list with the one returned by the super class. You will return to JobTracker only the InputSplits that were found in your index.
In Driver class we have now to use this IndexFileInputFormat class. We need to set as InputFormatClass using -
To Use our custom IndexFileInputFormat In Driver class we need to provide
job.setInputFormatClass(IndexFileInputFormat.class);
For Code Sample and other details Refer this -
https://hadoopi.wordpress.com/2013/05/24/indexing-on-mapreduce-2/

We can identify 2 different levels of granularity for creating indices: Index based on File URI or index based on InputSplit. Let’s take 2 different examples of data set.
index
First example:
2 files in your data set fit in 25 blocks, and have been identified as 7 different InputSplits. The target you are looking for (grey highlighted) is available on file #1 (block #2,#8 and #13), and on file #2 (block #17)
With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query
With InputSplit based indexing, you will end up with 4 InputSplits on 7 available. The performance should be definitely better than doing a full scan query
index
Let’s take a second example:
This time the same data set has been sorted by the column you want to index. The target you are looking for (grey highlighted) is now available on file #1 (block #1,#2,#3 and #4).
With File based indexing, you will end up with only 1 file from your data set
With InputSplit based indexing, you will end up with 1 InputSplit on 7 available
For this specific study, I decided to use a custom InputSplit based index. I believe such approach should be quite a good balance between the efforts it takes to implement, the added value it might bring in term of performance optimization, and its expected applicability regardless to the data distribution.

Related

Create Input Format of Elasticsearch using Flink Rich InputFormat

We are using Elasticsearch 6.8.4 and Flink 1.0.18.
We have an index with 1 shard and 1 replica in elasticsearch and I want to create the custom input format to read and write data in elasticsearch using apache Flink dataset API with more than 1 input splits in order to achieve better performance. so is there any way I can achieve this requirement?
Note: Per document size is larger(almost 8mb) and I can read only 10 documents at a time because of size constraint and per reading request, we want to retrieve 500k records.
As per my understanding, no.of parallelism should be equal to number of shards/partitions of the data source. however, since we store only a small amount of data we have kept the number of shard as only 1 and we have a static data it gets increased very slightly per month.
Any help or example of source code will be much appreciated.
You need to be able to generate queries to ES that effectively partition your source data into relatively equal chunks. Then you can run your input source with a parallelism > 1, and have each sub-task read only part of the index data.

Try to confirm my understanding of HBase and MapReduce behavior

I'm trying to do some process on my HBase dataset. But I'm pretty new to the HBase and Hadoop ecosystem.
I would like to get some feedback from this community, to see if my understanding of HBase and the MapReduce operation on it is correct.
Some backgrounds here:
We have a HBase table that is about 1TB, and exceeds 100 million records.2. It has 3 region servers and each region server contains about 80 regions, making the total region 240.3. The records in the table should be pretty uniform distributed to each region, from what I know.
And what I'm trying to achieve is that I could filter out rows based on some column values, and export those rows to HDFS filesystem or something like that.
For example, we have a column named "type" and it might contain value 1 or 2 or 3. I would like to have 3 distinct HDFS files (or directories, as data on HDFS is partitioned) that have records of type 1, 2, 3 respectively.
From what I can tell, MapReduce seems like a good approach to attack these kinds of problems.
I've done some research and experiment, and could get the result I want. But I'm not sure if I understand the behavior of HBase TableMapper and Scan, yet it's crucial for our code's performance, as our dataset is really large.
To simplify the issue, I would take the official RowCounter implementation as an example, and I would like to confirm my knowledge is correct.
So my questions about HBase with MapReduce is that:
In the simplest form of RowCounter (without any optional argument), it is actually a full table scan. HBase iterates over all records in the table, and emits the row to the map method in RowCounterMapper. Is this correct?
The TableMapper will divide the task based on how many regions we have in a table. For example, if we have only 1 region in our HBase table, it will only have 1 map task, and it effectively equals to a single thread, and does not utilize any parallel processing of our hadoop cluster?
If the above is correct, is it possible that we could configure HBase to spawn multiple tasks for a region? For example, when we do a RowCounter on a table that only has 1 region, it still has 10 or 20 tasks, and counting the row in parallel manner?
Since TableMapper also depends on Scan operation, I would also like to confirm my understanding about the Scan operation and performance.
If I use setStartRow / setEndRow to limit the scope of my dataset, as rowkey is indexed, it does not impact our performance, because it does not emit full table scan.
In our case, we might need to filter our data based on their modified time. In this case, we might use scan.setTimeRange() to limit the scope of our dataset. My question is that since HBase does not index the timestamp, will this scan become a full table scan, and does not have any advantage compared to we just filter it by our MapReduce job itself?
Finally, actually we have some discussion on how we should do this export. And we have the following two approaches, yet not sure which one is better.
Using the MapReduce approach described above. But we are not sure if the parallelism will be bound by how many regions a table has. ie, the concurrency never exceeds the region counts, and we could not improve our performance unless we increase the region.
We maintain a rowkey list in a separate place (might be on HDFS), and we use spark to read the file, then just get the record using a simple Get operation. All the concurrency occurs on the spark / hadoop side.
I would like to have some suggestions about which solution is better from this community, it will be really helpful. Thanks.
Seems like you have a very small cluster. Scalability is dependent on number of region servers(RS) also. So, just by merely increasing number of regions in table without increasing number of region servers wont really help you speed up the job. I think 80 Regions/RS for that table itself is decent enough.
I am assuming you are going to use TableInputFormat, it works by running 1 mapper/region and performs server side filter on basis of scan object. I agree that scanning using TableInputFormat is optimal approach to export large amount of data from hbase but scalability and performance not just proportional to number of regions. There are many many other factors like # of RS, RAM and Disk on each RS, uniform distribution of data are some of them.
In general, I would go with #1 since you just need to prepare a scan object and then hbase will take care of rest.
#2 is more cumbersome since you need to maintain the rowkey state outside hbase.

Saving ordered dataframe in Spark

I'm trying to save ordered dataframe into HDFS. My code looks like this:
dataFrame.orderBy("index").write().mode(SaveMode.Overwrite).parquet(getPath());
I run same code on two different clusters, one cluster uses Spark 1.5.0, another - 1.6.0. When running on cluster with Spark 1.5.0 it does not preserve sorting after saving on disc.
Is there any specific cluster settings to preserve sorting during saving data on disc? or is it a known problem of the spark version? I've searched spark documentation but couldn't find any info about.
Update:
I've checked files in parquet and in both cases files are sorted. So problem occures while reading, Spark 1.5.0 doesn't preserve ordering while reading and 1.6.0 does.
So my question now: Is it possible to read sorted file and preserve ordering in Spark 1.5.0?
There are several things going on here:
When you are writing, spark splits the data into several partitions and those are written separately so even if the data is ordered it is split.
When you are reading the partitions do not save ordering between them, so you would be sorted only blocks. Worse, there might be something different than a 1:1 mapping of file to partition:
Several files might be mapped to a single partition in the wrong order causing the sorting inside the partition to only be true in blocks
A single file might be divided between partitions (if it is larger than the block size).
Based on the above, the easiest solution would be to repartition (or rather coalesce) to 1 when writing and thus have 1 file. When that file is read the data would be ordered if the file is smaller than the block size (you can even make the block size very large to ensure this).
The problem with this solution is that it reduces your parallelism (when you write you need to repartition and when you read you would need to repartition again to get parallelism. The coalesce/repartition can be costly.
The second problem with this solution is that it doesn't scale well (you might end up with a huge file).
A better solution would be based on your use case. The basic would be if you can use partitioning before sorting. For example, if you are planning to do a custom aggregation that requires the sorting then if you make sure to keep a 1:1 mapping between files and partitions you can be assured of the sorting within the partition which might be enough for you. You can also add the maximum value inside each partition as a second value and then groupby it and do a secondary sort.

How to create a key, value pair in mapreduce program if values are stored across the boundaries ?

In the input file which I need to process have data classified by headers and its respective records. My 200 MB file has 3 such headers and its records split across 4 blocks(3*64 MB and 1*8 MB).
The data would be in below format
HEADER 1
Record 1
Record 2
.
.
Record n
HEADER 2
Record 1
Record 2
.
.
Record n
HEADER 3
Record 1
Record 2
.
.
Record n
All I need is to take the HEADER as a key and its below Records as values and process some operations in my mapper code.
The problem over here is my Records are split across different blocks.
For suppose my first Header and its respective Records occupy a space of 70 MB, it means it occupies 64 MB of the first block and 6 MB of space in 2nd block.
Now how does the mapper that runs on 2nd block knows that 6 MB of file belongs to records of the HEADER 1.
Can any one please explain me as how to get the Header and its records completely.
You need a custom recordreader and custom linereader to process in such a way rather than reading each line.
Since the splits are calculated in the client, every mapper already knows if it needs to discard the records of previous header or not.
Hope this below link might be helpful
How does Hadoop process records split across block boundaries?
You have two ways:
A single mapper handling all the records, so you have the complete data in single class, and you decide how to separate them. Given the input size, this will have performance issues. More Info at Hadoop Defintive guide, MR Types and Formats, Input Formats, Prevent Splitting. Less coding effort, and if your mapper has less data and running frequently, this approach is ok.
If you plan to use custom split and record reader, you are modifying the way the framework works. Because, your records are similar to TextInputFormat. So mostly no need to plan for custom record reader. However you need to define how the splits are made. In general, splits are divided mostly equal to block size, to take advantage of data locality. In your case, your data(mainly the header part) can end at any block and you should split accordingly. All the above changes need to be made, to make map reduce work with the data you have.
You can increase the default size of HDFS block to 128MB and if the file is small it will take that as one block.

Query preprocessing: Hadoop or distributed system

I am trying to optimize the performance of a search engine by preprocessing all the results. We have around 50k search terms. I am planning to search these 50k terms before hand and save it in memory (memcached/redis). Searching for all 50k terms takes more than a day in my case since we do deep semantic search. So I am planning to distribute the searching (preprocessing) over several nodes. I was considering to use hadoop. My input size is very less. Probably less than 1MB even though total search term is over 50k. But searching each term takes over a min i.e more computation oriented than data oriented. So I am wondering if I should use Hadoop or build my own distributed system. I remember reading that hadoop is used mainly if input is very huge. Please suggest me on how to go about this.
And I read hadoop reads data in block size. i.e 64mb for each jvm/mapper. Is it possible to make it number of lines instead of block size. Example: Every mapper gets 1000 lines instead of 64mb. Is it possible to achieve this.
Hadoop can definitely handle this task. Yes, much of Hadoop was designed to handle jobs with very large input or output data, but that's not it's sole purpose. It can work well for close to any type of distributed batch processsing. You'll want to take a look at NLineInputFormat; it allows you to split your input up based on exactly what you wanted, number of lines.

Resources