Create Input Format of Elasticsearch using Flink Rich InputFormat - elasticsearch

We are using Elasticsearch 6.8.4 and Flink 1.0.18.
We have an index with 1 shard and 1 replica in elasticsearch and I want to create the custom input format to read and write data in elasticsearch using apache Flink dataset API with more than 1 input splits in order to achieve better performance. so is there any way I can achieve this requirement?
Note: Per document size is larger(almost 8mb) and I can read only 10 documents at a time because of size constraint and per reading request, we want to retrieve 500k records.
As per my understanding, no.of parallelism should be equal to number of shards/partitions of the data source. however, since we store only a small amount of data we have kept the number of shard as only 1 and we have a static data it gets increased very slightly per month.
Any help or example of source code will be much appreciated.

You need to be able to generate queries to ES that effectively partition your source data into relatively equal chunks. Then you can run your input source with a parallelism > 1, and have each sub-task read only part of the index data.

Related

Nifi Hbase data insertion taking more space than original data

I am doing data transformation in realtime using Nifi and after processing data is stored in Hbase. I am using puthbasejson for storing the data in hbase. While storing row key/id i am using is uuid. But the original data size in nifi data provonance or in online tool for a single JSON is 390bytes. But for 15 million data the size which it is taking 55 GB, according to which the data size for single record is 3.9 KB.
So, I am not getting how the data is stored, why the data size which is stored in hbase is more than the original data size and how I can reduce or optimize both in Hbase and Nifi(if any changes required).
JSON:
{"_id":"61577d7aba779647060cb4e9","index":0,"guid":"c70bff48-008d-4f5b-b83a-f2064730f69c","isActive":true,"balance":"$3,410.16","picture":"","age":40,"eyeColor":"green","name":"Delia Mason","gender":"female","company":"INTERODEO","email":"deliamason#interodeo.com","phone":"+1 (892) 525-3498","address":"682 Macon Street, Clinton, Idaho, 3964","about":"","registered":"2019-09-03T06:00:32 -06:-30"}
Steps to reproduce in nifi:
generate flowfile--->PuthbaseJSON(uuid rowkey)
Update1:
data stored in hbase:
I think the main thing you may be getting surprised by is that Hbase stores each column of a table as an individual record.
Suppose your UUID is 40 characters on average, field 1, 2 and 3 may each be 5 on average and perhaps it adds a timestamp of length 15.
Now originally you would have an amount of data of size 40+5+5+5+15 = 70
And after storing per row as per your screenshot, with three columns it would become 3*(40+5+15)=180 and this effect can increase if you have smaller or more fields.
I got this understanding from your screenshot but also from this article: https://dzone.com/articles/how-to-improve-apache-hbase-performance-via-data-s
Now the obvious way forward if you want to reduce your footprint, is to reduce the overhead. I believe the article recommends serialization, but perhaps it would also simply be possible to put the entire json body into one column, depending on how you plan to access it.

Spark not ignoring empty partitions

I am trying to read a subset of a dataset by using pushdown predicate.
My input dataset consists in 1,2TB and 43436 parquet files stored on s3. With the push down predicate I am supposed to read 1/4 of data.
Seeing the Spark UI. I see that the job actually reads 1/4 of data (300GB) but there are still 43436 partitions in the first stage of the job however only 1/4 of these partitions has data, the other 3/4 are empty ones (check the median input data in the attached screenshots).
I was expecting Spark to create partitions only for non empty partitions. I am seeing a 20% performance overhead when reading the whole dataset with the pushdown predicate comparing to reading the prefiltred dataset by another job (1/4 of data) directly. I suspect that this overhead is due to the huge number of empty partitions/tasks I have in my first stage, so I have two questions:
Are there any workaround to avoid these empty partitions?
Do you think to any other reason responsible for the overhead? may be the pushdown filter execution is naturally a little bit slow?
Thank you in advance
Using S3 Select, you can retrieve only a subset of data.
With Amazon EMR release version 5.17.0 and later, you can use S3 Select with Spark on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object.
Otherwise, S3 acts as an object store, in which case, an entire object has to be read. In your case you have to read all content from all files, and filter them on client side.
There is actually very similar question, where by testing you can see that:
The input size was always the same as the Spark job that processed all of the data
You can also see this question about optimizing data read from s3 of parquet files.
Seems like your files are rather small: 1.2TB / 43436 ≈ 30MB. So you may want to look at increasing the spark.sql.files.maxPartitionBytes, to see if it reduces the total number of partitions. I have not much experience with S3, so not sure whether its going to help given this note in its description:
The maximum number of bytes to pack into a single partition when
reading files. This configuration is effective only when using
file-based sources such as Parquet, JSON and ORC.
Empty partitions: It seems that spark (2.4.5) tries to really have partitions with size ≈ spark.sql.files.maxPartitionBytes (default 128MB) by packing many files into one partition, source code here.
However it does this work before running the job, so it can't know that 3/4 of files will not output data after the pushed down predicate being applied. For the partitions where it will put only files whose lines will be filtered out, I ended up with empty partitions. This explains also why my max partition size is 44MB and not 128MB, because none of the partitions had by chance files that passed all the pushdown filter.
20% Overhead: Finally this is not due to empty partitions, I managed to have much less empty partitions by setting spark.sql.files.maxPartitionBytes to 1gb but it didn't improve reading. I think that the overhead is due to opening many files and reading their metadata.
Spark estimates that opening a file is equivalent to reading 4MB spark.sql.files.openCostInBytes. So opening many files even if thanks to the filter won't be read shouldn't be negligible..

Data load from HDFS to ES taking very long time

I have created an external table in hive and need to move the data to ES (of 2 nodes, each with 1 TB). Below regular query taking very long time (more than 6 hours) for a source table with 9GB of data.
INSERT INTO TABLE <ES_DB>.<EXTERNAL_TABLE_FOR_ES>
SELECT COL1, COL2, COL3..., COL10
FROM <HIVE_DB>.<HIVE_TABLE>;
ES index is having default 5 shards and 1 replica. Increasing the number of shards could any way speed up the ingestion?
Could some one suggest any improvements to speed up the ES node ingestion.
You don't mention the methodology you're using to feed the data into ES so it's hard to see if you're using an ingestion pipeline or what technology to bridge the gap. Given that, I'll stick with generic advice on how to optimize ingestion into Elasticsearch.
Elastic has published some guidance for optimizing systems for ingestion, and there are three points that we've found do make a real difference:
Turn Off Replicas: Set the number of replicas to zero while injesting the data to eliminate the need to copy the data while also injesting it. This is an index-level setting ("number_of_replicas")
Don't Specify an ID: It isn't clear from your database schema if you are mapping across any identifiers, but if you can avoid specifying a document Id to Elastic and let it specify its own that significantly improves performance.
Use Parallel Bulk Operators: Use the BulkAPI to push data into ES and feed it with multiple threads so it always has more than one Bulk request to work on server-side.
Finally, have you installed Kibana and monitored your nodes to know what they are limited by? In particular CPU or Memory?

Indexing process in Hadoop

could any body please explain me what is meant by Indexing process in Hadoop.
Is it something like a traditional indexing of data that we do in RDBMS, so drawing the same analogy here in Hadoop we index the data blocks and store the physical address of the blocks in some data structure.
So it will be an additional space in the Cluster.
Googled around this topic but could not get any satisfactory and detailed things.
Any pointers will help.
Thanks in advance
Hadoop stores data in files, and does not index them. To find something, we have to run a MapReduce job going through all the data. Hadoop is efficient where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data.
However, we can use indexing in HDFS using two types viz. file based indexing & InputSplit based indexing.
Lets assume that we have 2 Files to store in HDFS for processing. First one is of 500 MB and 2nd one is around 250 MB. Hence we'll have 4 InputSplits of 128MB each on 1st File and 3 InputSplits on 2nd file.
We can apply 2 types of indexing for the mentioned case -
1. With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query
2. With InputSplit based indexing, you will end up with 4 InputSplits. The performance should be definitely better than doing a full scan query.
Now, to for implementing InputSplits index we need to perform following steps:
Build index from your full data set - This can be achived by writing a MapReduce job to extract the value we want to index, and output it together with its InputSplit MD5 hash.
Get the InputSplit(s) for the indexed value you are looking for - Output of MapReduce program will be Reduced Files (Containing Indices based on InputSplits) which will be stored in HDFS
Execute your actual MapReduce job on indexed InputSplits only. - This can be done by Hadoop as it is able to retrieve the number of InputSplit to be used using the FileInputFormat.class. We will create our own IndexFileInputFormat class extending the default FileInputFormat.class, and overriding its getSplits() method. You have to read the file you have created at previous step, add all your indexed InputSplits into a list, and then compare this list with the one returned by the super class. You will return to JobTracker only the InputSplits that were found in your index.
In Driver class we have now to use this IndexFileInputFormat class. We need to set as InputFormatClass using -
To Use our custom IndexFileInputFormat In Driver class we need to provide
job.setInputFormatClass(IndexFileInputFormat.class);
For Code Sample and other details Refer this -
https://hadoopi.wordpress.com/2013/05/24/indexing-on-mapreduce-2/
We can identify 2 different levels of granularity for creating indices: Index based on File URI or index based on InputSplit. Let’s take 2 different examples of data set.
index
First example:
2 files in your data set fit in 25 blocks, and have been identified as 7 different InputSplits. The target you are looking for (grey highlighted) is available on file #1 (block #2,#8 and #13), and on file #2 (block #17)
With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query
With InputSplit based indexing, you will end up with 4 InputSplits on 7 available. The performance should be definitely better than doing a full scan query
index
Let’s take a second example:
This time the same data set has been sorted by the column you want to index. The target you are looking for (grey highlighted) is now available on file #1 (block #1,#2,#3 and #4).
With File based indexing, you will end up with only 1 file from your data set
With InputSplit based indexing, you will end up with 1 InputSplit on 7 available
For this specific study, I decided to use a custom InputSplit based index. I believe such approach should be quite a good balance between the efforts it takes to implement, the added value it might bring in term of performance optimization, and its expected applicability regardless to the data distribution.

How to compare two large data sets using hadoop mapreduce?

I am new to hadoop and mapreduce. We have a normal java application where we read a file ( 8 GB in size ) from hadoop file system and we apply some rules on that data. After applying rules we get java hashmap (which is huge in size) and we keep that data in cache or in buffer. At the same time we get the data from hive by applying a query on it and prepare a java hashmap which is again huge in size. Now we compare both the hashmaps data to prepare final report to check the data accuracy.
In the above process since we are using normal java program to do the stuff we are facing below problems.
To process this huge data it takes ages to complete the job. Because input file contains tens of millions of records in it and we need to apply rules on each row to extract the data. It takes days to complete the job. At the same time hive also contains the same amount of data, query is taking too much time to return the data from hive.
Since we are keeping the data in buffer we are facing memory issues.
Now we are trying to implement the same in hadoop mapreduce.
What is the best way to achieve the above scenario?
What are the best ways to implement the above scenario in mapreduce?
How can I increase the application performance by using mapreduce?
8 GB is a tiny data set. I can fit 4 of these 'data sets' into my Laptop RAM! Just dump it in a any relational engine and massage it as fit until the cows come home. This is not 'big data'.
For the record, the way to do processing of two truly large datasets (say +1 TB each) in Hive is a sort-merge-bucket join (aka. SMB join). Read LanguageManual JoinOptimization, watch Join Strategies in Hive.

Resources