Apache Drill Query data rertival is not constant on HDFS system - hadoop

I am working on Apache Drill and HDFS in my project.
I am dealing with v.big file (e.g 150GB) and that file is stored in HDFS system. I am writing my Drill query such a way that i will get some amount of data and i will process that (e.g 100 rows) and then again fire a query on that file, so my performance will increase.
(e.g SELECT * FROM dfs.file path LIMIT 100 )
But every time when i perform a query on that File which is in HDFS system, i am not getting consistent data. It changes every time as Hadoop will fetch that data from any cluster.
Because of that, it may be the case that during the entire process of getting all the record, i may get the same records which i have already.

You might be lucky with using pagination with LIMIT and OFFSET, altough I am not sure about it's behaviour with HDFS.
There is a question with a similar approach How to use apache drill do page search and the documentation says:
The OFFSET clause provides a way to skip a specified number of first rows in a result set before starting to return any rows.
(Source)

Related

How to find optimum Spark-athena file size

I have a spark job that writes to s3 bucket and have a athena table on top of this location.
The table is partitioned. Spark was writing 1GB single file per partition. We experimented with maxRecordsPerFile option thus writing only 500MB data per file. In the above case we ended up having 2 files with 500MB each
This saved 15 mins in run-time on the EMR
However, there was a problem with athena. Athena query CPU time started getting worse with the new file size limit.
I tried comparing the same data with the same query before and after execution and this is what I found:
Partition columns = source_system, execution_date, year_month_day
Query we tried:
select *
from dw.table
where source_system = 'SS1'
and year_month_day = '2022-09-14'
and product_vendor = 'PV1'
and execution_date = '2022-09-14'
and product_vendor_commission_amount is null
and order_confirmed_date is not null
and filter = 1
order by product_id
limit 100;
Execution time:
Before: 6.79s
After: 11.102s
Explain analyze showed that the new structure had to scan more data.
Before: CPU: 13.38s, Input: 2619584 rows (75.06MB), Data Scanned: 355.04MB; per task: std.dev.: 77434.54, Output: 18 rows (67.88kB)
After: CPU: 20.23s, Input: 2619586 rows (74.87MB), Data Scanned: 631.62MB; per task: std.dev.: 193849.09, Output: 18 rows (67.76kB)
Can you please guide me why this takes double the time? What are the things to look out for? Is there a sweet spot on file size that would be optimal for spark & athena combination?
One hypothesis is that pushdown filters are more effective with the single file strategy.
From AWS Big Data Blog's post titled Top 10 Performance Tuning Tips for Amazon Athena:
Parquet and ORC file formats both support predicate pushdown (also
called predicate filtering). Both formats have blocks of data that
represent column values. Each block holds statistics for the block,
such as max/min values. When a query is being run, these statistics
determine whether the block should be read or skipped depending on the
filter value used in the query. This helps reduce data scanned and
improves the query runtime. To use this capability, add more filters
in the query (for example, using a WHERE clause).
One way to optimize the number of blocks to be skipped is to identify
and sort by a commonly filtered column before writing your ORC or
Parquet files. This ensures that the range between the min and max of
values within the block are as small as possible within each block.
This gives it a better chance to be pruned and also reduces data
scanned further.
To test it I would suggest to do another experiment if possible. Change the spark job and sort the data before persisting it into the two files. Use the following order:
source_system, execution_date, year_month_day, product_vendor, product_vendor_commission_amount, order_confirmed_date, filter and product_id. Then check the query statistics.
At least the dataset would be optimised for the presented use case. Otherwise, change it according to the most heavy queries.
The post comments about optimal file sizes too and it gives a general rule of thumb. From my experience, Spark works well with sizes between 128MB and 2GB. It should be also fine for other query engines like Presto used by Athena.
My suggestion would be to break year_month_day/execution date ( as mostly used in the queries ) to Year, Month and Day partitions , which would reduce the amount of data scan and efficient filtering.

Can Hive periodically append or insert incremental data to the same table file in hdfs?

I'm loading network captured data every minute from Spark streaming (from Flume exec), then aggregate data according to ip address, save to Hive at the end. To make it faster I create Hive ORC table with partition on ip address, it works well. The only issue is every minute it creates many (depends on how many ip addresses) kb small files, now I use "ALTER TABLE...CONCATENATE;" to merge them manually, but I think it could be easier, so want to ask whether there is solution that can incrementally merge/append new data to first minute table files instead of creating new table files every minute. Any suggestion is appreciated!
I give up, looks no direct solution as Hive can't append content to existing datafile for performance consideration. Now my alternative is still to concatenate it every week, the problem is query will be broken with error message (complaining it can't find data file) when it's doing concatenation, so there is big business impact. Now I'm thinking replacing Hive with HBase or Kudu which is more flexible and can provide update/delete operation.

Processing very large dataset in real time in hadoop

I'm trying to understand how to architect a big data solution. I have historic data of 400TB of data and every hour 1GB of data is getting inserted.
Since data is confidential, I'm describing sample scenario, Data contains information of all activities in a bank branch. With every hour, when new data is inserted(no updation) into hdfs, I need to find how many loans closed, loans created,accounts expired, etc ( around 1000 analytics to be performed). Analytics involve processing entire 400TB of data.
I was plan was to use hadoop + spark. But I'm being suggested to use HBase. Reading through all the documents, I'm not able to find a clear advantage.
What is the best way to go for data which will grow to 600TB
1. MR for analytics and impala/hive for query
2. Spark for analytics and query
3. HBase + MR for analytics and query
Thanks in advance
About HBase:
HBase is a database that is build over HDFS. HBase uses HDFS to store data.
Basically, HBase will allow you to update records, have versioning and deletion of single records. HDFS does not support file updates, so HBase is introducing something you can consider "virtual" operations, and merge data from multiple sources (original files, delete markers) when you are asking it for data. Also, HBase as key-value store is creating indices to support selecting by key.
Your problem:
Choosing the technology in such situations you should look into what you are going to do with the data: Single query on Impala (with Avro schema) can be much faster than MapReduce (not to mention Spark). Spark will be faster in batch jobs, when there is caching involved.
You are probably familiar with Lambda architecture, if not, take a look into it. For what I can tell you now, the third option you mentioned (HBase and MR only) won't be good. I did not try Impala + HBase, so I can't say anything about performance, but HDFS (plain files) + Spark + Impala (with Avro), worked for me: Spark was doing reports for pre-defined queries (after that, data was stored in objectFiles - not human-readable, but very fast), Impala for custom queries.
Hope it helps at least a little.

Running a MR Job on a portion of the HDFS file

Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.

Storage of parsed log data in hadoop and exporting it into relational DB

I have a requirement of parsing both Apache access logs and tomcat logs one after another using map reduce. Few fields are being extracted from tomcat log and rest from Apache log.I need to merge /map extracted fields based on the timestamp and export these mapped fields into a traditional relational db ( ex. MySQL ).
I can parse and extract information using regular expression or pig. The challenge i am facing is on how to map extracted information from both logs into a single aggregate format or file and how to export this data to MYSQL.
Few approaches I am thinking of
1) Write output of map reduce from both parsed Apache access logs and tomcat logs into separate files and merge those into a single file ( again based on timestamp ). Export this data to MySQL.
2) Use Hbase or Hive to store data in table format in hadoop and export that to MySQL
3) Directly write the output of map reduce to MySQL using JDBC.
Which approach would be most viable and also please suggest any other alternative solutions you know.
It's almost always preferable to have smaller, simpler MR jobs and chain them together than to have large, complex jobs. I think your best option is to go with something like #1. In other words:
Process Apache httpd logs into a unified format.
Process Tomcat logs into a unified format.
Join the output of 1 and 2 using whatever logic makes sense, writing the result into the same format.
Export the resulting dataset to your database.
You can probably perform the join and transform (1 and 2) in the same step. Use the map to transform and do a reduce side join.
It doesn't sound like you need / want the overhead of random access so I wouldn't look at HBase. This isn't its strong point (although you could do it in the random access sense by looking up each record in HBase by timestamp, seeing if it exists, merging the record in, or simply inserting if it doesn't exist, but this is very slow, comparatively). Hive could be conveinnient to store the "unified" result of the two formats, but you'd still have to transform the records into that format.
You absolutely do not want to have the reducer write to MySQL directly. This effectively creates a DDOS attack on the database. Consider a cluster of 10 nodes, each running 5 reducers, you'll have 50 concurrent writers to the same table. As you grow the cluster you'll exceed max connections very quickly and choke the RDBMS.
All of that said, ask yourself if it makes sense to put this much data into the database, if you're considering the full log records. This amount of data is precisely the type of case Hadoop itself is meant to store and process long term. If you're computing aggregates of this data, by all means, toss it into MySQL.
Hope this helps.

Resources