I am trying to reformat over 600gb of csv files into parquet using apache drill in a single node setup.
I run my sql statement:
CREATE TABLE AS Data_Transform.'/' AS
....
FROM Data_source.'/data_dump/*'
and it is creating parquet files but I get the error:
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR:
One or more nodes ran out of memory while executing the query.
is there a way around this?
Or is there an alternative way to do the conversion?
I don't know if querying all those GB on a local node is feasible. If you've configured the memory per the docs, using a cluster of Drillbits to share the load is the obvious solution, but I guess you already know that.
If you're willing to experiment, and you're converting csv files using a select * to query the csv, rather than selecting individual columns, change the query to something like select columns[0] as user_id, columns1 as user_name. Cast any columns to types like int, float, datetime if possible. This avoids the read overhead storing data in the varchars and prepares data for your future queries that need to be cast for any analysis.
I've also seen the following recommendation from a Drill developer: split files into smaller files manually to overcome the local file system capability limitations. Drill doesn't split files on block splits.
Related
I'm loading network captured data every minute from Spark streaming (from Flume exec), then aggregate data according to ip address, save to Hive at the end. To make it faster I create Hive ORC table with partition on ip address, it works well. The only issue is every minute it creates many (depends on how many ip addresses) kb small files, now I use "ALTER TABLE...CONCATENATE;" to merge them manually, but I think it could be easier, so want to ask whether there is solution that can incrementally merge/append new data to first minute table files instead of creating new table files every minute. Any suggestion is appreciated!
I give up, looks no direct solution as Hive can't append content to existing datafile for performance consideration. Now my alternative is still to concatenate it every week, the problem is query will be broken with error message (complaining it can't find data file) when it's doing concatenation, so there is big business impact. Now I'm thinking replacing Hive with HBase or Kudu which is more flexible and can provide update/delete operation.
I have CSV files organized by date and time as follows
logs/YYYY/MM/DD/CSV files...
I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,
SELECT * from data where trans>='20170101' AND trans<'20170102';
In this SQL, the directory logs/2017/01/01 should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?
Please note:
SQL queries will almost always contain the time frame.
Number of CSV files in a given directory is not huge. Combined all years worth of data, it will be huge
There is a field called 'trans' in every CSV file, which contains the date and time.
The CSV file is put under appropriate directory based on the value of 'trans' field.
CSV files do not follow any schema. Columns may or may not be different.
Querying using column inside the data file would not help in partition pruning.
You can use dir* variables in Drill to refer to partitions in table.
create view trans_logs_view as
select
`dir0` as `tran_year`,
`dir1` as `trans_month`,
`dir2` as `tran_date`, * from dfs.`/data/logs`;
You can query using tran_year,tran_month and tran_date columns for partition pruning.
Also see if below query helps for pruning.
select count(1) from dfs.`/data/logs`
where concat(`dir0`,`dir1`,`dir2`) between '20170101' AND '20170102';
If so , you can define view by aliasing concat(dir0,dir1,dir2) to trans column name and query.
See below for more details.
https://drill.apache.org/docs/how-to-partition-data/
I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive
every one.
I have some data about 6G in hdfs that has been exported from mysql.And I have write mapreduces prehandling data to fill some key field that data can be easily queried.
As the business demands are different aggregation data group by day ,hour,hospital,area etc,
so I have to write many hive sqls exporting data to local disk,and then I write python script to parse files on local disk ,then get datas in demand.
Is there some good technique on hadoop to resolve my demand.I am considering.
Can you help me ,please.
Part 1: my enviroment
I have following files uploaded to Hadoop:
The are plain text
Each line contains JSON like:
{code:[int], customerId:[string], data:{[something more here]}}
code are numbers from 1 to 3000,
customerId are total up to 4 millions, daily up to 0.5 millon
All files are gzip
In hive I created external table with custom JSON serde (let's call it CUSTOMER_DATA)
All files from each date is stored in separate directory - and I use it as partitions in Hive tables
Most queries which I do are filtering by date, code and customerId. I have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]:
[customerId] [attribute_1] [attribute_2] ... [attribute_n]
which contains data for all my customers, so rows are up to 4 millions.
I query and filter my data in following way:
Filtering by date - partitions do the job here using WHERE partitionDate IN (20141020,20141020)
Filtering by code using statement like for example `WHERE code IN (1,4,5,33,6784)
Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query like
SELECT customerId
FROM CUSTOMER_DATA
JOIN CUSTOMER_ATTRIBUTES ON (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId)
WHERE CUSTOMER_ATTRIBUTES.attribute_1=[something]
Part 2: question
Is there any efficient way how can I optimize my queries. I read about indexes and buckets by I don't know if I can use them with external tables and if they will optimize my queries.
Performance on search:
Internal or External table does not make a difference as far as performance is considered. You can build indexes on both. Either ways building indexes on large data sets is counter intuitive.
Bucketing the data on your searching columns would give a lot of performance gains. But whether you can bucket you data or not depends on your use case.
You can consider more partitioning (if possible) to get more gains if you can on code/customer id. Hopefully you don't have to many unique code or customer id.
Rather than trying these things out on your Textual Json formatted data, I would strongly suggest you to move away from JSON test data. Parsing JSON(Text) is a big performance killer.
These days there are a lot of file format which work pretty good. If cant change the component which produces the data, you use a series of queries and tables to convert to other file formats. This will be one time job for each partition data. After that your search queries will run faster on newer file formats.
for eg. RCFile format is support by hive. If you pull out code, customerid as separate columns in RCFILE then the query engine can completely skip data col for not matching code in (1,4,5,33,6784) , reducing IO heavily.
Also storing data in RCFile ie columnar storage will help your joins. With RCFile when you run a query with join the hive execution engine will only read in required columns, again significantly reducing IO. On top of this if you bucketted your columns which are a part of JOIN keys it will lead to more performance gains.
If you need to have JSON due to nesting nature of data then I would suggesting you look at Parquet
It will give you performance gains of RCFile + binary (avro, thrift etc)
At my work we had 2 columns of heavily nested JSON data. We tried storing this as compressed text and sequence file format. We then broke up the complex nested JSON columns to lesser nested multiple columns and pulled out some frequently searched keys into other columns. We stored this as RCfile and performance gains we observed on searching were huge.
Rightnow with more burst in data we need to improve more. After trying a few more things and talking to Cloudera guys there is only one big area to improve. Move away from JSON parsing. Parquet seems to be ideal candidate for this.
Yes you can use Indexes with External Tables. Index do optimize the search Queries.
CREATE INDEX your_index_name ON TABLE your_table_name(field_you_want_to_index) AS 'COMPACT' WITH DEFERRED REBUILD;
indexing takes a lot of time for a huge dataset, so we can do a deferred rebuild, i.e after production hours :)
ALTER INDEX your_index_name ON your_table_name REBUILD;
you can even rebuild a specific partition.
ALTER INDEX your_index_name ON your_table_name PARTITION(your_field = 'any_thing') REBUILD;
when you JOIN two tables BUCKETING is the best option to go with, does alot of optimization.