Given a directory with hundreds of tab-delimited csv files, each of which contains no header in the first row. That means we will specify the column names by other means.These files can be located on a local disk, or HDFS.
What is the most efficient way to index these files?
if you have a lot of files , i think there are several methods to improve indexing speed :
First , if your data on a local disk , you can build index use multithreading , but need to pay attention to, each thread has its own index of an output directory. Finally merged them into an index so that improve search speed .
Second , if your data on HDFS , i think use Hadoop MapReduce to build index is very powerful .
in addition , some UDF plugins of Pig or Hive also can build index easily , but
you need convert your data into hive table or make pig schemal , these is simple !
Third , in order to better understand above methods , maybe you can read
How to make indexing faster
Related
I want to configure a flume flow so that it takes in a CSV file as a source, checks the data, and dynamically separates each row of data into folders by year/month in HDFS. Is this possible?
I might suggest you look at using Nifi instead. I feel like it's the natural replacement for Flume.
Having said that it would seem that you might want to consider using a the spooling directory source and a hive sink (instead of hdfs). The hive partitions (Partitions on year/Month) would enable you to land the data in the Manner you are suggesting.
My question is mostly theoretical, but i have some tables that already follow some sort of partition scheme, lets say my table is partitioned by day, but after working with the data for sometime we want to modifity to month partitions instead, i could easily recreare the table with the new partition definition and reinsert the data, is this the best approach? sounds slow when the data is huge, i have seen there are multiple alter commands in hive for partitions, is there one that can help me achieve what i need?
Maybe there is another choice of concatenating the files and then recreating the table with the new partition?
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;
If there are any relevant references they are appreciated as well.
If the files are in daily folders, you can not mount many daily folders into single month partition, for each month, files needs to be moved to month folder. You can not do it as a metadata only operation.
If you are good in shell scripting you can write loop in hadoop fs -ls <table location> | sort, in the loop save path into variable, check if substring including yyyy-MM is different from previous, then create yyyy-MM folder. For each row in a loop do copy everything into month location (hadoop fs -cp daily_location/* month_location/), all can be done in single loop.
If you are on S3 and using AWS-CLI commands, creating of folders is not necessary, just copy.
If there are too many small files, you may want to concatenate them in monthly folders, if it is ORC, you can execute ALTER TABLE PARTITION CONCATENATE. If not ORC, then better use Hive INSERT OVERWRITE, it will do all that for you, you can configure merge task and finally your files will be in optimal size. Additionally you can improve compression efficiency and make possible to use bloom filters and internal indexes(if it is ORC/Parquet) if you add distribute by partition_col sort by <keys used in filters/joins>, this will greatly reduce table size and improve queries performance.
So, better use Hive for this task because it gives you opportunity to improve data storage: change storage format, concatenate files, sort to reduce compressed size and make indices and bloom filters be really useful.
Hi I don't understand why this code takes too much time.
val newDataDF = sqlContext.read.parquet("hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*")
It's supposed than no bytes are transferred to the driver program, isn't it? How does read.parquet works?
What I can see from the Spark web UI is that read.spark fires about 4000 tasks (there's a lot of parquet files inside that folder).
The issue most likely is the file indexing that has to occur as the first step of loading a DataFrame. You said the spark.read.parquet fires off 4000 tasks, so you probably have many partition folders? Spark will get an HDFS directory listing and recursively get the FileStatus (size and splits) of all files in each folder. For efficiency Spark indexes the files in parallel, so you want to ensure you have enough cores to make it as fast as possible. You can also be more explicit in the folders you wish to read or define a Parquet DataSource table over the data to avoid the partition discovery each time you load it.
spark.sql("""
create table mydata
using parquet
options(
path 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*'
)
""")
spark.sql("msck repair table mydata")
From this point on, when you query the data it will no longer have to do the partition discovery, but it'll still have to get the FileStatus for the files within the folders you query. If you add new partitions you can either add the partition explicitly of force a full repair table again:
spark.sql("""
alter table mydata add partition(foo='bar')
location 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711/foo=bar'
""")
I have CSV files organized by date and time as follows
logs/YYYY/MM/DD/CSV files...
I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,
SELECT * from data where trans>='20170101' AND trans<'20170102';
In this SQL, the directory logs/2017/01/01 should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?
Please note:
SQL queries will almost always contain the time frame.
Number of CSV files in a given directory is not huge. Combined all years worth of data, it will be huge
There is a field called 'trans' in every CSV file, which contains the date and time.
The CSV file is put under appropriate directory based on the value of 'trans' field.
CSV files do not follow any schema. Columns may or may not be different.
Querying using column inside the data file would not help in partition pruning.
You can use dir* variables in Drill to refer to partitions in table.
create view trans_logs_view as
select
`dir0` as `tran_year`,
`dir1` as `trans_month`,
`dir2` as `tran_date`, * from dfs.`/data/logs`;
You can query using tran_year,tran_month and tran_date columns for partition pruning.
Also see if below query helps for pruning.
select count(1) from dfs.`/data/logs`
where concat(`dir0`,`dir1`,`dir2`) between '20170101' AND '20170102';
If so , you can define view by aliasing concat(dir0,dir1,dir2) to trans column name and query.
See below for more details.
https://drill.apache.org/docs/how-to-partition-data/
how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table