Difference between serial file and multi file in AB Initio - etl

What are the differences between a serial file and a Multi file in Ab Initio?

A serial file will contain the data directly.
However a multi file will contain the partitions in which the data will be contained. Once each of the partitions are opened separately then the data in it will be displayed.
We use 'cat' for serial files and 'm_cat' for multi files.
However if 'cat' is used to open a multi file it will just show the name of the partition files.
Typically the serial and multi file directories will differ in containing a file called .mdir.
The .mdir will contain all the partitions that the multi file system will refer to.
Performance wise multi files will increase speed of execution of a graph since the data will be processed in parallel. However in a serial file the same data will processed in sequence.
Thanks
Arijit

A serial file is Just like a normal file system file . Usually under serial paths.
While a "multifile" is a Single file divided into multiple files and kept under different partitions - DATA PARTITIONS ( count of no of data partitions is defined by MFS_DEPTH parmaeter).
If you want to operate on multifile you always do it from its control partition using 'm_' commands.
You can 'cd' to where the file resides in Control Partiion and do a 'cat FILENAME.mfctl' and you will get where is your file lies in all the data partition.

a serial file is a normal file
a multifile is file which is later divided into different partitions and each of the file may be located anywhere the ab initio co operating system is installed
the icon for a mutlifile has 3 platters

Related

How to merge HDFS small files into a one large file?

I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
Any suggestions?
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
you should be able to use .repartition(1) to write all results to 1 file. if you need to split by date, consider partitionBy("your_date_value") .
if you're working within HDFS and S3, this may also be helpful. you might actually even use s3-dist-cp and stay within HDFS.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
There's a specific option to aggregate multiple files in HDFS using a --groupBy option based n a regular expression pattern. So if the date is in the file name, you can group based on that pattern.
You can develop a spark application. Using this application read the data from small files and create dataframe and write dataframe to big file in append mode.

i have a csv file with locations i need to move s3 files to new locations

I am interested in loading my data into AWS ATHENA DB
my data is compartmentalized by source_video, and in each we have 11 csv files that represent 11 tables referencing this data
ATHENA wants to load by table and not by source_video
for this i have to move these files to folders based on table name and not source_video.
I am fluent in python and bash
i know how to use the aws cli
i wish to know if there is maybe an easier way than running 4Million+ mv commands and executing them in different processes in parallel on several machines
I have a csv file that has locations of files located as children of the source_video they were created for:
I have 400,000+ source_video locations
I have 11 files in each source_video location
i.e.
+source_video1
- 11 files by type
+source_video2
- 11 files by type
+source_video3
- 11 files by type
.
.
+source_video400,000+
- 11 files by type
I wish to move them to 11 folders with 400,000+ files in each folder type
fields: videoName, CClocation, identityLocation, TAGTAskslocation, M2Location
and other locations ....
Below is an example of 2 rows of data:
pj1/09/11/09/S1/S1_IBM2MP_0353_00070280_DVR1.avi,
S1_IBM2MP_0353_00070280_DVR1.avi,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsidentities.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsTAGtasks.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCs2data.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCs2extendeddata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCsdata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsALGsruns.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_Cg1CCsdata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_Cg1CCfeat.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsidentitiestaggers.csv
pj1/09/11/09/S1/S1_IBM2MP_0443_00070380_DVR1.avi,
S1_IBM2MP_0443_00070380_DVR1.avi,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsidentities.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsTAGtasks.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCs2data.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCs2extendeddata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCsdata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsALGsruns.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_Cg1CCsdata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_Cg1CCfeat.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsidentitiestaggers.csv
You are correct. Athena expects all files related to one table to be located in one directory, or in subdirectories of one directory.
Given that you are going to touch so many files, you could choose to process the files rather than simply moving them. For example, putting the contents of several files into a smaller number of files. You could also consider Zipping the files because this would cost you less to scan (Athena is charged based upon data read from disk -- zip files read less data and therefore cost less).
See: Analyzing Data in S3 using Amazon Athena
This type of processing could be done efficiently on an Amazon EMR cluster that runs Hadoop, but some specialist knowledge is required to run Hadoop so it might be easier to use the coding with which you are familiar (eg Python).

How to output multiple s3 files in Parquet

Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?
AvroParquetWriter<GenericRecord> writer =
new AvroParquetWriter<GenericRecord>(file, schema);
GenericData.Record record = new GenericRecordBuilder(schema)
.set("name", "myname")
.set("favorite_number", i)
.set("favorite_color", "mystring").build();
writer.write(record);
For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like
df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])
But I can find no equivalent to partitionBy in plain Java with Hadoop.
In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:
job.setNumReduceTasks(N);
or alternatively via the system property:
-Dmapreduce.job.reduces=N
I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.

What is the best place to store multiple small files in hadoop

I will be having multiple small text files around size of 10KB, got confused where to store those files in HBase or in HDFS. what will be the optimized storage?
Because to store in HBase I need to parse it first then save it against some row key.
In HDFS I can directly create a path and save that file at that location.
But till now whatever I read, it says you should not have multiple small files instead create less big files.
But I can not merge those files, so I can't create big file out of small files.
Kindly suggest.
A large number of small files don´t fit very well with hadoop since each file is a hdfs block and each block require a one Mapper to be processed by default.
There are several options/strategies to minimize the impact of small files, all options require to process at least one time small files and "package" them in a better format. If you are planning to read these files several times, pre-process small files could make sense, but if you will use those files just one time then it doesn´t matter.
To process small files my sugesstion is to use CombineTextInputFormat (here an example): https://github.com/lalosam/HadoopInExamples/blob/master/src/main/java/rojosam/hadoop/CombinedInputWordCount/DriverCIPWC.java
CombineTextInputFormat use one Mapper to process several files but could require to transfer the files to a different DataNode to put files together in the DAtaNode where the map is running and could have a bad performance with speculative tasks but you can disable them if your cluster is enough stable.
Alternative to repackage small files are:
Create sequence files where each record contains one of the small files. With this option you will keep the original files.
Use IdentityMapper and IdentityReducer where the number of reducers are less than the number of files. This is the most easy approach but require that each line in the files be equals and independents (Not headers or metadata at the beginning of the files required to understand the rest of the file).
Create a external table in hive and then insert all the records for this table into a new table (INSERT INTO . . . SELECT FROM . . .). This approach have the same limitations than the option two and require to use Hive, the adventage is that you don´t require to write a MapReduce.
If you can not merge files like in option 2 or 3, my suggestion is to go with option 1
You could try using HAR archives: https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html
It's no problem with having many small different files. If for example you have a table in Hive with many very small files in hdfs, it's not optimal, better to merge these files into less big ones because when reading this table a lot of mappers will be created. If your files are completely different like 'apples' and 'employees' and can not be merged than just store them as is.

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

Resources