Sort big files in Amazon S3

Sort big files in Amazon S3 - sorting

I have a big file (attribute file) in my Amazon S3 bucket in .zip form. It is around 30 gb when unzipped. The file is updated every 2 days.
INDEX HIEGHT GENDER AGE
00125 155 MALE 15
01002 161 FEMALE 18
00410 173 MALE 17
00001 160 MALE 20
00010 159 FEMALE 22
.
.
.
My use-case is such that I want to iterate once through the sorted attribute file in 1 program run. Since the zipped file is around 3.6 gb and is updated every 2 days, my code downloads it from S3 everytime. (Probably I can use caching but currently I am not using that.)
I want the file to be sorted. Since the unzipped file is large and expected to grow more everytime, I do not want to unzip it during the code run.
What I am trying to achieve is following:-
I have other files- metric files. They are relatively smaller in size ( ~20-30 mb) and are sorted.
INDEX MARKS
00102 45
00125 62
00342 134
00410 159
.
.
.
Using the INDEX, I want to create METRIC-ATTRIBUTE file for each METRIC file. If the attribute file was also sorted, I could do something similar to merging two sorted lists, taking only the common INDEX rows. It would take O(SIZE_OF_ATTRIBUTE_FILE + SIZE_OF_METRIC_FILE) space and time.
What is the best way to sort it( the attribute file)? An aws solution is preferred.

You may be able to use AWS Athena service. Athena service can operate over big S3 files (even if they are zipped). You could create one table for every S3 file and run SQL queries against them.
Sorting is possible, as you can use ORDER clause, but it don't know, how efficient it will be.
It may not be necessary though. You will be able to join the tables on the INDEX column.

Related

How to limit on # elements in array column and/or total size in Athena / Presto?

I've been looking at the Athena and PrestoDB documentation and can't find any reference to the limit on number of elements in an array column and/or maximum total size. Files will be in Parquet format, but that's negotiable if Parquet is the limiting factor.
Is this known?
MORE CONTEXT:
I'm going to be pushing data into a Fire Hose that will emit Parquet files to S3 that I plan to query using Athena. The data is a one-to-many mapping of S3 URIs to a set of IDs, e.g.
s3://bucket/key_one, 123
s3://bucket/key_one, 456
....
s3://bucket/key_two, 321
s3://bucket/key_two, 654
...
Alternately, I could store in this form:
s3://bucket/key_one, [123, 456, ...]
s3://bucket/key_two, [321, 654, ...]
Since Parquet is compressed I'm not concerned with the size of the files on S3. The repeated URIs should get taken care of by the compression.
What's of more concern is the number of calls I need to make to Firehose in order to insert records. In the first case there's record per (object, ID) tuple, of which there are approximately 6000 per object. There's a "batch" call but it's limited to 500 records per batch, so I would end up making multiple calls. This code will execute in a Lambda function I'm trying to save execution time however possible.

There should not be any explicit limit from Presto/Athena side for number of elements in an array column type. Eventually it drills down to JVM limitation which will be huge. Just make sure that you have enough node memory available to handle these fields. Would be great if you can review your use case and avoid storing very huge column values (of array type)

Approach to upload multiple interconnected csv files to HBase

I am new to HBase and still not sure which component of Hadoop ecosystem I will use in my case and how to analyse my data later so just exploring options.
I have an Excel sheet with a summary about all the customers like this but with ≈ 400 columns:
CustomerID Country Age E-mail
251648 Russia 27 boo#yahoo.com
487985 USA 30 foo#yahoo.com
478945 England 15 lala#yahoo.com
789456 USA 25 nana#yahoo.com
Also, I have .xls files created separately for each customer with an information about him (one customer = one .xls file), the number of columns and names of columns are the same in each file. Each of these files are named with a CustomerID. A one looks like this:
'customerID_251648.xls':
feature1 feature2 feature3 feature4
0 33,878 yes 789,598
1 48,457 yes 879,594
1 78,495 yes 487,457
0 94,589 no 787,475
I have converted all these files into .csv format and now feeling stuck which component of Hadoop ecosystem should I use for storing and querying such a data.
My eventual goal is to query some customerID and to get all the information about a customer from all the files.
I think that HBase fits perfectly for that because I can create such a schema:
row key timestamp Column Family 1 Column Family 2
251648 Country Age E-Mail Feature1 Feature2 Feature3 Feature4
What is the best approach to upload and query such a data in HBase? Should I first combine an information about a customer from different sources and then upload it to HBase? Or I can keep different .csv files for each customer and when uploading to HBase choose somehow which .csv to use for forming column-families?
For querying data stored in HBase I am going to write MapReduce tasks via Python API.
Any help would be very approciated!

You are correct with schema design, also remember that hbase loads the whole column family during scans, so if you need all the data at one time maybe its better to place everything in one column family.
A simple way to load the data will be to scan first file with customers and fetch the data from the second file on fly. Bulk CSV load could be faster in execution time, but you'll spend more time writing code.
Maybe you also need to think about the row key because HBase stores data in alphabetical order. If you have a lot of data, you'd better create table with given split-keys rather than let HBase do the splits because it can end up with unbalanced regions.

i have a csv file with locations i need to move s3 files to new locations

I am interested in loading my data into AWS ATHENA DB
my data is compartmentalized by source_video, and in each we have 11 csv files that represent 11 tables referencing this data
ATHENA wants to load by table and not by source_video
for this i have to move these files to folders based on table name and not source_video.
I am fluent in python and bash
i know how to use the aws cli
i wish to know if there is maybe an easier way than running 4Million+ mv commands and executing them in different processes in parallel on several machines
I have a csv file that has locations of files located as children of the source_video they were created for:
I have 400,000+ source_video locations
I have 11 files in each source_video location
i.e.
+source_video1
- 11 files by type
+source_video2
- 11 files by type
+source_video3
- 11 files by type
.
.
+source_video400,000+
- 11 files by type
I wish to move them to 11 folders with 400,000+ files in each folder type
fields: videoName, CClocation, identityLocation, TAGTAskslocation, M2Location
and other locations ....
Below is an example of 2 rows of data:
pj1/09/11/09/S1/S1_IBM2MP_0353_00070280_DVR1.avi,
S1_IBM2MP_0353_00070280_DVR1.avi,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsidentities.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsTAGtasks.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCs2data.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCs2extendeddata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCsdata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsALGsruns.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_Cg1CCsdata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_Cg1CCfeat.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsidentitiestaggers.csv
pj1/09/11/09/S1/S1_IBM2MP_0443_00070380_DVR1.avi,
S1_IBM2MP_0443_00070380_DVR1.avi,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsidentities.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsTAGtasks.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCs2data.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCs2extendeddata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCsdata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsALGsruns.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_Cg1CCsdata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_Cg1CCfeat.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsidentitiestaggers.csv

You are correct. Athena expects all files related to one table to be located in one directory, or in subdirectories of one directory.
Given that you are going to touch so many files, you could choose to process the files rather than simply moving them. For example, putting the contents of several files into a smaller number of files. You could also consider Zipping the files because this would cost you less to scan (Athena is charged based upon data read from disk -- zip files read less data and therefore cost less).
See: Analyzing Data in S3 using Amazon Athena
This type of processing could be done efficiently on an Amazon EMR cluster that runs Hadoop, but some specialist knowledge is required to run Hadoop so it might be easier to use the coding with which you are familiar (eg Python).

How to extract 5k files out of HIVE tables

I have a use case where i have all my 4 tb of data in HBase tables that i have interrogated with HIVE tables .
Now i want to extract 5 k files out of this 30 tables that i have created in HIVE.
This 5K files will be created by predefined 5K queries.
Can somebody suggest me what approach i should follow for this?
Required time for this is 15 hrs .
Should i write java code to generate all this files .
File generation is fast .Out of 5k text files there are 50 files that takes around 35 minutes rest of all creates very fast .
I have to generate zipped file and have to send it to client using ftp.

If I understand your question right, you can accomplish your task by first exporting the query results via one of methods from here : How to export a Hive table into a CSV file?, compressing the files in a zip archive and then FTP'ing them. You can write a shell script to automate the process.

hand coded ETL vs talend open studio

I have developed an ETL with shell scripting .
After that,I've found that there's an existing solution Talend open studio.
I'm thinking of using it in my future tasks.
But my problem is that the files that i want to integrate into the database must be transformed in structure . this is the structure that i have :
19-08-02 Name appel ok hope local merge (mk)
juin nov sept oct
00:00:t1 T1 299 0 24 8 3 64
F2 119 0 11 8 3 62
I1 25 0 2 9 4 64
F3 105 0 10 7 3 61
Regulated F2 0 0 0
FR T1 104 0 10 7 3 61
i must transform it into a flat file format .
Do talend offer me the possibility to do several transformations before integrating data from csvfiles into the databaseor not ?
Edit
this is an example of the flat file that i want to acheive before integrating data to the database (only first row is concerned) :
Timer,T1,F2,I1,F3,Regulated F2,FR T1
00:00:t1,299,119,25,105,0,104
00:00:t2,649,119,225,165,5,102
00:00:t5,800,111,250,105,0,100

We can split the task into three pieces, extract, transform, load.
Extract
First you have to find out how to connect to the source. With Talend its possible to connect to different kinds of sources, like databases, XML files, flat files, csv etc. pp. They are called tFileInput or tMySQLInput to name a few.
Transform
Then you have to tell Talend how to split the data into columns. In your example, this could be the white spaces, although the splitting might be difficult because the field Name is also split by a white space.
Afterwards, since it is a column to row transposition, you have to write some Java code in a tJavaRow component or could alternatively use a tMap component with conditional mapping: (row.Name.equals("T1") ? row.value : 0)
Load
Then the transformation would be completed and your data could be stored in a database, target file, etc. Components here would be called tFileOutput or tOracleOutput for example.
Conclusion
Yes, it would be possible to build your ETL process in Talend. The transposition could be a little bit complicated if you are new to Talend. But if you keep in mind that Talend processes data row by row (as your script does, I assume) this is not that big of a problem.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sort big files in Amazon S3 - sorting

Related

How to limit on # elements in array column and/or total size in Athena / Presto?

Approach to upload multiple interconnected csv files to HBase

i have a csv file with locations i need to move s3 files to new locations

How to extract 5k files out of HIVE tables

hand coded ETL vs talend open studio

Categories

Resources