How to convert a columns 'date' in a glue context dynamic frame using mapping? - parquet

I have an input csv file with two columns: name and expirationdate.
In my Glue job I am opening the file into a dynamic frame from options and looks good, but when printing the schema all the columns are string.
So I want to convert the date into a date column and then save it as parquet.
I am creating a temporal dynamic frame with mapping as follows:
myFrame = originalDynamicFrameFromOptions.apply_mapping([('name','string','name','string'),('expirationdate','date','expirationdate','date')])
When I print the schema looks good, but when trying to save it as parquet I get the following error:
'Unsupported case of DataType: com.amazonasaws.services.glue.schema.types and DynamicNode:stringnode'
Any help will be greatly appreciated.

Related

Parquet meta data of "has_dictionary_page" is false but column has "PLAIN_DICTIONARY" encoding

I used parquet of pyarrow ro read the meta data of parquet by this code:
from pyarrow import parquet
p_file = parquet.ParquetFile("v-c000.gz.parquet")
for rg_idx in range(p_file.metadata.num_row_groups):
rg = p_file.metadata.row_group(rg_idx)
for col_idx in range(rg.num_columns):
col = rg.column(col_idx)
print(col)
and got in the output: has_dictionary_page: False (for all the row group)
but according to my checks all the column chanks in all of row group are PLAIN_DICTIONARY encoded. furthermore I checked statistics about the dictionary and saw all the key and value over it. attaching part of it:
How is that possible that there is no dictionary page?
My best guess is that you are running into PARQUET-1547 which is described a bit more in this question.
In summary, some parquet readers did not write the dictionary_page_offset field correctly. Those parquet readers have workarounds in place to recognize the invalid write. However, parquet-cpp (which is used by pyarrow) does not have such a workaround in place.

Talend tInputFileDelimited component java.lang.NumberFormatException for CSV file

As a beginner to TOS for BD, I am trying to read two csv files in Talend OS, i have inferred the metadata schema from the same CSV file, and setup the first row to be header, and delimiter as comma (,)
In my code:
The tMap will read the csv file, and do a lookup on another csv file, and generate two output files passed and reject records.
But while running the job i am getting below error.
Couldn't parse value for column 'Product_ID' in 'row1', value is '4569,Laptop,10'. Details: java.lang.NumberFormatException: For input string: "4569,Laptop,10"
I believe it is considering entire row as one string to be the value for "Product_ID" column
I don't know why that is happening when i have set the delimiter and row separator correctly.
Schema
I can see no rows are going from the first tInputFileDelimited due to above error.
Job Run
Input component
Any idea what else can i check?
Thanks in advance.
In your last screenshot, you can see that the Field separator of your tFileInputDelimited_1 is ; and not ,.
I believe that you haven't set up your component to use the metadata you created for your csv file.
So you need to configure the component to use the metadata you've created by selecting Repository under Property Type, and selecting the delimited file metadata.

How to load nested collections in hive with more than 3 levels

I'm struggling to load data into Hive, defined like this:
CREATE TABLE complexstructure (
id STRING,
date DATE,
day_data ARRAY<STRUCT<offset:INT,data:MAP<STRING,FLOAT>>>
) row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by ':';
The day_data field contains a complex structure difficult to load with load data inpath...
I've tried with '\004', ^D... a lot of options, but the data inside the map doesn't get loaded.
Here is my last try:
id_3054,2012-09- 22,3600000:TOT'\005'0.716'\004'PI'\005'0.093'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|7200000:TOT'\005'0.367'\004'PI'\005'0.066'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|10800000:TOT'\005'0.268'\004'PI'\005'0.02'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.159'\004'RES'\005'0.0|14400000:TOT'\005'0.417'\004'PI'\005'0.002'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.165'\004'RES'\005'0.0`
Before posting here, I've tried (many many) options, and this example doesn't work:
HIVE nested ARRAY in MAP data type
I'm using the image from HDP 2.2
Any help would be much appreciated
Thanks
Carlos
So finally I found a nice way to generate the file from java. The trick is that Hive uses the first 8 ASCII characters as separators, but you can only override the first three. From the fourth on, you need to generate thee actual ASCII charaters.
After many tests, I ended up editing my file with an HEX editor, and inserting the right value worked, but how can I do that in Java? Can't be more simple: just cast an int into char, and that will generate the corresponding ASCII character:
ASCII 4 -> ((char)4)
ASCII 5 -> ((char)5)
...
And so on.
Hope this helps!!
Carlos
You could store Hive table in Parquet or ORC format which support nested structures natively and more efficiently.

How do I split in Pig a tuple of many maps into different rows

I have a relation in Pig that looks like this:
([account_id#100,
timestamp#1434,
id#900],
[account_id#100,
timestamp#1434,
id#901],
[account_id#100,
timestamp#1434,
id#902])
As you can see, I have three map objects within a tuple. All of the data above is within the $0'th field in the relation. So the data above in a relation with a single bytearray column.
The data is loaded as follows:
data = load 's3://data/data' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
DESCRIBE data;
data: {bytearray}
How do I split this data structure into three rows so that the output is as follows?
data: {account_id:chararray, timestamp:chararray, id:int}
(100, 1434,900)
(100, 1434,901)
(100, 1434,902)
It is very difficult to guess your problem without having a sample input data. If this is an intermediate result, then write it out using a STORE and put the output file as something that we can input to try out. I was able to solve this using STRSPLIT but am not sure if you meant that the input is a single column and a single row or are these three different rows with the same column.
In either case, Flattening out the data using the FLATTEN operator and using STRSPLIT later should help. If I get more information and input data for the problem, I can give a working example.
Data -> FLATTEN to get out of bag -> STRSPLIT over "," in a FOREACH,GENERATE

Importing CSV data to update existing records with Rails

I'm having a bit of trouble getting a CSV into my application witch I'd like to use to update existing and create records. My CSV data only has two headers Date and Total. I've create a import method in my model which creates everything but if I can the CSV and upload it won't update existing records, it just creates duplicates?
Here is my method, as you can see I'm finding the row by Date heading once matched using find_by, then creating a new record if this returns false and update with the data from the current row if matched but that doesn't seem to be the case, I just get duplicate rows.
def self.import(file)
CSV.foreach(file.path, headers: true) do |row|
entry = find_by(Date: row["Date"]) || new
entry.update row.to_hash
entry.save!
end
end
I hope I'm understanding this correctly. As discovered in the comments below, the CSV date format is DD-MM-YYYY and the database is storing the date as YYYY-MM-DD.
As we found in the comment thread for the question, the date was being persisted to the database in yyyy-mm-dd format.
The date being read in from the CSV file was in mm-dd-yyyy format. Doing a find_by using this format never returned results, as the format differed from that used in the database.
Date.parse will convert the string read from the CSV file into a true Date object which can be successfully compared against the date stored in the database.
So, rather than:
entry = find_by(Date: row["Date"]) || new
Use:
entry = find_by(Date: Date.parse(row["Date"])) || new

Resources