How do I split in Pig a tuple of many maps into different rows - hadoop

I have a relation in Pig that looks like this:
([account_id#100,
timestamp#1434,
id#900],
[account_id#100,
timestamp#1434,
id#901],
[account_id#100,
timestamp#1434,
id#902])
As you can see, I have three map objects within a tuple. All of the data above is within the $0'th field in the relation. So the data above in a relation with a single bytearray column.
The data is loaded as follows:
data = load 's3://data/data' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
DESCRIBE data;
data: {bytearray}
How do I split this data structure into three rows so that the output is as follows?
data: {account_id:chararray, timestamp:chararray, id:int}
(100, 1434,900)
(100, 1434,901)
(100, 1434,902)

It is very difficult to guess your problem without having a sample input data. If this is an intermediate result, then write it out using a STORE and put the output file as something that we can input to try out. I was able to solve this using STRSPLIT but am not sure if you meant that the input is a single column and a single row or are these three different rows with the same column.
In either case, Flattening out the data using the FLATTEN operator and using STRSPLIT later should help. If I get more information and input data for the problem, I can give a working example.
Data -> FLATTEN to get out of bag -> STRSPLIT over "," in a FOREACH,GENERATE

Related

Merge two bag and get all the field from first bag in pig

I am new to PIG scripting. need some help on this issue.
I got two set of bag in pig and from there I want to get all the field from first bag and overwrite data of first bag if second bag has the data of same field
Column list are dynamic (columns may get added or deleted any time).
in set b we may get data in another field also which are currently blank, if so, then we need to overwrite set a with data available in set b
columns - uniqueid,catagory,b,c,d,e,f,region,g,h,date,direction,indicator
EG:
all_data= COGROUP a by (uniqueid), b by (uniqueid);
Output:
(1,{(1,test,,,,,,,,city,,,,,2020-06-08T18:31:09.000Z,west,,,,,,,,,,,,,A)},{(1,,,,,,,,,,,,,,2020-09-08T19:31:09.000Z,,,,,,,,,,,,,,N)})
(2,{(2,test2,,,,,,,,dist,,,,,2020-08-02T13:06:16.000Z,east,,,,,,,,,,,,A)},{(2,,,,,,,,,,,,,,2020-09-08T18:31:09.000Z,,,,,,,,,,,,,,N)})
Expected Result:
(1,test,,,,,,,,city,,,,,2020-09-08T19:31:09.000Z,west,,,,,,,,,,,,,N)
(2,test2,,,,,,,,dist,,,,,2020-09-08T18:31:09.000Z,east,,,,,,,,,,,,N)
I was able to achieve expected output with below
final = FOREACH all_data GENERATE flatten($1),flatten($2.(region)) as region ,flatten($2.(indicator)) as indicator;

How to use the field cardinality repeating in Render-CSV BW step?

I am building a generic CSV output module with a variable number of columns. The DataFormat in BW (5.14) lets you define repeating item and thus offers a list of items that I could use to map data to in the RenderCSV step.
But when I run this with data for >> 1 column (and loopings) only one column is generated.
Is the feature broken or do I use it wrongly?
Alternatively I defined "enough" optional columns in the data format and map each field separately - no really generic solution.
Looks like In BW 5, when using Data Format and Parse Data to parse text, repeating elements isn’t supported.
Please see https://support.tibco.com/s/article/Tibco-KnowledgeArticle-Article-27133
The workaround is to use Data Format resource, Parse Data and Mapper
activities together. First use Data Format and Parse Data to parse the
text into the xml where every element represents one line of the text.
Then use Mapper activity and tib:tokenize-allow-empty XSLT function to
tokenize every line and get sub-elements for each field in the lines.
The link has also attached workaround implementation

Is there a simple way to use PySpark DataFrame filter() to split a frame into exactly two frames, based on condition?

I'm working on a data warehouse project. I'm reading input data into a frame, and then I want to filter out the bad rows. However, I want to print some sample bad rows. What I have now is
df_good = df_input.filter(((df_input.info.isNull()) | (df_input.info == '')))
This filter works, but I cannot print out a sample of the dropped records. What I would like is something like:
df_keep, df_reject = df_input.filter_split(((df_input.info.isNull()) | (df_input.info == '')))
print("Sample rejected records:")
df_reject.show(5)
I found one method which involves running the filter, then joining the good data back to the original data with an outer join, then filtering to find original data not-in the good data set. But this iterates over the original data twice; I would like to pass through the list just once.
Any ideas? I am doing this in AWS Glue, so I may be able to use a Dynamic Frame function.

Partial Vertical Caching of DataFrame

I use spark with parquet.
I'd like to be able to cache the columns we use most often for filtering, while keeping the other on disk.
I'm running something like:
myDataFrame.select("field1").cache
myDataFrame.select("field1").count
myDataFrame.select("field1").where($"field1">5).count
myDataFrame.select("field1", "field2").where($"field1">5).count
The fourth line doesn't use the cache.
Any simple solutions that can help here?
The reason this will not cache is that whenever you do a transformation on a dataframe (e.g. select) you are actually creating a new one. What you basically did is cached a dataframe containing only field1 and a dataframe containing only field1 where it is larger than 5 (probably you meant field2 here but it doesn't matter).
On the fourth line you are creating a third dataframe which has no lineage to the original two, just to the original dataframe.
If you generally do strong filtering (i.e. you get a very small number of elements) you can do something like this:
cachedDF = myDataFrame.select("field1", "field2", ... "fieldn").cache
cachedDF.count()
filteredDF = cachedDF.filter(some strong filter)
res = myDataFrame.join(broadcast(filteredDF), cond)
i.e. cachedDF has all the fields you filter on, then you filter very strongly and then do an inner join (with cond being all relevant selected fields or some id field) which would give all relevant data.
That said, in most cases, assuming you use a file format such as parquet, caching will not help you much.

SSIS: Variable as "NEW" derived column

I am trying to write the SQL task results in a flat file. I have a SQL task, followed by foreach loop that parses the object results to variables. Inside the foreach I have a data flow.
Inside the dataflow I have a Derived Column transformation editor, where I am trying to use the variables as columns. This is because I want to write the column in a flat file. However the Derived column keeps complaining about not having any INPUT columns (and writing 0 rows to flatfile) and I do not know why.
These are the instructions I am trying to follow: Using Variable as expression in Derived column transformation SSIS
Derived Column transformation is a part of Data Flow. Data Flow means that you have a set of rows with columns originated from some Data Flow Source, undergoing DFT transformations like Derived Column and then passing rows to Data Flow Destination. Data Flow Transformation needs to have input and output.
In your case - create a OLE DB Source with some dummy query like `select 0 as dummy' and direct this data flow to your Derived Column. Later you can drop this dummy column.

Resources