Column separation issue while using readerForListOf in jackson-dataformat-csv - java-8

We are using the following code to read the data from csv file and by default its using column separation with ',' but some times I have to use ';' as column separation.
CsvMapper mapper = new CsvMapper();
MappingIterator<List<String>> it = mapper
.readerForListOf(String.class)
.with(CsvParser.Feature.WRAP_AS_ARRAY)
.with(CsvParser.Feature.EMPTY_STRING_AS_NULL)// !!! IMPORTANT
.readValues(stream);
I have seen other ways to read with column separation but when using readerForListOf I have no idea where to put the column separation logic.

Related

How to use the field cardinality repeating in Render-CSV BW step?

I am building a generic CSV output module with a variable number of columns. The DataFormat in BW (5.14) lets you define repeating item and thus offers a list of items that I could use to map data to in the RenderCSV step.
But when I run this with data for >> 1 column (and loopings) only one column is generated.
Is the feature broken or do I use it wrongly?
Alternatively I defined "enough" optional columns in the data format and map each field separately - no really generic solution.
Looks like In BW 5, when using Data Format and Parse Data to parse text, repeating elements isn’t supported.
Please see https://support.tibco.com/s/article/Tibco-KnowledgeArticle-Article-27133
The workaround is to use Data Format resource, Parse Data and Mapper
activities together. First use Data Format and Parse Data to parse the
text into the xml where every element represents one line of the text.
Then use Mapper activity and tib:tokenize-allow-empty XSLT function to
tokenize every line and get sub-elements for each field in the lines.
The link has also attached workaround implementation

Multiple table input for mapreduce

I am thinking of doing a mapreduce using accumulo tables as input.
Is there a way to have 2 different tables as input, the same way it exists for the multiple files input like addInputPath ?
Or is it possible to have one input from a file and the other one from a table with AccumuloInputFormat ?
You probably want to take a look at AccumuloMultiTableInputFormat. The Accumulo manual demonstrates how to use it here.
Example Usage:
job.setInputFormat(AccumuloInputFormat.class);
AccumuloMultiTableInputFormat.setConnectorInfo(job, user, new PasswordToken(pass));
AccumuloMultiTableInputFormat.setMockInstance(job, INSTANCE_NAME);
InputTableConfig tableConfig1 = new InputTableConfig();
InputTableConfig tableConfig2 = new InputTableConfig();
Map<String, InputTableConfig> configMap = new HashMap<String, InputTableConfig>();
configMap.put(table1, tableConfig1);
configMap.put(table2, tableConfig2);
AccumuloMultiTableInputFormat.setInputTableConfigs(job, configMap);
See the unit test for AccumuloMultiTableInputFormat here for some additional information.
Note, that unlike normal multiple inputs, you can't specify different Mappers to run on each table. Although, its not a massive problem in this case since the incoming Key/Value types are the same and you can use:
RangeInputSplit split = (RangeInputSplit)c.getInputSplit();
String tableName = split.getTableName();
To workout which table the records are coming from (taken from the Accumulo manual) in your mapper.

How to load nested collections in hive with more than 3 levels

I'm struggling to load data into Hive, defined like this:
CREATE TABLE complexstructure (
id STRING,
date DATE,
day_data ARRAY<STRUCT<offset:INT,data:MAP<STRING,FLOAT>>>
) row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by ':';
The day_data field contains a complex structure difficult to load with load data inpath...
I've tried with '\004', ^D... a lot of options, but the data inside the map doesn't get loaded.
Here is my last try:
id_3054,2012-09- 22,3600000:TOT'\005'0.716'\004'PI'\005'0.093'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|7200000:TOT'\005'0.367'\004'PI'\005'0.066'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|10800000:TOT'\005'0.268'\004'PI'\005'0.02'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.159'\004'RES'\005'0.0|14400000:TOT'\005'0.417'\004'PI'\005'0.002'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.165'\004'RES'\005'0.0`
Before posting here, I've tried (many many) options, and this example doesn't work:
HIVE nested ARRAY in MAP data type
I'm using the image from HDP 2.2
Any help would be much appreciated
Thanks
Carlos
So finally I found a nice way to generate the file from java. The trick is that Hive uses the first 8 ASCII characters as separators, but you can only override the first three. From the fourth on, you need to generate thee actual ASCII charaters.
After many tests, I ended up editing my file with an HEX editor, and inserting the right value worked, but how can I do that in Java? Can't be more simple: just cast an int into char, and that will generate the corresponding ASCII character:
ASCII 4 -> ((char)4)
ASCII 5 -> ((char)5)
...
And so on.
Hope this helps!!
Carlos
You could store Hive table in Parquet or ORC format which support nested structures natively and more efficiently.

How do I split in Pig a tuple of many maps into different rows

I have a relation in Pig that looks like this:
([account_id#100,
timestamp#1434,
id#900],
[account_id#100,
timestamp#1434,
id#901],
[account_id#100,
timestamp#1434,
id#902])
As you can see, I have three map objects within a tuple. All of the data above is within the $0'th field in the relation. So the data above in a relation with a single bytearray column.
The data is loaded as follows:
data = load 's3://data/data' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
DESCRIBE data;
data: {bytearray}
How do I split this data structure into three rows so that the output is as follows?
data: {account_id:chararray, timestamp:chararray, id:int}
(100, 1434,900)
(100, 1434,901)
(100, 1434,902)
It is very difficult to guess your problem without having a sample input data. If this is an intermediate result, then write it out using a STORE and put the output file as something that we can input to try out. I was able to solve this using STRSPLIT but am not sure if you meant that the input is a single column and a single row or are these three different rows with the same column.
In either case, Flattening out the data using the FLATTEN operator and using STRSPLIT later should help. If I get more information and input data for the problem, I can give a working example.
Data -> FLATTEN to get out of bag -> STRSPLIT over "," in a FOREACH,GENERATE

ORACLE SQLLOADER, referencing calculated values

hope you're having a nice day. I'm learning how to use functions on SQL-LOADER and i have a question about it, lets say i have this table
table a
--------------
code
name
dept
birthdate
secret
the data.csv file contains this data
name
dept
birthdate
and i'm using this code to load data to it with SQLLOADER
LOAD DATA
INFILE "data.csv"
APPEND INTO TABLE a;
FIELDS TERMINATED BY ',' optionally enclosed by '"'
TRAILING NULLCOLS
(code "getCode(:name,:dept)",name,dept,birthdate,secret "getSecret(getCode(:name,:dept),birthdate)")
so this works like a charm it gets the values from my getCode and getSecret functions, however, i would like to reference the previously calculated value (by getCode) so i don't have to nest functions on getSecret, like this:
getSecret(**getCode(:name,:dept)**,birthdate)
i've tried to do it like this:
getSecret(**:code**,birthdate)
but it gets the original value from the file (meaning null) and not the calculated by the function (guess because it does it on the fly), so my question is if there is a way to avoid these nest calls for previously calculated values, so i don't have to loose performance recalculating the same values over and over again (the real table i'm using it's like 10 times bigger and nests a lot of functions for these previously calculated values, so i guess that's reducing performance)
any help would be appreciated, Thanks!!
complement
Sorry, but i haven't used external tables before (kinda new here), how could i implement this using this tables? (considering all the calculated values i need to get from functions i developed, tried trigger (SQL Loader, Trigger saturation?), killed database...)
I'm not aware of a way of doing this.
If you switched to using external tables you'd have a lot more freedom for this sort of thing -- common table expressions, leveraging subquery caching, that sort of stuff.

Resources