How to extract key-value pairs from CSV using Talend - etl

I have data for one column in my CSV file as :
`column1`
row1 : {'name':'Steve Jobs','location':'America','status':'none'}
row2 : {'name':'Mark','location':'America','status':'present'}
row3 : {'name':'Elan','location':'Canada','status':'present'}
I want as the output for that column as :
`name` `location` `status`
Steve jobs America none
Mark America present
Elan Canada present
But sometimes I have row value like {'name':'Steve Jobs','location':'America','status':'none'},{'name':'Mark','location':'America','status':'present'}
Please help !

You have to use tMap and tExtractDelimitedFields components.
Flow,
Below is the step by step explination,
Original data - row1 : {'name':'Steve Jobs','location':'America','status':'none'}
Substring the value inside the braces using below function
row1.Column0.substring(row1.Column0.indexOf("{")+1, row1.Column0.indexOf("}")-1)
Now the result is - 'name':'Steve Jobs','location':'America','status':'none'
3.Extract single columns to multiple using tExtractDelimitedFields. Since the columns are seperated be ,, delimiter should be provided as comma. And we have 3 fields in the data, so create 3 fields in the component schema. Below is the snipping of the tExtractDelimitedFields component configuration
Now the result is,
name location status
'name':'Steve Jobs' 'location':'America' 'status':'none'
'name':'Mark' 'location':'America' 'status':'present'
'name':'Elan' 'location':'Canada' 'status':'present'
Again using one more tMap, replace the column names and single quotes from the data,
row2.name.replaceAll("'name':", "").replaceAll("'", "")
row2.location.replaceAll("'location':", "").replaceAll("'", "")
row2.status.replaceAll("'status':", "").replaceAll("'", "")
Your final result is below,

Related

Remove all columns to the right of a specific column

I have an Excel template file with a dynamic number of columns that represent work week dates. Some users have decided to add their own subtotal columns to the right of those columns. I need a way to identify the first blank column, and then truncate that column and all columns following it.
I had previously been using the following script to remove all columns that begin with the word "Column":
// Create a list of columns that start with "Column" and remove them.
Removed_ColumnNum_Columns = Table.RemoveColumns(PreviousStepName, List.Select(Table.ColumnNames(PreviousStepName), each Text.StartsWith(_, "Column") )),
Based on being able to find the first ColumnXX column, I want to remove it and all columns after it
You can use List.PositionOf to get your ColumnIndex instead of parsing text.
I'd put it together like this:
// [...]
ColumnList = Table.ColumnNames(#"Promoted Headers"),
ColumnXX = List.Select(ColumnList, each Text.StartsWith(_, "Column")){0},
ColumnIndex = List.PositionOf(ColumnList, ColumnXX),
ColumnsToKeep = List.FirstN(ColumnList, ColumnIndex),
FinalTable = Table.SelectColumns(#"Promoted Headers", ColumnsToKeep)
Remove Columns after ColumnXX
Find the first column that begins with the name "Column" and delete that column and all columns following it. This parses the XX as the column index so you need to make sure you haven't deleted columns prior to this step. i.e. "Column35" needs to be the 35th column at this step in the code.
// Find the first ColumnXX column and remove it and all columns to the right.
ColumnXX = List.Select(Table.ColumnNames(#"Promoted Headers"), each Text.StartsWith(_, "Column")){0},
ColumnIndex = Number.FromText(Text.Middle(ColumnXX, 6,4)),
ColumnListToRemove = List.Range(Table.ColumnNames(#"Promoted Headers"),ColumnIndex-1),
RemovedTrailingColumns = Table.RemoveColumns(#"Promoted Headers", ColumnListToRemove),
To make this more robust I would prefer to have a way to identify the column index of columnXX without parsing the digits from it.

How to split a Webix datatable column into multiple columns?

In my webix datatable, I am showing multiple values in the cells for some columns.
To identify which values belong to which header, I have separated the column headers by a '|' (pipe) and similarly the values under them as well.
Now, in place of delimiting the columns by '|' , I need to split the columns into some editable columns with the same name.
Please refer to this snippet : https://webix.com/snippet/8ce1148e
In this above snippet, for example the Scores column will be split into two more editable columns as Rank and Vote. Similarly for Place column into Type and Name.
How the values of the first array elements is shown under each of them will remain as is.
How can this be done ?
Thanks
While creating the column configuration for webix, you can provide array to the header field for the first column along with the colspan like below:
var columns = [];
columns[0] =
{"id":"From", "header":[{"text":"Date","colspan":2},{"text":"From"}]};
columns[1] =
{"id":"To","header":[null, {"text":"To"}]};
column[0] will create Date and From and column[1] will be creating the To.

PIG: scalar has more than one row in the output

I have following code in pig in which i am checking the field (srcgt & destgt in record) from main files stored in record for values as mentioned in another file(intlgt.txt) having values 338,918299,181,238 but it throws error as mentioned below. Can you please suggest how to overcome this on Apache Pig version 0.15.0 (r1682971).
Pig code:
record = LOAD '/u02/20160201*.SMS' USING PigStorage('|','-tagFile') ;
intlgtrec = LOAD '/u02/config/intlgt.txt' ;
intlgt = foreach intlgtrec generate $0 as intlgt;
cdrfilter = foreach record generate (chararray) $1 as aparty, (chararray) $2 as bparty,(chararray) $3 as dt,(chararray)$4 as timestamp,(chararray) $29 as status,(chararray) $26 as srcgt,(chararray) $27 as destgt,(chararray)$0 as cdrfname ,(chararray) $13 as prepost;
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) ) ;`
Error is:
WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1939982195_0002
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (338), 2nd :(918299) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar") at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
When you are using
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) );
PIG is looking for a scalar. Be it a number, or a chararray; but a single one. So pig assumes your intlgt::intlgt is a relation with one row. e.g. the result of
intlgt = foreach (group intlgtrec all) generate COUNT_STAR(intlgtrec.$0)
(this would generate single row, with the count of records in the original relation)
In your case, the intlgt contains more than one row, since you have not done any grouping on it.
Based on your code, you're trying to look for SMS messages that had an intlgt on either end. Possible solutions:
if your intlgt enteries all have the same length (e.g. 3) then generate substring(srcgt, 1, 3) as srcgtshort, and JOIN intlgt::intlgt with record::srcgtshort. this will give you the records where srcgt begins with a value from intlgt. Then repeat this for destgt.
if they have a small number of lengths (e.g. some entries have length 3, some have length 4, and some have length 5) you can do the same thing, but it would be more laborious (as a field is required for each 'length').
if the number of rows in the two relations is not too big, do a cross between them, which would create all possible combinations of rows from record and rows from intlgt. Then you can filter by STARTSWITH(srcgt, intlgt::intlgt), because the two of them are fields in the same relation. Beware of this approach, as the number of records can get HUGE!

PIGLatin How to compare 2 CSV with too many rows and columns

I have a scenario with PIG in terms of comparing 2 CSV files. Basically, what it should do is read the 2 CSV files, compare it to each other, and create a log file which contains the ROW Number, and if possible Column Number of the different value.
Sample output :
Found 1 different value :
Row : #8764
Column : #67
Expected : 8984954
Actual : 0
Is there a way in PIG to do this?

HBase shell - Retrieve (only) column values (and not column name)

I am pretty new to Hadoop and HBase, trying to learn and evaluate if it can be used for my use case. And being new to Java (I am basically Perl/Unix and DB developer), I am trying to get solution in Hbase shell if possible.
I have a HBase table (schema below) where I am trying to implement the historical data (which can be used for audit and analytics).
Assume the basic structure as below,
rowkey 'cf1:id', 'cf1:price', 'cf1:user', 'cf1:timestamp'
Now,
rowkey - instrument or any object
id - using this to identify which col has latest data. First entry will have 1 as its value, and go on
user - user which updated data
e.g.
initially the data looks like,
hbase(main):009:0> scan 'price_history'
ROW COLUMN+CELL
row1 column=cf1:id, timestamp=1389020633920,value=1
row1 column=cf1:pr, timestamp=1389020654614, value=109.45
row1 column=cf1:us, timestamp=1389020668338, value=feed
row2 column=cf1:id, timestamp=1389020687334, value=1
row2 column=cf1:pr, timestamp=1389020697880, value=1345.65
row2 column=cf1:us, timestamp=1389020708403, value=feed
Now assume row2 or instrument 2 is updated on same day with new price,
hbase(main):003:0> scan 'price_history'
ROW COLUMN+CELL
row1 column=cf1:id, timestamp=1389020633920, value=1
row1 column=cf1:pr, timestamp=1389020654614, value=109.45
row1 column=cf1:us, timestamp=1389020668338, value=feed
row2 column=cf1:id, timestamp=1389020859674, value=2
row2 column=cf1:pr, timestamp=1389020697880, value=1345.65
row2 column=cf1:pr1, timestamp=1389020869856, value=200
row2 column=cf1:us, timestamp=1389020708403, value=feed
row2 column=cf1:us1, timestamp=1389020881601, value=user1`
If you see id is changed to 2 to indicate second set of data is latest. and new values or columns added.
What I want is,
1) Can I fetch the value of columns id? i.e. the output should be 1 or 2 and not all other attribs
2) Based on the above o/p i will fetch the further data, but can I also have a search and o/p as value of rowkey? i.e. something like give me o/p of row having VALUE as row1 (I can have list of row1, row2, rown..)
Please assist if possible in HBase shell as much as possible (Other solutions are also welcomed)
Also, if any of the architect can suggest better solution to model the table to keep track of changes/versions of prices are also welcomed.
Thanks.
That is going to be tough to do in the shell without doing a lot of piping output and grepping the results. The shell output formatting also makes this difficult because of how it breaks up lines. A lighter weight solution than writing Java would be to write your scanner in ruby. HBase comes with the jruby jar and lets you execute ruby scripts.
include Java
import "org.apache.hadoop.hbase.client.Scan"
import "org.apache.hadoop.hbase.util.Bytes"
import "org.apache.hadoop.hbase.client.HTable"
config = HBaseConfiguration.create()
family = Bytes.toBytes("family-name")
qual = Bytes.toBytes("qualifier"
scan = Scan.new()
scan.addColumn(family, qualifier)
table = HTable.new(config, "table-name")
scanner = table.getScanner(scan)
scanner.each do |result|
keyval = result.getColumnLatest(family, qualifier)
puts "#{Bytes.toDouble(keyval.getValue())}"
end
That should get you pretty close, you can add additional data to the output for example the row key. To run it just use hbase org.jruby.Main your_ruby_file.rb

Resources