How to get data in json format in informatica? - etl

I have Id and email in flat file, I am using expression transformation, created an output port as string data type, trying to pass the value as below
'[
{
"Id":"'||Id||'"
"email":"'||email||'"
}
]'
output:
"[
{
"Id":"2904"
"email":"k9#gmail.com"
}
]"
I don't need the " " at the start and end for square braces and no extra space should be there after line ends .
expected output:
[
{
"Id":"2904"
"email":"k9#gmail.com"
}
]
I will pass this string to the API webservice as http request. Kindly help with required format.

In exp transformation create below ports and logic.
--assuming input column is datacol
var_str1 = LTRIM(substr(datacol, substr(datacol,'output: '))) -- This will result everything after the word output:.
var_str2 = substr( var_str1 ,2) -- removes first double quote
var_str3 = substr( var_str2 ,-1) -- removes last double quote
out_str= var_str3
I did them in step by step process. if you are comfortable, pls combine all into one line.
EDIT
var_* - this denotes these are variable ports.
out_ - this output these are variable ports.

Related

Sybase UDF difficulty

When I try to run the following function on Sybase ASE 15.7, it just spins indefinitely. Each component of the function seems to act as expected independently. This is simply a mechanism to strip all non-numeric characters from a string. Any and all thoughts appreciated.
create function dbo.mdsudf_just_digits(#str varchar(64))
returns varchar(64)
as
begin
while patindex('%[^0-9]%', #str) > 0
set #str = stuff(#str, patindex('%[^0-9]%', #str), 1, '')
return #str
end
-- A typical invocation would be like so:
select dbo.mdsudf_just_digits('hello123there456')
```
In Sybase (ASE) the empty string ('') actually translates into a single space:
select '.'+''+'.' -- period + empty string + period
go
---
. . -- we get a space between the periods
So in the current stuff(#str, ..., 1, '') you're actually replacing the first non-numeric with a single space. This 'new' space then matches the non-number test on the next pass through the loop at which point the space is replaced with ... another space. This is leading to an infinite loop of constantly replacing the first non-number character with a space.
You can get around this by using NULL as the last arg to the stuff() call, eg:
set #str = stuff(#str, patindex('%[^0-9]%', #str), 1, NULL)

Parse a string in specific format and store its content for further processing

I have a string in the format as below. The string on the LHS could be any string and the result on the RHS has values in the {} of varying length and some are separated by delimiters.
I am not able to understand on how to be able to extract the LHS and RHS into two distinct variables.
Input String format:
[TEAM DETAILS]={2,TeamName,23,4697}
I want to be able to extract the LHS let us say into an array.
For the RHS, I need to process each entry separated by the comma and stored them into an array as well.
I cannot understand on how to do this. It looks simple but I am not able to get a logic out of it.
This script:
# input
in="[TEAM DETAILS]={0001/0880,TeamName,0881,0882/3999,8400/8499,4900/4999,6900/6999,9101,9104,5851,5850,5855,7697}"
# get var name
# remove everything after ]=
var="${in%]=*}"
# remove the leading [
var="${var#[}"
# get values
# remove everything before ={
valstr="${in#*={}"
# remove trailing }
valstr="${valstr%'}'}"
# read string as array
IFS=, read -r -a "values" <<<"$valstr"
# output
declare -p var values
will output in repl:
declare -- var="TEAM DETAILS"
declare -a values=([0]="0001/0880" [1]="TeamName" [2]="0881" [3]="0882/3999" [4]="8400/8499" [5]="4900/4999" [6]="6900/6999" [7]="9101" [8]="9104" [9]="5851" [10]="5850" [11]="5855" [12]="7697")

Extract 2 fields from string with search

I have a file with several lines of data. The fields are not always in the same position/column. I want to search for 2 strings and then show only the field and the data that follows. For example:
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
I would like to return the following:
"id":"1111","hwVersion":"4444"
"id":"5555","hwVersion":"7777"
I am struggling because the data isn't always in the same position, so I can't chose a column number. I feel I need to search for "id" and "hwVersion" Any help is GREATLY appreciated.
Totally agree with #KamilCuk. More specifically
jq -c '{id: .id, hwVersion: .hwVersion}' <<< '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
Outputs:
{"id":"1111","hwVersion":"4444"}
Not quite the specified output, but valid JSON
More to the point, your input should probably be processed record by record, and my guess is that a two column output with "id" and "hwVersion" would be even easier to parse:
cat << EOF | jq -j '"\(.id)\t\(.hwVersion)\n"'
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
EOF
Outputs:
1111 4444
5555 7777
Since the data looks like a mapping objects and even corresponding to a JSON format, something like this should do, if you don't mind using Python (which comes with JSON) support:
import json
def get_id_hw(s):
d = json.loads(s)
return '"id":"{}","hwVersion":"{}"'.format(d["id"], d["hwVersion"])
We take a line of input string into s and parse it as JSON into a dictionary d. Then we return a formatted string with double-quoted id and hwVersion strings followed by column and double-quoted value of corresponding key from the previously obtained dict.
We can try this with these test input strings and prints:
# These will be our test inputs.
s1 = '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
s2 = '{"id":"5555","name":"6666","hwVersion":"7777"}'
# we pass and print them here
print(get_id_hw(s1))
print(get_id_hw(s2))
But we can just as well iterate over lines of any input.
If you really wanted to use awk, you could, but it's not the most robust and suitable tool:
awk '{ i = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
h = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
printf("\"id\":\"%s\",\"hwVersion\":\"%s\"\n"), i, h}' /your/file
Since you mention position is not known and assuming it can be in any order, we use one regex to extract id and the other to get hwVersion, then we print it out in given format. If the values could be something other then decimal digits as in your example, the [0-9]+ but would need to reflect that.
And for the fun if it (this preserves the order) if entries from the file, in sed:
sed -e 's#.*\("\(id\|hwVersion\)":"[0-9]\+"\).*\("\(id\|hwVersion\)":"[0-9]\+"\).*#\1,\3#' file
It looks for two groups of "id" or "hwVersion" followed by :"<DECIMAL_DIGITS>".

How can I read a CSV file if only non-empty fields are wrapped by double quotes?

I'm trying to read a CSV file in a Bash script. I achieved that successfully using gawk and specifying FPAT like:
gawk -v LOGFILE="${LOGFILE}" 'BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
}
NR == 1{
# doing some logic with header
}
NR >= 2{
# doing some logic with fields
}' <filename>
The problem here is, the file contains data like:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
Now, with this data I'm getting wrong data because it is ignoring commas, which is giving me wrong position number of extracted data.
For example, it is telling "7865431234" is present at 3rd position whereas it is at 6th.
Can anyone suggest the changes to get the correct position of fields?
Your FPAT requires each field to contain at least one character, but you want to recognize empty fields with zero characters. Add an alternative to FPAT that allows zero characters:
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("[%s]", $i); print "" }'
Note the extra | at the end of FPAT. The action simply identifies the record number, the number of fields, and surrounds the value of each field with square brackets.
When your data string is provided to that script, the output is:
1:8:["RAM"]["31st street, Bengaluru, India"][][][]["7865431234"][]["VALID"]
That shows the four empty fields quite clearly.
Now all you have to do is deal with:
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,"",,,"INVALID"
where there are double quotes inside the quoted value. That's not dreadfully hard to manage:
gawk 'BEGIN { FPAT = "([^,]+)|(\"([^\"]|\"\")*\")[^,]*|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("%d[%s]", i, $i); print "" }' "$#"
The FPAT says that a field is:
a sequence of non-commas,
or it is a field started with a double quote, containing zero or more instances of either:
a non-quote, or
two double quotes
followed by a double quote and optional non-comma data
or it is empty
Note that the 'optional non-comma data' should be empty, and only appears in malformed CSV data.
Given input data:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,,,,"INVALID"
"Some","","Empty","",Fields "" Wrapped,"",in quotes
"Malformed" CSV,Data,"Note it has data after" a close quote,"and before a comma,",,"INVALID"
This produces:
1:8:1["RAM"]2["31st street, Bengaluru, India"]3[]4[]5[]6["7865431234"]7[]8["VALID"]
2:8:1["Mr ""Manipulator"", the Artisan"]2["29th Street, Delhi, India"]3[]4[]5[]6[]7[]8["INVALID"]
3:7:1["Some"]2[""]3["Empty"]4[""]5[Fields "" Wrapped]6[""]7[in quotes]
4:6:1["Malformed" CSV]2[Data]3["Note it has data after" a close quote]4["and before a comma,"]5[]6["INVALID"]
Note that the field numbers are included as a prefix to the bracketed data (so I tweaked the print format slightly).
About the only format this doesn't handle is one where newlines can be embedded in the data for a field — by the nature of the line-based input, it assumes that no field is split over multiple lines. (It also means it won't properly recognize a field that starts with a double quote and doesn't have a matching double quote before the end of the line. I suppose you could add an alternative to recognize that. It would be better just to make the data right.)
Note the advice in Sobrique's answer to use a tool designed to handle CSV for handling CSV. That is generally a good idea, and the more complex the sets of variations you have to deal with, the better an idea it is. This is close to as complicated a regex as you should consider using. Also note that although RFC 4180 defines a version of CSV formally and rigorously, there are multiple programs (including MS Office) that handle different but related formats.
If you have csv that needs parsing, then whilst you can usually hack it with a regex, it's far easier to user a parser.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV -> new;
open ( my $input, '<', 'flarg.csv' ) or die $!;
while ( my $row = $csv -> getline ( $input ) ) {
if ( $. == 1 ) {
# do first row stuff;
print "Header: ", join ",", #$row,"\n";
}
else {
print join "\n", #$row;
}
}
Or simpler yet - use Text::ParseWords which is core.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
while ( my $line = <DATA> ) {
my #fields = parse_line(',', 1, $line);
print join "\n", #fields;
}
__DATA__
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"

Why doesn't this Ruby replace regex work as expected?

Consider the following string which is a C fragment in a file:
strcat(errbuf,errbuftemp);
I want to replace errbuf (but not errbuftemp) with the prefix G-> plus errbuf. To do that successfully, I check the character after and the character before errbuf to see if it's in a list of approved characters and then I perform the replace.
I created the following Ruby file:
line = " strcat(errbuf,errbuftemp);"
item = "errbuf"
puts line.gsub(/([ \t\n\r(),\[\]]{1})#{item}([ \t\n\r(),\[\]]{1})/, "#{$1}G\->#{item}#{$2}")
Expected result:
strcat(G->errbuf,errbuftemp);
Actual result
strcatG->errbuferrbuftemp);
Basically, the matched characters before and after errbuf are not reinserted back with the replace expression.
Anyone can point out what I'm doing wrong?
Because you must use syntax gsub(/.../){"...#{$1}...#{$2}..."} or gsub(/.../,'...\1...\2...').
Here was the same problem: werid, same expression yield different value when excuting two times in irb
The problem is that the variable $1 is interpolated into the argument string before gsub is run, meaning that the previous value of $1 is what the symbol gets replaced with. You can replace the second argument with '\1 ?' to get the intended effect. (Chuck)
I think part of the problem is the use of gsub() instead of sub().
Here's two alternates:
str = 'strcat(errbuf,errbuftemp);'
str.sub(/\w+,/) { |s| 'G->' + s } # => "strcat(G->errbuf,errbuftemp);"
str.sub(/\((\w+)\b/, '(G->\1') # => "strcat(G->errbuf,errbuftemp);"

Resources