How to handle CSV columns with value "NULL" as real NULL on the destination using Informatica Data Engineering? - etl

Is there any option to specify to Informatica that is should consider the text "NULL" on any text file column as real NULL values?
I would want to believe that I would not be required to perform a comparison transformation on each and every column.
The file is NOT a fixed-width delimited.
Would appreciate if someone had gone through this experience.
Thank you

When you have a CSV file - it makes things quite simple - change the input type to Command as mentioned in the docs and use sed to read the file:
sed 's/NULL//' sed_test.csv
Here's a sample file that I've created. Below's the content of the file displaed by $ head sed_test.csv command:
George Washington,NULL,NULL, 1789-1797
John Adams,NULL,NULL, 1797-1801
Thomas Jefferson,NULL,NULL, 1801-1809
James Madison,NULL,NULL, 1809-1817
James Monroe,NULL,NULL, 1817-1825
John Quincy Adams,NULL,NULL, 1825-1829
Andrew Jackson,NULL,NULL, 1829-1837
Martin Van Buren,NULL,NULL, 1837-1841
William Henry Harrison,NULL,NULL, 1841
John Tyler,NULL,NULL, 1841-1845
Now, here's the same using sed 's/NULL//g' sed_test.csv | head:
George Washington,,, 1789-1797
John Adams,,, 1797-1801
Thomas Jefferson,,, 1801-1809
James Madison,,, 1809-1817
James Monroe,,, 1817-1825
John Quincy Adams,,, 1825-1829
Andrew Jackson,,, 1829-1837
Martin Van Buren,,, 1837-1841
William Henry Harrison,,,
As you see, the columns are there, values are not there anymore.

Related

Hide Repetitive data

I am using rdlc to generate my report in MVC. My task is to print a staff schedule. I am using 2 datasets which the first dataset is used for unrepetitive data like printedBy, printedDate while the second dataset is used for repetitive dataset which is the schedule.I used table in this case.The code works successfully and the data is showed. But i want to do some changes. This the example of current page:
7-Mar-2018 8:40 AM - 8:50 AM Ben
7-Mar-2018 8:50 AM - 9:00 AM Yusry
7-Mar-2018 9:10 AM - 9:20 AM Mark
8-Mar-2018 8:40 AM - 8:50 AM Joe
8-Mar-2018 8:50 AM - 8:50 AM Stan
I want to it to show like this:
7-Mar-2018 8:40 AM - 8:50 AM Ben
8:50 AM - 9:00 AM Yusry
9:10 AM - 9:20 AM Mark
8-Mar-2018 8:40 AM - 8:50 AM Joe
8:50 AM - 8:50 AM Stan
What I mean is to hide the date if it is same as before. Is it possible to do this?
In order to achieve your desired behaviour you'll need to apply Grouping.
I believe it's quite a large subject to explain here. I suggest you take a look at this official tutorial.
To get the result from your example, you'll have to add a Row Group for the
date part of your printedDate field.
If you have any more specific questions about the subject, feel free to leave a comment and I'll be happy to explain further.

Render HTML to stdout as formatted text using Ruby

While building a CLI Google Card viewer, I stumbled on the problem of rendering HTML in command line like the browsers w3m or lynx. The closest I have come is using the text spit out from Nokogiri:
Nokogiri::HTML::parse(card_snippet).text
But it prints out as follows:
"Albert EinsteinTheoretical PhysicistAlbert Einstein was a German-born theoretical physicist. He developed the general theory of relativity, one of the two pillars of modern physics. Einstein's work is also known for its influence on the philosophy of science. WikipediaBorn: March 14, 1879, Ulm, GermanyDied: April 18, 1955, Princeton, New Jersey, United StatesInfluenced: Satyendra Nath Bose, Wolfgang Pauli, Leo Szilard, moreInfluenced by: Isaac Newton, Mahatma Gandhi, moreBooksThe World as I See It1949Relativity: The Special a...1916Ideas and Opinions2000Out of My Later Years2006The Meaning of Relativity1922The Evolution of Physics1938People also search forIsaac NewtonEduard EinsteinSonStephen HawkingElsa EinsteinSpouseMileva MarićFormer spouseThomas Edison"
But using lynx:
cat card_snippet.html | lyx -dump -stdin
Albert Einstein
Theoretical Physicist
Albert Einstein was a German-born theoretical physicist. He
developed the general theory of relativity, one of the two pillars
of modern physics. Einstein's work is also known for its influence
on the philosophy of science. Wikipedia
Born: March 14, 1879, Ulm, Germany
Died: April 18, 1955, Princeton, New Jersey, United States
Influenced: Satyendra Nath Bose, Wolfgang Pauli, Leo
Szilard,
Note: After stripping off some noise. But nonetheless the line endings are proper.
Any ideas for a similar solution in Ruby? The html snippet: Pastebin Link.
This works for me,
require 'nokogiri'
html = `curl http://pastebin.com/raw/pYKwACBp`
doc = Nokogiri::HTML(html)
puts doc.text.gsub(/[\r\n]+/,"\n").strip

Ruby Filtering issue

I have been trying to filter some text from a file to only show certain years of birth
I have some ruby that reads in a file ages.txt, it contains names and years of birth
Joe Bloggs (2001)
Mary Bloggs (1987)
John apples (2010)
Old Guy (1990)
I wish to be able to filter for a year range, i.e. to find people born between 1987 and 1990
I have tried
#open('ages.txt') { |f| f.grep(/^19(8[0-9]|10)$/) }
but I get an error
ages.rb:3:in `===': invalid byte sequence in UTF-8 (ArgumentError)
Any help would be greatfull.
The ^ and the $ make the expression try to match whole lines, which will not work.
Try: /19(8[7-9]|90)/
Here is a working version. Let me know if this helps.

Algorithm for grouping names

What's a good way to group this list of names:
Doctor Watson.
Dr. John Watson.
Dr. J Watson.
Watson.
J Watson.
Sherlock.
Mr. Holmes.
S Holmes.
Holmes.
Sherlock Holmes.
Into a grouped list of unique and complete names:
Dr. John Watson.
Mr. Sherlock Holmes.
Also interesting:
Mr Watson
Watson
Mrs Watson
Watson
John Watson
Since the algorithm doesn't need to make inferences about whether the first Watson is a Mr (likely) or Mrs but only group them uniquely, the only problem here is that John Watson obviously belongs to Mr and not Mrs Watson. Without a dictionary of given names for each gender, this can't be deduced.
So far I've thought of iterating through the list and checking each item with the remaining items. At each match, you group and start from the beginning again, and on the first pass where no grouping occurs you stop.
Here's some rough (and still untested) Python. You'd call it with a list of names.
def groupedNames(ns):
if len(ns) > 1:
# First item is query, rest are target names to try matching
q = ns[0]
# For storing unmatched names, passed on later
unmatched = []
for i in range(1,len(ns)):
t = ts[i]
if areMatchingNames(q,t):
# groupNames() groups two names into one, retaining all info
return groupedNames( [groupNames(q,t)] + unmatched + ns[i+1:] )
else:
unmatched.append(t)
# When matching is finished
return ns
If your names are always of the form [honorific][first name or initial]LastName, then you can start by extracting and sorting by the last name. If some names have the form LastName[,[honorific][first name or initial]], you can parse them and convert to the first form. Or, you might want to convert everything to some other form.
In any case, you put the names into some canonical form and then sort by last name. Your problem is greatly reduced. You can then sort by first name and honorific within a last name group and then go sequentially through them to extract the complete names from the fragments.
As you noted, there are some ambiguities that you'll have to resolve. For example, you might have:
John Watson
Jane Watson
Dr. J. Watson
There's not enough information to say which of the two (if either!) is the doctor. And, as you pointed out, without information about the gender of names, you can't resolve Mr. J. Watson or Mrs. J. Watson.
I suggest using hashing here.
Define a hash function as interpreting the name as a base 26 number where a = 0 and z = 25
Now just hash the individual words. So
h(sherlock holmes) = h(sherlock) + h(holmes) = h(holmes) + h(sherlock).
Using this you can easily identify names like:
John Watson and Watson John
For ambiguities like Dr. John Watson and Mr John Watson you can define the hash value for Mr and Dr to be the same.
To resolve conflicts like J. Watson and John Watson, you can just have just the first letter and the last name hashed. You can extend the idea for similar conflicts.

Yugoslavia ISO country code clarification

I see two ISO codes for Yugoslavia.
891 - Yugoslavia and
807 - Macedonia, the Former Yugoslav Republic Of
Can someone clarify which one to use?
Yugoslavia (Jugoslavija) doesn't exist any more. Macedonia is one of 6 former republic. Serbia and Montenegro disintegrated a few years ago. Now, all former republics are independent countries: Serbia, Croatia, Slovenia, Bosna and Hercegovina, Montenegro and Macedonia.
Yugoslavia no longer exists. The nation that was Yugoslavia has now been split into several smaller nations, one of which is Macedonia.
The 'Former Yugoslav Republic of' notation is only used for Macedonia (ie not by any of the other states that were previously part of Yugoslavia), and isn't generally used much anyway (I'm not even sure if it's still part of Macedonia's official name), but it may be useful to distinguish it from the Greek province of Macedonia, which is geographically very close to it.
There has been quite a lot of change in that region in recent years; make sure you have an up-to-date ISO codes list.
Yugoslavia
YU, YUG, 891
1974–2003
YUCS
Name changed to Serbia and Montenegro (CS, SCG, 891)
Alphabetic codes used for both SFR Yugoslavia and FR Yugoslavia
Numeric code changed from 890 (for SFR Yugoslavia) to 891 (for FR Yugoslavia) in 1993
YU currently transitionally reserved
.yu deleted
ISO 3166-2:YU changed to ISO 3166-2:CS
en.wikipedia.org/wiki/ISO_3166-3
P.S. As far as Yugoslavia doesn't exist any more, what for do you need its code?
FR Yugoslavia doesn't exist anymore. It is disintegrated in Yugoslav wars. From the former Yugoslavia, currently, there are established six independent countries from 6 former Yugoslav republics. More about Yugoslavia you can read here. About disintegration of Yugoslavia you can read here.
These countries are (name, local name and country code):
Serbia (Srbija) - 688
Croatia (Hrvatska) - 191
Bosnia and Herzegovina (Bosna i Hercegovina) - 070
Slovenia (Slovenija) - 705
FYR Macedonia (Makedonija) - 807
Montengro (Crna Gora) - 499
Please, note the three crucial facts:
All of these countries are independent (and have their own country
code);
Legal successor is Serbia as a legal successor of Serbia and
Montenegro, which is legal successor of FR Yugoslavia;
FYR Macedonia has that name only because Greece does not allow
Macedonia country name for independent country (due to possible geo-political and historical misunderstandings related to Greece area - Macedonia).
Neither.
Wikipedia has a list of currently-assigned ISO-3166 country codes, and also links to the source UN documents.
891 is now Serbia & Montenegro. Yugoslavia as a country no longer exists.

Resources