Implementing Hive UDF - hadoop

I have a hive table with ip_address column. How can I find country, city and Zip code from that ip_address column?
I see a udf written:
https://github.com/edwardcapriolo/hive-geoip
How do I utilize udf in hive? Can I create function name myself?
The udf says we need separate database:
http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gz
How do I implement that database on Hive?
Any feedback will be appreciated.
Thanks,
Rio

You utilize UDFs in Hive by adding the jars and creating temporary functions as described by your first link.
add file GeoIP.dat;
add jar geo-ip-java.jar;
add jar hive-udf-geo-ip-jtg.jar;
create temporary function geoip as 'com.jointhegrid.hive.udf.GenericUDFGeoIP';
You may change the function name to whatever you would prefer, simply replace the word after "temporary function" from "geoip" to whatever you want.
Adding the database you linked to is a matter of downloading it to your unix server and then unzipping it using gzip. Once it is in the GeoIP.dat format, move it and the jars you've downloaded into the your /users/(your username)/ directory and then run the code as instructed above. The files must be in your top directory or else explicitly targeted during your add file and add jar statements. by that I mean instead of add file GeoIP.dat; it must be add file /users/wertz/downloads/GeoIP.dat; for example.
Finally, by looking at the code the UDF needs three arguments. The first argument is the IP address, the second argument is what you're looking for (choices appear to be COUNTRY_NAME, COUNTRY_CODE, AREA_CODE, CITY, DMA_CODE, LATITUDE, LONGITUDE, METRO_CODE, POSTAL_CODE, REGION, ORG, or ID) and the final value is the filename of the GeoIP database, which hopefully you have not changed from GeoIP.dat

Related

File And File Grouping SQL server

I have a filegroup Named (Year2020) which contains There different .ndf files, for example Summer.ndf Winter.ndf, Fall.ndf.
Now I want to create a Fall table and I want my table to be saved in Fall.ndf file not on Summer.ndf not on Winter.ndf Is there a way to do things like this? I am using SQL Server.
The problem is all are in the same filegroup Named year2020....how can we save it exactly where we want ??
When I save the fall table it goes into summer.ndf not on Fall.ndf

Multiple field names

I have a txt file, which I have to insert into a database.
My problem is that in some files I have header "customer_" instead of "customer".
I don’t know how to fix this in Pentaho. I’ve tried "select values" but I have no idea how it works.
My transformation for now : get file names -> csv file input -> tx file output -> table output.
You have Metadata Injection capabilities built in Pentaho Data Integration, but just "any" file won't work, you need some kind of logic to determine that "customer_" or whatever you get maps to the "customer" column in the database.
Once you have the logic to build of the variations of possible columns in the origin file to columns in the table, you can inject that metadata to your transformation.

Check if the input file names with the file names in the config table

I have a folder which contains many files and I got a configuration table in sql database which contains the list of file names which I need to load to Azure Blob Storage.
I tried getting the file names from the source folder using 'Get Metadata' activity and then used Filter activity to filter the file name but this way I have to hard code the filename inside the filter.
Can someone please let me know a way to do this?
here is an example:
I have below files in a folder.
And the below in sql Config table
This is how the sample pipeline looks like.
1. Lookup list of files from sql config table and using foreach actvity append to an array variable. In my example it is in config_files.
2. Using GetMetadata, list the childItems in the folder, and append the file names into another variable. In my example it is files
3. Use SetVariable activity to store the result i.e. the files that match from the entries in config table.
Expression: #intersection(variables('files'),variables('config_files'))

Infomatica Reading From Metadata

I have a metadata Name as CONTACTS(SOURCE.CSV|TAGET.CSV). Now I read this file using reader and populate the value in table that I created as CONTACT_TABLE(PK NUMBER, Source_name varchar2(500),target_name varchar2(500)) after that I want to read these source.csv and target.csv file stored in my table CONTACT_TABLE AND populate the value in other table called SOURCE_COLUMN_TARGET_COLUMN_TABLE(PK,FK as pk of contact_table,source_column,target_column) this table should contain all the column of source and target and should have one to one relationship with that, for example, source.csv(fn)-----target.csv(firstName)
My objective is whenever we add some other attribute in source or target I should not change the entire mapping for eg if we add source.csv(email) and target.csv(email) it should directly map
Thanks!
please help!
I have this task completed before Friday and I searched every source I found dynamic mapping thing and parameter thing but it was not very helpful I want to do this way itself
Not clear what you are asking actually. The source analyser uses source files(.csv) on import itself and thereby contains the same format in source qualifier.
So, if any of the values gets added into your existing files (source.csv, target.csv) then it becomes a new file for your existing mapping. hence, you dont need to change the whole mapping just that you need to import it again.

postgresql update a bytea field with image from desktop

I'm trying to write an update sql statement in postgresql (pg commander) that will update a user profile image column
I've tried this:
update mytable set avatarImg = pg_read_file('/Users/myUser/profile.png')::bytea where userid=5;
got ERROR: absolute path not allowed
Read the file in the client.
Escape the contents as bytea.
Insert into database as normal.
(Elaborating on Richard's correct but terse answer; his should be marked as correct):
pg_read_file is really only intended as an administrative tool, and per the manual:
The functions shown in Table 9-72 provide native access to files on the machine hosting the server. Only files within the database cluster directory and the log_directory can be accessed.
Even if that restriction didn't apply, using pg_read_file would be incorrect; you'd have to use pg_read_binary_file. You can't just read text and cast to bytea like that.
The path restrictions mean that you must read the file using the client application as Richard says. Read the file from the client, set it as a bytea placement parameter in your SQL, and send the query.
Alternately, you could use lo_import to read the server-side file in as a binary large object, then read that as bytea and delete the binary large object.
pg_read_file can read the files only from the data directory path, if you would like to know your data directory path use:
SHOW data_directory;
For example it will show,
/var/lib/postgresql/data
Copy you file to the directory mentioned.
After the you can use only file name in your query.
UPDATE student_card SET student_image = pg_read_file('up.jpg')::bytea;
or can use pg_read_binary_file function.
UPDATE student_card SET student_image = pg_read_binary_file('up.jpg')::bytea;

Resources