How do I obfuscate a column in a Hive view? - hadoop

I have created a view for a table as:
CREATE VIEW anonymous_table
AS SELECT id, value FROM sensitive_table
and would like the id field of sensitive table to be obfuscated somehow, like an MD5 hash or something similar so that people querying the view can't see the actual id. What is a good way to do this in Hive?

Some options:
Don't include ID in your view at all:
CREATE VIEW something AS SELECT "HIDDEN ID", value from sensitive_table;
If you still need there to be a distinct key available for each record, you could write a UDF to do whatever transformation you like:
ADD JAR mycode.jar;
CREATE TEMPORARY FUNCTION hash as 'com.example.MyUDF';
CREATE VIEW something as SELECT hash(id), value from sensitive_table;
BONUS: Seeing as your users can just look at the sensitive table anyway, you could hash the IDs before they arrives in hive? This is probably the best option honestly.
Either way, if you're processing the ID's, having a stable hashing function would be what you need if people still need to rely on the ID's for joining / aggregation, etc.
Here is the link to how to create a UDF

Related

Quicksight Lookup Values in Another Table

Below is example of I have a table called account which is the user. The user is in an organization but we only store the org id.
What I'm currently doing is using an calculated field and the ifelse function but there are a number of other areas with a lot of entries so a lot of work to create all these calculated fields.
Is there a smarter way to do this?
The best way to do this is to add a join between the 2 tables.
Add both datasets (user and orgs)
In the user dataset, use the "add data"
Select the org dataset
Use a join and it will look something like this:
You are probably past this point by now but at least an answer is here now

Choose the first non-null field for each record in ServiceNow

How to create a view in ServiceNow that combines multiple columns (same data type) into one by picking the first non-null value? Note that it should not actually modify the underlying data.
After a search of the documentation I thought I had an answer with function fields, but GlideFunction doesn't seem to have nvl/coalesce as a function. The functionality called coalesce in ServiceNow seems to relate to importing/permanently modifying data only.
An example would be if you have employee and department, both of which have a location field. Show the employee's location unless it is null, otherwise show the employee's department's location.
In standard SQL, I would do it like this:
CREATE VIEW my_view AS (
SELECT COALESCE(employee.location,department.location) AS location
FROM employee JOIN department
ON employee.department_id = department.department_id
);
You have not mentioned how you are going to query this view. SNOW does not give us control over selection while designing views like the standard SQL.
Use GlideRecord to conditionally select the columns based on nullability.

DynamoDB Throughput vs Search time

I've just figured out a big mistake I had while creating the dynamodb structure.
I've created 11 tables, whereas one of them is the table mostly refereed to and the others are complementary tables.
For example, I have a table where I hold names (together with other info) called "Names" and another table called "NamesMappings" holding all these names added to the "Names" table so that each time a user wants to add a name to the "Names" table he first tries to put the name in "NamesMappings" and only if it succeed (therefore this name doesn't exist) he can add the name into the "Names" table. This procedure helps if the name is not unique and is not the primary key in the "Names" table and with this technique I don't have to search inside the "Names" table if the name exists, but instead I can try to add it to the "NamesMappings" table and only if it succeed I know this is a unique name.
First of all, I would like to ask you if this is a common approach or there is a better one?
Next, I figured out that with this design I soon reached to 11 tables each has 5 provisioned capacity of read and write which leads to overall 55 provisioned read and write under the free-tier. Then I understood why I get all these payments each month, because as the number of tables is getting bigger, and I leave the provisioned capacity as default (both read/write capacity are 5) I get more and more provisioned capacity.
So, what should be my conclusion from this understanding? Should I try to reduce the number of tables even if it takes more effort to preform scanning and querying inside the table? Or should I split the table same as I do but reduce the capacity of these mappings tables used only for indication if an item exists or not in another table?
If I understand your problem correctly you're missing the whole concept of NoSQL Databases.
Your Names table should have a Hash key (which is similar to a Primary key) that has a uniformly generated identifier (an UUID is a great candidate). This would automatically make this Table queryable by this unique identifier. You said, however, that you don't know the ID but you only know the Name instead. This leads me to think you could create a Global Secondary Index (GSI) on the Name attribute inside the Names table so you can also query by Name. Up to this point, your table structure should look like this:
id | name
Both of them are independently queryable, which gives you a lot of flexibility already.
Now, let's say you want to add the NameMapping attribute (which I don't know how it looks like), you can simply add it under the Names table, getting rid of the NamesMappings table, greatly reducing the number of WCUs and RCUs across your account. Your table structure should now look like this:
id | name | mappings
where mappings is, let's say, a JSON object.
Since you can only query on top level attributes in DynamoDB, you can now perform a query against the name attribute which has a GSI configured. If the query returns nothing, then name is unique. But let's say you still need some data inside the mappings object, then you could query by name and, in your code, you could apply a map/filter/reduce operation on the mappings attribute and decide what to do next.
Remember that duplication is just OK in a NoSQL world. This may look scary if you come from a purely SQL background, but data should be stored in such a way in NoSQL databases that you should be able to fetch all the needed information in one go, therefore avoiding "joins" (joins are still possible in a NoSQL database, but since there are no strong relationships between entities, you need to perform these joins manually on the code level). To give you some real context, imagine you have a Orders table where you keep track of the ordered Products and the Store that the Order belongs to: you'd save both the Products and the Store objects (and not their IDs, as it would happen in the SQL way) inside the Order object, so if you want to query for a given OrderId in the future, you wouldn't need to make extra calls (aka "joins") to the Product/Store tables to fetch the information, since everything would already be stored inside the Order object.

Translate column name from one language to another

Is it possible I can translate column name from one language to another.
For example:
In one of our application, DB columns named in polish.
I want, for example when I give select * from table_name. I can read column name in English.
Create view on the top of your original table and give access to your users just to the view. For the end users hide your real table. You can map columns from original table to view column names that will have national characters.
I think the quickest way to do this is to define views (some sort of wrappers over the tables). However, this will lead to extra maintenance, as each schema change in the source tables will require a change in the correspondent view.

Data dictionaries and functionality behind Code Road Map

I was looking to a Code Road Map feature that Toad provides which shows dependencies of Objects.
Can anyone tell me on what basis the Toad Generate the Dependencies? I am assuming that there is a data dictionary view exists dba_dependencies which work at the backend for getting this relation.
So can we write a script to which we pass object name like package name, table_name amongst others that will show the dependency of the object passed by me.
In code Road Map there is an option to generate data for a table ...how does this work?
What is the algorithm behind it? If there is foreign key on the child table and the parent table is empty, how does this work? How it will populate the depending table first and then the child table.
Looking at user_depencies/ dba_dependencies view structure, querying the view with column REFERENCED_NAME equal to the object that you want to query with should provide you with a list of objects where the object you're searching for is referenced.
Second question is too broad & probably only the Toad developers know how they've implemented it. The data dictionaries provide information about the various constraints on a table. My guess would be the algorithm looks at data dictionary & has different code paths for handling constraints / master child relations. Another assumption would use of handled exceptions to ensure the data is generated cleanly.

Resources