How to "update" a column using pig latin - hadoop

Imagine I have the following table available to me:
A: { x: int, y: int, z: int, ...99 other columns... }
I now want to transform this, such that z is set to NULL where x > y, with the resulting dataset to be stored as B.
and I want to do it without having to explicitly mention all the other columns, as this becomes a maintenance nightmare.
Is there a simple solution?

This issue is tracked in this JIRA:
PIG-1693 There needs to be a way in foreach to indicate "and all the rest of the fields"
Currently I don't know anything simpler than doing what you say or not loading Z and adding a new column Z with the star expression.

I was able to drop some of the column bloat by nesting them in single-row bags and flattening afterwards.
Still, it feels like a bit of a hack. So I'm also investigating cascading to see if it's a better fit for my scenario.

A feature to facilitate your scenario was added in Pig 0.9. The new project-range operator (..) allows you to express a range of fields by indicating the starting and/or ending field names as in this example:
result = FOREACH someInput GENERATE field1, field2, null as field3, field4 .. ;
In the example above field1/2/3/4 are actual field names. One of the fields is set to null while the other fields are kept intact.
More details in this "New Apache Pig 0.9 Features – Part 3" article: http://hortonworks.com/blog/new-apache-pig-0-9-features-part-3-additional-features/
To solve your specific problem you probably want to do a FILTER and an UNION to combine the results.

Of course you can select columns by column number, but that can easily become a nightmare if you change anything at all. I have found column names to be much more stable, and therefore I recommend the following solution:
Update mycol when it is between two known columns
You can use .. to indicate leading, or trailing columns (or inbetween columns). Here is how that would work out if you want to change the value of 'MyCol' to 'updatedvalue'.
aliasAfter = FOREACH aliasBefore GENERATE
.. colBeforeMyCol, updatedvalue, colAfterMyCol ..;

Related

Crystal - Compare Strings that do not fully match

I am having some trouble with a query in Crystal 2008. I have two tables with columns that are loosely related, both contain addresses. One table column is just a street name while the other is a street name plus some additional info. I want to find all records where these have the same street name and only show those. Example below:
Address
AddressB
123 St
123 St, ABC City
123 St
345 St, ABC City
I have tried using a formula such as below
if({AddressB} startswith {Address}) then {AddressB} else 'ERROR'
I have also tried this with LIKE and as well as * wildcards. Nothing seems to work. I will admit I am pretty amateur-ish with SQL and crystal so formulas are a new frontier for me writing reports. Also I should note that tables are linked appropriately with inner joins.
Any help would be greatly appreciated!
This should work. Perhaps your {Address} column is padded with spaces, so try:
IF ({AddressB} startswith Trim({Address})) THEN {AddressB} ELSE 'ERROR'
Test the effect of replacing the reference to the column name with the static text value that you "think" is in that column.
If you get a different behavior, what you think is in that column is not what is actually in that column. For example, the column might contain non-printable characters. You can get rid of those using the Replace function.
If you don't get a different behavior, then show us the expression with the static text values. That would allow us to replicate the behavior and understand the situation.
Note: the problem might be in your table join logic. If you have no join condition, then all records in TableA would join to all the records in TableB. In that case, you need to place the fields in the detail section to get a proper sense of what is being compared to what. Or rethink your join logic. Perhaps you should move one table to a subreport, or a SQL Expression instead of trying to include both tables in the main report.

Power Query - conditional replace/clear entire cell in multiple columns

I'm trying to clear the entire cell if it doesn't contain a given keyword.
I've managed to do this for one column:
Table.ReplaceValue(#"PrevStep",each [#"My Column"], each if Text.PositionOf([#"My Column"],"keyword")>-1 then [#"My Column"] else null,Replacer.ReplaceValue,{"My Column"})
The problem is I need to iterate/repeat that step for a number of columns... the number of columns may vary and column names also may be different every time. I can have all those column names put into a list but I'm not able to use it.
The solution I'm looking for may look like this
for each ColNam in MyColumnsList
Table.ReplaceValue(#"PrevStep",each [#"ColNam"], each if Text.PositionOf([#"ColNam"],"keyword")>-1 then [#"ColNam"] else null,Replacer.ReplaceValue,MyColumnsList)
next
but this is not the VBA code but Power Query M - and of course the problem is with #PrevStep as I would see it like a recursions... again... do not know how to process.
Is the path I follow correct or should it be done some other way
Thanks
Andrew
Unpivot your columns to turn all the columns into two columns. Apply your replacement to the single value column then pivot it back into the original format

Combine Tables matching a name pattern in Power Query

I am trying to combine many tables that has a name that matches a patterns.
So far, I have extracted the table names from #shared and have the table names in a list.
What I haven't being able to do is to loop this list and transform in a table list that can be combined.
e.g. Name is the list with the table names:
Source = Table.Combine( { List.Transform(Name, each #shared[_] )} )
The error is:
Expression.Error: We cannot convert a value of type List to type Text.
Details:
Value=[List]
Type=[Type]
I have tried many ways but I am missing some kind of type transformation.
I was able to transform this list of tables names to a list of tables with:
T1 = List.Transform(Name, each Expression.Evaluate(_, #shared))
However, the Expression.Evaluate feels like an ugly hack. Is there a better way for this transformation?
With this list of tables, I tried to combine them with:
Source = Table.Combine(T1)
But I got the error:
Expression.Error: A cyclic reference was encountered during evaluation.
If I extract the table from the list with the index (e.g T1{2}) it works. So in this line of thinking, I would need some kind o loop to append.
Steps illustrating the problem.
The objective is to append (Tables.Combine) every table named T_\d+_Mov:
After filtering the matching table names in a table:
Converted to a List:
Converted the names in the list to the real tables:
Now I just need to combine them, and this is where I am stuck.
It is important to not that I don't want to use VBA for this.
It is easier to recreate the query from VBA scanning the ThisWorkbook.Queries() but it would not be a clean reload when adding removing tables.
The final solution as suggested by #Michal Palko was:
CT1 = Table.FromList(T1, Splitter.SplitByNothing(), {"Name"}, null, ExtraValues.Ignore),
EC1 = Table.ExpandTableColumn(CT1, "Name", Table.ColumnNames(CT1{0}[Name]) )
where T1 was the previous step.
The only caveat is that the first table must have all columns or they will be skiped.
I think there might be easier way but given your approach try to convert your list to table (column) and then expand that column:
Alternatively use Table.Combine(YourList)

Query performance in PostgreSQL using 'similar to'

I need to retrieve certain rows from a table depending on certain values in a specific column, named columnX in the example:
select *
from tableName
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')
So if columnX contains at least one of the values specified (A, B, C, 1, 2, 3), I will keep the row.
I can't find a better approach than using similar to. The problem is that the query takes too long for a table with more than a million rows.
I've tried indexing it:
create index tableName_columnX_idx on tableName (columnX)
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')
However, if the condition is variable (the values could be other than A, B, C, 1, 2, 3), I would need a different index for each condition.
Is there any better solution for this problem?
EDIT: Thanks everybody for the feedback. Looks like I've achieved to this point maybe because of a design mistake (topic I've posted in a separated question).
If you are only going to search lists of one-character values, then split each string into an array of characters and index the array:
CREATE INDEX
ix_tablename_columnxlist
ON tableName
USING GIN((REGEXP_SPLIT_TO_ARRAY(columnX, '')))
then search against the index:
SELECT *
FROM tableName
WHERE REGEXP_SPLIT_TO_ARRAY(columnX, '') && ARRAY['A', 'B', 'C', '1', '2', '3']
I agree with #Quassnoi, a GIN index is fastest and simplest - unless write performance or disk space are issues because it occupies a lot of space and eats quite a bit of performance for INSERT, UPDATE and DELETE.
My additional answer is triggered by your statement:
I can't find a better approach than using similar to.
If that is what you found, then your search isn't over, yet. SIMILAR TO is a complete waste of time. Literally. PostgreSQL only features it to comply to the (weird) SQL standard. Inspect the output of EXPLAIN ANALYZE for your query and you will find that SIMILAR TO has been replaced by a regular expression.
Internally every SIMILAR TO expression is rewritten to a regular expression. Consequently, for each and every SIMILAR TO expression there is at least one regular expression match that is a bit faster. Let EXPLAIN ANALYZE translate it for you, if you are not sure. You won't find this in the manual, PostgreSQL does not promise to do it this way, but I have yet to see an exception.
More details in this related answer on dba.SE.
This strikes me as a data modelling issue. You appear to be using a text field as a set, storing single character codes to identify values present in the set.
If so, I'd want to remodel this table to use one of the following approaches:
Standard relational normalization. Drop columnX, and replace it with a new table with a foreign key reference to tableName(id) and a charcode column that contains one character from the old columnX per row, like CREATE TABLE tablename_columnx_set(tablename_id integer not null references tablename(id), charcode "char", primary key (tablename_id, charcode)). You can then fairly efficiently search for keys in columnX using normal SQL subqueries, joins, etc. If your application can't cope with that change you could always keep columnX and maintain the side table using triggers.
Convert columnX to a hstore of keys with a dummy value. You can then use hstore operators like columnX ?| ARRAY['A','B','C']. A GiST index on the hstore of columnX should provide fairly solid performance for those operations.
Split to an array as recommended by Quassnoi if your table change rate is low and you can pay the costs of the GIN index;
Convert columnX to an array of integers, use intarray and the intarray GiST index. Have a mapping table of codes to integers or convert in the application.
Time permitting I'll follow up with demos of each. Making up the dummy data is a pain, so it'll depend on what else is going on.
I'll post this as an answer because it may guide other people in the future: Why not have 6 columns, haveA, haveB ~ have3 and do a 6-part OR query? Or use a bitmask?
If there are too many attributes to assign a column each, I might try creating an "attribute" table:
(fkey, attr) VALUES (1, 'A'), (1, 'B'), (2, '3')
and let the DBMS worry about the optimization.

Improve SQL Server 2005 Query Performance

I have a course search engine and when I try to do a search, it takes too long to show search results. You can try to do a search here
http://76.12.87.164/cpd/testperformance.cfm
At that page you can also see the database tables and indexes, if any.
I'm not using Stored Procedures - the queries are inline using Coldfusion.
I think I need to create some indexes but I'm not sure what kind (clustered, non-clustered) and on what columns.
Thanks
You need to create indexes on columns that appear in your WHERE clauses. There are a few exceptions to that rule:
If the column only has one or two unique values (the canonical example of this is "gender" - with only "Male" and "Female" the possible values, there is no point to an index here). Generally, you want an index that will be able to restrict the rows that need to be processed by a significant number (for example, an index that only reduces the search space by 50% is not worth it, but one that reduces it by 99% is).
If you are search for x LIKE '%something' then there is no point for an index. If you think of an index as specifying a particular order for rows, then sorting by x if you're searching for "%something" is useless: you're going to have to scan all rows anyway.
So let's take a look at the case where you're searching for "keyword 'accounting'". According to your result page, the SQL that this generates is:
SELECT
*
FROM (
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY sq.name) AS Row,
sq.*
FROM (
SELECT
c.*,
p.providername,
p.school,
p.website,
p.type
FROM
cpd_COURSES c, cpd_PROVIDERS p
WHERE
c.providerid = p.providerid AND
c.activatedYN = 'Y' AND
(
c.name like '%accounting%' OR
c.title like '%accounting%' OR
c.keywords like '%accounting%'
)
) sq
) AS temp
WHERE
Row >= 1 AND Row <= 10
In this case, I will assume that cpd_COURSES.providerid is a foreign key to cpd_PROVIDERS.providerid in which case you don't need an index, because it'll already have one.
Additionally, the activatedYN column is a T/F column and (according to my rule above about restricting the possible values by only 50%) a T/F column should not be indexed, either.
Finally, because searching with a x LIKE '%accounting%' query, you don't need an index on name, title or keywords either - because it would never be used.
So the main thing you need to do in this case is make sure that cpd_COURSES.providerid actually is a foreign key for cpd_PROVIDERS.providerid.
SQL Server Specific
Because you're using SQL Server, the Management Studio has a number of tools to help you decide where you need to put indexes. If you use the "Index Tuning Wizard" it is actually usually pretty good at tell you what will give you the good performance improvements. You just cut'n'paste your query into it, and it'll come back with recommendations for indexes to add.
You still need to be a little bit careful with the indexes that you add, because the more indexes you have, the slower INSERTs and UPDATEs will be. So sometimes you'll need to consolidate indexes, or just ignore them altogether if they don't give enough of a performance benefit. Some judgement is required.
Is this the real live database data? 52,000 records is a very small table, relatively speaking, for what SQL 2005 can deal with.
I wonder how much RAM is allocated to the SQL server, or what sort of disk the database is on. An IDE or even SATA hard disk can't give the same performance as a 15K RPM SAS disk, and it would be nice if there was sufficient RAM to cache the bulk of the frequently accessed data.
Having said all that, I feel the " (c.name like '%accounting%' OR c.title like '%accounting%' OR c.keywords like '%accounting%') " clause is problematic.
Could you create a separate Course_Keywords table, with two columns "courseid" and "keyword" (varchar(24) should be sufficient for the longest keyword?), with a composite clustered index on courseid+keyword
Then, to make the UI even more friendly, use AJAX to apply keyword validation & auto-completion when people type words into the keywords input field. This gives you the behind-the-scenes benefit of having an exact keyword to search for, removing the need for pattern-matching with the LIKE operator...
Using CF9? Try using Solr full text search instead of %xxx%?
You'll want to create indexes on the fields you search by. An index is a secondary list of your records presorted by the indexed fields.
Think of an old-fashioned printed yellow pages - if you want to look up a person by their last name, the phonebook is already sorted in that way - Last Name is the clustered index field. If you wanted to find phone numbers for people named Jennifer or the person with the phone number 867-5309, you'd have to search through every entry and it would take a long time. If there were an index in the back with all the phone numbers or first names listed in order along with the page in the phonebook that the person is listed, it would be a lot faster. These would be the unclustered indexes.
I would try changing your IN statements to an EXISTS query to see if you get better performance on the Zip code lookup. My experience is that IN statements work great for small lists but the larger they get, you get better performance out of EXISTS as the query engine will stop searching for a specific value the first instance it runs into.
<CFIF zipcodes is not "">
EXISTS (
SELECT zipcode
FROM cpd_CODES_ZIPCODES
WHERE zipcode = p.zipcode
AND 3963 * (ACOS((SIN(#getzipcodeinfo.latitude#/57.2958) * SIN(latitude/57.2958)) +
(COS(#getzipcodeinfo.latitude#/57.2958) * COS(latitude/57.2958) *
COS(longitude/57.2958 - #getzipcodeinfo.longitude#/57.2958)))) <= #radius#
)
</CFIF>

Resources