Assessing from the end of a split array in Hive - hadoop

I need to split a tag that looks something like "B1/AHU/_1/RoomTemp", "B1/AHU/_1/109/Temp", so with a variable with a variable number of fields. I am interested in getting the last field, or sometimes the last but one. I was disappointed to find that negative indexes do not count from the right and allow me to select the last element of an array in Hive as they do in Python.
select tag,split(tag,'[/]')[ -1] from sensor
I was more surprised when this did not work either:
select tag,split(tag,'[/]')[ size(split(tag,'[\]'))-1 ] from sensor
Both times giving me an error along the lines of this:
FAILED: SemanticException 1:27 Non-constant expressions for array indexes not supported.
Error encountered near token '1'
So any ideas? I am kind of new to Hive. Regex's maybe? Or is there some syntactic sugar I am not aware of?

This question is getting a lot of views (over a thousand now), so I think it needs a proper answer. In the event I solved it with this:
select tag,reverse(split(reverse(tag),'[/]')[0]) from sensor
which is not actually stated in the other suggested answers - I got the idea from a suggestion in the comments.
This:
reverses the string (so "abcd/efgh" is now "hgfe/dcba")
splits it on "/" into an array (so we have "hgfe" and "dcba")
extracts the first element (which is "hgfe")
then finally re-reverses (giving us the desired "efgh")
Also note that the second-to-last element can be retrieved by substituting 1 for the 0, and so on for the others.

There is a great library of Hive UDFs here. One of them is LastIndexUDF(). It's pretty self-explainatory, it retrieves the last element of an array. There are instructions to build and use the jar on the main page. Hope this helps.

This seem to work for me, this returns the last element from the SPLIT array
SELECT SPLIT(INPUT__FILE__NAME,'/')[SIZE(SPLIT(INPUT__FILE__NAME,'/')) -1 ] from test_table limit 10;

After reading the LanguageManual UDF a while, I luckily found the function substring_index exactly meets your requirement, dosen't need any additional calculations at all.
The manual says:
substring_index(string A, string delim, int count) returns the substring from string A before count occurrences of the delimiter delim (as of Hive 1.3.0). If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. Substring_index performs a case-sensitive match when searching for delim. Example: substring_index('www.apache.org', '.', 2) = 'www.apache'.
Use cases:
SELECT SUBSTRING_INDEX('www.mysql.com', '.', 2);
--www.mysql
SELECT SUBSTRING_INDEX('www.mysql.com', '.', -1);
--com
See here for more information.

Related

Excel - Search an exact match within a string

I'm currently struggling on finding the formula that will resolve my problem.
Here's the status quo:
In Sheet 1, column A, I have a set of string, such as:
/search.action?gender=men&brand=10177&tag=10203&tag=10336
/search.action?gender=women&brand=11579&tag=10001&tag=10138
/search.action?gender=men&brand=12815&tag=10203&tag=10299
/search.action?gender=women&brand=1396&tag=10203&tag=10513
/search.action?gender=women&brand=11&tag=10001&tag=10073
/search.action?gender=women&brand=1396&tag=10203&tag=10336
/search.action?gender=women&brand=13
In Sheet 2, column A, I have a set of strings such as:
brand=10177
brand=12815
brand=13
brand=1396
brand=11579
Finally, in sheet 1, column B will be my "filter" with the formula I'm struggling to find. The goal of my formula is to detect in any of the strings in sheet 1 if one of the string in sheet 2 is present (as an exact match!). Indeed, now it only finds approximative matches. As you can see, the row 5 shouldn't return anything. But with my current formula it does.
Here's the formula:
{=IFERROR(INDEX('Sheet 2'!$A$1:$A$5;MATCH(1;COUNTIF(A1;"*"&'Sheet 2'!$A$1:$A$5&"*");0));"")}
Any idea on the matter?
Please note that I don't want to use VBA, macros, but only a formula.
Thanks a lot for your help!
Following will solve your problem I guess:
=VLOOKUP(MID(A2,FIND("&",A2)+1,FIND("&",A2,FIND("&",A2)+1)-FIND("&",A2)-1),Sheet2!A:A,1,FALSE)
Basically with find function I have identified the start and length of the string in between "&" signs. and used in vlookup.
Another point to mention is this formula is only looking for the first 2 "&" signs.
For completeness, here is another solution based on this answer
=INDEX(Sheet2!$A$1:$A$5,MAX(IF(ISERROR(FIND(Sheet2!$A$1:$A$5,A1)),-1,1)*(ROW(Sheet2!$A$1:$A$5)-ROW(Sheet2!$A$1)+1)))
This is a bit more general and it doesn't matter how many search tags there are.
However as it stands it would match brand=13 in the second sheet with brand=1396 in the first sheet. To avoid that you could add an ampersand to the search strings:-
=INDEX(Sheet2!$A$1:$A$5,MAX(IF(ISERROR(FIND(Sheet2!$A$1:$A$5&"&",A1&"&")),-1,1)*(ROW(Sheet2!$A$1:$A$5)-ROW(Sheet2!$A$1)+1)))
This formula throws a #VALUE error if there is no match: to avoid this, you would need to put an IFERROR statement round it:-
=IFERROR(INDEX(Sheet2!$A$1:$A$5,MAX(IF(ISERROR(FIND(Sheet2!$A$1:$A$5&"&",A1&"&")),-1,1)*(ROW(Sheet2!$A$1:$A$5)-ROW(Sheet2!$A$1)+1))),"")
All these are array formulae.

Selecting multiple results from XQUERY query

I am trying to select multiple columns from a query, but so far, I can only manage to select one. So I'm basically stuck with either selecting one, or all of them.
Here's my expression, what I got so far, which select only (1) column:
let $y := doc("http://en.wikipedia.org/wiki/List_of_deaths_on_eight-thousanders")//table[preceding-sibling::h2//span[string() = "K2"]][1]
return $y/tr/td[2]/string()
I would love some explanation of how one would go about doing this, since there's almost no documentation of this lovely language.
How would you like the result to be returned? You could construct new elements, or concatenate strings. There are many ways that this could be accomplished.
Here's one way to get comma-separated values:
return $y/tr/fn:string-join( (td[2] | td[4]), ", " )
You can try it on zorb.io.
Update
(td[2] | td[4]) selects both elements, and passes them, as a sequence, to fn:string-join(). | is the XQuery union operator (and can be substituted for the keyword).
As far as documention, the functx site documents the standard library (all fn-prefixed functions), and has useful examples. And the specs are surprisingly readable.

Format statement with unknown columns

I am attempting to use fortran to write out a comma-delimited file for import into another commercial package. The issue is that I have an unknown number of data columns. My output needs to look like this:
a_string,a_float,a_different_float,float_array_elem1,float_array_elem2,...,float_array_elemn
which would result in something that might look like this:
L1080,546876.23,4325678.21,300.2,150.125,...,0.125
L1090,563245.1,2356345.21,27.1245,...,0.00983
I have three issues. One, I would prefer the elements to be tightly grouped (variable column width), two, I do not know how to define a variable number of array elements in the format statement, and three, the array elements can span a large range--maybe 12 orders of magnitude. The following code conceptually does what I want, but the variable 'n' and the lack of column-width definition throws an error (of course):
WRITE(50,900) linenames(ii),loc(ii,1:2),recon(ii,1:n)
900 FORMAT(A,',',F,',',F,n(',',F))
(I should note that n is fixed at run-time.) The write statement does what I want it to when I do WRITE(50,*), except that it's width-delimited.
I think this thread almost answered my question, but I got quite confused: SO. Right now I have a shell script with awk fixing the issue, but that solution is...inelegant. I could do some manipulation to make the output a string, and then just write it, but I would rather like to avoid that option if at all possible.
I'm doing this in Fortran 90 but I like to try to keep my code as backwards-compatible as possible.
the format close to what you want is f0.3, this will give no spaces and a fixed number of decimal places. I think if you want to also lop off trailing zeros you'll need to do a good bit of work.
The 'n' in your write statement can be larger than the number of data values, so one (old school) approach is to put a big number there, eg 100000. Modern fortran does have some syntax to specify indefinite repeat, i'm sure someone will offer that up.
----edit
the unlimited repeat is as you might guess an asterisk..and is evideltly "brand new" in f2008
In order to make sure that no space occurs between the entries in your line, you can write them separately in character variables and then print them out using theadjustl() function in fortran:
program csv
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: nn = 3
real(dp), parameter :: floatarray(nn) = [ -1.0_dp, -2.0_dp, -3.0_dp ]
integer :: ii
character(30) :: buffer(nn+2), myformat
! Create format string with appropriate number of fields.
write(myformat, "(A,I0,A)") "(A,", nn + 2, "(',',A))"
! You should execute the following lines in a loop for every line you want to output
write(buffer(1), "(F20.2)") 1.0_dp ! a_float
write(buffer(2), "(F20.2)") 2.0_dp ! a_different_float
do ii = 1, nn
write(buffer(2+ii), "(F20.3)") floatarray(ii)
end do
write(*, myformat) "a_string", (trim(adjustl(buffer(ii))), ii = 1, nn + 2)
end program csv
The demonstration above is only for one output line, but you can easily write a loop around the appropriate block to execute it for all your output lines. Also, you can choose different numerical format for the different entries, if you wish.

ORA-00907 Error when using Analytic Function in a Query (PS/Query, Peopletools 8.51.12)

Query's throwing an ORA-00907 Error when I try to paste a list of values into a criteria.
Background: I'm not a developer, I'm just an end user that's studied enough to where I can write queries using PS/Query within Peoplesoft,
for my company's implementation. I work with Peoplesoft's FSCM module
(Financials and Supply Chain Management), currently on Version FSCM
8.90.08.024, using I think Oracle 11g as the base database.
I'm mostly self-taught, and the technical experts we have are busy
with database/application stuff, or they aren't familiar with my
section's specific data needs.
I should point out that I'm unable to directly write SQL statements to
Query the database. I have to use a built-in program called "PS/Query"
(also known as Query Manager) with a GUI that writes the SQL for you
and saves it as a Query that you can run to the database to extract
data. This is relevant to my question only in that:
1. I cannot create or alter views/tables
2. I cannot perform any type of SQL Statement except "SELECT"
3. I can embed PL/SQL, MetaSQL and plain SQL into Expressions
4. At this point, Query Manager is the only option I have.
PS/Query is my only experience with SQL so far, aside from Oracle's
documentation and sites like this. From my research, it's considered
extremely confining by "actual" SQL programmers.The restrictions on it
require you to do things in a manner that violates what seem to be
best practices of SQL coding.
Query Request: I have a query I've been requested to write that pulls out spend (on Vouchers and POs) against certain system-defined
Category Codes. What I'm trying to do is pull in Voucher IDs, sum the
merchandise amounts on them by Vendor and Category Code, and display
the results. Or in other words, for every unique combination of
Vendor/Category, add up all the Voucher Amounts that have that
Vendor/Category combination.
Using the SUM (Fieldname) OVER (PARTITION BY fieldname, fieldname)
syntax.
So the end result should look something like...
Code Vendor Amount
123-45 Acme $5000.00
123-45 Apple $4200.00
123-46 Acme $750.00
With that said, here's the SQL that Query Manager is displaying to get the result set I showed above:
SELECT DISTINCT D.CATEGORY_CD, D.TN_DESCR1000, C.VENDOR_ID, E.NAME1, SUM ( A.MERCH_AMT_VCHR) OVER (PARTITION BY D.CATEGORY_CD, C.VENDOR_ID),E.SETID,E.VENDOR_ID
FROM PS_PO_LINE_MATCHED A, PS_PO_LINE B, PS_PO_HDR C, PS_ITM_CAT_TBL D, PS_VENDOR E, PS_PYMNT_VCHR_XREF F
WHERE A.BUSINESS_UNIT = B.BUSINESS_UNIT
AND A.PO_ID = B.PO_ID
AND A.LINE_NBR = B.LINE_NBR
AND B.BUSINESS_UNIT = C.BUSINESS_UNIT
AND B.PO_ID = C.PO_ID
AND D.CATEGORY_ID = B.CATEGORY_ID
AND D.EFFDT =
(SELECT MAX(D_ED.EFFDT) FROM PS_ITM_CAT_TBL D_ED
WHERE D.SETID = D_ED.SETID
AND D.CATEGORY_TYPE = D_ED.CATEGORY_TYPE
AND D.CATEGORY_CD = D_ED.CATEGORY_CD
AND D.CATEGORY_ID = D_ED.CATEGORY_ID
AND D_ED.EFFDT <= SYSDATE)
AND ( F.SCHEDULED_PAY_DT >= TO_DATE('2010-07-01','YYYY-MM-DD')
AND F.SCHEDULED_PAY_DT <= TO_DATE('2011-06-30','YYYY-MM-DD'))
AND D.CATEGORY_CD LIKE :1
AND E.VENDOR_ID = C.VENDOR_ID
AND A.BUSINESS_UNIT = F.BUSINESS_UNIT
AND A.VOUCHER_ID = F.VOUCHER_ID
ORDER BY 1
Underlying Issue: This works fine, but it can only prompt on one
Category Code at a time. Category Codes are 5 digits, a 3-digit
"Class" followed by a dash and then a 2-digit "subclass. I have a list
of 375 Category Codes I need to get this Query result for.
I've set up a prompt on this version that allows entry of a Wildcard
(So 123-%%), but that's still about a hundred separate runs of the
Query. Query Manager allows use of an "In List" expression type in
Criteria, but it requires you to manually enter each entry in the
list.
I'm trying to set it up to where I can paste a plaintext copy of the
Code list into an Expression, with proper quotes/commas, and have it
evaluate that to give me a combined list of all the NIGP codes
specified. The Prompt field created by Query Manager doesn't allow
pasting of lists (as far as I know).
Attempted Solution: I viewed the page at http://peoplesoft.ittoolbox.com/groups/technical-functional/peoplesoft-other-l/create-an-expression-in-psoft-90-query-to-paste-a-list-of-emplids-2808427 and I've tried some of the answers given there, but none of them worked. That page led to me trying this modified SQL (obviously the list of codes is truncated a bit for display here):
SELECT DISTINCT D.CATEGORY_CD, D.TN_DESCR1000, C.VENDOR_ID, E.NAME1, SUM ( A.MERCH_AMT_VCHR) OVER (PARTITION BY D.CATEGORY_CD, C.VENDOR_ID),E.SETID,E.VENDOR_ID
FROM PS_PO_LINE_MATCHED A, PS_PO_LINE B, PS_PO_HDR C, PS_ITM_CAT_TBL D, PS_VENDOR E, PS_PYMNT_VCHR_XREF F
WHERE A.BUSINESS_UNIT = B.BUSINESS_UNIT
AND A.PO_ID = B.PO_ID
AND A.LINE_NBR = B.LINE_NBR
AND B.BUSINESS_UNIT = C.BUSINESS_UNIT
AND B.PO_ID = C.PO_ID
AND D.CATEGORY_ID = B.CATEGORY_ID
AND D.EFFDT =
(SELECT MAX(D_ED.EFFDT) FROM PS_ITM_CAT_TBL D_ED
WHERE D.SETID = D_ED.SETID
AND D.CATEGORY_TYPE = D_ED.CATEGORY_TYPE
AND D.CATEGORY_CD = D_ED.CATEGORY_CD
AND D.CATEGORY_ID = D_ED.CATEGORY_ID
AND D_ED.EFFDT <= SYSDATE)
AND ( F.SCHEDULED_PAY_DT >= TO_DATE('2010-07-01','YYYY-MM-DD')
AND F.SCHEDULED_PAY_DT <= TO_DATE('2011-06-30','YYYY-MM-DD'))
AND D.CATEGORY_CD = '005-00' OR D.CATEGORY_CD IN ('015-00,'' '015-06,'' '015-10,'' '615-07'')
AND E.VENDOR_ID = C.VENDOR_ID
AND A.BUSINESS_UNIT = F.BUSINESS_UNIT
AND A.VOUCHER_ID = F.VOUCHER_ID
ORDER BY 1
And the SQL above is what's giving me the ORA-00907 error. Has anyone ran into this problem before? Massive wall of text, I know. My apologies. This is my first post here and I'm trying not to leave anything relevant out.
I've got the immediate problem that spurred this question fixed,but that request is just the tip of a very large iceberg, and at some point I need to figure out a way to be able to paste plaintext lists in as criteria using Query Manager, preferably in a way that plays nice with Analytic Grouping.
TL;DR version:
Using Peoplesoft Query Manager to do an Analytic SUM with grouping using OVER, PARTITION BY. When I try to paste a list into the criteria, it throws an ORA-00907 Error.
Any help would be greatly appreciated. Thanks!
Ok, after a bit more tweaking with this, I've found what I think is the underlying issue.
The error, in this case, is two-fold. Part of it was my fault (I didn't check for Peoplesoft mangling the quotation marks I pulled from Word), and part of it was the way Query Manager interprets some kinds of functions (you have to wrap some stuff in a Case When statement to get it to evaluate properly).
First, the "My Fault" part:
Every time I was pasting in my list of test NIGP Codes, I was doing it from a file I kept saved in Microsoft Word.
Which has the probably-handy "replace straight quotes with smart quotes" feature. Peoplesoft goes bonkers when its presented a "smart quote", and will display them as upside-down question marks (there's probably a technical term, I don't know it).
So when I'd test suggestions (such as fixing the quote/comma order as suggested by #Rene Nyffenegger and #WayneH) I'd start with my base test query, add in the expressions and test it, saving it as a separate query. If they didn't work, I'd go back to the base query. That way I could iterate changes and save potential tests as different versions.
My mistake was in not saving the different versions, leaving the application and going back in. It's when you save the query, leave the page, go somewhere else in Peoplesoft, then go back to open Query Manager that it actually shows you that it's doing the character conversion. You can't see it unless you do that. Even though Query Manager is doing it. So it was throwing a character Query Manager wouldn't recognize, but not showing me the character it wouldn't recognize.
I got a new work PC recently, and I've now disabled the Smart Quotes auto-replace for future use.
Second, the "Query Manager: part:
On the version of this that I got to work, I made use of wrapping the "IN" function inside a Case statement. I've found that a lot of SQL functions, when used "plain" (as I'd define them by just copy-pasting from Oracle's definitions pages and filling in the appropriate variables) tend to give PS/Query (Query Manager) heartburn. But if you wrap them inside a CASE...WHEN...END statement that evaluates the result of the function and then build a criteria that selects based on certain values of that result, the function will work and properly display a result.
So for an example, set up this expression (like in the example from #qyb2zm302). I'm using different codes from what was in my original example, but they work the same (they're all five-digit, character-typed codes consisting of three digits, a dash, then two digits)
Case when E.CATEGORY_CD IN
('375-15', '375-30', '375-54', '375-60', '380-30','938-63')
then 'true'
else 'false'
end
And then set a criteria:
AND
Case when E.CATEGORY_CD IN
('375-15', '375-30', '375-54', '375-60', '380-30','938-63')
then 'true'
else 'false'
end
= 'true'
It'll run to completion and return any rows that have that Category Code.
If you don't want to do that, you can do like in #qyb2zm302's Method 2. The only downside to that in Query Manager is that you have to enter them into individual rows in the "List", and if you can only copy-paste 25 at a time.
Wrapping it in a Case Statement lets you paste it directly into an Expression, which is far better for larger lists.
Solutions:
The above is the code I went with that worked. It's simplifying a bit for brevity's sake, but it works.
In List works through the native Query Manager option as long as you manually-populate the list
D.CATEGORY_CD = '005-00' OR works as long as you wrap it in a Case Statement
D.CATEGORY_CD IN ('015-00','015-06','015-10','615-07') works as long as you wrap it in a Case Statement
Peoplesoft hates Smart Quotes. None of the above will work if you're copying quotation marks directly from Word, but you won't see it unless you save, leave and come back to the same query in edit mode
Formatting is important. All of the above require the proper comma/quotation formatting, as pointed out by Rene and Wayne. Meaning: ('xxx-xx', 'xxx-01','xxx-02') etc
Thanks to everyone who helped on this! I don't think I've head-desked this hard before on any question, but I guess that's part of the learning process. Since all the answers posted are valid and correct (or at least a portion of the larger "correct"), I'm going to flag them all.
The
D.CATEGORY_CD IN ('015-00,'' '015-06,'' '015-10,'' '615-07'')
part looks fishy to me
Since a '' within a string "evaluates" to a single ' the first string is
'015-00,'' '
followed by (the non-string)
015-06,
The following '' is probably the thing that the parser stumbles upon since it's pretty meaningless.
Edit try it with a D.CATEGORY_CD IN ('015-00', '015-06', '015-10', '615-07').
Following the link you posted, I see 2 methods for doing what you are trying to accomplish.
I also notice that you tried a 3rd method.
Method 1
Criteria > Add Criteria
Expression Type: Character
Length: 255
Expression Text: D.CATEGORY_CD IN ('015-00','015-06','015-10','615-07') AND 1
Condition Type: equal to
Constant: 1
Method 2
Criteria > Add Criteria
Field: D.CATEGORY_CD
Condition Type: in list
Value: 015-00','015-06','015-10','615-07
Method 3 (Your Method)
Criteria > Add Criteria
Field: D.CATEGORY_CD
Condition Type: equal to
Define Expression: '015-00' OR D.CATEGORY_CD IN ('015-00','015-06','015-10','615-07')
Question) Does the below exactly match the text you are putting the Expression box?
'015-00' OR D.CATEGORY_CD IN ('015-00','015-06','015-10','615-07')
If not, what are you putting in that box?
I think the D.CATEGORY_CD criteria are giving you the problems, I changed the double quotes to single quotes and then it still looked strange to me. I then notice the commas are inside your quotes and not between them, try making the one criteria line look like this:
before:
OR D.CATEGORY_CD IN ('015-00,'' '015-06,'' '015-10,'' '615-07'')
after:
OR D.CATEGORY_CD IN ('015-00', '015-06', '015-10', '615-07')
Also, the "IN" is an implied "OR" and I am not sure if you have parenthesis around the two D.CATEGORY_CD,
I would just put the one additional code into the IN criteria and remove the "D.CATEGORY_CD =" line:
before:
AND D.CATEGORY_CD = '005-00' OR D.CATEGORY_CD IN ('015-00', '015-06', '015-10', '615-07')
after:
AND D.CATEGORY_CD IN ('015-00', '015-06', '015-10', '615-07', '005-00')
Of course, you are already ordering by CATEGORY_CD, you could remove this criteria and pull all categories in one run (that is unless there are too many rows for excel), and then you might also want to include either VENDOR_ID or NAME1 in the ORDER BY clause.
Hope that helps you.

Selecting random phrase from a list

I've been playing around with a .lua file which passes a random phrase using the following line:
SendChatMessage(GetRandomArgument("text1", "text2", "text3", "text4"), "RAID")
My problem is that I have a lot of phrases and the one line of code is very long indeed.
Is there a way to hold
text1
text2
text3
text3
in a list somewhere else in the code (or externally) and call a random value from the main code. Would make maintaining the list of text options easier.
For lists up to a few hundred elements, then the following will work:
messages = {
"text1",
"text2",
"text3",
"text4",
-- ...
}
SendChatMessage(GetRandomArgument(unpack(messages)), "RAID")
For longer lists, you would be well served to replace GetRandomArgument with GetRandomElement that would take a single table as its argument and return a random entry from the table.
Edit: Olle's answer shows one way that something like GetRandomElement might be implemented. But it used table.getn on every call which is deprecated in Lua 5.1, and its replacement (table.maxn) has a runtime cost proportional to the number of elements in the table.
The function table.maxn is only required if the table in use might have missing elements in its array part. However, in this case of a list of items to choose among, there is likely to be no reason to need to allow holes in the list. If you need to edit the list at run time, you can always use table.remove to remove an item since it will also close the gap.
With a guarantee of no gaps in the array of text, then you can implement GetRandomElement like this:
function GetRandomElement(a)
return a[math.random(#a)]
end
So that you send the message like this:
SendChatMessage(GetRandomElement(messages), "RAID")
You want a table to contain your phrases like
phrases = { "tex1", "text2", "text3" }
table.insert(phrases ,"text4") -- alternative syntax
SendChatMessage(phrases[math.random(table.getn(phrases))], "RAID")
Note: getn gets the size of the table; math.random gets a random number (with a max of the size of the phrases table) and the phrases[] syntax returns the table element at the index inside [].

Resources