Hive function to retrieve particular array element - hadoop

I have a table which stores strings in array. Couldn't figure it out why but simple example looks like that:
+--------+----------------------------------+
| reason | string |
+--------+----------------------------------+
| \N | \N\N\N\NXXX - ABCDEFGH\N\N |
| \N | \N\N\N\NXXX - ABCDEFGH |
| \N | \N\N\N\N |
| \N | \N\N\N\NXXX - ABCDEFGH\N |
| \N | \N\N |
| \N | \N\N\N |
| \N | \N |
+--------+----------------------------------+
We couldn't see that in table above but true format of first string looks like that
Basically, what I would like to retrieve is:
+--------+----------------------------------+
| reason | string |
+--------+----------------------------------+
| \N | XXX - ABCDEFGH |
+--------+----------------------------------+
XXX - remains always the same but ABCDEFGH may be any string.
The problem is I can't use table path.path.path_path[4] because string XXX - ABCDEFGH may be 4th or any element of the array (even 20th).
Tried to use where lower(path.path.string) like ('xxx - %') but received error
Select
path.path.reason,
path.path.string
From table_name
Where path.id = '123'
And datestr = '2018-07-21'

This regular expression will do the job for you([^\N$])+.
Assuming the character showed in the image is a $.
First,
you can use regexp_extract() to retrieve particular array element.
It has the following syntax:
regexp_extract(string subject, string pattern, int index)
Second, you can use regexp_replace which has the following syntax:
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
Test Data
WITH string_column
AS (SELECT explode(array('XXX - ABCSSSSSSSSSSSGH\N\N',
'\N$\N$\N$\N$XXX - ABCDEFGH$\N\N',
'\N\N\N\N', '\N\N\N\NXXX - ABCDEFGH\N')) AS
str_column
)
SELECT regexp_replace(regexp_extract(str_column, '([^\N$])+', 0), "$", " ")
AS string_col
FROM string_column
Will result in
------------------------------
| string_col |
------------------------------
| XXX - ABCSSSSSSSSSSSGH |
------------------------------
| XXX - ABCDEFGH |
------------------------------
| |
------------------------------
| XXX - ABCDEFGH |
------------------------------
Note: A '0' which specifies the index produces a match, after the extract based on the pattern.
regexp_extract(str_column, '(,|[^\N$])+', 0)
The following statement will replace occurrence of any '$'
regexp_replace(regexp_extract(str_column, '([^\N$])+', 0), "$", " ")
For more information on
regexp_replace & regexp_extract(): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions

Related

How to split a row where there's 2 data in each cells separated by a carriage return?

Someone gives me a file with, sometimes, inadequate data.
Data should be like this :
+---------+-----------+--------+
| Name | Initial | Age |
+---------+-----------+--------+
| Jack | J | 43 |
+---------+-----------+--------+
| Nicole | N | 12 |
+---------+-----------+--------+
| Mark | M | 22 |
+---------+-----------+--------+
| Karine | K | 25 |
+---------+-----------+--------+
Sometimes it comes like this tho :
+---------+-----------+--------+
| Name | Initial | Age |
+---------+-----------+--------+
| Jack | J | 43 |
+---------+-----------+--------+
| Nicole | N | 12 |
| Mark | M | 22 |
+---------+-----------+--------+
| Karine | K | 25 |
+---------+-----------+--------+
As you can see, Nicole and Mark are put in the same row, but the data are separated by a carriage return.
I can do split by row, but it demultiply the data :
+---------+-----------+--------+
| Nicole | N | 12 |
| | M | 22 |
+---------+-----------+--------+
| Mark | N | 12 |
| | M | 22 |
+---------+-----------+--------+
Which make me lose that Mark is associated with the "2nd row" of data.
(The data here is purely an example)
One way to do this is to transform each cell into a list by doing a Text.Split on the line feed / carriage return symbol.
TextSplit = Table.TransformColumns(Source,
{
{"Name", each Text.Split(_,"#(lf)"), type text},
{"Initial", each Text.Split(_,"#(lf)"), type text},
{"Age", each Text.Split(_,"#(lf)"), type text}
}
)
Now each column is a list of lists which you can combine into one long list using List.Combine and you can glue these columns together to make table with Table.FromColumns.
= Table.FromColumns(
{
List.Combine(TextSplit[Name]),
List.Combine(TextSplit[Initial]),
List.Combine(TextSplit[Age])
},
{"Name", "Initial", "Age"}
)
Putting this together, the whole query looks like this:
let
Source = <Your data source>
TextSplit = Table.TransformColumns(Source,{{"Name", each Text.Split(_,"#(lf)"), type text},{"Initial", each Text.Split(_,"#(lf)"), type text},{"Age", each Text.Split(_,"#(lf)"), type text}}),
FromColumns = Table.FromColumns({List.Combine(TextSplit[Name]),List.Combine(TextSplit[Initial]),List.Combine(TextSplit[Age])},{"Name","Initial","Age"})
in
FromColumns

how to count number of words in each column delimited by "|" seperator using hive?

input data is
+----------------------+--------------------------------+
| movie_name | Genres |
+----------------------+--------------------------------+
| digimon | Adventure|Animation|Children's |
| Slumber_Party_Massac | Horror |
+----------------------+--------------------------------+
i need output like
+----------------------+--------------------------------+-----------------+
| movie_name | Genres | count_of_genres |
+----------------------+--------------------------------+-----------------+
| digimon | Adventure|Animation|Children's | 3 |
| Slumber_Party_Massac | Horror | 1 |
+----------------------+--------------------------------+-----------------+
select *
,size(split(coalesce(Genres,''),'[^|\\s]+'))-1 as count_of_genres
from mytable
This solution covers varying use-cases, including -
NULL values
Empty strings
Empty tokens (e.g. Adventure||Animation orAdventure| |Animation )
This is a really, really bad way to store data. You should have a separate MovieGenres table with one row per movie and per genre.
One method is to use length() and replace():
select t.*,
(1 + length(genres) - length(replace(genres, '|', ''))) as num_genres
from t;
This assumes that each movie has at least one genre. If not, you need to test for that as well.

PageObject/Cucumber String being input incorrectly

In my scenario outline I have the below
Examples:
| user | password | from | to | amount | date | message |
| joel10 | lolpw12 | bankA | bankB | $100 | 1/30/2015 | Transfer Success. |
in my step definitions I have the below
And(/^the user inputs fields (.*), (.*), (.*)$/) do |from, to, amount|
on(TransferPage).from = /#{from}/
on(TransferPage).to = /#{to}/
on(TransferPage).amount = /#{amount}/
on(TransferPage).date = /#{date}/
end
The FROM, TO, and AMOUNT all comes out correct from the table but when it inputs the date, it comes out (?-mix:1/30/2015)
why is this happening and how do i fix?
When you do /#{date}/ you are taking the value returned from the parsing of the step definition and then turning it into a regular expression:
/#{date}/.class
#=> Regexp
You presumably want to leave the value in its original String format:
on(TransferPage).date = date

Grid table with hard line break in first line with pandoc

How can I make pandoc create a cell with the first thing being a line break in a grid table?
The following makes latex say ! LaTeX Error: There's no line here to end.
+-----------------------------------+------------------------------------+
| X | Y |
+===================================+====================================+
|Total revenue:\ | \ |
| - Current year\ | YYY\ |
| - Previous year\ | XXX\ |
|Total profit/loss:\ | \ |
| - Current year\ | TTT\ |
| - Previous year | ZZZ |
+-----------------------------------+------------------------------------+
Using linebreaks for layouting is frowned upon in Markdown and HTML. Because it's not semantic, but also because it often doesn't work well (like in your example).
I would make the table.... well, more tabular:
X Y
-------------- -------------- ---- ----
Total revenue:
Current year XXX YYY
Previous year XXX YYY
Total profit:
Current year XXX YYY
Previous year XXX YYY
Using \mbox{} \ will create the expected line breaks and no latex errors
+-----------------------------------+------------------------------------+
| X | Y |
+===================================+====================================+
|Total revenue:\ | \mbox{} \ |
| - Current year\ | YYY\ |
| - Previous year\ | XXX\ |
|Total profit/loss:\ | \mbox{} \ |
| - Current year\ | TTT\ |
| - Previous year | ZZZ |
+-----------------------------------+------------------------------------+
I have created an issue at github https://github.com/jgm/pandoc/issues/1733

multiline text in varchar2 field

I have a multiline date, and I'd like to insert it in a table. Then of course, I'd like to retrieve it while preserving the places of cartridge returns.
For example. I have data like this in text file
-------------------------------
| ID | text |
| | |
| 01 | This is headline. |
| 02 | This is all the text.|
| | ¤ |
| | Of great story once |
| 03 | Great weather |
-------------------------------
The ¤ is the indicator of cartridge return. When I try to run the query then data comes like this:
-------------------------------
| ID | text |
| | |
| 01 | This is headline. |
| 02 | This is all the text.|
| 03 | Great weather |
-------------------------------
What I'd like to have in table: (I have no idea how to show cartridge return in the example below)
-----------------------------------------------------
| ID | text |
| | |
| 01 | This is headline. |
| 02 | This is all the text. Of great story once |
| 03 | Great weather |
-----------------------------------------------------
Which is of course, wrong as the data for ID 02 wasn't imported completely.
Here is my script:
LOAD DATA
INFILE "file.txt" BADFILE "file.bad" DISCARDFILE "file.dsc"
APPEND
INTO TABLE text_table
FIELDS TERMINATED BY X'7C' TRAILING NULLCOLS
(
employee_id,
exp_pro CHAR(4000)
)
Any ideas?
First make sure the issue isn't with how you're viewing the data (or the IDE used). Sometimes viewers will simply stop at a linefeed (or carriage return, or some binary char).
Try dumping a hex representation of some data first. for example:
with txt as (
select 'This is line 1.' || chr(13) || chr(10) || 'Line 2.' as lines
from dual
)
select dump(txt.lines, 16) from txt;
You should be able to see the 0d0a (crlf) chars, or whatever other "non-printable" chars exists, if any.

Resources