I'd like to compute the frequency of each single emoji in a Hive table. To do that, I'm trying to extract single emojis, or split a sentence into words / emojis properly using HiveQL.
For example, I have 'hotel 💵❤️ *' as UTF-8 format in my Hive table, and I want to get the following as results:
'hotel', '💵','❤️', '*'
My progress so far:
if I use the following code, I could separate data by blanks but not splitting the two emojis
result:
'hotel', '💵❤️', '*'
code:
select split(col_name,'[ ]')
FROM the_table;
I know I could use something like regexp_extract to extract desired emojis, but I don't have a comprehensive list of emojis, so still can't leverage that to get what I want.
SELECT regexp_extract('hotel 💵❤️ *', '[🏨💵⚾️]', 0);
Related
I am trying to normalize a column in my table without having an easy way to do it.
I tried to use the NORMALIZE function and other solutions posted here, but nothing works.
I would like to get a text normalized from the table like "C\u00f3mo funciona" => "Cómo funciona".
The problem was that before saving the data, I made a json_encode that encodes all the data before it can be writted. I only use JSON_QUERY and JSON_VALUE functions of BigQuery to recover and decode the special characters.
I need to get part of a blob field which has some json data. one part of the blob is like this CustomData:{HDFC;1;0;sent} . I need separate values after CustomData like I need to get HDFC, 1, 0, sent separately.
This is what I have tried in two separate queries which works:
This gives me index of CustomData within payment_data blob field for example it returns 11000
select dbms_lob.instr(payment_data, utl_raw.cast_to_raw('CustomData'))
from table_x;
I am specifying 3rd parameter as what first query returned + length of test CustomData: to get {HDFC;1;0;sent}
select UTL_RAW.CAST_TO_VARCHAR2(dbms_lob.substr(payment_data,1000,11011))
from table_x;
Problem is I need to take dynamic offset in 2nd query and not run 1st query as individual. Specifying dynamic offset is not working with dbms_lob.substr() function. Any suggestions how can I combine these two queries into one?
Once I get {HDFC;1;0;sent}, I also need to get these delimited values separately, so combining these three into one would even be better if someone can help with it. I can use regexp_substr to get delimited text once I get first two combined.
If you want extract text data from blob first u need convert it to clob using dbms_lob.converttoclob.
If you have Oracle 12c or higher you may use JSON SQL functions, for example, JSON_TABLE.
If your Oracle version between 10 and 11 you may use regex functions or instr + substr if your version less than 10.
This question is related with this question. However, in that question I made some wrong assumptions...
I have a string that contains a SQL query, with or without one or more parameters, where each parameter has a "&" (ampersand sign) as prefix.
Now I want to extract all parameters, load them into a table in excel where the user can enter the values for each parameter.
Then I need to use these values as a replacement for the variables in the SQL query so I can run the query...
The problem I am facing is that extracting (and therefore also replacing) the parameter names is not that straight forward, because the parameters are not always surrounded with spaces (as I assumed in my previous question)
See following examples
Select * from TableA where ID=&id;
Select * from TableA where (ID<&ID1 and ID>=&ID2);
Select * from TableA where ID = &id ;
So, two parts of my question:
How can I extract all parameters
How can I replace all parameters using another table where the replacements are defined (see also my previous question)
A full solution for this would require getting into details of how your data is structured and would potentially be covering a lot of topics. Since you already covered one way to do a mass find/replace (which there are a variety of ways to accomplish in Power Query), I'll just show you my ugly solution to extracting the parameters.
List.Transform
(
List.Select
(
Text.Split([YOUR TEXT HERE], " "), each Text.Contains(_,"&")
),
each List.Accumulate
(
{";",")"}, <--- LIST OF CHARACTERS TO CLEAN
"&" & Text.AfterDelimiter(_, "&"),
(String,Remove) => Text.Replace(String,Remove,"")
)
)
This is sort of convoluted, but here's the best I can explain what is going on.
The first key part is combining List.Select with Text.Split to extract all of the parameters from the string into a list. It's using a " " to separate the words in the list, and then filtering to words containing a "&", which in your second example means the list will contain "(ID<&ID1" and "ID>=&ID2);" at this point.
The second part is using Text.AfterDelimiter to extract the text that occurs after the "&" in our list of parameters, and List.Accumulate to "clean" any unwanted characters that would potentially be hanging on to the parameter. The list of characters you would want to clean has to be manually defined (I just put in ";" and ")" based on the sample data). We also manually re-append a "&" to the parameter, because Text.AfterDelimiter would have removed it.
The result of this is a List object of extracted parameters from any of the sample strings you provided. You can setup a query that takes a table of your SQL strings, applies this code in a custom column where [YOUR TEXT HERE] is the field containing your strings, then expand the lists that result and remove duplicates on them to get a unique list of all the parameters in your SQL strings.
I have an excel file with a table named 'Table1' in it. I have to perform 'Filter Table' activity in UiPath with the condition "column1 begins with '*my column'". But when I specify the value like this, the column is filtered for 'ends with' operation.
Here is the screenshot for my table-
Below is the screenshot for the steps I followed-
This has been answered many times on UiPath Forum
For example https://forum.uipath.com/t/filter-table-in-excel-data-tables/559/3
If you use *my value as the search / filter pattern, then it'd mean, anything in the beginning and must have my value in the end. So, it is being interpreted correctly as Ends With. If you want to have a Begins With filter, you should have your filter text followed by the wildcard, like - my value*.
Further, if you want to include wildcard as a literal in the search pattern, you'd need to escape that by enclosing it in brackets like [*]my value* - this'd search for text beginning with *my value.
MS Excel / VBA also supports Tilde ~ as an escape character in some cases.
In excel filters, '' represents any series of characters.
The issue in the above case is that the filter value in the condition already contains a ''. Because of this, system always reads it as '*My column' => '[any characters]My column'. i.e., value ends with 'My column'.
To resolve this issue, I have specified contains filter instead of Begins with as 'My column'.
I have also tried to escape '*'. But it threw excel exception.
In addition, you can not specify condition as "Column1 Like '*My column%'". This works file when you are adding filter to 'DataTable'(after performing 'ReadRange' activity). But in this case, you will retrieve all the records and then you will be filtering the columns. This will lead to performance issues if the the excel table is huge.
You can follow the syntax below to perform filter activities in an excel:
DataTableName.Select("[ColumnName]='Datawithwhichweneedtofilter’").CopytoDataTable()
let's say I have some sample rows of data
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
site1^http://article1.com?test=yes
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
I want to create a table like so
create table clicklogs (sitename string, url string)
ROW format delimited fields terminated by '^';
As you can see I have some data in the url parameter I'd like to extract, namely
datacoll=5|4|3|2|1
I also want to work with those individual elements seperated by pipes so I can do group bys on them to show for example how many urls had a 2nd position of "4" which would be 2 rows in this case. So in this case I have the "url" field that has additional data I'd like to parse out and use in my queries.
The question is, what is the best way to do that in hive?
thanks!
First, use parse_url(string urlString, string partToExtract [, string keyToExtract]) to grab the data in question:
parse_url('http://article1.com?datacoll=5|4|3|2|1&test=yes', 'QUERY', 'datacol1')
This returns '5|4|3|2|1', which gets us halfway there. Now, use split(string str, string pat) to break those out of each sub-delimiter into an array:
split(parse_url(url, 'QUERY', 'datacol1'), '\|')
With the result of this, you should be able to grab the columns that you want.
See the UDF documentation for more built-in functions.
Note: I wasn't able to verify this works in Hive from where I am, sorry if there are some minor issues.
This looks very similar to something I've done a couple weeks ago, I think the best approach in your case would be to apply a pre-processing step (possibly with hadoop streaming), and change the prototype of your table to be:
create table clicklogs(sitename string, datacol Array<int>) row format delimited fields terminated by '^' collection items terminated by '|'
Once you have that you can easily manipulate your data in Hive using lateral views and the builtin explode. The following code should help you get the counts of URLs per col.
select col, count(1) from clicklogs lateral view explode(datacol) dataTable as col group by col