Hive - Create External Table with text files excluding line terminator - hadoop

I want to create an external table with set of text files. Each row should be one text files. Example of one text file is as below and there can be multiple text files.(files are stored in HDFS)
thanking
you
for
the
participation
Lines are terminated by /n. I want to create an external table with the above text files and data in a text file should be in one row(one cell).
I tried the following Create table statement.
Create External table if not exists sample_email(
email STRING
)
STORED AS TEXTFILE
LOCATION '/tmp/txt/sample/';
It will give create table as follows.
+--------------------------------------+
+ email +
+--------------------------------------+
+ thanking +
+--------------------------------------+
+ you +
+--------------------------------------+
+ for +
+--------------------------------------+
+ the +
+--------------------------------------+
+participation +
+--------------------------------------+
+please +
+--------------------------------------+
+find +
+--------------------------------------+
+the +
+--------------------------------------+
+discussed +
+--------------------------------------+
+points +
+--------------------------------------+
But I want as follows.
+--------------------------------------+
+ email +
+--------------------------------------+
+ thanking you for the participation +
+--------------------------------------+
+ please find the discussed points +
+--------------------------------------+
How to overcome my issue?
Thank you in advance

select concat_ws(' ',collect_list(email)) as emails
from sample_email
group by input__file__name
+------------------------------------+
| emails |
+------------------------------------+
| thanking you for the participation |
| please find the discussed points |
+------------------------------------+

Use tr to remove \n from files.
hadoop fs -cat file.txt | tr -d '\n' | hadoop fs -put - new_file.txt

set textinputformat.record.delimiter='\0';
select translate(email,'\n',' ') as emails
from sample_email
+-------------------------------------+
| emails |
+-------------------------------------+
| thanking you for the participation |
| please find the discussed points |
+-------------------------------------+
Unfortunately, I still don't know how to set textinputformat.record.delimiter back to newline within the same session.
How to reset textinputformat.record.delimiter to its default value within hive cli / beeline?

Related

PowerAutomate - replace nth occurrence of character

I'm attempting to parse email body to excel file.
After some manipulations, my current output is an array, where each line is data related to a product.
[  
"Periods: 01.01.2023 - 01.02.2023 | Code: 111 | Code2: 1111 | product-name",  
"Periods: 01.01.2023 - 01.02.2023 | Code: 222 | Code2: 2222 | product-name2"
]
I need to replace the 3rd occurrence of " | " with " | Product: " , so i can get field Product before the product name.
I've tried to use Apply to each -> current item -> various ways to find 3rd occurrence and replace it, but can't succeed.
Any suggestion?
You should be able to loop through each item and perform a simple replace expression like thus ...
replace(item(), split(item(), ' | ')[3], concat('Product: ', split(item(), ' | ')[3]))
That should get you across the line. Of course, I'm basing my answer off the limited information you provided.

how to count number of words in each column delimited by "|" seperator using hive?

input data is
+----------------------+--------------------------------+
| movie_name | Genres |
+----------------------+--------------------------------+
| digimon | Adventure|Animation|Children's |
| Slumber_Party_Massac | Horror |
+----------------------+--------------------------------+
i need output like
+----------------------+--------------------------------+-----------------+
| movie_name | Genres | count_of_genres |
+----------------------+--------------------------------+-----------------+
| digimon | Adventure|Animation|Children's | 3 |
| Slumber_Party_Massac | Horror | 1 |
+----------------------+--------------------------------+-----------------+
select *
,size(split(coalesce(Genres,''),'[^|\\s]+'))-1 as count_of_genres
from mytable
This solution covers varying use-cases, including -
NULL values
Empty strings
Empty tokens (e.g. Adventure||Animation orAdventure| |Animation )
This is a really, really bad way to store data. You should have a separate MovieGenres table with one row per movie and per genre.
One method is to use length() and replace():
select t.*,
(1 + length(genres) - length(replace(genres, '|', ''))) as num_genres
from t;
This assumes that each movie has at least one genre. If not, you need to test for that as well.

How to add value description with API Blueprint?

Is there any way to add a description to possible values of URI parameter?
## Search Items [/items{?s}]
### Get items [GET]
+ Parameters
+ s (optional, values) ... Sort results by
+ Values
+ `1 - price`
+ `4 - date`
If I use the approach given above, then I can not define example and default values (for ex., 4), since it expects the full value (4 - date).
No, there is currently no way to add description to possible values of URI parameters.
Neither
+ Values
+ `A - means something`
+ `B`
+ `C`
or
+ Values
+ `A` means something
+ `B`
+ `C`
will work correctly. I filed a feature request under API Blueprint's repository. If you want to be part of the design process and help us to get the best solution to your problem, you can track it and comment under it.
Using tables
When in troubles with API Blueprint, you can always use plain old Markdown in endpoint's description to supplement or substitute what's missing. E.g. you can freely use tables as an addition or replacement to the Values section:
# My API
## Sample [/endpoint{?id}]
Description.
| Value | Meaning |
| ------------ |:----------------:|
| A | Alaska |
| B | Bali |
| C | Czech Republic |
+ Parameters
+ id (string)
Description...
| Value | Meaning |
| ------------ |:----------------:|
| A | Alaska |
| B | Bali |
| C | Czech Republic |
Description...
+ Values
+ `A`
+ `B`
+ `C`

Select only the rows which timestamp correspond to the current month

I am starting to try some experiments using Google SpreadSheets as a DB and for that I am collecting data from different sources and inserting them via spreadsheets API into a sheet.
Each row has a value (Column B) and a timestamp (Column A).
+---------------------+------+
| ColA | ColB |
+---------------------+------+
| 13/10/2012 00:19:01 | 42 |
| 19/10/2012 00:29:01 | 100 |
| 21/10/2012 00:39:01 | 23 |
| 22/10/2012 00:29:01 | 1 |
| 23/10/2012 00:19:01 | 24 |
| 24/10/2012 00:19:01 | 4 |
| 31/10/2012 00:19:01 | 2 |
+---------------------+------+
What I am trying to do is to programmatically add the sum of all rows in Column B where Column A is equal to the current month into a different cell.
Is there any function that I can use for that? Or anyone can point me to the right direction on how can I create a custom function which might do something like this? I know how to do this using MySQL but I couldn't find anything for Google SpreadSheets
Thanks in advance for any tip in the right direction.
Would native spreadsheet functions do?
=ArrayFormula(SUMIF(TEXT(A:A;"MM/yyyy");TEXT(GoogleClock();"MM/yyyy");B:B))

Oracle SQL Loader split data into different tables

I have a Data file that looks like this:
1 2 3 4 5 6
FirstName1 | LastName1 | 4224423 | Address1 | PhoneNumber1 | 1/1/1980
FirstName2 | LastName2 | 4008933 | Address1 | PhoneNumber1 | 1/1/1980
FirstName3 | LastName3 | 2344327 | Address1 | PhoneNumber1 | 1/1/1980
FirstName4 | LastName4 | 5998943 | Address1 | PhoneNumber1 | 1/1/1980
FirstName5 | LastName5 | 9854531 | Address1 | PhoneNumber1 | 1/1/1980
My DB has 2 Tables, one for PERSON and one for ADDRESS, so I need to store columns 1,2,3 and 6 in PERSON and column 4 and 5 in ADDRESS. All examples provided in the SQL Loader documentation address this case but only for fixed size columns, and my data file is pipe delimited (and spiting this into 2 different data files is not an option).
Do someone knows how to do this?
As always help will be deeply appreciated.
Another option may be to set up the file as an external table and then run inserts selecting the columns you want from the external table.
options(skip=1)
load data
infile "csv file path"
insert into table person
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(1,2,3,6)
insert into table address
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(4,5)
Even if SQLLoader doesn't support this (I'm not sure) nothing stops you from pre-processing it with say awk and then loading. For example:
cat 1.dat | awk -F '|' '{print $1 $2 $3 $6}' > person.dat
cat 1.dat | awk -F '|' '{print $4 $5}' > address.dat

Resources