Apache Drill - Using Multiple Delimiters in File Storage Plugin?

Apache Drill - Using Multiple Delimiters in File Storage Plugin? - hadoop

I have logs that resemble the following:
value1 value2 "value 3 with spaces" value4
using:
"formats": {
"csv": {
"type": "text",
"delimiter": " "
}
}
for the storage plugin delimiting by " " gives me the following columns:
columns[0] | columns[1] | columns[2] | columns[3] | columns[5] | columns[6] | columns[7]
value1 | value2 | value | 3 | with | spaces | value4
what I'd like is:
columns[0] | columns[1] | columns[2] | columns[3]
value1 | value2 | value 3 with spaces | value4

To my knowledge, there is no way to skip delimiters in Drill. However, if variable 3 is the only one that can have those " " in between, a workaround I can think of is:
structure your first query so that columns[3] is always the last, Ex
select columns[0], columns[1], columns[2], columns[4], columns[3] from dfs.default./path/to/your/file;
use the CONCATENATE() command to build your variable in a separate column.
Another way around it would require changing the default delimiter in the file prior having Drill reading it. Depending on where you are ingesting your data from this may be feasible or not.
Good luck and if you are looking for more things on Drill, be sure to check out MapR's Community page on Drill, which has code examples that might be helpful: https://community.mapr.com/community/products/apache-drill

Related

Query, transpose and skip blank cells

I'm completely lost here:
I have a table that looks like this, but has a variable amount of value columns
+------------+------------+-----------+-----------+
| name1 | value1 | value2 | value3 |
+------------+------------+-----------+-----------+
| name1 | value1 | | value3 |
+------------+------------+-----------+-----------+
| name1 | | value2 | value3 |
+------------+------------+-----------+-----------+
What I need is a table looking like this:
+------------+------------+-----------+-----------+
| name1 | value1 | value2 | value3 |
+------------+------------+-----------+-----------+
| name1 | value1 | value3 | |
+------------+------------+-----------+-----------+
| name1 | value2 | value3 | |
+------------+------------+-----------+-----------+
What I came up with for now is this formula, which only works for the first row of data. Named range is my source table range.
=MTRANS(QUERY(MTRANS({Named Range});"select * where Col1 is not null"))
I cannot just add all the columns to it, as I dont know how many of them will be. What secret sauce will I have to add to be able to solve this?
Thank you very much for your help!

#Andii This seems to do what you want:
=ArrayFormula(split(transpose(query(transpose(A5:D7),,9^99))," ",1,1))
I have a sample sheet here:
https://docs.google.com/spreadsheets/d/1Em1V9o5aeAtq0Fo_Yb39xXAZRmIZhwSyAHawa-ExilA/edit?usp=sharing
Let us know if this answers your question.

how to count number of words in each column delimited by "|" seperator using hive?

input data is
+----------------------+--------------------------------+
| movie_name | Genres |
+----------------------+--------------------------------+
| digimon | Adventure|Animation|Children's |
| Slumber_Party_Massac | Horror |
+----------------------+--------------------------------+
i need output like
+----------------------+--------------------------------+-----------------+
| movie_name | Genres | count_of_genres |
+----------------------+--------------------------------+-----------------+
| digimon | Adventure|Animation|Children's | 3 |
| Slumber_Party_Massac | Horror | 1 |
+----------------------+--------------------------------+-----------------+

select *
,size(split(coalesce(Genres,''),'[^|\\s]+'))-1 as count_of_genres
from mytable
This solution covers varying use-cases, including -
NULL values
Empty strings
Empty tokens (e.g. Adventure||Animation orAdventure| |Animation )

This is a really, really bad way to store data. You should have a separate MovieGenres table with one row per movie and per genre.
One method is to use length() and replace():
select t.*,
(1 + length(genres) - length(replace(genres, '|', ''))) as num_genres
from t;
This assumes that each movie has at least one genre. If not, you need to test for that as well.

Tokenizing a multi-language text field in Elasticsearch

I have the following table which contains millions of documents data in the form of a json file:
+-------+---------------------------------------+------------+
| doc_id| doc_text | doc_lang |
+-------+---------------------------------------+------------+
| doc1 | "first /resource X 'title' " | en |
| doc2 | "<r>ressource 2 #titre en France" | Fr |
| doc3 | "die Tür geöffnet?" | ge |
| doc4 | "$risorsa 4 <in> lingua italiana" | It |
| ... | " ........." | .. |
| ... | "........." | .. |
+-------+---------------------------------------+------------+
I need to do the following:
Tokenizing, filtering and stopwords removing for each document text using an appropriate analyzer (dynamically) according to the text language shown in doc_lang field (let's say European languages).
Getting TF and IDF for each term inside doc_text field.(no search operations are required, just for scoring)
Q) Could anybody advice me if Elasticsearch is a good choice in this case?
P.S. I am looking for something compatible with Apache Spark.

Include language code in the doc_text field when indexing like
{ "doc_id": "doc", "doc_text_en": "xxx", "doc_lang": "en"}
Then you will be able to specify dynamic mapping of lang-specific analyzer.
https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html

Ruby split pipes in regex

I have put the data from a file into an array, then I am just staying with the data I want of that array which looks like follows:
Basically what I want, is to access each column independently. As the file will keep changing I don't want something hard coded, I would have done it already :).
Element0: | data | address | type | source | disable |
Element1: | 0x000001 | 0x123456 | in | D | yes |
Element2: | 0x0d0f00 | 0xffffff | out | M | yes |
Element3: | 0xe00ab4 | 0xaefbd1 | in | E | no |
I have tried with the regexp /\|\s+.*\s+\|/it prints just few lines (it removes the data I care of). I also tried with /\|.*\|/ and it prints all empty.
I have googled the split method and I know that this is happening it is because of the .* removing the data I care of. I have also tried with the regexp \|\s*\| but it prints the whole line. I have tried with many regexp's but at this moment I can't think of a way to solve this.
Any recommendation?
`line_ary = ary_element.split(/\|\s.*\|/)
unless line_ary.nil? puts line_ary`

You should use the csv class instead of trying to regex parse it. Something like this will do:
require 'csv'
data = CSV.read('data.csv', 'r', col_sep: '|')
You can access rows and columns as a 2 dimentional array, e.g. to access row 2, column 4: data[1][3].
If for example you just wanted to print the address column for all rows you could do this instead:
CSV.foreach('data.csv', col_sep: '|') do |row|
puts row[2]
end

I'd probably use a CSV parser for this but if you want to use a regex and you're sure that you'll never have | inside one of the column values, then you want to say:
row = line.split(/\s*\|\s*/)
so that the whitespace on either side of the pipe becomes part of the delimiter. For example:
> 'Element0: | data | address | type | source | disable |'.split(/\s*\|\s*/)
=> ["Element0:", "data", "address", "type", "source", "disable"]
> 'Element1: | 0x000001 | 0x123456 | in | D | yes |'.split(/\s*\|\s*/)
=> ["Element1:", "0x000001", "0x123456", "in", "D", "yes"]

Split together with strip might be the easiest option. Have you tried something like this?
"Element3:...".split(/\|/).collect(&:strip)

What is the best big data solution for interactive queries of rows with up to 200 columns?

We have a simple table such as follows:
------------------------------------------------------------------------
| Name | Attribute1 | Attribute2 | Attribute3 | ... | Attribute200 |
------------------------------------------------------------------------
| Name1 | Value1 | Value2 | null | ... | Value3 |
| Name2 | null | Value4 | null | ... | Value5 |
| Name3 | Value6 | null | Value7 | ... | null |
| ... |
------------------------------------------------------------------------
But there could be up to hundreds of millions of rows/names.
The data will be populated every hour or so.
The goal is to get results for interactive queries on the data within a couple of seconds.
Most queries look like:
select count(*) from table
where Attribute1 = Value1 and Attribute3 = Value3 and Attribute113 = Value113;
The where clause contains arbitrary number of attribute name-value pairs.
I'm new in big data and wondering what the best option is in terms of data store (MySQL, HBase, Cassandra, etc) and processing engine (Hadoop, Drill, Storm, etc) for interactive queries like above.

A columnar DB like Vertica (closed source) or MonetDB (open source - but I haven't used it) will handle queries like the ones you mentioned efficiently. In 50000 feet view the reasons for this is that they stores each column separately and thus doesn't read any unneeded columns when they need to query data - for your example 3 attributes will be read and the other 197 won't be

Playorm for Cassandra provide a decent support for SQL including Joins. Read more at http://buffalosw.com/wiki/SJQL-Support/ and for examples see http://buffalosw.com/wiki/Command-Line-Tool/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Apache Drill - Using Multiple Delimiters in File Storage Plugin? - hadoop

Related

Query, transpose and skip blank cells

how to count number of words in each column delimited by "|" seperator using hive?

Tokenizing a multi-language text field in Elasticsearch

Ruby split pipes in regex

What is the best big data solution for interactive queries of rows with up to 200 columns?

Categories

Resources