Is there a way to turn unix commands as map reduce activities? - hadoop

I'm trying to get UNIX outputs in hive query.
For example the following query doesn't work:
select transform ('')
using 'pwd'
as syspath
But this query works:
select transform ('')
using 'hive -e "select 10 as col1"'
as col1
How do I enable UNIX commands or bash scripts as a map reduce job to gets its output available in hive?
Thanks in advance!

count the number of characters from a file
Why would you ever use Hive for that? Spark is so much more flexible.
val charCount = spark.read.textLines("path/to/file.txt")
.flatMap(line => line.toList())
.map(char => (char, 1)) // This is literally just wordcount, now
.reduceByKey(_ + _)
.map((char, count) => count)
.sum() // something like this ...
println(charCount.collect()(0))

Related

Split characters inside Pig field

I have a text input with '|' separator as
0.0000|25000| |BM|BM901002500109999998|SZ
which I split using PigStorage
A = LOAD '/user/hue/data.txt' using PigStorage('|');
Now I need to split the field BM901002500109999998 into different fields based on their position , say 0-2 = BM - Field1 and like wise.
So after this step I should get BM, 90100, 2500, 10, 9999998.
Is there any way in Pig script to achieve this, otherwise I plan to write an UDF and put separator on required positions.
Thanks.
You are looking for SUBSTRING:
A = LOAD '/user/hue/data.txt' using PigStorage('|');
B = FOREACH A GENERATE SUBSTRING($4,0,2) AS FIELD_1, SUBSTRING($4,2,7) AS FIELD_2, SUBSTRING($4,7,11) AS FIELD_3, SUBSTRING($4,11,13) AS FIELD_4, SUBSTRING($4,13,20) AS FIELD_5;
The output would be:
dump B;
(BM,90100,2500,10,9999998)
You can find more info about this function here.
I think that it will be much more efficient to use the built in UDF REGEX_EXTRACT_ALL.
You can get some idea of how to use this UDF from:
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#REGEX_EXTRACT_ALL
STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

Select in ADO (vb6) with a numeric variable

Excuse me, occasionally I refer with some problem that maybe it's already been fixed. In any case, I would appreciate a clarification on vs.
I have a TariffeEstere table with the fields country, Min, Max, tariff
from which to extract the rate for the country concerned, depending on whether the value is between a minimum and a maximum and I should return a single record from which to extract its tariff:
The query is:
stsql = "Select * from QPagEstContanti Where country = ' Spain '
and min <= ImpAss and max >= ImpAss"
Where ImpAss is a variable of type double.
When I do
rstariffa.open ststql,.....
the recodset contains a record if e.g. ImpAss = 160 (i.e. an integer without decimals), and then the query works, but if it contains 21,77 ImpAss (Italian format) does not work anymore and gives me a syntax error.
To verify the contents of the query string (stsql) in fact I find:
Select * from QPagEstContanti Where country = 'Spain' and min < = 21,77 and max > = 21,77
in practice the bothering and would like a comma decimal, but do not know how do.
I tried to pass even a
format (ImpAss, "####0.00"),
but the value you found in a stsql is 21,77 always.
How can I fix the problem??
It sounds like the underlying language setting in SQL is expecting '.' decimals instead of ',' decimal notation.
To check this out - run the DBCC useroptions command and see what the 'language' value is set to. If the language is set to English or another '.' decimal notation - it explains why your SQL string is failing with values of double.
If that's the problem, the simplest way to fix it is to insert the following line after your stsql = statement:
  stsql = REPLACE(stsql, ",", ".")
Another way to fix it would be to change the DEFAULT_LANGUAGE for the login using the ALTER LOGIN command (but this changes the setting permanently)
Another way to fix it would be to add this command to the beginning of your stsql, which should change the language for the duration of the rs.Open:
  "SET LANGUAGE Italian;"

PigStorage and Variable Schemas from Input

I have a comma separated text file like
1,abc,1,
2,def,1,2,3,4
3,ghi,1,2
4,jkl,1,5,6,7,8,9
5,mno
The text file will always have the first two values, but will have 0 or more values after the second comma.
How can I load this data and give an alias to the first two values?
I can load it and not give an alias to the first two values via:
A = LOAD 'data.txt' USING PigStorage(',');
From here, I can do a B = FOREACH A GENERATE $0 AS foo:chararray, $1 AS bar:chararray; but it would discard the rest. It would be nice to do a wildcard and put the rest in a tuple.
Is there anyway to do this?
Try this
B = foreach A generate $0 as foo:chararray, $1 as bar:chararray, $2..;
reference
Drop single column in Pig
I am not sure about what you need.
Try this one
A = LOAD 'data.txt' USING PigStorage(',') AS (foo:chararray, bar:chararray);
This will ignore the other values after the second comma in your file.
Or you can create a Map for reamining fields.

Expecting QUOTED STRING in pig script

I have written a script to select from vsql:
LOAD 'sql://{select * from sandesh.insights_voice_day
WHERE Observation_date BETWEEN '2011-11-22' AND '2011-11-23' AND
Type='total'
ORDER BY Observation_date}'
It is showing exception as '' Expecting QUOTEDSTRING?. What is problem?
Pig expects a quoted string following a load with the name of the file you are loading. Pig is not SQL, so you have to do something like first dump your query into a file and then:
A = LOAD "your_file" as (column1:datatype, column2:datatype);
B = FITER A by observation date > '2011-11-22' AND observation_date < '2011-11-23' AND
Type='total';
C = ORDER B by observation_date;
DUMP C;
Now, this will order these as strings. So depending on the version of Pig you're using, you'll need to deal with timestamps with the appropriate function. Something like:
http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/evaluation/datetime/convert/CustomFormatToISO.html
The problem seems to be use of single quotes multiple times. Following in a single line seems to compile (pig -c test.pig)
A = LOAD 'sql://{select * from sandesh.insights_voice_day WHERE Observation_date BETWEEN "2011-11-22" AND "2011-11-23" AND Type="total" ORDER BY Observation_date}';

easier way to eval a string from file?

I would like to store a mySQL query in a file. I plan on having to string replace parts of it with variables from my program.
I played around with the 'eval' method in ruby, and it works, but it feels a little clumsy.
Using irb I did the following.
>> val = 7
=> 7
>> myQuery = "select * from t where t.val = \#{val}" #escaped hash simulates reading it from file
=> "select * from t where t.val = \#{val}"
>> myQuery = eval "\"#{myQuery}\""
=> "select * from t where t.val = 7"
As you can see it works! But to make it work I had to wrap the 'myQuery' variable in escaped quotes, and the whole thing looks a little messy.
Is there an easier way?
Generally, you should not use string interpolation to build SQL queries. Doing so will leave you open to SQL injection attacks, in which someone supplies input that has a closing quote character, followed by another query. For instance, using your example:
>> val = '7; DROP TABLE users;'
=> "7; DROP TABLE users;"
>> myQuery = "select * from t where t.val = \#{val}"
=> "select * from t where t.val = \#{val}"
>> eval "\"#{myQuery}\""
=> "select * from t where t.val = 7; DROP TABLE users;"
Even without malicious input, you could simply accidentally execute code that you weren't intending to, if for instance someone included quote marks in their input.
It is also generally a good idea to avoid using eval unless absolutely necessary; it makes it possible that if you have a bug in your program, someone could execute arbitrary code by getting it passed to eval, and it makes code less maintainable since some of your source code will be loaded from places other than your regular source tree.
So, how do you do this instead? Database APIs generally include a prepare command, which can prepare to execute an SQL statement. Within that statement, you can include ? characters, which represent parameters that can be substituted within that statement. You can then call execute on the statement, passing in values for those parameters, and they will be executed safely, with no way for someone to get an arbitrary piece of SQL executed.
Here's how it would work in your example. This is assuming you are using this MySQL/Ruby module; if you are using a different one, it will probably have a similar interface, though it may not be exactly the same.
>> val = 7
>> db = Mysql.new(hostname, username, password, databasename)
>> query = db.prepare("select * from t where t.val = ?")
>> query.execute(val)
You can use ERB templates instead - read them from the files and interpolate the variables (convert <%= something %> tags into the actual values).
Here's the official doc, it's quite complete and straightforward.
You can use printf like syntax for string replacement
"123 %s 456" % 23 # => "123 23 456"
This only works if your program knows in advance which variables to use.
Could you use parametrized queries?
I don't know off hand how to do so in ruby, but basically it involves marking your SQL statement with commands that SQL recognizes are replaces with parameters that are sent in addition to your statement.
This link might help: http://sqlite-ruby.rubyforge.org/sqlite3/faq.html#538670816

Resources