Expecting QUOTED STRING in pig script - hadoop

I have written a script to select from vsql:
LOAD 'sql://{select * from sandesh.insights_voice_day
WHERE Observation_date BETWEEN '2011-11-22' AND '2011-11-23' AND
Type='total'
ORDER BY Observation_date}'
It is showing exception as '' Expecting QUOTEDSTRING?. What is problem?

Pig expects a quoted string following a load with the name of the file you are loading. Pig is not SQL, so you have to do something like first dump your query into a file and then:
A = LOAD "your_file" as (column1:datatype, column2:datatype);
B = FITER A by observation date > '2011-11-22' AND observation_date < '2011-11-23' AND
Type='total';
C = ORDER B by observation_date;
DUMP C;
Now, this will order these as strings. So depending on the version of Pig you're using, you'll need to deal with timestamps with the appropriate function. Something like:
http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/evaluation/datetime/convert/CustomFormatToISO.html

The problem seems to be use of single quotes multiple times. Following in a single line seems to compile (pig -c test.pig)
A = LOAD 'sql://{select * from sandesh.insights_voice_day WHERE Observation_date BETWEEN "2011-11-22" AND "2011-11-23" AND Type="total" ORDER BY Observation_date}';

Related

Is there a way to turn unix commands as map reduce activities?

I'm trying to get UNIX outputs in hive query.
For example the following query doesn't work:
select transform ('')
using 'pwd'
as syspath
But this query works:
select transform ('')
using 'hive -e "select 10 as col1"'
as col1
How do I enable UNIX commands or bash scripts as a map reduce job to gets its output available in hive?
Thanks in advance!
count the number of characters from a file
Why would you ever use Hive for that? Spark is so much more flexible.
val charCount = spark.read.textLines("path/to/file.txt")
.flatMap(line => line.toList())
.map(char => (char, 1)) // This is literally just wordcount, now
.reduceByKey(_ + _)
.map((char, count) => count)
.sum() // something like this ...
println(charCount.collect()(0))

How Replace function in Pig works?

My input file name is words.txt as below . Also there is no space in each record of this below file .
Hi
Hi
How
I am loading this file into Pig
words = LOAD '/user/inputs/words.txt' USING PigStorage() AS (line:chararray);
words_each = FOREACH words GENERATE REPLACE(line,'','|') ;
dump words_each;
I am getting output as
|H|i|
|H|i|
|H|o|w|
But I would like to know how exactly REPLACE functions treats '' which is my second argument in REPLACE function .
There is no empty space in my file, then how come I am getting | in my output .
Well, As per your statement, REPLACE function is called on ''. It doesn't contain any whitespace.
If you want to replace the space, you need to give it like this ' '. +
Both are different conditions as given below:
words_each = FOREACH words GENERATE REPLACE(line,'','|') ; // without space
words_each = FOREACH words GENERATE REPLACE(line,' ','|') ; // with space
First condition will add the Pipe symbol(|) after each character, while 2nd condition won't make any impact because there is no space in your file content.

Split characters inside Pig field

I have a text input with '|' separator as
0.0000|25000| |BM|BM901002500109999998|SZ
which I split using PigStorage
A = LOAD '/user/hue/data.txt' using PigStorage('|');
Now I need to split the field BM901002500109999998 into different fields based on their position , say 0-2 = BM - Field1 and like wise.
So after this step I should get BM, 90100, 2500, 10, 9999998.
Is there any way in Pig script to achieve this, otherwise I plan to write an UDF and put separator on required positions.
Thanks.
You are looking for SUBSTRING:
A = LOAD '/user/hue/data.txt' using PigStorage('|');
B = FOREACH A GENERATE SUBSTRING($4,0,2) AS FIELD_1, SUBSTRING($4,2,7) AS FIELD_2, SUBSTRING($4,7,11) AS FIELD_3, SUBSTRING($4,11,13) AS FIELD_4, SUBSTRING($4,13,20) AS FIELD_5;
The output would be:
dump B;
(BM,90100,2500,10,9999998)
You can find more info about this function here.
I think that it will be much more efficient to use the built in UDF REGEX_EXTRACT_ALL.
You can get some idea of how to use this UDF from:
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#REGEX_EXTRACT_ALL
STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

PigStorage and Variable Schemas from Input

I have a comma separated text file like
1,abc,1,
2,def,1,2,3,4
3,ghi,1,2
4,jkl,1,5,6,7,8,9
5,mno
The text file will always have the first two values, but will have 0 or more values after the second comma.
How can I load this data and give an alias to the first two values?
I can load it and not give an alias to the first two values via:
A = LOAD 'data.txt' USING PigStorage(',');
From here, I can do a B = FOREACH A GENERATE $0 AS foo:chararray, $1 AS bar:chararray; but it would discard the rest. It would be nice to do a wildcard and put the rest in a tuple.
Is there anyway to do this?
Try this
B = foreach A generate $0 as foo:chararray, $1 as bar:chararray, $2..;
reference
Drop single column in Pig
I am not sure about what you need.
Try this one
A = LOAD 'data.txt' USING PigStorage(',') AS (foo:chararray, bar:chararray);
This will ignore the other values after the second comma in your file.
Or you can create a Map for reamining fields.

easier way to eval a string from file?

I would like to store a mySQL query in a file. I plan on having to string replace parts of it with variables from my program.
I played around with the 'eval' method in ruby, and it works, but it feels a little clumsy.
Using irb I did the following.
>> val = 7
=> 7
>> myQuery = "select * from t where t.val = \#{val}" #escaped hash simulates reading it from file
=> "select * from t where t.val = \#{val}"
>> myQuery = eval "\"#{myQuery}\""
=> "select * from t where t.val = 7"
As you can see it works! But to make it work I had to wrap the 'myQuery' variable in escaped quotes, and the whole thing looks a little messy.
Is there an easier way?
Generally, you should not use string interpolation to build SQL queries. Doing so will leave you open to SQL injection attacks, in which someone supplies input that has a closing quote character, followed by another query. For instance, using your example:
>> val = '7; DROP TABLE users;'
=> "7; DROP TABLE users;"
>> myQuery = "select * from t where t.val = \#{val}"
=> "select * from t where t.val = \#{val}"
>> eval "\"#{myQuery}\""
=> "select * from t where t.val = 7; DROP TABLE users;"
Even without malicious input, you could simply accidentally execute code that you weren't intending to, if for instance someone included quote marks in their input.
It is also generally a good idea to avoid using eval unless absolutely necessary; it makes it possible that if you have a bug in your program, someone could execute arbitrary code by getting it passed to eval, and it makes code less maintainable since some of your source code will be loaded from places other than your regular source tree.
So, how do you do this instead? Database APIs generally include a prepare command, which can prepare to execute an SQL statement. Within that statement, you can include ? characters, which represent parameters that can be substituted within that statement. You can then call execute on the statement, passing in values for those parameters, and they will be executed safely, with no way for someone to get an arbitrary piece of SQL executed.
Here's how it would work in your example. This is assuming you are using this MySQL/Ruby module; if you are using a different one, it will probably have a similar interface, though it may not be exactly the same.
>> val = 7
>> db = Mysql.new(hostname, username, password, databasename)
>> query = db.prepare("select * from t where t.val = ?")
>> query.execute(val)
You can use ERB templates instead - read them from the files and interpolate the variables (convert <%= something %> tags into the actual values).
Here's the official doc, it's quite complete and straightforward.
You can use printf like syntax for string replacement
"123 %s 456" % 23 # => "123 23 456"
This only works if your program knows in advance which variables to use.
Could you use parametrized queries?
I don't know off hand how to do so in ruby, but basically it involves marking your SQL statement with commands that SQL recognizes are replaces with parameters that are sent in addition to your statement.
This link might help: http://sqlite-ruby.rubyforge.org/sqlite3/faq.html#538670816

Resources