UDF python in Hive with File Json - hadoop

I have two problem with Hive view in Ambari.
1. Problem 1:I have script :
<br>DELETE FILE /user/admin/hive/scripts/MKT_UDF/fb_audit_ads_creatives.py;
<br>ADD FILE /user/admin/hive/scripts/MKT_UDF/fb_audit_ads_creatives.py;
<br>SELECT TRANSFORM (line) USING 'python fb_audit_ads_creatives.py' as (ad_id) FROM stg_fb_audit_ads_creatives_json where date_time='2018-05-05';
I run it many times, It's run smoothly. But on some times, I get error:
You can see it: Here's Log error
I think it's due config (Hive, Ambari, ...time out...v.v.)
Problem 2: I have file json:
{
"body": "https://www.facebook.com/groupkiemhieptinhduyen #truongsinhquyet #sapramat #tsq #game3d",
"thumbnail_url": "https://external.xx.fbcdn.net/safe_image.php?d=AQDU01asRxdnCObW&w=64&h=64&url=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fv%2Ft15.0-10%",
"campaign_id": "23842841688740666"
}
I use script HQL above and UDF Python:
for line in sys.stdin:
data = json.loads(line)
print (data)
print(data['thumbnail_url']
I run it's okay.
But with UDF Python:
for line in sys.stdin:
data = json.loads(line)
print (data)
print(data['body']
I get error: Log error
Can you help me?

Instead of working with python, I recommend to try this UDTF that allows working on json columns within hive It is then possible to manipulate large json and fetch needed data in a distributed and optimized way.

Related

WindowsError when calling sc.parallelize()

I want to use the sc.parallelize() function, but whenever I try to call it, I get the below error:
File "V:/PyCharmProjects/sample.py", line 9, in <module>
input_data = sc.parallelize(sc.textFile("C:\Users\Spider\Desktop\GM_coding\Sample Data.csv"))
File "V:\spark-2.2.0-bin-hadoop2.7\python\pyspark\context.py", line 497, in parallelize os.unlink(tempFile.name)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: u'C:\\Users\\Spider\\AppData\\Local\\Temp\\spark-fef6debd-ff91-4fb6-85dc-8c3a1da9690a\\pyspark-6ed523e7-358f-4e3c-ad83-a479fb8ecc52\\tmpxffhfi'
Not sure if it's relevant to your error (and cannot test it in Windows), but you are trying to parallelize something that is already an RDD (i.e. "parallelized"); from the docs:
textFile(name, minPartitions=None, use_unicode=True)
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an
RDD of Strings.
You don't need (and shouldn't use) sc.parallelize() here; the output of sc.textFile is already an RDD. You should simply go for
input_data = sc.textFile("C:\Users\Spider\Desktop\GM_coding\Sample Data.csv")
See also the example in the quick start guide.

Problems reading large JSON file in Ruby

I have problems reading a large JSON file (2.9GB) in Ruby. I am using this code
json_file = File.read(filename)
results = JSON.parse(json_file)
and when I try to read the file I get the error:
Errno::EINVAL: Invalid argument - <filename>
I have tested the same code with smaller files and it works fine. To verify that the file is written correctly I have tried to read it with python and it works.
Is there a limitation on the size of the file for JSON.parse? If so, could you recommend an alternative?
I have looked in the msgpack to reduce the size of the files, but unfortunately I am constraint by the fact that I cannot install gems.
This is a limitation of IO.read.
You may split your file into smaller parts (for example, 1 gigabyte) and read them separately:
dirname = File.dirname(filename)
`split -b 1024m #{filename} #{filename}.parts.`
Dir.chdir(dirname)
parts = Dir["#{filename}.parts.*"]
json = ''
parts.each do |partname|
json += File.read(partname)
File.delete(partname)
end
results = JSON.parse(json)
Be patient, this could take a while.

Error in uploading Data bag from Ruby

I have designed code in ruby which converts my xls to json partially through spreadsheet and code. I need to upload this json to my data bag on chef-server. I am using knife commands from ruby code and running it. The json files upload to my local chef-repo correctly, but for transferring to chef-server I am getting this error:
ERROR: Chef::Exceptions::ValidationFailed: Data Bag Items must contain a Hash or Mash!
I have validated the json, id matches name of file and tried using [] braces at start and end but doesn't work. This is start of my json:
{
"id": "default_1",
"Sr.No" : "1", ....}
My ruby code essentials look like :
require 'spreadsheet'
book = Spreadsheet.open('BI.xls')
sheet1 = book.worksheet('Sheet1')
.
.
.
cmd1 = "cd #{current_dir}/chef-repo"
cmd2 = "knife data_bag create TestDB" #tried knife data bag too
cmd4 = "knife data_bag from file TestDB default_1.json" #tried knife data bag too
upload = %x[#{cmd1} && #{cmd2} && #{cmd4} ]
puts "#{upload}"
The command knife node list shows nodes correctly. I am new to chef and ruby , searched and tried many things but not working.
It surprisingly got sorted out! I stored all docs in chef-repo, and separated one file into two, the first to convert .xls to json , and the other to execute knife commands . I now run both ruby scripts from chef-repo and the work just as expected. Phew!

How to include input arguments in the CQL command - source

In Cassandra Query Language (CQL), there is a handy command called source which allows user to execute the cql command stored in an external file.
SOURCE
Executes a file containing CQL statements. Gives the output for each
statement in turn, if any, or any errors that occur along the way.
Errors do NOT abort execution of the CQL source file.
Usage:
SOURCE '<file>';
But I am wondering if it is possible to allow this external file to take additional input arguments.
For example, suppose I would like to develop the following cql file with two input arguments:
create keyspace $1 with
strategy_class='SimpleStrategy' and
strategy_options:replication_factor=$2;
and would like to execute this cql in cqlsh by something like:
source 'cql-filename' hello_world 3
I developed the above example cql, stored it in a file called create-keyspace.cql, and tried some possible commands that I can come up with, but none of them works.
cqlsh> source 'create-keyspace.cql'
create-keyspace.cql:2:Invalid syntax at char 17
create-keyspace.cql:2: create keyspace $1 with strategy_class='SimpleStrategy' and strategy_options:replication_factor=$2;
create-keyspace.cql:2: ^
cqlsh> source 'create-keyspace.cql' hello_world 3
Improper source command.
cqlsh> source 'create-keyspace.cql hello_world 3'
Could not open 'create-keyspace.cql hello_world 3': [Errno 2] No such file or directory: 'create-keyspace.cql hello_world 3'
Can I know whether CQl has this type of supports? If yes, then how should I do it properly?
cqlsh is not really intended to be a scripting environment. It sounds like you'd be better served by using the Python CQL driver: https://github.com/datastax/python-driver
cqlsh only supports one parameter, the file containing CQL statements:
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/source_r.html
It's a python-based command-line client. You can see its source code by looking for a file named cqlsh.py on the official Cassandra repo:
http://git-wip-us.apache.org/repos/asf/cassandra.git
And doing a search for SOURCE inside that file to see how it's handled.

Calling a compiled binary on Amazon MapReduce

I'm trying to do some data analysis on Amazon Elastic MapReduce. The mapper step is a python script which includes a call to a compiled C++ binary called "./formatData". For example:
# myMapper.py
from subprocess import *
inputData = sys.stdin.readline()
# ...
p1 = Popen('./formatData', stdin=PIPE, stdout=PIPE)
p1Output = p1.communicate(input=inputData)
result = ... # manipulate the formatted data
print "%s\t%s" % (result,1)
Can I call a binary executable like this on Amazon EMR? If so, where would I store the binary (in S3?), for what platform should I compile it, and how I ensure my mapper script has access to it (ideally it would be in the current working directory).
Thanks!
You can call the binary that way, if you make sure the binary gets copied to the worker nodes correctly.
See:
https://forums.aws.amazon.com/thread.jspa?threadID=35158
For an explanation on how to use the distributed cache to make the binary files accessible on the worker nodes.

Resources