ruby - many instances - parallel processing - ruby

I'm new to ruby, and I,m not sure why is my program not working as expected.
The problem is that I'm playing with reading certain website and I'm trying to read some info. The problem is that I don't know how to create many instances of the XsidVehicle class.
There is 200 objects I'm trying to read, if each instance waits until the previous finishes then reading the info will take at least 200 seconds. I would like to create many instances with different IDs that would read different pages at the same time.
require 'open-uri'
class XsidVehicle
def VehicleID(id)
#id = id
individual_source = open("https://www.motocykle.com/cars/id/#{id}")
puts individual_source[individual_source.index(id),individual_source.index(id)+12]
end
end
source = open('https://www.motocykle.pl/cars').read
newItem = source.index('lId=')
endOfThePage = source.index('</html>')
myNewSource = source[newItem.to_i,endOfThePage.to_i]
for i in 0..199 do
#puts "This is i: #{i}"
iid = myNewSource[myNewSource.index('"')+1,20]
iid = iid[0,iid.index('"').to_i]
myNewSource = myNewSource[myNewSource.index('class=').to_i,myNewSource.length]
myNewSource = myNewSource[myNewSource.index('listingId=').to_i,myNewSource.length]
puts "iid #{i}: #{iid}" #here i noticed that the class instance is created then it navigates to the website to read it, and then a new instance is created. But I want many instances to be created at the same time, so not to be restricted by ordered processing.
vehicle = XsidVehicle.new
vehicle.VehicleID(iid)
end

Related

How to use Threading correctly in pyqt?

I am very new to this PyQt, so please I need some help. I created a MainWindow which is imported from another file. collect_host_status also imported from another py file. Basicly the GUI works, but freezing obviously, so therefore I need to use threading for long running process. So far I have changed my code to be like this, but when I click on the button which suppose to check the hosts, happens nothing. :( I dont really get it how to connect the textEdit from MainWindow class to Worker class. As how it is now, it seems like Worker class has no clue what is really self.ui.textEdit.
class Worker(QObject):
finished = Signal()
def __init__(self):
super(Worker, self).__init__()
def run(self):
hostname = self.ui.textEdit.toPlainText()
output_text = collect_host_status(hostname)
for i in output_text:
if "not found" in i:
w = i.replace(" not found", "")
self.ui.textEdit_3.append(w)
else:
self.ui.textEdit_2.append(i)
self.finished.emit()
class MainWindow(QMainWindow):
def __init__(self):
QMainWindow.__init__(self)
self.ui = Ui_MainWindow()
self.ui.setupUi(self)
self.ui.exitbutton.clicked.connect(self.close)
self.ui.actionExit_2.triggered.connect(self.close)
self.ui.actionOpen_2.triggered.connect(self.openfiles)
self.ui.pushButton.clicked.connect(self.ui.textEdit.clear)
self.ui.pushButton.clicked.connect(self.ui.textEdit_2.clear)
self.ui.pushButton.clicked.connect(self.ui.textEdit_3.clear)
self.ui.pushButton_2.clicked.connect(self.ui.textEdit_2.clear)
self.ui.pushButton_2.clicked.connect(self.ui.textEdit_3.clear)
self.connect(self.ui.pushButton_2, SIGNAL("clicked()",), self.buttonclicked)
def buttonclicked(self):
self.thread = QThread()
self.worker = Worker()
self.worker.moveToThread(self.thread)
self.thread.started.connect(self.worker.run)
self.worker.finished.connect(self.thread.quit)
self.worker.finished.connect(self.worker.deleteLater)
self.thread.finished.connect(self.thread.deleteLater)
self.thread.start()
Qt, as with most UI frameworks, does not allow any kind of access to objects from outside its main thread, which means that external threads cannot create widgets, reading properties (like toPlainText()) is unreliable and writing (such as using append()) might even lead to crash.
Even assuming that that was possible, as you pointed out the worker has no clue about self.ui.textEdit and other objects, and that's pretty obvious: self.ui is an attribute created on the main window instance, the thread has no ui object. I suggest you to do some research about how classes and instances work, and how their attributes are accessible.
The only safe and correct way to do so is to use custom signals that are emitted from the thread and connected to the slots (functions) that will actually manipulate the UI.
In the following code I made some adjustments to make it working:
the worker thread directly subclasses from QThread;
only one worker thread is created, and it uses a Queue to get requests from the main thread;
two specialized signals are used to notify whether the request is valid or not, and directly connected to the append() function of the QTextEdits;
I removed the finished signal, which is unnecessary since the thread is going to be reused (and, in any case, QThread already provides such a signal);
changed the "old style" self.connect as it's considered obsolete and will not be supported in newer versions of Qt;
class Worker(QThread):
found = Signal(str)
notFound = Signal(str)
def __init__(self):
QThread.__init__(self)
self.queue = Queue()
def run(self):
while True:
hostname = self.queue.get()
output_text = collect_host_status(hostname)
for i in output_text:
if "not found" in i:
self.notFound.emit(i.replace(" not found", ""))
else:
self.found.emit(i)
def lookUp(self, hostname):
self.queue.put(hostname)
class MainWindow(QMainWindow):
def __init__(self):
# ...
self.ui.pushButton_2.clicked.connect(self.buttonclicked)
self.thread = Worker()
self.thread.found.connect(self.ui.textEdit_2.append)
self.thread.notFound.connect(self.ui.textEdit_3.append)
self.thread.start()
def buttonclicked(self):
if self.ui.textEdit.toPlainText():
self.thread.lookUp(self.ui.textEdit.toPlainText())

Save Google Cloud Speech API operation(job) object to retrieve results later

I'm struggling to use the Google Cloud Speech Api with the ruby client (v0.22.2).
I can execute long running jobs and can get results if I use
job.wait_until_done!
but this locks up a server for what can be a long period of time.
According to the API docs, all I really need is the operation name(id).
Is there any way of creating a job object from the operation name and retrieving it that way?
I can't seem to create a functional new job object such as to use the id from #grpc_op
What I want to do is something like:
speech = Google::Cloud::Speech.new(auth_credentials)
job = speech.recognize_job file, options
saved_job = job.to_json #Or some element of that object such that I can retrieve it.
Later, I want to do something like....
job_object = Google::Cloud::Speech::Job.new(saved_job)
job.reload!
job.done?
job.results
Really hoping that makes sense to somebody.
Struggling quite a bit with google's ruby clients on the basis that everything seems to be translated into objects which are much more complex than the ones required to use the API.
Is there some trick that I'm missing here?
You can monkey-patch this functionality to the version you are using, but I would advise upgrading to google-cloud-speech 0.24.0 or later. With those more current versions you can use Operation#id and Project#operation to accomplish this.
require "google/cloud/speech"
speech = Google::Cloud::Speech.new
audio = speech.audio "path/to/audio.raw",
encoding: :linear16,
language: "en-US",
sample_rate: 16000
op = audio.process
# get the operation's id
id = op.id #=> "1234567890"
# construct a new operation object from the id
op2 = speech.operation id
# verify the jobs are the same
op.id == op2.id #=> true
op2.done? #=> false
op2.wait_until_done!
op2.done? #=> true
results = op2.results
Update Since you can't upgrade, you can monkey-patch this functionality to an older-version using the workaround described in GoogleCloudPlatform/google-cloud-ruby#1214:
require "google/cloud/speech"
# Add monkey-patches
module Google
Module Cloud
Module Speech
class Job
def id
#grpc.name
end
end
class Project
def job id
Job.from_grpc(OpenStruct.new(name: id), speech.service).refresh!
end
end
end
end
end
# Use the new monkey-patched methods
speech = Google::Cloud::Speech.new
audio = speech.audio "path/to/audio.raw",
encoding: :linear16,
language: "en-US",
sample_rate: 16000
job = audio.recognize_job
# get the job's id
id = job.id #=> "1234567890"
# construct a new operation object from the id
job2 = speech.job id
# verify the jobs are the same
job.id == job2.id #=> true
job2.done? #=> false
job2.wait_until_done!
job2.done? #=> true
results = job2.results
Ok. Have a very ugly way of solving the issue.
Get the id of the Operation from the job object
operation_id = job.grpc.grpc_op.name
Get an access token to manually use the RestAPI
json_key_io = StringIO.new(ENV["GOOGLE_CLOUD_SPEECH_JSON_KEY"])
authorisation = Google::Auth::ServiceAccountCredentials.make_creds(
json_key_io:json_key_io,
scope:"https://www.googleapis.com/auth/cloud-platform"
)
token = authorisation.fetch_access_token!
Make an api call to retrieve the operation details.
This will return with a "done" => true parameter, once results are in and will display the results. If "done" => true isn't there then you'll have to poll again later until it is.
HTTParty.get(
"https://speech.googleapis.com/v1/operations/#{operation_id}",
headers: {"Authorization" => "Bearer #{token['access_token']}"}
)
There must be a better way of doing that. Seems such an obvious use case for the speech API.
Anyone from google in the house who can explain a much simpler/cleaner way of doing it?

Is it reasonable to use resque(ruby) to manage external long-running commands (and log tasks)

I have to run bash heavy-job.sh <data-num> (that takes 0.5~2 days) frequently on my computer to process data located at ~/a/data/num . The script call a few sub-processes sequentially and write a log to ~/a/result/num.log . I have done this manually until now.
I wanted to visualize processed tasks and it's status(success or fail), etc as html table. I wrote simple sinatra app to render a table that shows
the list of ~/a/data/num to be processed
~/a/result/num.log exists or not (process not-launched/processing/done)
it's status (the log file contains the word "error" or not)
I found that it would be convenient that if I could launch a bash heavy-job.sh <data-num> from the sinatra app, log the tasks (and info like time,date,etc..) and it's args (heavy-jobs takes some optional args ) and show them as html table.
So I need something that manages jobs and logs to files (or db).
First I wrote a code like below for test (! for test, not integrated with my system yet !), but later I found resque is what i wanted. I am a beginner and not sure if my decision is reasonable or not.
my questions are
is it reasonable to use resque to manage external long-running commands (and log tasks)
or should I use another tool (not necessarily ruby-tool).
(extra;) the task-manager and the sinatra app should work separately (and communicate each other over REST or something) OR not ?
The jobs are not critical since I can retry tasks manually later if failed.
I am not good at English and my question may be misleading. I appreciate any help :) .
class TaskSpawn
def initialize()
#pids = []
end
def spawn(command, options = {})
#opt = {:pgroup => true}
#pids << Kernel.spawn(command, options)
end
def pids()
return #pids.clone
end
def waitany_nohang()
delete_idx = nil
ret = nil
#pids.each_with_index do |p, idx|
pid,status = Process.waitpid2(p, Process::WNOHANG)
unless pid.nil?
delete_idx = idx
ret = [pid,status]
break
end
end
if delete_idx
#pids.delete_at(delete_idx)
return ret
else
# no task fininshed
return nil
end
end
def waitall()
ret = waitall
raise "interal error" if ret.size != pids.size
return ret
end
end

Streaming to HBase with pyspark

There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
How can data be bulk loaded into HBase using PySpark?
Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using Puts or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.
Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a Put request for each cell in the CSV and queues them up to HBase.
def bulk_load(rdd):
#Your configuration will likely be different. Insert your own quorum and parent node and table name
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\#Split the input into individual lines
.flatMap(csv_to_key_value)#Convert the CSV line to key value pairs
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
The function csv_to_key_value is where the magic happens:
def csv_to_key_value(row):
cols = row.split(",")#Split on commas.
#Each cell is a tuple of (key, [key, column-family, column-descriptor, value])
#Works well for n>=1 columns
result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
(cols[0], [cols[0], "f2", "c2", cols[2]]),
(cols[0], [cols[0], "f3", "c3", cols[3]]))
return result
The value converter we defined earlier will convert these tuples into HBase Puts
Bulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a Put request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:
import py4j.GatewayServer;
import org.apache.hadoop.hbase.*;
public class GatewayApplication {
public static void main(String[] args)
{
GatewayApplication app = new GatewayApplication();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Compile this, and run it. Leave it running as long as your streaming is happening. Now update bulk_load as follows:
def bulk_load(rdd):
#The output class changes, everything else stays
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}#"org.apache.hadoop.hbase.client.Put"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)
#Don't process empty RDDs
if not load_rdd.isEmpty():
#saveAsNewAPIHadoopDataset changes to saveAsNewAPIHadoopFile
load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
#The file has now been written, but HBase doesn't know about it
#Get a link to Py4J
gateway = JavaGateway()
#Convert conf to a fully fledged Configuration type
config = dict_to_conf(conf)
#Set up our HTable
htable = gateway.jvm.org.apache.hadoop.hbase.client.HTable(config, "Test")
#Set up our path
path = gateway.jvm.org.apache.hadoop.fs.Path("/tmp/hfiles" + startTime)
#Get a bulk loader
loader = gateway.jvm.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles(config)
#Load the HFile
loader.doBulkLoad(path, htable)
else:
print("Nothing to process")
Finally, the fairly straightforward dict_to_conf:
def dict_to_conf(conf):
gateway = JavaGateway()
config = gateway.jvm.org.apache.hadoop.conf.Configuration()
keys = conf.keys()
vals = conf.values()
for i in range(len(keys)):
config.set(keys[i], vals[i])
return config
As you can see, bulk loading with HFiles is more complex than using Puts, but depending on your data load it is probably worth it since once you get it working it's not that difficult.
One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily:
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)#Sort in ascending order

Ruby: Dynamically defining classes based on user input

I'm creating a library in Ruby that allows the user to access an external API. That API can be accessed via either a SOAP or a REST API. I would like to support both.
I've started by defining the necessary objects in different modules. For example:
soap_connecton = Library::Soap::Connection.new(username, password)
response = soap_connection.create Library::Soap::LibraryObject.new(type, data, etc)
puts response.class # Library::Soap::Response
rest_connecton = Library::Rest::Connection.new(username, password)
response = rest_connection.create Library::Rest::LibraryObject.new(type, data, etc)
puts response.class # Library::Rest::Response
What I would like to do is allow the user to specify that they only wish to use one of the APIs, perhaps something like this:
Library::Modes.set_mode(Library::Modes::Rest)
rest_connection = Library::Connection.new(username, password)
response = rest_connection.create Library::LibraryObject.new(type, data, etc)
puts response.class # Library::Response
However, I have not yet discovered a way to dynamically set, for example, Library::Connection based on the input to Library::Modes.set_mode. What would be the best way to implement this functionality?
Murphy's law prevails; find an answer right after posting the question to Stack Overflow.
This code seems to have worked for me:
module Library
class Modes
Rest = 1
Soap = 2
def self.set_mode(mode)
case mode
when Rest
Library.const_set "Connection", Class.new(Library::Rest::Connection)
Library.const_set "LibraryObject", Class.new(Library::Rest::LibraryObject)
when Soap
Library.const_set "Connection", Class.new(Library::Soap::Connection)
Library.const_set "LibraryObject", Class.new(Library::Soap::LibraryObject)
else
throw "#{mode.to_s} is not a valid Library::Mode"
end
end
end
end
A quick test:
Library::Modes.set_mode(Library::Modes::Rest)
puts Library::Connection.class == Library::Rest::Connection.class # true
c = Library::Connection.new(username, password)

Resources