h2o steam deployed model has no input fields in prediction service - h2o

When trying to use the prediction service for a model deployed by steam, this is what I see:
Notice that when I click the "Predict" button, I get a prediction label response from the model. But there are no input fields being displayed. Why is this happening?
I start my steam session like this:
I launch h2o flow
java -Xmx4g -jar h2o.jar
I start the steam jetty server for the prediction service (as instructed here):
java -Xmx6g -jar var/master/assets/jetty-runner.jar var/master/assets/ROOT.war
I use -Xmx6g because I was getting a java.lang.OutOfMemoryError
from the prediction service earlier.
I launch the steam server:
./steam serve master --prediction-service-host=localhost --prediction-service-port-range=12345:22345
I use a custom port range for the prediction service since I was having problems deploying models from steam where it could not access port 8080 (if anyone knows a better way around this please let me know). From here, I import model from the localhost h2o flow server in steam and deploy it to get the screen show at the top of this post.
I was having problems before where the prediction service builder server (launched with GRADLE_OPTS=-Xmx6g ./gradlew jettyRunWar following the instruction here) was not showing input fields for .war files built from mojos (see here), but I am using a model imported directly from h2o flow into steam in this case. If anyone knows what is going on here it would be a big help. Thanks :)
UPDATE
Used a smaller similar model (POJO size of ~200MB) and can now see input fields (after waiting on the prediction service screen for ~10sec.). Can't tell what kind of file the model is currently being transferred as under the hood though, I assume POJO now. One weird thing though is that the input fields also include the models binomial response labels (as if the user could just choose the response as input).

As I explained in this other question Using MOJOS in H2O Steam Prediction Service Builder this is because the UI has not been updated to handle MOJOs, it currently only handles POJOs.
You can use command line (or other tools) to send data to and get predictions from the prediction service. How to do this is explained here: https://github.com/h2oai/steam/tree/master/prediction-service-builder

Related

Update central cache with different system data change in microservices scale architecture

We're building a microservice system which new data can come from three(or more) different sources and which eventually effects the end user.
It doesn't matter what the purpose of the system for the question so I'll really try to make it simple. Please see the attached diagram.
Data can come from the following sources:
Back-office site: define the system and user configurations.
Main site: where user interact with the site and make actions.
External sources data: such as partners which can gives additional data(supplementary information) about users.
The services are:
Site-back-office service: serve the back-office site.
User-service: serve the main site.
Import service: imports additional data(supplementary information) from external sources.
User cache service: sync with all the above system data and combine them to pre-prepared cache responses. The reason for that is because the main site should serve hundreds of millions of user and should work with very low latency.
The main idea is:
Each microservice has its own db.
Each microservice can scale.
Each data change on one of the three parts effects the user and should be sent to the cache service so it eventually be reflect on the main site.
The cache (Redis) holds all data combined to pre-prepared responses for the main-site.
Each service data change will be published to pubsub topic for the cache-service to update the Redis db.
The system should serve around 200 million of users.
So... the questions are: .
since the User-cache service can(and must) be scale, what happen if, for example, there are two update data messages waiting on pubsub, one is old and one is new. how to process only the new message and prevent the case when one cache-service instance update the new message data to Redis and only after another cache-service instance override it with the old message.
There is also a case when the Cache-service instance need to first read the current cache user data, make the change on it and only then update the cache with the new data. How to prevent the case when two instances for example read the current cache data while a third instance update it with new data and they override it with their data.
Is it at all possible to pre-prepare responses based on several sources which can periodically change?? what is the right approach to this problem?
I'll try to address some of your points, let me know if I misunderstood what you're asking.
1) I believe you're asking about how to enforce ordering of messages, that an old update does not override a newer one. There "publish_time" field of a message (https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage) to coordinate based on the time the cloud pubsub server received your publish request. If you wish to coordinate based on some other time or ordering mechanism, you can add an attribute to your PubsubMessage or payload to do so.
2) This seems to be a general synchronization problem, not necessarily related to cloud pubsub; I'll leave this to others to answer.
3) Cloud dataflow implements a windowing and watermark mechanism similar to what you're describing. Perhaps you could use this to remove conflicting updates and perform preprocessing prior to writing them to the backing store.
https://beam.apache.org/documentation/programming-guide/#windowing
-Daniel

How to allow sinatra poll for data smartly

I am wanting to design an application where the back end is constantly polling different sensors while the front end (sinatra) allows for this data to be viewed either via json api, or by simply displaying the results in html.
What considerations should I take to develop such an application and how should I structure the application for best scaling and ease of maintenance.
My first thought is to simply let sinatra poll the sensors every time it receives a request to the proper end points, but this seems like it could bog down quiet fast especially seeing how that some sensors only update themselves every couple seconds.
My second thought is to have a background process (or thread) poll the sensors and store the values for sinatra. When a request is received sinatra can then simply poll the background process for a cached value (or pull it from the threaded code) and present it to the client.
I like the second thought more, but I am not sure how I would develop the "background application" so that sinatra could poll it for data to present to the client. The other option would be for sinatra to thread the sensor polling code so that it can simply grab values from it inside the same process rather than requesting it from another process.
Due note that this application will also be responsible for automation of different relays and such based off the sensors and sinatra is only responsible for relaying the status of the sensors to the user. I think separating the backend (automation + sensor information) in a background process/daemon from the frontend (sinatra) would be ideal, but I am not sure how I would fetch the data for sinatra.
Anyone have any input on how I could structure this? If possible I would also appreciate a sample application that simply displays the idea that I could adopt and modify.
Thanks
Edit::
After a bit more research I have discovered drb (distributed ruby http://ruby-doc.org/stdlib-1.9.3/libdoc/drb/rdoc/DRb.html) which allows you to make remote calls on objects over the network. This may be a suitable solution to this problem as the daemon can automate the relays, read the sensors and store the values in class objects, and then present the class objects over drb so that sinatra can call the getters on the remote object to obtain up to date data from the daemon. This is what I initially wanted to attempt to do.
What do you guys think? Is this advisable for such an application?
I have decided to go with Sinatra, DRB, and Daemons to meet the requirements I have stated above.
The web front end will run in its own process and only serve up statistical information via DRB interactions with the backend. This will allow quick response times for the clients and allow me to separate front end code from backend code.
The backend will run in its own process and constantly poll the sensors for updates and store them as class objects with getters so that Sinatra can fetch the information over DRB when required. It will also use the gathered information for automation that is project specific.
Finally the backend and frontend will be wrapped with a Daemons wrapper so that the project will have the capabilities of starting, restarting, stopping, run status, and automatic restarting of the Daemons if it crashes or quits for what ever reason.
Source information:
http://phrogz.net/drb-server-for-long-running-web-processes
http://ruby-doc.org/stdlib-1.9.3/libdoc/drb/rdoc/DRb.html
http://www.sinatrarb.com/
https://github.com/thuehlinger/daemons

Using Grafana with Jmeter

I am trying to make Grafana display all my metrics (CPU, Memory, etc).
I have already configured Grafana on my server and have configured influxdb and of course I have configured Jmeter listener (Backend Listener) but still I cannot display all grpahas, any idea what should I do in order to make it work ?
It seems like that system metrics (CPU/Memory, etc.) are not in the scope of the JMeter Backend Listener implementation. Actually capturing those KPIs is a part of PerfMon plugin, which currently doesn't seem to support dumping the metrics to InfluxDB/Graphite (at least it doesn't seem to work for me). It might be a good idea to raise such a request at https://groups.google.com/forum/#!forum/jmeter-plugins. Until this gets done, I guess you also have the option of using some alternative metric-collection tools to feed data in InfluxDB/Graphite. Those would depend on the server OS you want to monitor (e.g. Graphite-PowerShell-Functions for Windows or collectd for everything else)
Are you sure that JMeter posts the data to InfluxDB? Did you see the default measurements created in influxDB?
I am able to send the data using backend listener to influxdb. I have given the steps in this site.
http://www.testautomationguru.com/jmeter-real-time-results-influxdb-grafana/

Keeping state in sync between server and GUI in realtime

I am looking for a library that will help me keep some state in sync between my server and my GUI in "real time". I have the messaging and middleware sorted (push updates etc), but what I need is a protocol on top of that which guarantees that the data stays in sync within some reasonably finite period - an error / dropped message / exception might cause the data to go out of syn for a few seconds, but it should resync or at least know it is out of sync within a few seconds.
This seems like it should be something that has been solved before but I can't seem to find anything suitable - any help much appreciated
More detail - I have a Rich Client (Silverlight but likely to move to Javascript/C# or Java soon) GUI that is served by a JMS type middleware.
I am looking to re engineer some of the data interactions to something like as follows
Each user has their own view on several reasonably small data sets for items such as:
Entitlements (what GUI elements to display)
GUI data (e.g. to fill drop down menus etc)
Grids of business data (e.g. a grid of orders)
Preferences (e.g. how the GUI is laid out)
All of these data sets can be changed on the server at any time and the data should update on the client as soon as possible.
Data is changed via the server – the client asks for a change (e.g. cancel a request) and the server validates it against entitlements and business rules and updates its internal data set which would then send the change back to the GUI. In order to provide user feedback an interim state may be set on the gui (cancel submitted or similar) which is the over ridden by the server response.
At the moment the workflow is:
User authenticates
GUI downloads the initial data sets from the server (which either loads them from the database or some other business objects it has cached)
GUI renders
GUI downloads a snapshot of the business data
GUI subscribes to updates to the business data
As updates come in the GUI updates the model and view on screen
I am looking for a generalised library that would improve on this
Should be cross language using an efficient payload format (e.g. Java back end, C# front end, protobuf data format)
Should be transport agnostic (we use a JMS style middleware we don’t want to replace right now)
The client should be sent a update when a change occurs to the server side dataset
The client and server should be able to check for changes to ensure they are up to date
The data sent should be minimal (minimum delta)
Client and Server should cope with being more than one revision out of sync
The client should be able to cache to disk in between session and then just get deltas on login.
I think the ideal solution would be used something like
Any object (or object tree) can be registered with the library code (this should work with data/objects loaded via Hibernate)
When the object changes the library notifys a listener / callback with the change delta
The listener sends that delta to the client using my JMS
The client gets the update and can give that back to the client side version of the library which will update the client side version of the object
The client should get sufficient information from the update to be able to decide what UI action needs to be taken (notify user, update grid etc)
The client and server periodically check that they are on the same version of the object (e.g. server sends the version number to the client) and can remediate if necessary by either the server sending deltas or a complete refresh if necessary.
Thanks for any suggestions
Wow, that's a lot!
I have a project going on which deals with the Synchronization aspect of this in Javascipt on the front end. There is a testing server wrote in Node.JS (it actually was easy once the client was was settled).
Basically data is stored by key in a dataset and every individual key is versioned. The Server has all versions of all data and the Client can be fed changes from the server. Version conflicts for when something is modified on both client and server are handled by a conflict resolution callback.
It is not complete, infact it only has in-memory stores at the moment but that will change over the new week or so.
The actual notification/downloading and uploading is out of scope for the library but you could just use Sockets.IO for this.
It currently works with jQuery, Dojo and NodeJS, really it's got hardly any dependencies at all.
The project (with a demo) is located at https://github.com/forbesmyester/SyncIt

tweepy Streaming API integration with Django

I am trying to create a Django webapp that utilizes the Twitter Streaming API via the tweepy.Stream() function. I am having a difficult time conceptualizing the proper implementation.
The simplest functionality I would like to have is to count the number of tweets containing a hashtag in real time. So I would open a stream, filtering by keywords, every time a new tweet comes over the connection i increment a counter. That counter is then displayed on a webpage and updated with AJAX or otherwise.
The problem is that the tweepy.Stream() function must be continuously running and connected to twitter (thats the point). How can I have this stream running in the background of a Django app while incrementing counters that can be displayed in (near) real time?
Thanks in advance!
There are various ways to do this, but using a messaging lib (celery) will probably be the easiest.
1) Keep a python process running tweepy. Once an interesting message is found, create a new celery task
2) Inside this carrot task persist the data to the database (the counter, the tweets, whatever). This task can well run django code (e.g the ORM).
3) Have a regular django app displaying the results your task has persisted.
As a precaution, it's probably a good ideal to run the tweepy process under supervision (supervisord might suit your needs). If anything goes wrong with it, it can be restarted automatically.

Resources