Impala Open/Alive Sessions Monitoring - jdbc

I have been looking around on the current available documentation as well as in the following UIs:
Cloudera Manager (Impala Tab)
Impala StateStore Web UI
Impala Catalog Server Web UI
For a place where I can see the current open sessions.
Any idea where I could find it? or an alternative method for monitoring the alive Impala connections?

This is a significant weakness (at least of the version of CM we're using). The only solution I have found so far is:
First:
Cloudera Manager Home > Impala
Click Queries
This will show you all queries that are executing, and that have been executed within the selected time window. This detail is high-level, and we have found it often shows queries as "Executing" that long-since failed (this may have been fixed in a more recent version of CDH than the 5.2.6 we run).
In any case, this list will identify the nodes running impalad in the Coordinator field. To get much greater detail, access node-by-node. If the host running a query you were interested in was called node12, use
http://node12:25000/queries
and look for "In flight" queries at the top.

https://impala.apache.org/docs/build/html/topics/impala_webui.html
Sessions Page
By default, the sessions page of the debug web UI is at http://impala-server-hostname:25000/sessions (non-secure cluster) or https://impala-server-hostname:25000/sessions (secure cluster).
This page displays information about the sessions currently connected to this impalad instance. For example, sessions could include connections from the impala-shell command, JDBC or ODBC applications, or the Impala Query UI in the Hue web interface.

Related

why solr jdbc only support solrCloud? the single node can get this feature?

I want to use the solr jdbc in my project base on single node. But it noly support solrCloud? so the single node can get this feature?
The reason for this is that the SQL feature of Solr is a Mapreduce task built on top of parallell processing on the nodes. This requires a collection to keep track of SQL plans and available workers. Since this feature uses the collection API (and doesn't have an alternative implementation), and the JDBC driver connects to ZooKeeper (and not to Solr directly) to get information about your Solr cluster, Cloud mode is required for JDBC and SQL support.
You could run in SolrCloud-mode with a single node, and you can also run multiple instances on a single server (like having a cluster with three nodes, but only a single server).
I don't think having this support work with a non-cloud setup would be a very high priority task, as the SQL feature is rather new and experimental and cloud mode is becoming more of a norm (and "cloud mode" and "old mode" will probably be merged to a single mode later).

cache spark table in thrift server

when using jupyter to cache some data into spark (using sqlcontext.cacheTable) i can see the table cached for the sparkcontext running within Jupyter. But now i want to access those cached tables from BI tools via odbc using the thrift server. when checking the thriftserver cache I dont see any table, the question is how do i get those tables cached to be consumed from BI tools?
do i have to send the same spark commands via jdbc? in that case, is the context related to the current session?
regards,
miguel
I found the solution. In order to have the tables cached to be used with jdbc/odbc clients via thriftserver i have to use CACHE TABLE from one of the clients, for exmaple from beeline. Once this is done the table is in-memory for all the different sessions.
It is also important to be sure you are using the right spark thriftserver. In order to know that just do a show table; in beeline, if you get just one column back you are not using the spark one and the CACHE TABLE wont work.

Can I use a SnappyData JDBC connection with only a Locator and Server nodes?

SnappyData documentation and architecture diagrams seem to indicate that a JDBC thin client connection goes from a client to a Locator and then it is routed to a direct connection to a Server.
If this is true, then I can run JDBC queries without a Lead node, correct?
Yes, that is correct. The locator provides load and connectivity information back to the client that is now able to connect to one or more servers either for direct access to a bucket for low latency queries but more importantly, is HA - can failover and failback.
So, yes, your connected clients will continue to function even when the locator goes away. Note that the "lead" plays a different role than the locator. Its primary function is to host Spark driver, orchestrate Spark Jobs and provide HA to Spark. With no lead, you won't be able to run such Jobs.
In addition to what #jagsr has mentioned, if you do not intend to run the lead nodes (and thus no Spark jobs or column store), then you can run the cluster as pure row store using snappy-start-all.sh rowstore (see rowstore docs)

hbase as database in web application

A big question about using hadoop or related technologies in a real web application.
I just want to find out how a web app can use hbase as its database. I mean is it the thing the big data apps do or they use normal databases and just use these sort of technologies for analysis?
Is it ok to have a online store with Hbase database or something like this?
Yes it is perfectly fine to have hbase as your backend.
What I am doing to get this done,( I have a online community and forum running on my website )
1.Writing C# code to access the Hbase using thrift, very easy and simple to get this done. (Thrift is a cross language binding platform, to HBase Java is only the first class citizen!)
2.Managing the HBase cluster(have it on Amazon) using the Amazon EMI
3.Using ganglia to monitor Hbase
Some Extra tips:
So you can organize the web application like this
You can set up your webservers on Amazon Web Services or IBMWebSphere
You can set up your own HBase cluster using cloudera or use AmazonEC2 again here.
Communication between web server and Hbase master node happens via thrift client.
You can generate thrift code in your own desired programming language
Here are some links that helped me
A)Thrift Client,
B)Filtering options
Along with this I refer to HBase administrative cookbook by Yifeng Jiang and HBase reference guide by Lars George in case I dont get answers on web.
Filtering options provided by HBase are fast and accurate. Let's say if you use HBase for storing your product details, you can have sub-stores and have a column in your Product table, which tells to which store a product may belong and use Filters to get products for a specific store.
I think you should read the article below:
"Apache HBase Do’s and Don’ts"
http://blog.cloudera.com/blog/2011/04/hbase-dos-and-donts/

EC2 database server failover strategy

I am planning to deploy my web app to EC2. I have several webserver instances. I have 1 primary database instance. I have 1 failover database instance. I need a strategy to redirect the webservers to the failover database instance IP when the primary database instance fails.
I was hoping I could use an Elastic IP in my connection strings. But, the webservers are not able to access/ping the Elastic IP. I have several brute force ideas to solve the problem. However, I am trying to find the most elegant solution possible.
I am using all .Net and SQL Server. My connection strings are encrypted.
Does anybody have a strategy for failing over a database instance in EC2 using some form of automation or DNS configuration?
Please let me know.
http://alestic.com/2009/06/ec2-elastic-ip-internal
tells you how to use the Elastic IP public DNS.
Haven't used EC2 but surely you need to either:
(a) put your front-end into some custom maintenance mode, that you define, while you switch the IP over; and have the front-end perform required steps to manage potential data integrity and data loss issues related to the previous server going down and the new server coming up when it enters and leaves your custom maintenance mode
OR, for a zero down-time system:
(b) design the system at the object/relational and transaction levels from the ground up to support zero-down-time fail-over. It's not something you can bolt on quicjkly to just any application.
(c) use some database support for automatic failover. I am unaware whether SQL Server support for failover suitable for your application exists or is appropriate here. I suggest adding a "sql-server" tag to the question to start a search for the right audience.
If Elastic IPs don't work (which sounds odd to say the least - shouldn't you talk to EC2 about that), you mayhave to be able to instruct your front-end which new database IP to use at the same time as telling it to go from maintenance mode to normal mode.
If you're willing to shell out a bit of extra money, take a look at Rightscale's tools; they've built custom server images and supporting tools that handle database failover (among many other things). This link explains how to do it with MySQL, so will hopefully show you some principles even though it doesn't use SQL Server.
I always thought there was this possibility in the connnection string
This is taken (but not yet tested) from How to add Failover Partner to a connection string in VB.NET :
If you connect with ADO.NET or the SQL
Native Client to a database that is
being mirrored, your application can
take advantage of the drivers ability
to automatically redirect connections
when a database mirroring failover
occurs. You must specify the initial
principal server and database in the
connection string and the failover
partner server.
Data Source=myServerAddress;Failover Partner=myMirrorServerAddress;
Initial Catalog=myDataBase;Integrated Security=True;
There is ofcourse many other ways to
write the connection string using
database mirroring, this is just one
example pointing out the failover
functionality. You can combine this
with the other connection strings
options available.
To broaden gareth's answer, cloud management softwares usually solve this type of problems. RightScale is one of them, but you can try enStratus or Scalr (disclaimer: I work at Scalr). These tools provide failover solutions like:
Backups: you can schedule automated snapshots of the EBS volume containing the data
Fault-tolerant database: in the event of failure, a slave is promoted master and mounted storage will be switched if the failed master and new master are in the same AZ, or a snapshot taken of the volume
If you want to build your own solution, you could replicate the process detailed below that we use at Scalr:
Is there a slave in the same AZ? If so, promote it, switch EBS
volumes (which are limited to a single AZ), switch any ElasticIP you
might have, reconfigure replication of the remaining slaves.
If not, is there a slave fully replicated in another AZ? If so, promote it,
then do the above.
If there are no slave in same AZ, and no slave fully
replicated in another AZ, then create a snapshot from master's
volume, and use this snapshot to create a new volume in an AZ where a
slave is running. Then do the above.

Resources