Multi Label Classification on Data Columns in Tables - algorithm

I am seeking guidance on a machine learning problem involving the tagging of data columns. Currently, I have a system where users can add multiple tags to a columns in a table. However, I want to automate the tagging of new columns by using Multi Label Classification. I have extracted 21 features from each column by doing a column analysis on the column values. The features obtained would include statistical values such standard deviation, max,min, kurtosis and etc. Am I on the right path in using these features as inputs for a Multi Label Classification model ? Right now I am focusing on numeric values in columns.

Related

Use label as metric in Grafana Prometheus source

Hello everyone I have prometheus as a label returns the amount. The metric value is the number of payments. How do I withdraw the total amount a day to the dashboard? i.e. value_metric*sum
As far as I know, there is no way to do that because labels aren't meant to be used in calculations. Labels and their values are essentially the index of Prometheus' NoSQL TSDB, they're used to create relations and join pieces of data together. You wouldn't store values and do math with column names of a relational database, would you?
Another problem is that labels with high cardinality greatly increase database size. Here is an extraction from Prometheus best practices:
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
Though I see that you use somewhat fixed values in labels, maybe a histogram would fit your needs.

Interpret Google AutoML Online Prediction Results

We are using Google AutoML with Tables using input as CSV files. We have imported data , linked all schema with nullable columns and train model and then deployed and used the online prediction to predict value of one column.
Column we targeted has values min-max ( 44 - 263).
When we deployed and ran online-prediction it return values like this
Prediction result
0.49457597732543945
95% prediction interval
[-8.209495544433594, 0.9892584085464478]
Most of the resultset is in above format. How can we convert it to values in range of (44-263). Didn't find much documentation online on the same.
Looking for documentation reference and interpretation along with interpretation of 95% prediction.
Actually to clarify (I'm the PM of AutoML Tables)--
AutoML Tables does not do any normalization of the predicted values for your label data, so if you expect your label data to have a distribution of min/max 44-263, then the output predictions should also be in that range. Two possibilities would make it significantly different:
1) You selected the wrong label column
2) Your input features for this prediction are dramatically different than what was seen in the training data used.
Please feel free to reach out to cloud-automl-tables-discuss#googlegroups.com if you'd like us to help debug further

hbase design concat long key-value pairs vs many columns

Please help me understand the best way storing information in HBase.
Basically, I have a rowkey like hashed_uid+date+session_id with metrics like duration, date, time, location, depth and so on.
I have read a lot of materials where I am bit confused. People have suggested less column family for better performance, so I am facing three options to choose:
Have each metrics sits in one row like rowkey_key cf1->alias1:value
Have many columns like rowkey cf1->key1:val1, cf1->key2:val2 ...
Have all the key-value pairs coded into one big string like rowkey cf1->"k1:v1,k2:v2,k3:v3..."
Thank you in advance. I don't know which to choose. The goal of my HBase design is to prepare for incremental windowing functions of a user profiling output, like percentiles, engagement and stat summary for last 60 days. Most likely, I will use hive for that.
Possibly you are confused by the similarity of naming of column family and column. These concepts are different things in HBase. Column family consist of several columns. This design is to improve the speed of access to data when you need to read only some type of columns. E.g., you have raw data and processed data. Reading processed data will not involve raw data if they are stored in separated column families. You can partially to have any numbers of columns per row key; it should be stored in one region, no more than 10GB. The design depends on what you what:
The first variant has no alternatives when you need to store a lot of
data per one-row key, that can't be stored in on a region. More than
10GB.
Second is good when you need to get only a few metrics per
single read per row key.
The last variant is suitable when you
always get all metrics per single read per row key.

Looking to represent numbers better on Power BI

Is there any way we can restructure the whole numbers in Power BI to distinguish the thousands, millions and billions using normal comma operator.
For example: 1,047,890 is represented as 1047890 or 1.04M in Power BI where as I would like it to be represented as 1,047,890. Is there any way we can do that?
Those features are available in the Power BI Desktop tool, a free download from www.powerbi.com.
On the Data view you can set the default numeric format shown in tables, cards, tool-tips etc. On the Report view you can set the numeric format for Chart Axes etc (those are dynamic by default based on the aggregated results).

MS Access - matching a small data set with a very large data set

I have a huge excel file with more than a million rows and a bunch of columns (300) which I've imported to an access database. I'm trying to run an inner join query on it which matches on a numeric field in a relatively small dataset. I would like to capture all the columns of data from the huge dataset if possible. I was able to get the query to run in about 1/2 hour when I selected just one column from the huge dataset. However, when I select all the columns from the larger dataset, and have the query writes to a table, it just never stops.
One consideration is that the smaller dataset's join field is a number, while the larger one's is in text. To get around this, I created a query on the larger dataset which converts the text field to a number using the "val" function. The text field in question is indexed, but I'm thinking I should convert on the table itself to a numeric field to match the smaller dataset's type. Maybe that would make the lookup more efficient.
Other than that, I could use and would greatly appreciate some suggestions of a good strategy to get this query to run in a reasonable amount of time.
Access is a relational database. It is designed to work efficiently if your structure respects the relational model. Volume is not the issue.
Step 1: normalize your data. If you don't have a clue about what that means, there is a wizard in Access that can help you for this (Database Tools, Analyze table) , or search for Database normalization
Step 2: index the join fields
Step 3: enjoy fast results
Your idea of having both sides of the join in the same type IS a must. If you don't do that, indexes and optimisation won't be able to operate.

Resources