Power Query. improving performance while remaining flexible in input - performance

I have been doing research relating to query performances and I have read that any unused column should be suppressed in order to optimise a query.
Thus, I should take insert a "remove columns" step in my query and I would thus improve performance.
However, I did some tests and noticed that if I would have a new file with some of the columns that I am supressing in my steps already suppressed (for one reason or another), this leads to issues to my query.
I was wondering how it could be possible to set up the query with more flexibility.
I see the following possibilities to improve my queries:
Using Table.SelectColumns rather than Table.RemoveColumns : but this makes my query really horribly long.
Using the parameters of Table.RemoveColumns. But I have trouble understanding the difference between the effects of MissingField.Ignore or MissingField.UseNull
Here is a sample data:
|------------------|---------------------|-----------------|-------------|
| Wourf | Nap | Banzai | Hypernup |
|------------------|---------------------|-----------------|-------------|
| 0563 | Nap3343 | Picolo | Hyper |
|------------------|---------------------|-----------------|-------------|
| 0 | Coniglio | Grasso | Super |
|------------------|---------------------|-----------------|-------------|
| 0563 | Nao34 | Dimagritto | Hyper |
|------------------|---------------------|-----------------|-------------|
And here is a sample code. Typically, the "Wourf" or "Nap" could be suppressed, because they are not used in the query. But I could also get a data source with columns already missing, therefore leading to issues :
#"Promoted Headers" = Table.PromoteHeaders(#"Changed Type", [PromoteAllScalars=true]),
#"Removed Columns" = Table.RemoveColumns(#"Promoted Headers",{"Nap", "Wourf"}),
Here is my current solution:
#"Promoted Headers" = Table.PromoteHeaders(#"Changed Type", [PromoteAllScalars=true]),
#"Removed Columns" = Table.RemoveColumns(#"Promoted Headers",{"Nap", "Wourf"}, MissingField.Ignore),
Is there a better solution that I do not know?

Related

Using header names as a filter parameter in a dashboard

I have a data table that resembles the structure here:
| Prof | PI | Class |
|:----:|:------:|:-----:|
| Dr.K | Louisa | A |
| Dr.L | Jenny | B |
| Dr.X | Liu | C |
Filter 1: I'd like to create two dropdown, single selection parameter-filters, the first of which contains the headers of the columns. So, filter one would contain the option to select: Pro, PI, or Class.
Filter 2: The second filter would then dynamically change to represent values of the selected column. If a user chose "Prof" in Filter 1, Filter 2 would show: Dr. K, Dr. L, and Dr. X. The table in the dashboard would then reflect the chosen filters.
I believe choosing "only relevant values" on Filter 2 would take care of some of the issues, but I still don't understand how I can turn column headers into a list, and those values still retain the integrity of the original columns. Thank you for any help you can provide!
IF [Parameter 1] = STR("Prof") THEN [Prof] ELSEIF [Parameter 1] STR("PI") THEN [PI] END

Crate.io: Facets for search?

Does https://crate.io support facets (for faceted search)?
I didn't find anything in the docs. ElasticSearch replaced facets with aggregations in 2014, but the aggregation section in the crate docs only talks about SQL aggregation functions.
My use case:
I've got a list of web sites, each record has a domain and a language field. When displaying the search results, I want to get a list of all domains that the search results appear in, as well a list of all languages, ordered by number of occurences so search results can be narrowed down. The number of results for those single facet values shall also be given.
Screenshot with facets:
There is no way to get the facets I want from crate itself.
Instead we're enabling the ElasticSearch REST API in crate.yml now
es.api.enabled: true
.. and can use the ElasticSearch aggregation API.
Crate doesn't support facets or Elasticsearch aggregations directly. Like you suggested, you can always turn on the Elasticsearch API. However, there are other ways to get these aggregations.
1) Have you considered to issue multiple queries to the cluster? For example, if you load your page dynamically with Javascript, you can first return the search results and load the facets later. This should also decrease the overall response time of the application.
2) In CrateDB 2.1.x, there will be support for subqueries, which allow you to include the facets within your query:
select q1.id, q1.domain, q1.tag, q2.d_count, q3.t_count from websites q1,
(select domain, count(*) as d_count from websites where text like '%query%' group by domain) q2,
(select tag, count(*) as t_count from websites where text like '%query%' group by tag) q3
where q1.domain = q2.domain and q1.tag = q3.tag and q1.text like '%query%'
order by q1.id
limit 5;
This gives you a result table like this where you have the search results alongside with the domain and tag count for the query:
+----+--------------+-----------+---------+-----------+
| id | domain | tag | d_count | t_count |
+----+--------------+-------------+---------+---------+
| 1 | example.com | example | 2 | 3 |
| 14 | crate.io | software | 1 | 4 |
| 17 | google.com | search | 5 | 2 |
| 29 | github.com | open-source | 3 | 3 |
| 47 | linux.org | software | 2 | 4 |
+----+--------------+-------------+---------+---------+
Disclaimer: I'm new to Crate :)

SELECT ... LIMIT 1 query results in more than one row?

I noticed that LIMIT queries will return more than the expected number of rows when they are executed against tables that contain nested or repeated data. For example, the following query run against the persons sample data set from the developer guide produces the following results:
% bq query 'SELECT fullName, children.name FROM [persons.person] LIMIT 1'
+----------+---------------+
| fullName | children_name |
+----------+---------------+
| John Doe | Jane |
| John Doe | John |
+----------+---------------+
It looks like BQL is applying the LIMIT operator before flattening the results as opposed to the other way around (which I think would make more sense).
Is this a bug in the BQL implementation or is this the expected behavior? If this is the expected behavior can someone please provide an explanation for why this makes sense?
This is expected given the way BigQuery flattens query results. When you run the query, the LIMIT 1 applies to the repeated record. Then the results get flattened in the output, and you get two rows. A workaround is to use an explicit flatten operation. For example:
SELECT fullName, children.name
FROM (FLATTEN([persons.person], children.name) LIMIT 1
This will return only a single row.

YUI DataTable nested columns with JSON object with unknown keys

I am pretty new to YUI and need some help.
I have a JSON response like this:
{
"Results":[
{
"alpha":57.935,
"beta:{
"delta":2.975,
"omega":1.431
},
"gamma":{
"theta":"0.339",
"lambda":"1.195"
}
},
{
"alpha":87,
"beta":{
"lambda":2.680,
"kappa":0.714
},
"gamma":{
"zeta":"0.288",
"epsilon":"0.289"
}
}
]
}
I would like to have a datatable with nested columns where:
1) alpha, beta and gamma are parent columns.
2) beta and gamma each have two columns formed of the JSON key-value pair (e.g., delta => 2.975).
3) The number of rows, i.e., total key-value pairs, is dynamic.
Basically, something like this:
----------------------------------------------
| alpha | beta | gamma |
----------------------------------------------
| 57.935 | delta | 2.975 | theta | 0.339 |
----------------------------------------------
| | omega | 1.431 | lambda | 1.195 |
----------------------------------------------
| 87.435 | lambda | 2.680 | zeta | 0.288 |
----------------------------------------------
| | kappa | 0.714 | epsilon | 0.289 |
----------------------------------------------
I have been able to generate non-nested, simple JSON responses.
My problems:
1) I have the object for each JSON child ({theta:0.339}, etc.). Both child columns will need data from this same object. How do I use it without modifying it? Should I use the same 'keyName' for both child columns in myColumnDefs?
2) How to create more than one rows where alpha td is empty?
Any help will be appreciated !
This is not an easy problem to solve. Barring your ability to format the JSON into individual rows before its sent to the client, you can hack together a solution using some column configurations, formatters, and a custom bodyView modelList attribute setter that flattens the data for display.
http://jsbin.com/3/efigim/1/edit?javascript,live
This would likely involve some breakage of table row -> data record associations since the bodyView's modelList contains its own Models for the rows rather than sharing a clientId. This may or may not get in your way, depending on whether you need additional features.
But since the DataTable's data ModelList preserves the objects for beta and gamma values--only the view's representation is customized--you might be fine.
YMMV, HTH

Dynamic database design for variable length spatio-temporal data in Oracle (need a schema design)

Currently I am working on a research project, where I need to store spatio-temporal data and analyze them efficiently. I am giving the exact requirement below.
The research is going on meteorological data, so the data attributes are temperature, humidity, pressure, wind-speed, wind-direction etc. The number of attributes is previously unknown to us, depending on requirement we may need to add more attributes (Table having dynamic attribute and different datatype nature). Again the data is captured from various locations, from various height and in a certain time duration as well as time interval.
So, what should be the best way to design a schema for the requirement? We must have to find out relation efficiently.
The purpose of the project is not only to store database, also need to manipulate the data.
Sample data in table format -
location | time | height | pressure | temperature | wind-direction | ...
L1 | 2011-12-18 08:04:02 | 7 | 1009.6 | 28.3 | east | ...
L1 | 2011-12-18 08:04:02 | 15 | 1008.6 | 27.9 | east | ...
L1 | 2011-12-18 08:04:02 | 27 | 1007.4 | 27.4 | east | ...
L1 | 2011-12-18 08:04:04 | 7 | 1010.2 | 28.4 | north-east | ...
L1 | 2011-12-18 08:04:04 | 15 | 1009.4 | 28.2 | north-east | ...
L1 | 2011-12-18 08:04:04 | 27 | 1008.9 | 27.6 | north-east | ...
L2 | 2011-12-18 08:04:02 | ..... so on
Here I need to design a schema for the above sample data where Location is a spatial location that can be implemented using oracle MDSYS.SDO_GEOMETRY type.
Constraints are:
The no of attributes (table column) is unknown during development. In runtime any new attribute(let say - humidity, refractive index etc.) can be added. So we can't design attribute specific table schema.
    1.1) for this constraint I thought to use a schema like -
           tbl_attributes(attr_id_pk, attr_name, attr_type);          
tbl_data(loc, time, attr_id_fk, value);
     The my design the attribute value must be varchar type, and as required I thought to cast (not a good idea at all).
     But finding relational data with this schema is very difficult using SQL query only. For example I want to find -
          1.1.1) avg pressure for location L1 when wind direction is east and temperature in between 27-28
         1.1.2) locations, where pressure is maximum at 15 height.
     1.2) I am also thinking to edit table schema during runtime, which is again not a good idea I think.
We will use a loader application, which will be taking care of this dynamic insertion depending on the schema (what ever it maybe).
Need to retrieve statistical data efficiently as some example is given above [1.1.*].
I am not completely sure I understand what you mean when you say that
The no of attributes (table column) is unknown during development. In
runtime any new attribute(let say - humidity, refractive index etc)
can be added.
first of all, I suppose that this is not really happening at random: i.e. when you get a new bunch of data from the field you know (before importing) that these have an extra dimension or two. Correct?
Also, the fact that in this new data batch you get "refractive index" will not make the older data magically acquire a proper value for this dimension.
Therefore I would go for a classical Object-to-RDBMS mapping where you have:
a header table with things that exist for every measurement: i.e. time and space, possibly the source (i.e. lab, sensor, team which provided the data) and an autogenerated key.
one or more detail table where the values are defined as proper fields.
Example:
Header
location | time | height | source |Key |
L1 | 2011-12-18 08:04:02 | 7 | team-1 | 002020013 |
L1 | 2011-12-18 08:04:02 | 15 | team-1 | 002020017 |
L1 | 2011-12-18 08:04:02 | 27 | Lab-X | 002020018 |
L1 | 2011-12-18 08:04:04 | 7 | Lab-Y | 002020021 |
L1 | 2011-12-18 08:04:04 | 15 | Lab-X | 002020112 |
Atmospheric data (basic)
Key | pressure | temp | wind-dir |
002020013 | 1009.6 | 28.3 | east |
002020017 | 1019.3 | 29.2 | east |
002020018 | 1011.6 | 26.9 | east |
Light-sensor data
Key | refractive-ind | albedo | Ultraviolet |
002020017 | 79.6 | .37865 | 7.0E-34 |
002020018 | 67.4 | .85955 | 6.5E-34 |
002020021 | 91.6 | .98494 | 8.1E-34 |
In other words: every different set of data will use one or more subtables (these you can add "dynamically", if needed) and you can still create queries by standard means, you will just have to join subtables (where possible: i.e. if you want to analyze by Wind Directions AND refractive index, you can - but only when you have set of data which have both values) by using the reference keys to keep these consistent).
I believe this more efficient than using text fields with CSV inside, or data blobs or using a key-values associations.
I would definitely go with 1.2 (edit table schema during runtime), at least to begin with. Any sufficiently advanced configuration is indistinguishable from programming; don't think you can magically avoid making changes to your program.
Don't be scared of alter table. Yes, the upfront costs are higher - you may need a process (not just a program) to ensure your schema stays clean. And there are some potential locking problems (that have solutions). But if you do it right you only have to pay the price once for each change.
With a completely generic solution you will pay a small price with every query. Your queries will be complicated, slower, ugly, and more likely to fail. You can never write a query like select avg(value) ..., it may or may not work, depending on how the data is accessed. You can use a PL/SQL function to catch exceptions, or use inline views and hints to force a specific access pattern. Either way, your queries are more complicated and slower, and you have to make sure that everybody understands these problems before they use the data.
And with a generic solution the optimizer will suck because it knows nothing about your data. Oracle can't predict how many rows will be returned by where attr_name = 'temperature' and is_number(value) = 28.4. But it can make a very good guess for where temperature = 28.4. You may have significantly more bad plans (i.e. slow queries) with generic columns.
Thank you for the quick response and good guidance. I have gotten some concepts from the both answers and decided to go with a mix model. I don't know whether I am in the write path or not. I want comments on the model. Below I am describing the complete conceptual model with MySQL code snippet.
Conceptual model
For dynamicity - (no of column is not defined previously) I have created 4 tables as follows -
geolocation(locid int, name varchar, geometry spatial_type) - to store information of a particular location, may be defined with spatial feature.
met_loc_event(loceventid int, locid* int, record_time timestamp, height float) - this is to identify a perticular event in a place with sudden height.
metfeatures(featureid int, name varchar, type varchar) - to store feature (ie. Column) details with a data type, that type field will help to cast data as required.
metstore(loceventid* int, featureid* int, value varchar) - to store an atom value for a feature at a particular time.
Up to that part I design a column orientation to store a dynamic nature of table. But as you suggest this is not a good design for quering (some will not work like arithmetic functions) the database. This is also not good if we consider performance.
For efficient query needs (to avoid to much joining and to avoid casting value during query) - I extend the model with some helper view, I write store procedure to generate views from the stored database.
First I created views for each feature (by taking value from feature table, so no of entry will be no of feature view initially) with the help of met_loc_event, metfeatures and metstore tables. These views store locid, record_time, height, and caste value according to feature type
Next from these views, I created a row oriented view named metrelview - which consist of all relation data row wise as like normal table. I have planned to fire query to the view, so the query performance will be improved.
This view generation procedure needs to execute whenever any insert, update or delete operation will be there in features table.
Below is the MySQL procedure that I have developed for the view generation
CREATE PROCEDURE `buildModel`()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE fid INTEGER;
DECLARE fname VARCHAR(45);
DECLARE ftype VARCHAR(45);
DECLARE cur_fatures CURSOR FOR SELECT `featureid`, `name`, `type` FROM `metfeatures`;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
SET #viewAlias = 'v_';
SET #metRelView = "metrelview";
SET #stmtCols = "";
SET #stmtJoin = "";
START TRANSACTION;
OPEN cur_fatures;
read_loop: LOOP
FETCH cur_fatures INTO fid, fname, ftype;
IF done THEN
LEAVE read_loop;
END IF;
IF fname IS NOT NULL THEN
SET #featureView = CONCAT(#viewAlias, LOWER(fname));
IF ftype = 'float' THEN
SET #featureCastStr = "`value`+0.0";
ELSEIF ftype = 'int' THEN
SET #featureCastStr = "CAST(`value` AS SIGNED)";
ELSE
SET #featureCastStr = "`value`";
END IF;
SET #stmtDeleteView = CONCAT("DROP VIEW IF EXISTS `", #featureView, "`");
SET #stmtCreateView = CONCAT("CREATE VIEW `", #featureView, "` AS SELECT le.`loceventid` AS loceventid, le.`locid`, le.`rectime`, le.`height`, ", #featureCastStr, " AS value FROM `metlocevent` le JOIN `metstore` ms ON (le.`loceventid`=ms.`loceventid`) WHERE ms.`featureid`=", fid);
PREPARE stmt FROM #stmtDeleteView;
EXECUTE stmt;
PREPARE stmt FROM #stmtCreateView;
EXECUTE stmt;
SET #stmtCols = CONCAT(#stmtCols, ", ", #featureView, ".`value` AS ", #featureView);
SET #stmtJoin = CONCAT(#stmtJoin, " ", "LEFT JOIN ", #featureView, " ON (le.`loceventid`=", #featureView,".`loceventid`)");
END IF;
END LOOP;
SET #stmtDeleteView = CONCAT("DROP VIEW IF EXISTS `", #metRelView, "`");
SET #stmtCreateView = CONCAT("CREATE VIEW `", #metRelView, "` AS SELECT le.`loceventid`, le.`locid`, le.`rectime`, le.`height`", #stmtCols, " FROM `metlocevent` le", #stmtJoin);
PREPARE stmt FROM #stmtDeleteView;
EXECUTE stmt;
PREPARE stmt FROM #stmtCreateView;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
CLOSE cur_fatures;
COMMIT;
END;
N.B. - I tried to call the procedure with any event in features table, so that every thing should be automated. But as MySQL is not supported dynamic query with function or trigger, I cant do it automatically
I also want criticism before i finalize as accepted model, I am not a DBA so, if you can help me how to improve performance for the model will be very helpful for me.
This sounds like a homework assignment whose underlying subject is: use-cases for abandoning strict normal-form design principles.
The solution to this conundrum is to develop a three-stage solution. Stage 1 is runtime adaptability using the flexible AttributeType, AttributeValue approach, so that rapidly incoming data can be captured and put somewhere temporarily in a quasi-structured manner. Stage 2 involves the analysis of that runtime data to see where the model must be extended with additional columns and validation tables to accommodate any new attributes. Stage 3 is the importing of the as-yet-unimported data into the revised model, which never relaxes its strict datatyping and declarative referential integrity constraints.
As they say: Life, friends, is a trade-off.

Resources