How to extract number that has a unit in mathematica? - wolfram-mathematica

If the table data is set up with a value plus a unit,
for example, 1. "Kilos" or 34300. "GBP",
how to get the number using a functional approach or pattern matching?

A couple of suggestions here, mapping First or QuantityMagnitude.

Related

The column of the csv file in google automl tables is recognised as text or categorical instead of numeric as i would like

I tried to train a model using google automl tables but i have the following problem
The csv file is correctly imported, it has 2 columns and about 1870 rows, all numeric.
The system recognises only 1 column as numeric but not the other.
The column, where the problem is, has 5 digits in each row separated with space.
Is there anything i should do in order for the system to properly recognise the data as numeric?
Thanks in advance for your help
The issue is with the Data type Numeric definition, the number needs to be comparable (greater than, smaller than, equal).
Two different list of numbers are not comparable, for example 2 4 7 is not comparable to 1 5 7. To solve this, without using strings and therefore losing the "information" of those numbers, you have several options.
For example:
Create an array of numbers, by inserting [ ] in the limits of the second entrance. Take into consideration the Array Data type relative weighted approach in AutoMl tables as it may affect the "information" extracted from the sequence.
Create additional columns for every entry of the second column so each one is a single number and hence truly numeric.
I would personally go for the second option.
If you are afraid of losing "information" by splitting the numbers take into consideration that after training, the model should deduce by itself the importance of the position and other "information" those number sequences might contain (mean, norm/modulus,relative increase,...) provided the training data is representative.

Fast algorithm for approximate lookup on multiple keys

I have formulated a solution to a problem where I am storing parameters in a set of tables, and I want to be able to look up the parameters based on multiple criteria.
For example, if criteria 1 and criteria 2 can each be either A or B, then I'd have four potential parameters - one for each combination A&A, A&B, B&A and B&B. For these sort of criteria I could concatenate the fields or something similar and create a unique key to look up each value quickly.
Unfortunately not all of my criteria are like this. Some of the criteria are numerical and I only care about whether or not a result sits above or below a boundary. That also wouldn't be a problem on its own - I could maybe use a binary search or something relatively quick to find the nearest key above or below my value.
My problem is I need to include a number of each in the same table. In other words, I could have three criteria - two with A/B entries, and one with less-than-x/greater-than-x type entries, where x is in no way fixed. So in this example I would have a table with 8 entries. I can't just do a binary search for the boundary because the closest boundary won't necessarily be applicable due to the other criteria. For example, if the first two criteria are A&B, then the closest boundary might be 100, but if the if first two criteria are A&A, the closest boundary might be 50. If I want to look up A, A, 101, then I want it to recognise that 50 is the closest boundary that applies - not 100.
I have a procedure to do the lookup but it gets very slow as the tables get bigger - it basically goes through each criteria, checks if a match is still possible, and if so it looks at more criteria - if not, it moves on to check the next entry in the table. So in other words, my procedure requires cycling through the table entries one by one and checking for a match. I have tried to optimise that by ensuring the tables that are input to the procedure are as small as possible and by making sure it looks at the criteria that are least likely to match first (so that it checks each entry as quickly as possible) but it is still very slow.
The biggest tables are maybe 200 rows with about 10 criteria to check, but many are much smaller (maybe 10x5). The issue is that I need to call the procedure many times during my application, so algorithms with some initial overhead don't necessarily make things better. I do have some scope to change the format of the tables before runtime but I would like to keep away from that as much as possible (while recognising it may be the only way forward).
I've done quite a bit of research but I haven't had any luck. Does anyone know of any algorithms that have been designed to tackle this kind of problem? I was really hoping that there would be some clever hash function or something that means I won't have to cycle through the tables, but from my limited knowledge something like that would struggle here. I feel confident that I understand the problem well enough to gradually optimise the solution I have at the moment, but I want to be sure I've not missed a much better solution.
Apologies for the very long and abstract description of the problem - hopefully it's clear what I'm trying to do. I'll amend my question if it's unclear.
Thanks for any help.
this is basically what a query optimizer does in SQL land. There are fast, free, in memory databases for exactly this purpose. Checkout sqlite https://www.sqlite.org/inmemorydb.html.
It sounds like you are doing what is called a 'full table scan' for each query, which is like the last resort for a query optimizer.
As I've understood, you want to select entries by criteria like
A& not B & x1 >= lower_x1 & x1 < upper_x1 & x2 >= lower_x2 & x2 < lower_x2 & ...
The easiest way is to have them sorted by all possible xi, where i=1,2.. in separate sets, and have separated 'words' for various combination of A,B,..
The search will works as follows:
Select a proper world by Boolean criteria combination
For each i, find the population of lower_xi..upper_xi range in corresponding set (this operation is O(log(N))
Select i where the population is the lowest
While iterating instances through lower_xi..upper_xi range, filter the results by checking other upper/lower bound criteria (for all xj where j!=i)
Note that this s a general solution. Of course if you know some relation between your bound(s), you may use a list sorted by respective combination(s) of item values.

Find all the rows given the number of matching attributes and the query

Below is my problem definition:
Given a database D, each row has m categorical attributes. Given a query which is a vector of m categorical attributes and the number of matching, k. How to find all the row ids such that the number of matching attributes to the query is greater than or equal to k efficiently?
The easier version (I think) is that given a vector of <=m-categorical attributes, how to find ids of all the rows that match those <=m-categorical attributes.
In some of the question (e.g. this), they need to scan the whole database every time the query comes in. I think this is not fast enough. I am not sure about the complexity on this actually.
If it is possible, I want to avoid scanning all the rows in the database. Therefore, I am thinking of building some kinds of index but I am wondering if there is any existing work for these?
In addition, is there a problem similar to this and what is it called? I want to take a look.
Thank you very much for your help.
(Regarding the coding, I mainly code in Python 2.7 for this.)

Optimizing algorithms for multiple queries of the same kind

There is a particular class of algorithm coding problems which require us to evaluate multiple queries which can be of two kind :
Perform search over a range of data
Update the data over a given range
One example which I've been recently working on is this(though not the only one) : Quadrant Queries
Now, to optimize my algorithm, I have had one idea :
I can use dynamic programming to keep the search results for a particular range, and generate data for other ranges as required.
For example, if I have to calculate sum of numbers in an array from index 4 to 7, I can already keep sum of elements upto 4 and sum of elements upto 7 which is easy and then I'll just need the difference of the two + 4th element which is O(1). But this raises another problem : During the update operation, I'll have to update my stored search data for all the elements following the updated element. This seems to be inefficient, though I did not try it practically.
Someone suggested me that I can combine subsequent update operations using some special data structure.(Actually read it on some forum).
Question: Is there a known way to optimize these kind of problems? Is there a special data structure that does it? The idea I mentioned;Is it possible that it might be more efficient than direct approach? Should I try it out?
It might help:
Segment Trees (Range-Range part)

prefix similarity search

I am trying to find a way to build a fuzzy search where both the text database and the queries may have spelling variants. In particular, the text database is material collected from the web and likely would not benefit from full text engine's prep phase (word stemming)
I could imagine using pg_trgm as a starting point and then validate hits by Levenshtein.
However, people tend to do prefix queries E.g, in the realm of music, I would expect "beetho symphony" to be a reasonable search term. So, is someone were typing "betho symphony", is there a reasonable way (using postgresql with perhaps tcl or perl scripting) to discover that the "betho" part should be compared with "beetho" (returning an edit distance of 1)
What I ended up is a simple modification of the common algorithm: normally I would just pick up the last value from the matrix or vector pair. Referring to the "iterative" algorithm in http://en.wikipedia.org/wiki/Levenshtein_distance I put the strings to be probed as first argument and the query string as second one. Now, when the algorithm finishes, the minimum value in the result column gives the proper result
Sample results:
query "fantas", words in database "fantasy", "fantastic" => 0
query "fantas", wor in database "fan" => 3
The input to edit distance are words selected from a "most words" list based on trigram similarity
You can modify edit distance algorithm to give a lower weight to the latter part of the string.
Eg: Match(i,j) = 1/max(i,j)^2 instead of Match(i,j)=1 for every i&j. (i and j are the location of the symbols you are comparing).
What this does is: dist('ABCD', 'ABCE') < dist('ABCD', 'EBCD').

Resources