I would like to create a format in SAS, that converts float into text, e.g.
1.7 should be converted into 'one point seven'. Float contains three symbols only - a digit, a point and a digit.
I know this could be solved by creating data set containing all variations and then format based on this set, but it doesn't satisfy me at all.
Related
I'm trying to match a string of numbers (like 370488004) using the typical INDEX MATCH formula. Unfortunately, in one range the numbers are formatted as plain text, and in the other range they are formatted differently. Usually 'Automatic' or 'Number'. I want to avoid having to update the formatting of both ranges whenever the values get updated (usually via a paste from an outside source). Especially since it's not always going to be me doing the updating.
Is there a way I can write my INDEX MATCH formula so that it ignores the formatting of the values it's attempting to match?
The value returned by the INDEX formula can be in any format. Plain text, number, doesn't matter. The problem is the two values I'm matching are in different formats. I need a formula that ignores that their formatting.
Here's an example sheet:
https://docs.google.com/spreadsheets/d/1cwO7HGtwR4mRnAqcjxqr1qbhGwJHLjBKkp7-iwzkOqY/edit?usp=sharing
You can use VALUE or INT to force it into a number value, or if you want to keep it text use TEXT. Example would be:
=INDEX(VALUE(D1:E4),MATCH(G1,E1:E4,FALSE),1)
The numbers in column D are in fact text, but utilizing VALUE first for the range puts them all in number format. It is finding the value associated with "Green" written in G1. Without seeing a working example sheet this is the best solution I can offer.
UPDATE:
You can use VLOOKUP array with a static range (otherwise error), or QUERY to have the range infinite.
=ARRAYFORMULA(VLOOKUP(VALUE($G3:$G5),$B3:$C,2,FALSE))
=QUERY(FILTER($B3:$C,$B3:$B=VALUE($G3:$G)),"Select Col2")
I see that parquet supports dictionary encoding on a per-column basis, and that dictionary encoding is described in the GitHub documentation:
Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
The dictionary encoding builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integers using the RLE/Bit-Packing Hybrid encoding. If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column chunk.
Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.
Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width).
Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Prefer using RLE_DICTIONARY in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.
Okay.... so how do I know when to use dictionary encoding or not?
Is there any rule of thumb to help? e.g. if 90% of values in a column are expected to be in some particular set I should use them?
I have a use case where I expect three different scenarios for different columns:
integer column where all values lie within a very small set → seems perfect for dictionary encoding
integer column where 99% of values lie within a very small set but 1% are unlikely to form any clustering → not sure
string column where no value is likely to be the same → seems like dictionary encoding is a bad idea
Is there any documentation explaining which strategy is appropriate under various conditions?
I'm not aware of any documentation (on the Arrow side at least) that recommends when to use or not dictionary encoding. It's a good question and your instincts are reasonable--maybe you can try writing those kinds of data both ways and comparing file size and read/write speed. I'd be interested to see what you find.
I am trying to publish data from our SAS environment into a remote Hadoop/Hive database (as sequence files). I'm performing basic tests by taking some source data from our business users and using a data step to write out to the Hadoop library.
I'm getting errors indicating that a value at row X is out of range.
For example:
ERROR: Value out of range for column BUY_RT1, type DECIMAL(5, 5). Disallowed value is: 0.
The source data has a numeric format of 6.5, and the actual value is .00000.
Why is .00000 out of range? Would the format for Hadoop need to be DECIMAL(6, 5)?
I get the same error when the value is 0.09:
ERROR: Value out of range for column INT_RT, type DECIMAL(5, 5). Disallowed value is: 0.09
You may need to check the actual values in SAS. If a numeric value in SAS has a format applied, you will see the formatted (possibly rounded) version of the numeric value wherever you output the value, but the underlying numeric may still have more significant digits that you're not seeing, due to the format.
For example, you say your source data has a format of 6.5 and the 'actual value' is 0.00000; are you sure that's the actual value? To check, you could try comparing the value to a literal 0, or putting the value to the SAS log with a different format like BEST32. (eg put BUY_RT1 best32.;).
If this is the problem, the solution is to properly round the source numeric values, rather than just applying a format.
I am using CDH 5.3.0 and Hive 0.12. I am having a Hive table with columns defined as double.
I am loading data to these double columns with 2 scale of precision after decimal point from a HDFS sequence file. For example, in my HDFS sequence file my data is like - 100.23 or 345.00. I need to choose double as my data value can be a big value like "3457894545.00"
My requirement is to display two scale precision after decimal point while querying to the Hive table. So with the example data, mentioned above, if i query for this column then I would need to see the value as "100.23" or "345.00".
But with Hive 0.12, I am getting only single precision after decimal point, i.e. value is getting truncated to "100.2" or "345.0".
I tried with "decimal" data type giving syntax as "decimal(3,2)" but in that case my value is getting completely rounded off i.e. "100" or "345".
I was goggling to see if there is any option to define custom precision to a double data type and found that custom precision can be given from hive 0.13 on wards.
Is Hive 0.12 double data type shows only single precision after decimal point. Do i need to apply any custom fix. Kindly suggests.
Thanks in advance.
You should declare as decimal(5,2).
The syntax is DECIMAL(precision, scale). Precision means the number of digits of this number including the digits after the dot.
Here's another one for ya.
I have a column in a database that consists of a formatted string containing titles and values associated to those titles.
such as: "Genre : Science Fiction Style : Vintage StuffTitle : StuffValue"
parsing the data is no problem.
I do a split on the 3 spaces separating the groups and then another split on the colon (:) to get each title and value.
The problem is I want crystal to see this as a record so I can apply formatting to the section containing the values.
Assign the different pieces to variables in one formula while taking care of your type casting, etc., and then create other formulas that return the value of each of these variables individually (one per piece of data you want to format individually). Then, drop them both in the same report section.
For example, if your parsing formula is {#parse}, and in it
...
numbervar sample := <some parsed value>
stringvar another := <some other value>
...
to return and format the first individually:
evaluateafter({#parse});
stringvar sample