SparkR reading in column as binary instead of string - sparkr

I have a table in Impala/Hive, which is type def'd as a string:
name, type
tdate, string
area, int
(for example).
When I read in the Parquet file that this is based on:
df<-parquetFile(sqlContext,'/path/to/main/folder')
df
It tells me it has binary fields?
DataFrame[tdate:binary, area:int]
How do I get around this?
Thanks!

The solution is here:
I found the solution to the problem.
We can do the following:
sql(sqlContext,'SET spark.sql.parquet.binaryAsString=true')
This fixes everything.

Related

Can anyone explain the difference between Uuid::generate and DB::generateKey?

Without thinking too hard about it I created a column of type [UUID] and successfully stored "strings" (as noted in the documentation, and generally referred to as a separate type altogether) returned from DB::generateKey in it.
Feels like I've done something I shouldn't have.
Can anyone share some light on this. Thanks in advance.
Mostly they return different types.
For clarity, DB::generateKey is equivalent to Uuid::generate |> toString
According to the standard library docs, it's the return type.
Uuid::generate() -> UUID
Generate a new UUID v4 according to RFC 4122
DB::generateKey() -> Str
Returns a random key suitable for use as a DB key
I believe the UUID type is a bitstring representation, that is, a specific sequence of bits in memory.
Whereas the Str type is a string representation of a UUID.

AX 2012 Database - BaseEnum to String

I am trying to Code a "while select" Statement.
I have a table "CarBrandTable" with two fields:
CarBrandId and Countries.
CarBrandId is a String.
Countries is a Base Enum.
Now I want to get the Data, by asking with a select Statement.
When I want to retrieve the Data by saying info(carBrandTable.countries);
The Compiler says... "Argument 'txt' is incompatible with the required type"
I know that my baseenum is not a string and that I hvae to somehow convert it.
But I have Trouble in doing so.
Does anybody have a tip for me ?
Thanks in advance
Rephrase the info like this:
info(strFmt('%1', carBrandTable.countries));
Other way is enum2str function.
info(enum2str(carBrandTable.countries));

InputFormat Decision

I am trying to figure out which of the given answers suits best the question:
Given a directory of files with the following structure: line number,
tab character, string:
Example:
1abialkjfjkaoasdfjksdlkjhqweroij
2kadfjhuwqounahagtnbvaswslmnbfgy
3kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper. Which
InputFormat should you use to complete the line: conf.setInputFormat
(____.class) ; ?
A. SequenceFileAsTextInputFormat
B. SequenceFileInputFormat
C. KeyValueFileInputFormat
D. BDBInputFormat
My analysis:
Option A is a format I found to exist, but I'm not sure of the correct usage of it and if it suits as an answer.
Option B is not possible since SequenceFiles are file of binary data (K,V) pairs of binary data, and thus will not be suitable..
Option C is not possible because there is no KeyValueFileInputFormat, though here, if it is a typo and it actually is KeyValuetextInputFormat, than I think it will be a good choice. Or isn't it?
Option D is not possible because there is no BDBInputFormat and even if it is a typo and it actually is BDInputFormat than it wouldn't suit the case.
Thank You!
D
The answer is Option C. It may be a typo
KeyValueTextInputFormat helps you to get line splitted with TAB.
So line number will be the key and the string will be the value.
It maybe a typo in the option C as you guessed, and it should be https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html.
See for more details: How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

Dealing with huge data

Let's assume that I have a big file (500GB+) and I have a data record
declaration Sample which indicates a row in that file:
data Sample = Sample {
field1 :: Int,
field2 :: Int
}
Now what is the data structure suitable for processing
(filter/map/fold) on the collection of these Sample datas ? Don
Stewart has answered here that the Sample type should not be treated
as a list [Sample] type but as a Vector type. My question is how
does representing it as Vector type solve the problem ? Doesn't
representing the file contents as a vector of Sample type will also
occupy around 500Gb ?
What is the recommended method for solving these types of problem ?
As far as I can see, the operations you want to use (filter, map and fold) can be done via both conduit (see Data.Conduit.List) and pipes (see Pipes.Prelude).
Both libraries are perfectly capable of manipulating/folding and filtering streaming data. Depending on your scenario they might solve your actual problem.
If you, however, need to investigate values several times, you're better of by loading chunks into a vector, as #Don said.

Pass data from workspace to a function

I created a GUI and used uiimport to import a dataset into matlab workspace, I would like to pass this imported data to another function in matlab...How do I pass this imported dataset into another function....I tried doing diz...but it couldnt pick diz....it doesnt pick the data on the matlab workspace....any ideas??
[file_input, pathname] = uigetfile( ...
{'*.txt', 'Text (*.txt)'; ...
'*.xls', 'Excel (*.xls)'; ...
'*.*', 'All Files (*.*)'}, ...
'Select files');
uiimport(file_input);
M = dlmread(file_input);
X = freed(M);
I think that you need to assign the result of this statement:
uiimport(file_input);
to a variable, like this
dataset = uiimport(file_input);
and then pass that to your next function:
M = dlmread(dataset);
This is a very basic feature of Matlab, which suggests to me that you would find it valuable to read some of the on-line help and some of the documentation for Matlab. When you've done that you'll probably find neater and quicker ways of doing this.
EDIT: Well, #Tim, if all else fails RTFM. So I did, and my previous answer is incorrect. What you need to pass to dlmread is the name of the file to read. So, you either use uiimport or dlmread to read the file, but not both. Which one you use depends on what you are trying to do and on the format of the input file. So, go RTFM and I'll do the same. If you are still having trouble, update your question and provide details of the contents of the file.
In your script you have three ways to read the file. Choose one on them depending on your file format. But first I would combine file name with the path:
file_input = fullfile(pathname,file_input);
I wouldn't use UIIMPORT in a script, since user can change way to read the data, and variable name depends on file name and user.
With DLMREAD you can only read numerical data from the file. You can also skip some number of rows or columns with
M = dlmread(file_input,'\t',1,1);
skipping the first row and one column on the left.
Or you can define a range in kind of Excel style. See the DLMREAD documentation for more details.
The filename you pass to DLMREAD must be a string. Don't pass a file handle or any data. You will get "Filename must be a string", if it's not a string. Easy.
FREAD reads data from a binary file. See the documentation if you really have to do it.
There are many other functions to read the data from file. If you still have problems, show us an example of your file format, so we can suggest the best way to read it.

Resources