Exporting data into excel using iterative loop - export-to-excel

I am doing an iterative calculation on maple and I want to store the resulting data (which comes in a column matrix) from each iteration into a specific column of an Excel file. For example, my data is
mydat||1:= <<11,12,13,14>>:
mydat||2:= <<21,22,23,24>>:
mydat||3:= <<31,32,33,34>>:
and so on.
I am trying to export each of them into an excel file and I want each data to be stored in consecutive columns of the same excel file. For example, mydat||1 goes to column A, mydat||2 goes to column B and so on. I tried something like following.
with(ExcelTools):
for k from 1 to 3 do
Export(mydat||k, "data.xlsx", "Sheet1", "A:C"): #The problem is selecting the range.
end do:
How do I select the range appropriately here? Is there any other method to export the data and store in the way that I explained above?

There are couple of ways to do this. The easiest is certainly to put all of your data into one data structure and then export that. For example:
mydat1:= <<11,12,13,14>>:
mydat2:= <<21,22,23,24>>:
mydat3:= <<31,32,33,34>>:
mydata := Matrix( < mydat1 | mydat2 | mydat3 > );
This stores your data in a Matrix where mydat1 is the first column, mydat2 is the second column, etc. With the data in this form, either ExcelTools:-Export or the more generic Export command will work:
ExcelTools:-Export( data, "data.xlsx" );
Export( "data.xlsx", data );
Now since you mention that you are doing an iterative calculation, you may want to write the results out column by column. Here's another method that doesn't involve the creation of another data structure to house the results. This does assume that the data in mydat"i" has been created before the loop.
for i to 3 do
ExcelTools:-Export( cat(`mydat`,i), "data.xlsx", 1, ["A1","B1","C1"][i] );
end do;
If you want to write the data out to a file as you are building it, then just do the Export call after the creation of each of the columns, i.e.
ExcelTools:-Export( mydat1, "data.xlsx", 1, "A1" );
Note that I removed the "||" characters. These are used in Maple for concatenation and caused some issues with the second method.

Related

Performance difference map() vs withColumn()

I have a table with over 100 columns. I need to remove double quotes from certain columns. I found 2 ways to do it, using withColumn() and map()
Using withColumn()
cols_to_fix = ["col1", ..., "col20"]
for col in cols_to_fix:
df = df.withColumn(col, regexp_replace(df[col], "\"", ""))
Using map()
def remove_quotes(row: Row) -> Row:
row_as_dict = row.asDict()
cols_to_fix = ["col1", ..., "col20"]
for column in cols_to_fix:
if row_as_dict[column]:
row_as_dict[column] = re.sub("\"", "", str(row_as_dict[column]))
return Row(**row_as_dict)
df = df.rdd.map(remove_quotes).toDF(df.schema)
Here is my question. I found using map() takes about 4 times longer than withColumn() on a table that has ~25M records. I will really appreciate if any fellow stack overflow user can explain the reason for the performance difference, so that I can avoid similar pitfall in future.
firstly, one piece of advice: do not convert DataFrame to RDD and just do df.map(your function here), this may save a lot of time.
the following page
https://dzone.com/articles/apache-spark-3-reasons-why-you-should-not-use-rdds
would save us a lot of time, its main conclusion is that RDD is remarkably slow than DataFrame/Dataset, not to mention the time used for the conversion from DataFrame to RDD.
Let's talk about map and withColumn without any conversion between DataFrame to RDD now.
Conclusion first: map is usually 5x slower than withColumn. the reason is that map operation always involves deserialization and serialization while withColumn can operate on column of interest.
to be specific, map operation should deserialize the Row into several parts on which the operation will be carrying,
An example here :
assume we have a DataFrame which looks like
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| Scala| 3000|
+--------+-----------+
Then we want to increment all the values in column users_count by 1, we can do it like this:
df.map(row => {
val usersCount = row.getInt(1) + 1
(row.getString(0), usersCount)
}).toDF("language", "users_count_incremented_by_1")
In the code above, we firstly need to deserialize every row to extract the values in the 2nd column, after that we output the modified values and save it as an DataFrame(this step requires serialization of (a,b) into Row(a, b) since DataFrame is nothing but a DataSet of Rows).
for more detailed explanation, check the following excellent article
https://medium.com/#fqaiser94/udfs-vs-map-vs-custom-spark-native-functions-91ab2c154b44
map can not operate on the column itself but have to operate on the values of the column, getting the values require deserialization, saving it as a DataFrame requires serialization.
But map is still of great use: with the help of map method people could implement very sophisticated operations while just built-in operations could be done if we just use withColumn.
To sum it up, map is slower but more flexible, withColumn is surely the most efficient while it's functionality is limited.

AddField function does not work as expected

I have the following block of code that iterates through the fields of each table and adds the fields of the current table respectively in order to create a number of tableboxes.
'iterate through every table
For i=1 To arrTCount
'the arrFF array holds the names of the fields of each table
arrFF = Split(arrFields(i), ", ")
arrFFCount = UBound(arrFF)
'create a tablebox
Set TB = ActiveDocument.Sheets("Main").CreateTableBox
'iterate through the fields of the array
For j=0 to (arrFFCount - 1)
'add the field to the tablebox
TB.AddField arrFF(j)
'Msgbox(arrFF(j))
Next
Set tboxprop = TB.GetProperties
tboxprop.Layout.Frame.ObjectId = "TB" + CStr(i)
TB.SetProperties tboxprop
Next
The above code creates the tableboxes, but with one field less every time (the last one is missing). If I change the For loop from For j=0 To (arrFFCount - 1) to For j=0 To (arrFFCount) it creates empty tableboxes and seems to execute forever. Regarding this change, I tested the field names with the Msgbox(arrFF(j)) command and it shows me the correct field names as I want them to be in the tableboxes in the UI of QlikView.
Does anybody have an idea of what seems to be the problem here? Can I do this in a different way?
To clarify the situation here and what I have tested so far, I have 11 tables to make tableboxes of and I have tried with just one of them or some of them. The result I am seeing with the code is on the left and what I am expecting to see is on the right of the following image. Please note that the number of fields vary for each table and the image has just one of them as an example.

SAS proc IML error: Not enough memory to store all matrices

Good Morning,
I am trying to progam the next simple function in SAS using proc iml, but I obtain the next error " not enough memory to store all matrices". I am trying to read two matrices one call "matriz_product" and the other one "matriz_segment", these table have a dimension of 21x(more than)1.000.000 and the values are characters. After reading this matrices I want to create one vector from each of the tables where the column picked is the one that is specified in position (another vector that I read).
The code is the following:
proc iml;
use spain.Tabla_product;
read all var {a_def_prdt1 b_def_prdt2 c_def_prdt3 d_def_prdt4 e_def_prdt5 f_def_prdt6 g_def_prdt7 h_def_prdt8 i_def_prdt9 j_def_prdt10 k_def_prdt11 l_def_prdt12 m_def_prdt13 n_def_prdt14 o_def_prdt15 p_def_prdt16 q_def_prdt17 r_def_prdt18 s_def_prdt19 t_def_prdt20} into matrizProduct;
use spain.Tabla_segment;
read all var {a_def_sgmt1 b_def_sgmt2 c_def_sgmt3 d_def_sgmt4 e_def_sgmt5 f_def_sgmt6 g_def_sgmt7 h_def_sgmt8 i_def_sgmt9 j_def_sgmt10 k_def_sgmt11 l_def_sgmt12 m_def_sgmt13 n_def_sgmt14 o_def_sgmt15 p_def_sgmt16 q_def_sgmt17 r_def_sgmt18 s_def_sgmt19 t_def_sgmt20} into matrizsegment;
use spain.contratonodato;
read all var {posi} into position;
n=nrow(matrizsegment);
DEF_PRDT=j(n,1,"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz");
DEF_SGMT=j(n,1,"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz");
do i =1 to n;
DEF_PRDT[i,1]=matrizproduct[i,position[i]];
DEF_SGMT[i,1]=matrizsegment[i,position[i]];
end;
create contratosnodato_modi var {"DEF_SGMT" "DEF_PRDT"};
append;
run;
Thank you very much.
Base SAS reads row-by-row and so rarely runs out of memory. Proc IML reads the entire dataset into memory and so can easily run out of memory for larger datasets. For this reason, I only use proc IML when absolutely necessary (e.g. doing matrix multiplication), and when I do I will:
Chunk datasets into smaller pieces that will fit into memory, and do these sequentially.
Optimise algorithms to be able to run within constraints - for example, exploit the structure of a matrix I need to invert to avoid inverting the whole matrix.
Fortunately in this case you don't even appear to need proc IML at all - what you're trying to do can be done in a data step. Try this:
data contratosnodato_modi;
format DEF_PRDT $40. DEF_SGMT $40.;
set spain.Tabla_product;
set spain.Tabla_segment;
set spain.contratonodato;
array product {20} a_def_prdt1 b_def_prdt2 c_def_prdt3 d_def_prdt4 e_def_prdt5 f_def_prdt6 g_def_prdt7 h_def_prdt8 i_def_prdt9 j_def_prdt10 k_def_prdt11 l_def_prdt12 m_def_prdt13 n_def_prdt14 o_def_prdt15 p_def_prdt16 q_def_prdt17 r_def_prdt18 s_def_prdt19 t_def_prdt20;
array segment {20} a_def_sgmt1 b_def_sgmt2 c_def_sgmt3 d_def_sgmt4 e_def_sgmt5 f_def_sgmt6 g_def_sgmt7 h_def_sgmt8 i_def_sgmt9 j_def_sgmt10 k_def_sgmt11 l_def_sgmt12 m_def_sgmt13 n_def_sgmt14 o_def_sgmt15 p_def_sgmt16 q_def_sgmt17 r_def_sgmt18 s_def_sgmt19 t_def_sgmt20;
DEF_PRDT = product{posi};
DEF_SGMT = segment{posi};
keep DEF_PRDT DEF_SGMT;
run;
Here I'm reading all data in at once, storing the columns of interest as arrays and only accessing the columns specified in the position dataset.

ASP: How to insert contents of array to database?

NOTE:, I don't need help with the generic concept of inserting data to a database, just sorting through the contents of an array depending on the content of the "line" and how to determine which "items" in the array correspond to a field in the database
I have a glob of data posted to me by a desktop application that I need to sort through. My old solution worked, but was far less than elegant (INSERT each line of glob into database, then query for, reINSERT, and delete old).
How can I get the following chunk of information (POSTED to me as "f_data") into an array and insert the data into a database?
f_data Contents:
Open~notepad.exe~7/14/2011 2:28:46 PM~COMPUTER01
Open~mspaint.exe~7/14/2011 2:28:55 PM~COMPUTER01
Close~notepad.exe~7/14/2011 2:30:06 PM~COMPUTER01
Close~mspaint.exe~7/14/2011 2:30:06 PM~COMPUTER01
Session~7/14/2011~336~COMPUTER01
Startup~7/18/2011 11:23:12 AM~COMPUTER01
Please keep in mind that I have never used arrays before. 15 years of ASP and I've never had to use an array. How I've been so lucky I don't know, but I think that it may be required for this solution. Here is my current code to put "f_data" into an array:
Example of what I want to do:
var_logdata = request.form("f_data")
arr_logdata = Split(var_logdata,"~")
for var_arrayitem = 0 to ubound(arr_logdata)
'Do some stuff here depending on the log type
'If type is "Open"
'insert to tb_applicationlog
'Elseif type is "Close"
'insert to tb_applicationlog
'Elseif type is "Session"
'insert to tb_sessions
'End if
next
What I don't know how to do is to determine what "type" of log entry the item in the array is. If you look at the code above, I need to insert to different tables in the database depending on the "type" of log entry. For example, an "Open" or "Close" entry goes into the tb_applicationlog table. Once I determine what type the log entry is, how do I align the items in the array "row" to fields in the database?
Thanks very much in advance,
Beems
I think it would be better to split 'logdata' using another character first, then spilt the fields in the array created by 'logdata' using '~', as below (code not tested) -
var_logdata = request.form("f_data")
arr_logdata = Split(var_logdata,vbCrLf)
'split request.form("f_data") using newline so we have an array containing each line
for var_arrayitem = 0 to ubound(arr_logdata)
'now we can split each line by "~"
arr_linelogdata = Split(arr_logdata(var_arrayitem),"~")
'now arr_linelogdata(0) is log type, arr_linelogdata(1) is next field etc
'linetype = arr_linelogdata(0) etc
'use variables derived from array to do what you need to
next

How can I use the map datatype in Apache Pig?

I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant book, Google, and even tried parsing the parser source. Every single example loads map literals from a file... and then never uses them. How can you use Pig's maps?
First, there doesn't seem to be a way to load a 2-column CSV file into a map directly. If I have a simple map.csv:
1,2
3,4
5,6
And I try to load it as a map:
m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;
I get three empty tuples:
()
()
()
So I try to load tuples and then generate the map:
m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...
Many variations on the syntax also fail (e.g., generate [$0#$1]).
OK, so I munge my map into Pig's map literal format as map.pig:
[1#2]
[3#4]
[5#6]
And load it up:
m = load 'map.pig' as (M: []);
Now let's load up some keys and try lookups:
k = load 'keys.csv' as (key);
dump k;
3
5
1
c = foreach k generate m#key; /* Or m[key], or... what? */
ERROR 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Hrm, OK, maybe since there are two relations involved, we need a join:
c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Fail. How do I refer to the key (or value) of a map? The map schema syntax doesn't seem to let you even name the key and value (the mailing list says there's no way to assign types).
Finally, I'd just like to be able to find all they keys in my map:
d = foreach m generate ...oh, forget it.
Is Pig's map type half-baked? What am I missing?
Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').
The typical use case of this is to load a complex data structure one of the element being a key value pair and later in a foreach statement you can refer to a particular value based on the key you are interested in.
http://pig.apache.org/docs/r0.9.1/basic.html#map-schema
In Pig version 0.10.0 there is a new function available called "TOMAP" (http://pig.apache.org/docs/r0.10.0/func.html#tomap) that converts its odd (chararray) parameters to keys and even parameters to values. Unfortunately I haven't found it to be that useful, though, since I typically deal with arbitrary dicts of varying lengths and keys.
I would find a TOMAP function that took a tuple as a single argument, instead of a variable number of parameters, to be much more useful.
This isn't a complete solution to your problem, but the availability of TOMAP gives you some more options for your constructing a real solution.
Great question!
I personally do not like Maps in Pig. They have a place in traditional programming languages like Java, C# etc, wherein its really handy and fast to lookup a key in the map. On the other hand, Maps in Pig have very limited features.
As you rightly pointed, one can not lookup variable key in the Map in Pig. The key needs to be Constant. e.g. myMap#'keyFoo' is allowed but myMap#$SOME_VARIABLE is not allowed.
If you think about it, you do not need Map in Pig. One usually loads the data from some source, transforms it, joins it with some other dataset, filter it, transform it and so on. JOIN actually does a good job of looking up the variable keys in the data.
e.g. data1 has 2 columns A and B and data2 has 3 columns X, Y, Z. If you join data1 BY A with data2 BY Z, JOIN does the work of a Map (from traditional language) which maps value of column Z to value of column B (via column A). So data1 essentially represents a Map A -> B.
So why do we need Map in Pig?
Usually Hadoop data are the dumps of different data sources from Traditional languages. If original data sources contain Maps, the HDFS data would contain a corresponding Map.
How can one handle the Map data?
There are really 2 use cases:
Map keys are constants.
e.g. HttpRequest Header data contains time, server, clientIp as the keys in Map. to access value of a particular key, one case access them with Constant key.
e.g. header#'clientIp'.
Map keys are variables.
In these cases, you would most probably would want to JOIN the Map keys with some other data set. I usually convert the Map to Bag using UDF MapToBag, which converts map data into Bag of 2 field tuples (key, value). Once map data is converted to Bag of tuples, its easy to join it with other data sets.
I hope this helps.
1)If you want to load map data it should be like "[programming#SQL,rdbms#Oracle]"
2)If you want to load tuple data it should be like "(first_name_1234,middle_initial_1234,last_name_1234)"
3)If you want to load bag data it should be like"{(project_4567_1),(project_4567_2),(project_4567_3)}"
my file pigtest.csv like this
1234|emp_1234#company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]
4567|emp_4567#company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]
my schema:
a = LOAD 'pigtest.csv' using PigStorage('|') AS (employee_id:int, email:chararray, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray), project_list:bag{project: tuple(project_name:chararray)}, skills:map[chararray]) ;
b = FOREACH a GENERATE employee_id, email, name.first_name, project_list, skills#'programming' ;
dump b;
I think you need to think in term of relations and the map is just one field of one record. Then you can apply some operations on the relations, like joining the two sets data and mapping:
Input
$ cat data.txt
1
2
3
4
5
$ cat mapping.txt
1 2
2 4
3 6
4 8
5 10
Pig
mapping = LOAD 'mapping.txt' AS (key:CHARARRAY, value:CHARARRAY);
data = LOAD 'data.txt' AS (value:CHARARRAY);
-- list keys
mapping_keys =
FOREACH mapping
GENERATE key;
DUMP mapping_keys;
-- join mapping to data
mapped_data =
JOIN mapping BY key, data BY value;
DUMP mapped_data;
Output
> # keys
(1)
(2)
(3)
(4)
(5)
> # mapped data
(1,2,1)
(2,4,2)
(3,6,3)
(4,8,4)
(5,10,5)
This answer could also help you if you just want to do a simple look up:
pass-a-relation-to-a-pig-udf-when-using-foreach-on-another-relation
You can load up any data and then convert and store in key value format to read for later use
data = load 'somedata.csv' using PigStorage(',')
STORE data into 'folder' using PigStorage('#')
and then read as a mapped data.

Resources