jcl sort to divide Mainframe Dataset - sorting

I am trying to divide MF PS into several datasets.
e.g. If I have a Dataset containg 600 recs, I want to divide this into 6 files with 100 records each. Is it possible to do this using JCL sort?

The below JCL uses DFSORT to split DD SOTRIN evenly across 3 output DATASETS (OUT1,OUT2 and OUT3), to do it across 6 add in 3 more output DD statements and add them in to the FNAMES statement.
//SPLIT EXEC PGM=ICEMAN
//SYSOUT DD SYSOUT=*
//SORTIN DD DSN=Y897797.INPUT1,DISP=OLD
//OUT1 DD DSN=Y897797.SPLIT1,DISP=(NEW,CATLG),
// SPACE=(CYL,(5,5)),UNIT=SYSDA
//OUT2 DD DSN=Y897797.SPLIT2,DISP=(NEW,CATLG),
// SPACE=(CYL,(5,5)),UNIT=SYSDA
//OUT3 DD DSN=Y897797.SPLIT3,DISP=(NEW,CATLG),
// SPACE=(CYL,(5,5)),UNIT=SYSDA
//SYSIN DD *
SORT FIELDS=(21,5,FS,A)
OUTFIL FNAMES=(OUT1,OUT2,OUT3),SPLIT
/*
SORT FIELDS=(21,5,FS,A) is how you want the sortint dataset sorted, below is what this fields statement means
21 beginning of field to be sorted
5 Length of field to be sorted
FS Floating Sign (Signed Numeric)
A Ascending order
DFSORT Getting Started Manual
Smart DFSORT Tricks has lots of useful examples and a couple of other ways to split the records out of a dataset

SPLIT is just SPLIT, you can't associate it with a number.
SPLITBY=n will "rotate" n records between each OUTFIL dataset specified. SPLIT is the same as SPLITBY=1.
SPLIT1R=n will only carry out one "rotation" (n records will be written to first OUTFIL dataset, then n to second OUTFIL and continuing like that until the final OUTFIL dataset is used, which will contain any remaining records for the input, no matter how many.
OUTFIL FILES=OUT1 is not permissible.It should be OUTFIL FNAMES=(OUT1,OUT2,OUT3),SPLIT.
IF using STATREC/ENDREC or INCLUDE/OMIT, OUTFIL SAVE can be used to establish a file for records that are not written to any of the other OUTFIL datasets.

Deuian's SORT CARD splits the input file into output file equally. If we have 3 output files for instance, the total input records divided by 3 will be record count of each input file.
Of cos we can specify the count based on which split should happen, as below. Its implicitly splits the input file into output files 10000 records each. Say for eg, we have 40000 records in input file and we are dividing them into 3 output files, then we will be getting 10000+10000/3 records in the output file.
OUTFIL
FNAMES=(OUT1,OUT2,OUT3),SPLIT=10000
In nutshell, we can make use of it, when we do not have any constraints on output record count. When we have any such criteria while splitting then below piece of code helps...
SORT FIELDS=COPY
OUTFIL FILES=OUT1,ENDREC=10000
OUTFIL
FILES=02,STARTREC=10001,ENDREC=20000
OUTFIL
FILES=03,STARTREC=20001,ENDREC=30000
Lastly, If we have more than 30000 records in input file and we didnt specify what to do for those records, so SORT will not bother about them. Meaning only 10000 records will be held by last output file.
Hope I made you clear. Do get back incase of question further.

Suppose you don't know how many records are in a data set, but you want to divide the records as equally as possible between two output data sets. You can use OUTFIL's SPLIT parameter to put the first record into OUTPUT1, the second record into OUTPUT2, the third record into OUTPUT1, the fourth record into OUTPUT2, and so on until you run out of records. SPLIT splits the records one at a time among the data sets specified by FNAMES. The following statements split the records between two OUTFIL data sets:
OPTION COPY
OUTFIL FNAMES=(OUTPUT1,OUTPUT2),SPLIT
With 17 input records, the results produced for OUTPUT1 are:
Record 01
Record 03
Record 05
Record 07
Record 09
Record 11
Record 13
Record 15
Record 17
Similarly, OUTFIL's SPLITBY=n parameter splits the records n at a time among the data sets specified by FNAMES. The following statements split the records four at a time between three OUTFIL data sets:
OPTION COPY
OUTFIL FNAMES=(OUT1,OUT2,OUT3),SPLITBY=4

Related

Improve performance of wide GroupBy + write

I need to tune a job that looks like below.
import pyspark.sql.functions as F
dimensions = ["d1", "d2", "d3"]
measures = ["m1", "m2", "m3"]
expressions = [F.sum(m).alias(m) for m in measures]
# Aggregation
aggregate = (spark.table("input_table")
.groupBy(*dimensions)
.agg(*expressions))
# Write out summary table
aggregate.write.format("delta").mode("overwrite").save("output_table")
The input table contains transactions, partitioned by date, 8 files per date.
It has 108 columns and roughly half a billion records. The aggregated result has 37 columns and ~20 million records.
I am unable to make any sort of improvement in the runtime whatever I do, so I would like to understand what are the things that affect the performance of this aggregation, i.e. what are the things I can potentially change?
The only thing that seems to help is to manually partition the work, i.e. starting multiple concurrent copies of the same code but with different date ranges.
to the best of my understanding currently the groupBy clause doesn't include the 'date' column so you are actually aggregating all dates in the query and you are not using the input table partition at all.
you can add the "date" column to the partitionBy clause and then you will sum up the measures for each date.
also, as for the input_table, when it is built, if possible, you can additionally partition it by d1, d2, d3 if they don't have a high cardinality or at least some of them.
finally the input_table will benefit from a columnar file type (parquet) so you won't have to i/o all 108 columns if you are using something like csv. assuming you are using something like parquet but just in case.

How to select highly variable genes in bulk RNA seq data?

As a pre-processing step, I need to select the top 1000 highly variable genes (rows) from a bulk RNA-seq data which contains about 60k genes across 100 different samples(columns). The column value already contains the mean of the triplicates. The table contains normalized value in FPKM (Note: I don't have access to raw counts and am not able to use common R packages as these packages takes raw counts as input.)
In this case, what is the best way to select the top 1000 variable genes ?
I have tried to filter out the genes using rowSums() function (to remove the genes with lower rowsums values) and narrowed it down from 60k genes to 10K genes but I am not sure if it the right way to select highly variable genes. Any input is appreciated.
row sum is first filtration step. after this your data will discarded by log2fold change cutoff and padjst value (0.05 or o.o1 depend on your goal). you can repeat this pathway with different row sum cutoff to see results. I personal discard row sums zero

Parquet file with uneven columns

I'm trying to figure out how to write a parquet file where the columns do not contain the same number of rows per Row Group. For example, my first column might be a value sampled at 10Hz, while my second column may be a value sampled at only 5Hz. I'd rather not repeat values in the slower column since this can lead to computational errors. However, I cannot write columns of two different sizes to the same Row Group, so how can I accomplish this?
I'm attempting to do this with ParquetSharp.
It is not possible for the columns in a parquet file to have different row counts.
It is not explicit in the documentation but if you look on https://parquet.apache.org/documentation/latest/#metadata, you will see that a RowGroup has a num_rows and several ColumnChunks that do not themselves have individual row numbers.

Perl processing a trillion records

Looking for some advice or insight on what I consider a simple method in PERL to compare text files to one another.
Lets assume you have 90,000 text files that are all structured similarly, say they have a common theme with a small amount of unique data in each.
My logic says to simply loop through the files (breaking into 1000 lines for simplicity), then loop through the # of files ... 90,000 - then loop through the 90,000 files again to compare to each other. This becomes a virtually endless loop of a bazillion lines or processes.
Now the mandatory step here is to "remove" any line that is found in any file except the file we are working on. The ultimate goal is to scrub all the files down to content that is unique across the entire collection, even if it means some files end up empty.
I am saying files, but this could be rows in a database, or elements in an array. (I`ve tried all.) The fastest solution so far has been to load all the files into mysql, then run
UPDATE table SET column=REPLACE(column, find, replace); Also tried Parallel::ForkManager when working with mysql.
The slowest approach actually led to exhausting my 32 GB of ram - that was loading all 90k files into an array. 90k files didnt work at all, smaller batches like 1000 works fine, but then doesnt compare to the other 89,000.
Server specs if helpful: Single Quad-Core E3-1240 4Cores x 3.4Ghz w/ HT 32GB DDR3 ECC RAM 1600MHz 1x256SSD
So how does an engineer solve this problem? I am just PERL hacker...
Tag every line with the filename (and maybe the line number) and sort all the lines using Sort::External. Then you can read the sorted records in order and write only a single unique line to the result files.
A Bloom filter is perfect for this, if you can handle arbitrarily small error.
To quote wikipedia: "A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either 'possibly in set' or 'definitely not in set'."
In essence, you'll use k hashes to hash each row to k spots on a bit array. Each time you encounter a new row, you are guaranteed you haven't seen it if at least one of the k hashed indices has a '0' bit. You can read up on Bloom filters to see how to size the array and choose k to make false positives arbitrarily small.
Then you go through your files, and either delete rows where you get a positive match, or copy the negative match rows into a new file.
Sort the items using an external merge sort algorithm and remove the duplicates on the merge phase.
Actually, you can do that efficiently just calling the sort command with the -u flag. From Perl:
system "sort -u #files >output";
Your sort command may provide several adjustable parameters to improve its performance. For instance, the number of parallel processes or the amount of memory it can allocate.

sorting and splitting in a DFSORT together?

Input file Layout:
01 to 10 - 10 Digit Acct#
53 to 01 - An indicator with values 'Y' or 'N'
71 to 10 - Time stamp
(Rest of the fields are insignificant for this sort)
While sorting the input file by splitting and eliminating duplicates in two ways result in different results. I wanna know why?
Casei: Splitting and Eliminating duplicates in the same step.
SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE
OUTFIL FILES=01,
INCLUDE=(53,01,CH,C'Y',AND,71,10,CH,GT,&DATE2(-)),
OUTFIL FILES=02,
INCLUDE=(53,01,CH,C'N',AND,71,10,CH,GT,&DATE2(-)),
Caseii: Splitting and eliminating duplicates in two different steps:
STEP:01
SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE
STEP:02
SORT FIELDS=COPY
OUTFIL FILES=01,
INCLUDE=(53,01,CH,C'Y',AND,71,10,CH,GT,&DATE2(-)),
OUTFIL FILES=02,
INCLUDE=(53,01,CH,C'N',AND,71,10,CH,GT,&DATE2(-)),
These two steps are resulting different output. Do u see any difference between both cases? Please clarify.
You are asking to sort on an Account Number (10 characters ascending) then on an Indicator (1 character ascending).
These two fields alone determine the key of the record - Timestamp is not part of the sort key. Consequently if there
are two or more records with the same key they could be placed in any (random) order by the sort. No telling
what order the Timestamp values will appear.
Keeping the above in mind, consider what happens when you have two records with the same key but different
Timestamp values. One of these Timestamp values meets the given INCLUDE criteria and the other one doesn't.
The SUM FIELDS=NONE parameter is asking to remove duplicates based on the key. It does this by grouping
all of the records with the same key together and then selecting the last one in the group. Since key
does not include the Timestamp the choosen record is essentially a random event. Consequently it is unpredictable
as to whether you get the record that meets the subsequent INCLUDE condition.
There are a couple of ways to fix this:
Add Timestamp to the sort key. This might not work because it may leave multiple records for the same Account Number / Inidcator, that is it may break your duplicate removal requirement
Request a stable sort.
A stable sort causes records having the same sort key to maintain their same relative positions after the sort.
This will preserve the original order of the Timestamp values in your file given the same key. When the removal of duplicates occurs DFSORT will choose the last record from the set of duplicates. This should bring the predicability to the duplicate elimination process you are looking for. Specify
a stable sort by adding an OPTIONS EQUALS control card before the SORT card.
EDIT Comment: ...picks the VERY FIRST record
The book I based my original answer on clearly stated the last record in a group of records with the same
key would be selected when SUM=NONE is specified. However, it is always
best to consult the vendors own manuals. IBM's DFSORT Application Programming Guide only states
that one record with each key will be selected. However,
it also has the following note:
The FIRST operand of ICETOOL's SELECT operator can be used to perform the same
function as SUM FIELDS=NONE with OPTION EQUALS. Additionally, SELECT's FIRSTDUP,
ALLDUPS, NODUPS, HIGHER(x), LOWER(y), EQUAL(v), LASTDUP, and LAST operands can be
used to select records based on other criteria related to duplicate and non-duplicate
keys. SELECT's DISCARD(savedd) operand can be used to save the records discarded by
FIRST, FIRSTDUP, ALLDUPS, NODUPS, HIGHER(x), LOWER(y), EQUAL(v), LASTDUP, or
LAST. See SELECT Operator for complete details on the SELECT operator.
Based on this information I would suggest using ICETOOL's SELECT operator to select the correct record.
Sorry for the misinformation.
The problem is as NealB identified.
The easiest thing to do is to "get rid of" the records you don't want by date before the SORT. The SORT will take less time. This assumes that SORTOUT is not required. If it is, you have to keep your INCLUDE= on the OUTFILs.
SELECT is a good option. SELECT uses OPTION EQUALS by default. The below Control Cards can be included in an xxxxCNTL dataset, and action from the SELECT with USING(xxxx). SELECT gives you greater flexibility than SUM (you can get the last, amongst other things).
The whole task sounds flawed. If there are records per account with different dates, I'd expect either the first date or the last date, or something else specific, to be required, not just whatever record happens to be hanging around at the end of the SUM.
OPTION EQUALS
INCLUDE COND=(71,10,CH,GT,&DATE2(-))
SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE
OUTFIL FILES=01,
INCLUDE=(53,01,CH,EQ,C'Y')
OUTFIL FILES=02,
INCLUDE=(53,01,CH,EQ,C'N')
Or, if the Y/N cover all records:
OUTFIL FILES=02,SAVE

Resources