Multi level sorting of numerical and string values - sorting

I want to do a multi level sorting of floating values using awk on a data like below:
store:LA----------------400.68
----pens----------------200.34
--------reynolds--------110.34
--------butterflow------90.00
--------trimex----------NA
----copies--------------110.34
--------classmate-------110.34
----pencil--------------90.00
--------HB--------------44.5
--------classmate-------45.5
The numerical value is the amount of available stock.
The sorted result should be like:
store:LA----------------400.68
----pencil--------------90.00
--------HB--------------44.5
--------classmate-------45.5
----copies--------------110.34
--------classmate-------110.34
----pens----------------200.34
--------butterflow------90.00
--------reynolds--------110.34
--------trimex----------NA
In ascending order first based on the product and in product based on the brand with NA value in the last.
I tried picking up the values of $2 first with respect to store (as there are multiple stores) then appending the value of product after the store value and in the last the value of brand and stored the same in an array.
It is something looks like:
400.68
400.68:200.34
400.68:200.34:110.34
400.68:200.34:90.00
400.68:200.34:NA
Using asort on this array is not displaying the required result:
{
match($0, /^ */);
offset = RLENGTH;
if (offset == 1) { items[NR] = $2 }
else if (offset > prev_ofst) { items[NR] = items[NR-1]":"$2 }
else if (offset < prev_ofst) {
prev_item = items[NR-1];
gsub("(\\:[^:]+\\:[^:]+)$", "", prev_item);
items[NR] = prev_item":"$2;
}
else {
prev_item = items[NR-1];
gsub("(\\:[^:]+)$", "", prev_item);
items[NR] = prev_item" "$2;
}
prev_ofst = offset;
print items[NR];
}
END{
asort(items);
for (i = 1; i <= NR; i++) {
gsub("[^:]+\\:", "", items[i]);
print items[i];
}
}

It's not clear what in product based on the brand with NA value in the last. means (are you sorting based on brand which does not have an NA value in your sample input or are you sorting based on the number at the end of each line which does have an NA value?), but assuming that you have "pencil" and "copies" in the wrong order in your posted expected output, here's one way to do what I think you might want with GNU awk (which you're already using for asort()) for multi-dimensional arrays and sorted_in:
$ cat tst.awk
match($0,/^(-*)([^-]+)(-+)([^-]+)/,a) {
offset = length(a[1])/4 + 1
for (i=offset+1; i<=3; i++) {
tags[i] = ""
}
tags[offset] = a[2]
vals[tags[1]][tags[2]][tags[3]] = $0
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for ( tag1 in vals ) {
for ( tag2 in vals[tag1] ) {
for ( tag3 in vals[tag1][tag2] ) {
print vals[tag1][tag2][tag3]
}
}
}
}
$ awk -f tst.awk file
store:LA----------------400.68
----copies--------------110.34
--------classmate-------110.34
----pencil--------------90.00
--------HB--------------44.5
--------classmate-------45.5
----pens----------------200.34
--------butterflow------90.00
--------reynolds--------110.34
--------trimex----------NA

Related

Parse through rotation choices to produce class schedules

I have a Sheets project with 6 worksheets. I'm using student choices to generate a series of 4 rotations in which students can visit classes.
Form Responses 1 lists names in Column D; Columns I-N list a set of six possible rotation choices
6Rot1, 6Rot2, 6Rot3, and 6Rot4 correspond to each of the four possible rotations
mainRoster6 is a master roster produced to reflect names and rotation choices
I'm trying to parse through Form Responses 1, with the following conditions:
Start by looking at Column I and see if there are "open spots" (<47) in the corresponding column of 6Rot1. If so, add the name to that column. Otherwise, check 6Rot2 and try to perform the same operation. Etc., until all 4 rotations have been checked.
If there are no open spots in either of the 4 rotations, then check Column J and try to fit the student in the correct class's name, etc.
The intention is to assign the student's top unique choices to 4 rotations. If a student has already chosen a class for one rotation, then that class should not appear again in a different rotation.
I'm trying to solve two bugs:
If a class has already been added for a student for a rotation, he should not have the same class in a newer rotation (I tried to solve this in lines 62-67 but it didn't work). In the example sheet, for the student in row 3, once the student has been placed in "Spanish I" for rotation 2, it should skip his third choice (a repeated "Spanish I" and proceed trying to place him in "Coding" for his third rotation.
The program should assign in rotations in the order of preference (first, then second, then third, etc.). Right now, if a rotation is full for a specific class, it simply ignores that class (even if the next rotation may be available) and moves on to the next choice. In the example sheet, the first student has "coding" as third choice. Although that class is not available in the 3rd rotation, it's available in the 4th rotation. The program should try to place coding in the 4th rotation and then look at the fourth choice and try to place the student into any available rotation (in this case, rotation 3 would be available).
This is what I have currently:
function myFunction() {
var mainSheet = SpreadsheetApp.openById('1ki0Cya3IWNwdLIBe0fR1vwfz_ansmfa52NilmdFBrd0');
var firstRotation = mainSheet.getSheetByName('6Rot1');
var secondRotation = mainSheet.getSheetByName('6Rot2');
var thirdRotation = mainSheet.getSheetByName('6Rot3');
var fourthRotation = mainSheet.getSheetByName('6Rot4');
var mainRoster = mainSheet.getSheetByName('mainRoster6');
var responsesSheet = mainSheet.getSheetByName('Form Responses 1');
var destinationCell;
var column;
var columnValues;
var columnContainsStudent;
var currentSheet;
var mainRosterLastRow;
var studentName;
var compassTeacher;
var classChoice;
var rotationCounter;
var choicesCounter;
var successCounter;
var rotationOneSuccess;
var rotationTwoSuccess;
var rotationThreeSuccess;
var rotationFourSuccess;
var responsesLastRow = responsesSheet.getLastRow();
var currentRotationsArray;
// var y;
var z;
for (z = 2; z < responsesLastRow + 1; z++) {
studentName = responsesSheet.getRange(z, 4).getValues();
mainRoster.getRange(z, 2).setValue(studentName);
compassTeacher = responsesSheet.getRange(z, 6).getValues();
mainRoster.getRange(z, 1).setValue(compassTeacher);
choicesCounter = 1;
successCounter = 0;
rotationCounter = 1;
rotationOneSuccess = false;
rotationTwoSuccess = false;
rotationThreeSuccess = false;
rotationFourSuccess = false;
while (choicesCounter < 7) {
if (successCounter > 4) {
break;
}
classChoice = responsesSheet.getRange(z, 8 + choicesCounter).getValues();
//test for includes in previous choices
var currentRotationsArray = mainRoster.getRange(z, 4, 1, 4).getValues();
// for (y = 0; y < 4; y++) {
// if (currentRotationsArray[0][y] == classChoice) {
// choicesCounter++;
// break;
// }
// }
while (successCounter < 4) {
if (rotationOneSuccess == false) {
currentSheet = firstRotation;
} else if (rotationTwoSuccess == false) {
currentSheet = secondRotation;
} else if (rotationThreeSuccess == false) {
currentSheet = thirdRotation;
} else {
currentSheet = fourthRotation;
}
// Set column number and get values as an array, depending on class choice
if (classChoice == "Guitar 7") {
column = 1;
columnValues = currentSheet.getRange("A1:A").getValues();
} else if (classChoice == "Communications - Video Production") {
column = 2;
columnValues = currentSheet.getRange("B1:B").getValues();
} else if (classChoice == "Communications - Journalism") {
column = 3;
columnValues = currentSheet.getRange("C1:C").getValues();
} else if (classChoice == "Coding") {
column = 4;
columnValues = currentSheet.getRange("D1:D").getValues();
} else if (classChoice == "Art 7") {
column = 5;
columnValues = currentSheet.getRange("E1:E").getValues();
} else if (classChoice == "Latin I") {
column = 6;
columnValues = currentSheet.getRange("F1:F").getValues();
} else if (classChoice == "Spanish I") {
column = 7;
columnValues = currentSheet.getRange("G1:G").getValues();
} else if (classChoice == "French I") {
column = 8;
columnValues = currentSheet.getRange("H1:H").getValues();
} else {
column = 9;
columnValues = currentSheet.getRange("I1:I").getValues();
}
// Find column's current last row
var columnLength = columnValues.filter(String).length;
//Add the student to the class's rotation, if the student isn't already there and there's >46 people already in
if (columnLength < 48) {
for(var n in columnValues){
if(columnValues[n][0] == studentName){
columnContainsStudent = true;
rotationCounter++;
break;
}
}
if (columnContainsStudent != true) {
destinationCell = currentSheet.getRange(columnLength + 1, column);
destinationCell.setValue(studentName);
if (currentSheet == firstRotation) {
rotationOneSuccess = true;
} else if (currentSheet == secondRotation) {
rotationTwoSuccess = true;
} else if (currentSheet == thirdRotation) {
rotationThreeSuccess = true;
} else {
rotationFourSuccess = true;
}
mainRoster.getRange(z, rotationCounter + 3).setValue(classChoice);
successCounter++;
rotationCounter++;
break;
}
rotationCounter++;
} else {
break
}
rotationCounter++;
}
choicesCounter++;
}
}
}
If you look at the linked sample spreadsheet, the logic is as follows:
We're looking at Form Responses 1 row by row, with the intention of adding 4 "choices" for that person in 4 rotations (6Rot1, 6Rot2, etc.). Columns I-N list 6 possible choices, but ideally I want to put the student in his top 4 choices.
I want the program to first look at Form Responses 1 Column I. In the sample sheet it's "Guitar 7". Then the program should look at 6Rot1 and see if the column for "Guitar 7" (Column A) has any open spaces (i.e. if it has less than 48 rows already filled). In the sample sheet, there are 36 "Placeholder Names" so it adds the first student to the first blank row in Column A (row 37). Once the name has been added to one of the rotations, mainRoster6 reflects that same information (column D in this case reflects "Guitar 7", which contains the same information for 6Rot1).
The program should next look back at Form Responses 1 and consider the student's next choice: in this case it would be the "second choice" (column J, "French I"). I want to look at the first possible rotation (6Rot1, 6Rot2, etc) that doesn't include the student AND has less than 48 people already in it, so that we can put his choice there. In this case6Rot2had already 39 names in it, so it accepted the student into row 40 of column H. After adding the student name in6Rot2, the program adds "French I" to Column E ofmainRoster6` ("Rotation 2")
It next looks back at Form Responses 1 and consider the "third choice" in column K, and repeats the process.
If a class is already "full" (has more than 47 names listed already in the column), then the program should try to find the first available rotation. For example, the student in Form Responses 1 has "Coding" as his third choice. 6Rot3 already has no space for column D ("Coding"), so it should first try to add the student's name ("Billy Kid") in Column D of 6Rot4 and then try to put in the student's name in column I of 6Rot3 for "German I".
What I'm trying to do with the code is parsing row by row of Form Responses 1 and putting the first 4 possible "choices" into the 4 possible rotations, and having mainRoster6 reflect that same information as a kind of abbreviated master list.
If a student lists two or more similar choices in Form Responses 1 columns I-N, then the program should ignore the repeated values. For example, in row 3 of Form Responses 1 it should overlook the second "Spanish" in K3 and move on to the next choice for that student.
If there are no more possible choices or rotations that can account for that choice, then the space can be left blank (see for example Form Responses 1 row 4: based on the student's choices and available spots, we were only able to honor the first three choices ("French" was already filled completely in 6Rot4; "Latin I" and "Art 7" were already in the schedule).
Under these conditions, I would have expected mainRoster6 to have this intended output:

Creating sequence number in hive or Pig

I'm facing a data transformation issue :
I have the table here under with 3 columns : client, event, timestamp.
And I basically want to assign a sequence number to all events for a given client based on timestamp, which is basically the "Sequence" columns I added hereunder.
Client Event TimeStamp Sequence
C1 Ph 2014-01-30 12:15:23 1
C1 Me 2014-01-31 15:11:34 2
C1 Me 2014-01-31 17:16:05 3
C2 Me 2014-02-01 09:22:52 1
C2 Ph 2014-02-01 17:22:52 2
I can't figure out how to create this sequence number in hive or Pig. Would you have any clue ?
Thanks in advance !
Guillaume
Put all the records in a bag (by say grouping all), sort the tuples inside bag by TimeStamp field and then use Enumerate function.
Something like below (I did not execute the code, so you might need to clean it up a bit):
// assuming input contains 3 columns - client, event, timestamp
input2 = GROUP input all;
input3 = FOREACH input2
{
sorted = ORDER input BY timestamp;
sorted2 = Enumerate(sorted);
GENERATE FLATTEN(sorted2);
}
We eventually modified enumerate source the following way and it works great :
public void accumulate(Tuple arg0) throws IOException {
nevents=13;
i=nevents+1;
DataBag inputBag = (DataBag)arg0.get(0);
Tuple t2 = TupleFactory.getInstance().newTuple();
for (Tuple t : inputBag) {
Tuple t1 = TupleFactory.getInstance().newTuple(t.getAll());
tampon=t1.get(2).toString();
if (tampon.equals("NA souscription Credit Conso")) {
if (i <= nevents) {
outputBag.add(t2);
t2=TupleFactory.getInstance().newTuple();
}
i=0;
t2.append(t1.get(0).toString());
t2.append(t1.get(1).toString());
t2.append(t1.get(2).toString());
i++;
}
else if (i < nevents) {
t2.append(tampon);
i++;
}
else if (i == nevents) {
t2.append(tampon);
outputBag.add(t2);
i++;
t2=TupleFactory.getInstance().newTuple();
}
if (count % 1000000 == 0) {
outputBag.spill();
count = 0;
}
;
count++;
}
if (t2.size()!=0) {
outputBag.add(t2);
}
}

Subtract Dates in Reducer

The input to reducer is as follows
key: 12
List<values> :
1,2,3,2013-12-23 10:21:44
1,2,3,2013-12-23 10:21:59
1,2,3,2013-12-23 10:22:07
The output needed is as follows:
1,2,3,2013-12-23 10:21:44,15
1,2,3,2013-12-23 10:21:59,8
1,2,3,2013-12-23 10:22:07,0
Please note last column is 10:21:59 minus 10:21:44. Date(next) - Date(current)
I tried loading into memory and subtracting but it is causing java heap memory issue. Your help is highly appreciated. data size for this key is huge > 1 GB and not able to fit into main memory.
Perhaps something along the lines of this pseudocode in your reduce() method:
long lastDate = 0;
V lastValue = null;
for (V value : values) {
currentDate = parseDateIntoMillis(value);
if (lastValue != null) {
context.write(key, lastValue.toString() + "," + (currentDate - lastDate));
}
lastDate = currentDate;
lastValue = value;
}
context.write(key, lastValue.toString() + "," + 0);
Obviously there will be tidying up to do but the general idea is fairly simple.
Note that because of your requirement to include the date of the next value as part of the current value calculation, the iteration through the values skips the first write, hence the additional write after the loop to ensure all values are accounted for.
If you have any questions feel free to ask away.
you can do by following code
reduce (LongWritable key, Iterable<String> Values, context){
Date currentDate = null;
LongWritable diff = new LongWritable();
for (String value : values) {
Date nextDate = new Date(value.toString().split(",")[3]);
if (currentDate != null) {
diff.set(Math.abs(nextDate.getTime()-currentDate.getTime())/1000)
context.write(key, diff);
}
currentDate = nextDate;
}
}

Comma delimited value Entity Framework in contains Statement

I have LINQ statement that has a comma delimited value.
I want to see if my Field matches any of the comma delimited values.
public string IdentifyProductSKU(string Serial)
{
int Len = Serial.Length;
var Split = from ModelSplitter in entities.Models
select ModelSplitter.m_validationMask.Split(',');
var Product = (from ModelI in entities.Models
where ModelI.m_validation == 0 &&
ModelI.m_validationLength == Len &&
ModelI.m_validationMask.Contains(Serial.Substring(ModelI.m_validationStart, ModelI.m_validationEnd))
select ModelI.m_name).SingleOrDefault();
return Product;
}
To explain the code: Every Model has got multiple identifying properties for eg. XX1,XX5,XX7 is all the same product. Now when I pass in a serial number I want to Identify the product based on the validation mask. For eg: XX511122441141 is ProductA and YY123414124 is ProductC. I Just want to split the in this query so in this line:
ModelI.m_validationMask.Contains(Serial.Substring(ModelI.m_validationStart, ModelI.m_validationEnd))
I want to Split the Validation mask To see if the serial contains any of the validation mask characters. Does this make sense?
This is how you split values into a list
var split = context.Packs.Select(u => u.m_validationMask).ToList();
List<String[]> list=new List<String[]>();
foreach (var name in split)
{
String[] str = name.Split(',');
list.Add(str);
}
Now I need to know how I can use that list in my Final EF Query:
int Len = Serial.Length;
var split = entities.Models.Select(u => u.m_validationMask).ToList();
List<String[]> list = new List<String[]>();
foreach (var name in split)
{
String[] str = name.Split(',');
list.Add(str);
}
var Product = (from ModelI in entities.Models
where ModelI.m_validation == 0 &&
ModelI.m_validationLength == Len &&
list.Contains(Serial.Substring(ModelI.m_validationStart, ModelI.m_validationEnd))
select ModelI.m_name).SingleOrDefault();
return Product;
I don't fully understand what you mean or what you are trying to do. But...
If your ModelSplitter.m_malicationMask can indead be split as you had demonstrated, then Split is a List then. What I don't understand is if you are trying to match the entire product A, or just the first three characters, you can modifiy your query
var Product = (from ModelI in entities.Models
where ModelI.m_validation == 0 &&
ModelI.m_validationLength == Len &&
ModelI.m_validationMask.Contains(Serial.Substring(ModelI.m_validationStart, ModelI.m_validationEnd))
let productId = ModelI.m_name.Substring(0, 3)
where split.Contains(productId)
select ModelI.m_name).SingleOrDefault();
Product should now be null if it does not match or an acutal product if it does.

Group lines of log-file using Linq

I have an array of strings from a log file with the following format:
var lines = new []
{
"--------",
"TimeStamp: 12:45",
"Message: Message #1",
"--------",
"--------",
"TimeStamp: 12:54",
"Message: Message #2",
"--------",
"--------",
"Message: Message #3",
"TimeStamp: 12:55",
"--------"
}
I want to group each set of lines (as delimited by "--------") into a list using LINQ. Basically, I want a List<List<string>> or similar where each inner list contains 4 strings - 2 separators, a timestamp and a message.
I should add that I would like to make this as generic as possible, as the log-file format could change.
Can this be done?
Will this work?
var result = Enumerable.Range(0, lines.Length / 4)
.Select(l => lines.Skip(l * 4).Take(4).ToList())
.ToList()
EDIT:
This looks a little hacky but I'm sure it can be cleaned up
IEnumerable<List<String>> GetLogGroups(string[] lines)
{
var list = new List<String>();
foreach (var line in lines)
{
list.Add(line);
if (list.Count(l => l.All(c => c == '-')) == 2)
{
yield return list;
list = new List<string>();
}
}
}
You should be able to actually do better than returning a List>. If you're using C# 4, you could project each set of values into a dynamic type where the string before the colon becomes the property name and the value is on the left-hand side. You then create a custom iterator which reads the lines until the end "------" appears in each set and then yield return that row. On MoveNext, you read the next set of lines. Rinse and repeat until EOF. I don't have time at the moment to write up a full implementation, but my sample on reading in CSV and using LINQ over the dynamic objects may give you an idea of what you can do. See http://www.thinqlinq.com/Post.aspx/Title/LINQ-to-CSV-using-DynamicObject. (note this sample is in VB, but the same can be done in C# as well with some modifications).
The iterator implementation has the added benefit of not having to load the entire document into memory before parsing. With this version, you only load the amount for one set of blocks at a time. It allows you to handle really large files.
Assuming that your structure is always
delimeter
TimeStamp
Message
delimeter
public List<List<String>> ConvertLog(String[] log)
{
var LogSet = new List<List<String>>();
for(i = 0; i < log.Length(); i += 4)
{
if (log.Length <= i+3)
{
var set = new List<String>() { log[i], log[i+1], log[i+2], log[i+3] };
LogSet.Add(set);
}
}
}
Or in Linq
public List<List<String> ConvertLog(String[] log)
{
return Enumerable.Range(0, lines.Length / 4)
.Select(l => lines.Skip(l * 4).Take(4).ToList())
.ToList()
}

Resources