Organising inconsistent values [closed] - sorting

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
No idea if this is ok to ask here since it's not programming but I have no idea where else to go:
I want to organise the following data in a consistent way. At the moment it's a mess, with only the first two columns (comma separated) consistent. The remaining columns can number anywhere from 1-9 and are usually different.
In other words, I want to sort it so the text matches (all of the value columns in a row, all of the recoil columns in a row, etc). Then I can remove the text and add a header, and it will still make sense.
bm_wp_upg_o_t1micro, sight, value = 3, zoom = 3, recoil = 1, spread_moving = -1
bm_wp_upg_o_marksmansight_rear, sight, value = 3, zoom = 1, recoil = 1, spread = 1
bm_wp_upg_o_marksmansight_front, extra, value = 1
bm_wp_m4_upper_reciever_edge, upper_reciever, value = 3, recoil = 1
bm_wp_m4_upper_reciever_round, upper_reciever, value = 1
bm_wp_m4_uupg_b_long, barrel, value = 4, damage = 1, spread = 1, spread_moving = -2, concealment = -2
Any suggestions (even on just where the right place is to actually ask this) would be great.
Context is just raw data ripped from a game file that I'm trying to organise.

I'm afraid regex isn't going to help you much here because of the irregular nature of your input (it would be possible to match it, but it would be a bear to get it all arranged one way or another). This could be done pretty easily with any programming language, but for stuff like this, I always go to awk.
Assuming your input is in a file called input.txt, put the following in a program called parse.awk:
BEGIN {
FS=" *, *";
formatStr = "%32s,%8s,%8s,%8s,%10s,%16s,%8s,%18s,%10s,%10s,%16s,%16s\n";
printf( formatStr, "id", "sight", "value", "zoom", "recoil", "spread_moving", "extra", "upper_receiver", "barrel", "damage", "spread_moving", "concealment" );
}
{
split("",a);
for( i=2; i<=NF; i++ ) {
if( split( $(i), kvp, " *= *" ) == 1 ) {
a[kvp[1]] = "x";
} else {
a[kvp[1]] = gensub( /^\s*|\s*$/, "", "g", kvp[2] );
}
}
printf( formatStr, $1, a["sight"], a["value"], a["zoom"], a["recoil"],
a["spread_moving"], a["extra"], a["upper_receiver"],
a["barrel"], a["damage"], a["spread_moving"], a["concealment"] );
}
Run awk against it:
awk -f parse.awk input.txt
And get your output:
id, sight, value, zoom, recoil, spread_moving, extra, upper_receiver, barrel, damage, spread_moving, concealment
bm_wp_upg_o_t1micro, x, 3, 3, 1, -1, , , , , -1,
bm_wp_upg_o_marksmansight_rear, x, 3, 1, 1, , , , , , ,
bm_wp_upg_o_marksmansight_front, , 1, , , , x, , , , ,
bm_wp_m4_upper_reciever_edge, , 3, , 1, , , , , , ,
bm_wp_m4_upper_reciever_round, , 1, , , , , , , , ,
bm_wp_m4_uupg_b_long, , 4, , , -2, , , x, 1, -2, -2
Note that I chose to just use an 'x' for sight, which seems to a present/absent thing. You can use whatever you want there.
If you're using Linux or a Macintosh, you should have awk available. If you're on Windows, you'll have to install it.

I did make another awk version. I think this should a little easier to read.
All value/column are read from the file to make it as dynamic as possible.
awk -F, '
{
ID[$1]=$2 # use column 1 as index
for (i=3;i<=NF;i++ ) # loop through all fields from #3 to end
{
gsub(/ +/,"",$i) # remove space from field
split($i,a,"=") # split field in name and value a[1] and a[2]
COLUMN[a[1]]++ # store field name as column name
DATA[$1" "a[1]]=a[2] # store data value in DATA using field #1 and column name as index
}
}
END {
printf "%49s ","info" # print info
for (i in COLUMN)
{printf "%15s",i} # print column name
print ""
for (i in ID) # loop through all ID
{
printf "%32s %16s ",i, ID[i] # print ID and info
for (j in COLUMN)
{
printf "%14s ",DATA[i" "j]+0 # print value
}
print ""
}
}' file
Output
info spread recoil zoom concealment spread_moving damage value
bm_wp_m4_upper_reciever_round upper_reciever 0 0 0 0 0 0 1
bm_wp_m4_uupg_b_long barrel 1 0 0 -2 -2 1 4
bm_wp_upg_o_marksmansight_rear sight 1 1 1 0 0 0 3
bm_wp_upg_o_marksmansight_front extra 0 0 0 0 0 0 1
bm_wp_m4_upper_reciever_edge upper_reciever 0 1 0 0 0 0 3
bm_wp_upg_o_t1micro sight 0 1 3 0 -1 0 3

Stick with Ethan's answer — this is just me enjoying myself. (And yes, that makes me pretty weird!)
awk script
awk 'BEGIN {
# f_idx[field] holds the column number c for a field=value item
# f_name[c] holds the names
# f_width[c] holds the width of the widest value (or the field name)
# f_fmt[c] holds the appropriate format
FS = " *, *"; n = 2;
f_name[0] = "id"; f_width[0] = length(f_name[0])
f_name[1] = "type"; f_width[1] = length(f_name[1])
}
{
#-#print NR ":" $0
line[NR,0] = $1
len = length($1)
if (len > f_width[0])
f_width[0] = len
line[NR,1] = $2
len = length($2)
if (len > f_width[1])
f_width[1] = len
for (i = 3; i <= NF; i++)
{
split($i, fv, " = ")
#-#print "1:" fv[1] ", 2:" fv[2]
if (!(fv[1] in f_idx))
{
f_idx[fv[1]] = n
f_width[n++] = length(fv[1])
}
c = f_idx[fv[1]]
f_name[c] = fv[1]
gsub(/ /, "", fv[2])
len = length(fv[2])
if (len > f_width[c])
f_width[c] = len
line[NR,c] = fv[2]
#-#print c ":" f_name[c] ":" f_width[c] ":" line[NR,c]
}
}
END {
for (i = 0; i < n; i++)
f_fmt[i] = "%s%" f_width[i] "s"
#-#for (i = 0; i < n; i++)
#-# printf "%d: (%d) %s %s\n", i, f_width[i], f_name[i], f_fmt[i]
#-# pad = ""
for (j = 0; j < n; j++)
{
printf f_fmt[j], pad, f_name[j]
pad = ","
}
printf "\n"
for (i = 1; i <= NR; i++)
{
pad = ""
for (j = 0; j < n; j++)
{
printf f_fmt[j], pad, line[i,j]
pad = ","
}
printf "\n"
}
}' data
This script adapts to the data it finds in the file. It assigns the column heading 'id' to column 1 of the input, and 'type' to column 2. For each of the sets of values in columns 3..N, it splits up the data into key (in fv[1]) and value (in fv[2]). If the key has not been seen before, it is assigned a new column number, and the key is stored as the column name, and the width of key as the initial column width. Then the value is stored in the appropriate column within the line.
When all the data's read, the script knows what the column headings are going to be. It can then create a set of format strings. Then it prints the headings and all the rows of data. If you don't want fixed width output, then you can simplify the script considerably. There are some (mostly minor) simplifications that could be made to this script.
Data file
bm_wp_upg_o_t1micro, sight, value = 3, zoom = 3, recoil = 1, spread_moving = -1
bm_wp_upg_o_marksmansight_rear, sight, value = 3, zoom = 1, recoil = 1, spread = 1
bm_wp_upg_o_marksmansight_front, extra, value = 1
bm_wp_m4_upper_receiver_edge, upper_receiver, value = 3, recoil = 1
bm_wp_m4_upper_receiver_round, upper_receiver, value = 1
bm_wp_m4_uupg_b_long, barrel, value = 4, damage = 1, spread = 1, spread_moving = -2, concealment = -2
Output
id, type,value,zoom,recoil,spread_moving,spread,damage,concealment
bm_wp_upg_o_t1micro, sight, 3, 3, 1, -1, , ,
bm_wp_upg_o_marksmansight_rear, sight, 3, 1, 1, , 1, ,
bm_wp_upg_o_marksmansight_front, extra, 1, , , , , ,
bm_wp_m4_upper_receiver_edge,upper_receiver, 3, , 1, , , ,
bm_wp_m4_upper_receiver_round,upper_receiver, 1, , , , , ,
bm_wp_m4_uupg_b_long, barrel, 4, , , -2, 1, 1, -2

Related

Optimised EmEditor macro to populate column based on another column for a large file

I’ve got a really large file, circa 10m rows, in which I’m trying to populate a column based on conditions on another column via a jsee macro. While it is quite quick for small files, it does take some time for the large file.
//pseudocode
//No sorting on Col1, which can have empty cells too
For all lines in file
IF (cell in Col2 IS empty) AND (cell in Col1 IS NOT empty) AND (cell in Col1 = previous cell in Col1)
THEN cell in Col2 = previous cell in Col2
//jsee code
document.CellMode = true; // Must be cell selection mode
totalLines = document.GetLines();
for( i = 1; i < totalLines; i++ ) {
nref = document.GetCell( i, 1, eeCellIncludeNone );
gsize = document.GetCell( i, 2, eeCellIncludeNone );
if (gsize == "" && nref != "" && nref == document.GetCell( i-1, 1, eeCellIncludeNone ) ) {
document.SetCell( i, 2, document.GetCell( i-1, 2, eeCellIncludeNone ) , eeAutoQuote);
}
}
Input File:
Reference
Group Size
14/12/01819
1
14/12/01820
1
15/01/00191
4
15/01/00191
15/01/00191
15/01/00198
15/01/00292
3
15/01/00292
15/01/00292
15/01/00401
5
15/01/00401
15/01/00402
1
15/01/00403
2
15/01/00403
15/01/00403
15/01/00403
15/01/00404
20/01/01400
1
Output File:
Reference
Group Size
14/12/01819
1
14/12/01820
1
15/01/00191
4
15/01/00191
4
15/01/00191
4
15/01/00198
15/01/00292
3
15/01/00292
3
15/01/00292
3
15/01/00401
5
15/01/00401
5
15/01/00402
1
15/01/00403
2
15/01/00403
2
15/01/00403
2
15/01/00403
2
15/01/00404
20/01/01400
1
Any ideas on how to optimise this and make it run even faster?
I wrote a JavaScript for EmEditor macro for you. You might need to set the correct numbers in the first 2 lines for iColReference and iColGroupSize.
iColReference = 1; // the column index of "Reference"
iColGroupSize = 2; // the column index of "Group Size"
document.CellMode = true; // Must be cell selection mode
sDelimiter = document.Csv.Delimiter; // retrieve the delimiter
nOldHeadingLines = document.HeadingLines; // retrieve old headings
document.HeadingLines = 0; // set No Headings
yBottom = document.GetLines(); // retrieve the number of lines
if( document.GetLine( yBottom ).length == 0 ) { // -1 if the last line is empty
--yBottom;
}
str = document.GetColumn( iColReference, sDelimiter, eeCellIncludeQuotes, 1, yBottom ); // get whole 1st column from top to bottom, separated by TAB
sCol1 = str.split( sDelimiter );
str = document.GetColumn( iColGroupSize, sDelimiter, eeCellIncludeQuotes, 1, yBottom ); // get whole 2nd column from top to bottom, separated by TAB
sCol2 = str.split( sDelimiter );
s1 = "";
s2 = "";
for( i = 0; i < yBottom; ++i ) { // loop through all lines
if( sCol2[i].length != 0 ) {
s1 = sCol1[i];
s2 = sCol2[i];
}
else {
if( s1.length != 0 && sCol1[i] == s1 ) { // same value as previous line, copy s2
if( s2.length != 0 ) {
sCol2[i] = s2;
}
}
else { // different value, empty s1 and s2
s1 = "";
s2 = "";
}
}
}
str = sCol2.join( sDelimiter );
document.SetColumn( iColGroupSize, str, sDelimiter, eeDontQuote ); // set whole 2nd column from top to bottom with the new values
document.HeadingLines = nOldHeadingLines; // restore the original number of headings
To run this, save this code as, for instance, Macro.jsee, and then select this file from Select... in the Macros menu. Finally, select Run Macro.jsee in the Macros menu.

How to use "column" to center a chart?

I was wondering what the best way to sort a chart using the column command to center each column instead of the default left aligned column was. I have been using the column -t filename command.
Current Output:
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Desired Output: Something like this
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Basically, I want it to be centered instead of left aligned.
Here is an awk solution:
# Collect all lines in "data", keep track of maximum width for each field
{
data[NR] = $0
for (i = 1; i <= NF; ++i)
max[i] = length($i) > max[i] ? length($i) : max[i]
}
END {
for (i = 1; i <= NR; ++i) {
# Split record into array "arr"
split(data[i], arr)
# Loop over array
for (j = 1; j <= NF; ++j) {
# Calculate amount of padding required
pad = max[j] - length(arr[j])
# Print field with appropriate padding, see below
printf "%*s%*s%s", length(arr[j]) + int(pad/2), arr[j], \
pad % 2 == 0 ? pad/2 : int(pad/2) + 1, "", \
j == NF ? "" : " "
}
# Newline at end of record
print ""
}
}
Called like this:
$ awk -f centre.awk infile
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
The printf statement uses padding with dynamic widths:
The first %*s takes care of left padding and the data itself: arr[j] gets printed and padded to a total width of length(arr[j]) + int(pad/2).
The second %*s prints the empty string, left padded to half of the total padding required. pad % 2 == 0 ? pad/2 : int(pad/2) + 1 checks if the total padding was an even number, and if not, adds an extra space.
The last %s prints j == NF ? "" : " ", i.e., two spaces, unless we're at the last field.
Some older awks don't support the %*s syntax, but the formatting string can be assembled like width = 5; "%" width "s" in that case.
Here's a Python program to do what you want. It's probably too hard to do in bash, so you'll need to use a custom program or awk script. Basic algorithm:
count number of columns
[optional] make sure each line has the same number of columns
figure out the maximum length of data for each column
print each line using the max lengths
.
#!/usr/bin/env python3
import sys
def column():
# Read file and split each line into fields (by whitespace)
with open(sys.argv[1]) as f:
lines = [line.split() for line in f]
# Check that each line has the same number of fields
num_fields = len(lines[0])
for n, line in enumerate(lines):
if len(line) != num_fields:
print('Line {} has wrong number of columns: expected {}, got {}'.format(n, num_fields, len(line)))
sys.exit(1)
# Calculate the maximum length of each field
max_column_widths = [0] * num_fields
for line in lines:
line_widths = (len(field) for field in line)
max_column_widths = [max(z) for z in zip(max_column_widths, line_widths)]
# Now print them centered using the max_column_widths
spacing = 4
format_spec = (' ' * spacing).join('{:^' + str(n) + '}' for n in max_column_widths)
for line in lines:
print(format_spec.format(*line))
if __name__ == '__main__':
column()

AWK printing fields in multiline records

I have an input file with fields in several lines. In this file, the field pattern is repeated according to query size.
ZZZZ
21293
YYYYY XXX WWWW VV
13242 MUTUAL BOTH NO
UUUUU TTTTTTTT SSSSSSSS RRRRR QQQQQQQQ PPPPPPPP
3 0 3 0
NNNNNN MMMMMMMMM LLLLLLLLL KKKKKKKK JJJJJJJJ
2 0 5 3
IIIIII HHHHHH GGGGGGG FFFFFFF EEEEEEEEEEE DDDDDDDDDDD
5 3 0 3
My desired output is one line per total group of fields. Empty
fields should be marked. Example:"x"
21293 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
12345 67890 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
I have been thinking about how can I get the desired output with awk/unix scripts but can't figure it out. Any ideas? Thank you very much!!!
This isn't really a great fit for awk's style of programming, which is based on fields that are delimited by a pattern, not fields with variable positions on the line. But it can be done.
When you process the first line in each pair, scan through it finding the positions of the beginning of each field name.
awk 'NR%3 == 1 {
delete fieldpos;
delete fieldlen;
lastspace = 1;
fieldindex = 0;
for (i = 1; i <= length(); i++) {
if (substr($0, i, 1) != " ") {
if (lastspace) {
fieldpos[fieldindex] = i;
if (fieldindex > 0) {
fieldlen[fieldindex-1] = i - fieldpos[fieldindex-1];
}
fieldindex++;
}
lastspace = 0;
} else {
lastspace = 1;
}
}
}
NR%3 == 2 {
for (i = 0; i < fieldindex; i++) {
if (i in fieldlen) {
f = substr($0, fieldpos[i], fieldlen[i]);
} else { # last field, go to end of line
f = substr($0, fieldpos[i]);
}
gsub(/^ +| +$/, "", f); # trim surrounding spaces
if (f == "") { f = "X" }
printf("%s ", f);
}
}
NR%15 == 14 { print "" } # print newline after 5 data blocks
'
Assuming your fields are separated by blank chars and not tabs, GNU awk's FIELDWITDHS is designed to handle this sort of situation:
/^ZZZZ/ { if (rec!="") print rec; rec="" }
/^[[:upper:]]/ {
FIELDWIDTHS = ""
while ( match($0,/\S+\s*/) ) {
FIELDWIDTHS = (FIELDWIDTHS ? FIELDWIDTHS " " : "") RLENGTH
$0 = substr($0,RLENGTH+1)
}
next
}
NF {
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
$i = ($i=="" ? "X" : $i)
}
rec = (rec=="" ? "" : rec " ") $0
}
END { print rec }
$ awk -f tst.awk file
2129 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
In other awks you'd use match()/substr(). Note that the above isn't perfect in that it truncates a char off 21293 - that's because I'm not convinced your input file is accurate and if it is you haven't told us why that number is longer than the string on the preceding line or how to deal with that.

how to write bash script in ubuntu to normalize the index of text comparison

I had a input which is a result from text comparison. It is in a very simple format. It has 3 columns, position, original texts and new texts.
But some of the records looks like this
4 ATCG ATCGC
10 1234 123
How to write the short script to normalize it to
7 G GC
12 34 3
probably, the whole original texts and the whole new text is like below respectively
ACCATCGGA1234
ACCATCGCGA123
"Normalize" means "trying to move the position in the first column to the position that changes gonna occur", or "we would remove the common prefix ATG, add its length 3 to the first field; similarly on line 2 the prefix we remove is length 2"
This script
awk '
BEGIN {OFS = "\t"}
function common_prefix_length(str1, str2, max_len, idx) {
idx = 1
if (length(str1) < length(str2))
max_len = length(str1)
else
max_len = length(str2)
while (substr(str1, idx, 1) == substr(str2, idx, 1) && idx < max_len)
idx++
return idx - 1
}
{
len = common_prefix_length($2, $3)
print $1 + len, substr($2, len + 1), substr($3, len + 1)
}
' << END
4 ATCG ATCGC
10 1234 123
END
outputs
7 G GC
12 34 3

Algorithm to evenly distribute items into 3 columns

I'm looking for an algorithm that will evenly distribute 1 to many items into three columns. No column can have more than one more item than any other column. I typed up an example of what I'm looking for below. Adding up Col1,Col2, and Col3 should equal ItemCount.
Edit: Also, the items are alpha-numeric and must be ordered within the column. The last item in the column has to be less than the first item in the next column.
Items Col1,Col2,Col3
A A
AB A,B
ABC A,B,C
ABCD AB,C,D
ABCDE AB,CD,E
ABCDEF AB,CD,EF
ABCDEFG ABC,DE,FG
ABCDEFGH ABC,DEF,GH
ABCDEFGHI ABC,DEF,GHI
ABCDEFHGIJ ABCD,EFG,HIJ
ABCDEFHGIJK ABCD,EFGH,IJK
Here you go, in Python:
NumCols = 3
DATA = "ABCDEFGHIJK"
for ItemCount in range(1, 12):
subdata = DATA[:ItemCount]
Col1Count = (ItemCount + NumCols - 1) / NumCols
Col2Count = (ItemCount + NumCols - 2) / NumCols
Col3Count = (ItemCount + NumCols - 3) / NumCols
Col1 = subdata[:Col1Count]
Col2 = subdata[Col1Count:Col1Count+Col2Count]
Col3 = subdata[Col1Count+Col2Count:]
print "%2d %5s %5s %5s" % (ItemCount, Col1, Col2, Col3)
# Prints:
# 1 A
# 2 A B
# 3 A B C
# 4 AB C D
# 5 AB CD E
# 6 AB CD EF
# 7 ABC DE FG
# 8 ABC DEF GH
# 9 ABC DEF GHI
# 10 ABCD EFG HIJ
# 11 ABCD EFGH IJK
This answer is now obsolete because the OP decided to simply change the question after I answered it. I’m just too lazy to delete it.
function getColumnItemCount(int items, int column) {
return (int) (items / 3) + (((items % 3) >= (column + 1)) ? 1 : 0);
}
This question was the closest thing to my own that I found, so I'll post the solution I came up with. In JavaScript:
var items = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']
var columns = [[], [], []]
for (var i=0; i<items.length; i++) {
columns[Math.floor(i * columns.length / items.length)].push(items[i])
}
console.log(columns)
just to give you a hint (it's pretty easy, so figure out yourself)
divide ItemCount by 3, rounding down. This is what is at least in every column.
Now you do ItemCount % 3 (modulo), which is either 1 or 2 (because else it would be dividable by 3, right) and you distribute that.
I needed a C# version so here's what I came up with (the algorithm is from Richie's answer):
// Start with 11 values
var data = "ABCDEFGHIJK";
// Split in 3 columns
var columnCount = 3;
// Find out how many values to display in each column
var columnCounts = new int[columnCount];
for (int i = 0; i < columnCount; i++)
columnCounts[i] = (data.Count() + columnCount - (i + 1)) / columnCount;
// Allocate each value to the appropriate column
int iData = 0;
for (int i = 0; i < columnCount; i++)
for (int j = 0; j < columnCounts[i]; j++)
Console.WriteLine("{0} -> Column {1}", data[iData++], i + 1);
// PRINTS:
// A -> Column 1
// B -> Column 1
// C -> Column 1
// D -> Column 1
// E -> Column 2
// F -> Column 2
// G -> Column 2
// H -> Column 2
// I -> Column 3
// J -> Column 3
// K -> Column 3
It's quite simple
If you have N elements indexed from 0 to N-1 and column indexed from 0to 2, the i-th element will go in column i mod 3 (where mod is the modulo operator, % in C,C++ and some other languages)
Do you just want the count of items in each column? If you have n items, then
the counts will be:
round(n/3), round(n/3), n-2*round(n/3)
where "round" round to the nearest integer (e.g. round(x)=(int)(x+0.5))
If you want to actually put the items there, try something like this Python-style pseudocode:
def columnize(items):
i=0
answer=[ [], [], [] ]
for it in items:
answer[i%3] += it
i += 1
return answer
Here's a PHP version I hacked together for all the PHP hacks out there like me (yup, guilt by association!)
function column_item_count($items, $column, $maxcolumns) {
return round($items / $maxcolumns) + (($items % $maxcolumns) >= $column ? 1 : 0);
}
And you can call it like this...
$cnt = sizeof($an_array_of_data);
$col1_cnt = column_item_count($cnt,1,3);
$col2_cnt = column_item_count($cnt,2,3);
$col3_cnt = column_item_count($cnt,3,3);
Credit for this should go to #Bombe who provided it in Java (?) above.
NB: This function expects you to pass in an ordinal column number, i.e. first col = 1, second col = 2, etc...

Resources