How to insert a different delimiter in between two columns in shell - shell

I 've a file as below
ABc def|0|0|0| 1 | 2| 9|
0 2930|0|0|0|0| 1 | 2| 9|
Now, i want to split the first column with the same delimiter.
output:
ABc|def|0|0|0| 1 | 2| 9|
0|2930|0|0|0|0| 1 | 2| 9|
Please help me out with awk.

You can use sed for this:
$ sed 's/ /|/' file
ABc|def|0|0|0| 1 | 2| 9|
0|2930|0|0|0|0| 1 | 2| 9|
The way it is defined, it just replaces the first space with a |, which is exactly what you need.
With awk it is a bit longer:
$ $ awk 'BEGIN{FS=OFS="|"}{split($1, a, " "); $1=a[1]"|"a[2]}1' file
ABc|def|0|0|0| 1 | 2| 9|
0|2930|0|0|0|0| 1 | 2| 9|
After definining input and output field separator as |, it splits the first field based on space. Then prints the line back.

Another awk
awk '{sub(/ /,"|")}1' file
ABc|def|0|0|0| 1 | 2| 9|
0|2930|0|0|0|0| 1 | 2| 9|
Without the leading space, this works fine.

You said you want to replace the delimiter (space->pipe) in first column.
It could happen that in your first col, there is no space, but in other columns, there are spaces. In this case, you don't want to do any change on that line. Also in your first column, there could be more spaces, I guess you want to have them all replaced. So I cannot think of a shorter way for this problem.
awk -F'|' -v OFS="|" '{gsub(/ /,"|",$1)}7' file

sed 's/^[[:blank:]]\{1,\}/ /;/^\([^|]\{1,\}\)[[:blank:]]\{1,\}\([^|[[:blank:]]\)/ s//\1|\2/'
assuming first column is blank for empty, a blank (or several) as the separator than another non blank or |
this allow this
ABc def|0|0|0| 1 | 2| 9|
def|0|0|0| 1 | 2| 9|
ABc|def|0|0|0| 1 | 2| 9|

Related

Inconsistency in output field separator

We have to find the difference(d) Between last 2 nos and display rows with the highest value of d in ascending order
INPUT
1 | Latha | Third | Vikas | 90 | 91
2 | Neethu | Second | Meridian | 92 | 94
3 | Sethu | First | DAV | 86 | 98
4 | Theekshana | Second | DAV | 97 | 100
5 | Teju | First | Sangamithra | 89 | 100
6 | Theekshitha | Second | Sangamithra | 99 |100
Required OUTPUT
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
awk 'BEGIN{FS="|";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
Output:
4 $ Theekshana $ Second $ DAV $ 97 $ 100$3
5 $ Teju $ First $ Sangamithra $ 89 $ 100$11
3 $ Sethu $ First $ DAV $ 86 $ 98$12
As you can see there is space before and after $ sign but for the last column (avg) there is no space, please explain why its happening
2)
awk 'BEGIN{FS=" | ";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
OUTPUT
4$|$Theekshana$|$Second$|$0
5$|$Teju$|$First$|$0
6$|$Theekshitha$|$Second$|$0
I have not mentiond | as the output field separator but still it appears, why is this happening and the difference is zero too
I am just 6 days old in unix,please answer even if its easy
your field separator is only the pipe symbol, so surrounding whitespace is part of the field definitions and that's what you see in the output. In combined uses pipe has the regex special meaning and need to be escaped. In your second case it means space or space is the field separator.
$ awk 'BEGIN {FS=" *\\| *"; OFS="$"}
{d=sqrt(($NF-$(NF-1))^2); $1=$1;
print d "\t" $0,d}' file | sort -n | tail -3 | cut -f2-
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
a slight rewrite will eliminate the number of fields dependency and fixes the format.

bash looping and extracting of the fragment of txt file

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:
File 1:
CLUSTERING HISTOGRAM
____________________
________________________________________________________________________________
| | | | |
Clus | Lowest | Run | Mean | Num | Histogram
-ter | Binding | | Binding | in |
Rank | Energy | | Energy | Clus| 5 10 15 20 25 30 35
_____|___________|_____|___________|_____|____:____|____:____|____:____|____:___
1 | -5.78 | 11 | -5.78 | 1 |#
2 | -5.53 | 13 | -5.53 | 1 |#
3 | -5.47 | 17 | -5.44 | 2 |##
4 | -5.43 | 20 | -5.43 | 1 |#
5 | -5.26 | 19 | -5.26 | 1 |#
6 | -5.24 | 3 | -5.24 | 1 |#
7 | -5.19 | 4 | -5.19 | 1 |#
8 | -5.14 | 16 | -5.14 | 1 |#
9 | -5.11 | 9 | -5.11 | 1 |#
10 | -5.07 | 1 | -5.07 | 1 |#
11 | -5.05 | 14 | -5.05 | 1 |#
12 | -4.99 | 12 | -4.99 | 1 |#
13 | -4.95 | 8 | -4.95 | 1 |#
14 | -4.93 | 2 | -4.93 | 1 |#
15 | -4.90 | 10 | -4.90 | 1 |#
16 | -4.83 | 15 | -4.83 | 1 |#
17 | -4.82 | 6 | -4.82 | 1 |#
18 | -4.43 | 5 | -4.43 | 1 |#
19 | -4.26 | 7 | -4.26 | 1 |#
_____|___________|_____|___________|_____|______________________________________
The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.
3 | -5.47 | 17 | -5.44 | 2 |##
Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):
"Name of the file 1": 3 | -5.47 | 17 | -5.44 | 2 |##
"Name_of_the_file_2": 1 | -5.99 | 13 | -5.98 | 16 |################
"Name_of_the_file_3": 2 | -4.78 | 19 | -4.44 | 3 |###
A possible model of my BASH workflow would be:
#!/bin/bash
do
file_name2=$(basename "$f")
file_name="${file_name2/.dlg}"
echo "Processing of $f..."
# take a name of the file and save it in the log
echo "$file_name" >> $PWD/final_results.log
# search of the beginning of the table inside of each file and save it after its name
cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
# check whether it works
gedit $PWD/final_results.log
done
Here I need to substitute combination of echo and grep in order to take selected parts of the table.
You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.
grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.
Result should be like this:
file1.dlg: 3 | -5.47 | 17 | -5.44 | 2 |##########
file2.dlg: 3 | -5.47 | 17 | -5.44 | 2 |####
file3.dlg: 3 | -5.47 | 17 | -5.44 | 2 |#######
Here is a modification to get the first appearence in case of many equal max lines in a file:
grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.
Second solution
Here using only awk:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) print i ":" row[i]}' *.dlg
Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'
Probably makes more sense as an Awk script.
This picks the first line with the widest histogram in the case of a tie within an input file.
#!/bin/bash
awk 'FNR == 1 { if(sel) print sel; sel = ""; max = 0 }
FNR < 9 { next }
length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This assumes the histograms are always the tenth field; if your input format is even messier than the lump you show, maybe adapt to taste.
In some more detail, the first line triggers on the first line of each input file. If we have collected a previous line (meaning this is not the first input file), print that, and start over. Otherwise, initialize for the first input file. Set sel to nothing and max to zero.
The second line skips lines 1-8 which contain the header.
The third line checks if the current line's histogram is longer than max. If it is, update max to this histogram's length, and remember the current line in sel.
The last line is spillover for when we have processed all files. We never printed the sel from the last file, so print that too, if it's set.
If you mean to say we should find the lines between CLUSTERING HISTOGRAM and the end of the table, we should probably have more information about what the surrounding lines look like. Maybe something like this, though;
awk '/CLUSTERING HISTOGRAM/ { if (sel) print sel; looking = 1; sel = ""; max = 0 }
!looking { next }
looking > 1 && $1 != looking { looking = 0; nextfile }
$1 == looking && length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This sets looking to 1 when we see CLUSTERING HISTOGRAM, then counts up to the first line where looking is no longer increasing.
I would suggest processing using awk:
for i in $FILES
do
echo -n \""$i\": "
awk 'BEGIN {
output="";
outputlength=0
}
/(^ *[0-9]+)/ { # process only lines that start with a number
if (length(substr($10, 2)) > outputlength) { # if line has more hashes, store it
output=$0;
outputlength=length(substr($10, 2))
}
}
END {
print output # output the resulting line
}' "$i"
done

unix 'sort' command for inline characters

I have a .txt file of pumpkinsizes that I'm trying to sort by size of pumpkin:
name |size
==========
Joe |5
Mary |10
Bill |2
Jill |1
Adam |20
Mar |5
Roe |10
Mir |3
Foo |9
Bar |12
Baz |0
Currently I'm having great difficulty in getting sort to work properly. Can anyone help me sort my list by pumpkin size without modifying the list structure?
The table headings need special consideration, since "sorting" them will move them to some random line. So we use a two step process:
a) output the table headings. b) sort the rest numerically (-n), reverse
order (-r), with field separator | (-t), starting at field 2 (-k)
$ awk 'NR<=2' in; awk 'NR>2' in | sort -t '|' -nr -k 2
name |size
==========
Adam |20
Bar |12
Roe |10
Mary |10
Foo |9
Mar |5
Joe |5
Mir |3
Bill |2
Jill |1
Baz |0
The key point is the option -k of sort. You can use man sort to see how it works. The solution for your problem follows:
sed -n '3,$p' YOUR_FILENAME| sort -hrt '|' -k 2
You can simply remove the
name |size
==========
by using sed command. Then whatever is left can be sorted using sort command.
sed '1,2d' txt | sort -t "|" -k 2 -n
Here, sed '1,2d' will remove the first 2 lines.
Then sort will tokenize the data on character '|' using option -t.
Since you want to sort based on size which happens to be second token, so the token "size" can be specified by -k 2 option of sort.
Finally, considering "size" as number, so this can be specified by option -n of sort.
You can do this in the shell:
{ read; echo "$REPLY"; read; echo "$REPLY"; sort -t'|' -k2n; } < pumpkins.txt
That reads and prints the first 2 header lines, then sorts the rest.

Shell Scipt sub string

Please help me to get script for below case.
I Have my content of file like below,
AllIdPropert.txt (ID|PropertyBit|)
1|0000000000000000000000000|
2|0000100000000000000000000|
3|0000100000000000000000000|
4|0000100000000000000000000|
5|0000000000000000000000000|
6|0000000000000000000000000|
I need to extract all the Id's into different file where PropertyBit[5] == 1(Where 5th bit is 1) in the format as below.
5bitenable.txt
2|
3|
4|
`
this awk one-liner should do the job:
awk -F'|' 'substr($2,5,1)==1 {print $1FS}' file
test with your example input:
kent$ awk -F'|' 'substr($2,5,1)==1 {print $1FS}' f
2|
3|
4|

SAS/STAT 12.1: KEYLABEL in PROC TABULATE: need row total and column total lines for "all" to display different labels

I am working in SAS/STAT 12.1 and I have only one issue with my code below, I need to show "Total" for the bottom row (displaying columns sums and percentages), instead of "Both Genders." And yes, the top right-hand column header (displaying row totals and percentages) still needs to be "Both Genders."
I hope there is a simple way to do this using keylabel, but haven't figured it out so far.
proc tabulate data=dmhrind format=8.1;
format gender $gendfmt. ethnic $ethnic.;
class ethnic gender;
table (ethnic all)*f=4. , (gender all)*(n*f=4. colpctn*f=5.1 rowpctn*f=5.1) ;
title 'Ethnic Distribution by Gender';
label ethnic='Race/Ethnicity';
keylabel N='N' colpctn='%' all='Both Genders' reppctn='%' rowpctn = 'Total';
run;
Thanks in advance for any assistance provided.
The only way to do this that I can see is to make a dummy column that simulates All. Using sashelp.class:
data class;
set sashelp.class;
allage = 'All Ages';
run;
proc tabulate data=class format=8.1;
class sex age allage;
table (age allage=' ')*f=4. , (sex all)*(n*f=4. colpctn*f=5.1 rowpctn*f=5.1) ;
title 'age Distribution by sex';
label age='Age';
label allage='All Ages';
keylabel N='N' colpctn='%' all='Both Sexes' reppctn='%' rowpctn = 'Total';
run;
It needs to have the text you want as the label as its actual value, and you need to replace all in the tabulate with that variable (and add it to the class statement), and add =' ' to override the extra label subrow.
For this, you need to do the titling within the table statement. The following example is similar to yours, using sashelp.class (as in #Joe's example) where age is used as your ethnicity variable:-
** This option helps improve proc tabulate output on some systems;
options formchar="|----||---|-/\<>*";
** The key is adding the column titles directly in the table stmt;
proc tabulate data=sashelp.class format=8.1;
class sex age;
table (age all='Total')*f=4. , (sex='' all='Both Sexes')*(n='N'*f=4. colpctn='Col %'*f=5.1 rowpctn='Row %'*f=5.1) ;
run;
The output should look like this:-
---------------------------------------------------------------------------
| | F | M | Both Sexes |
| |----------------|----------------|-----------------
| | N |Col %|Row %| N |Col %|Row %| N |Col %|Row %|
|----------------------|----|-----|-----|----|-----|-----|----|-----|------
|Age | | | | | | | | | |
|----------------------- | | | | | | | | |
|11 | 1| 11.1| 50.0| 1| 10.0| 50.0| 2| 10.5|100.0|
|----------------------|----|-----|-----|----|-----|-----|----|-----|------
|12 | 2| 22.2| 40.0| 3| 30.0| 60.0| 5| 26.3|100.0|
|----------------------|----|-----|-----|----|-----|-----|----|-----|------
|13 | 2| 22.2| 66.7| 1| 10.0| 33.3| 3| 15.8|100.0|
|----------------------|----|-----|-----|----|-----|-----|----|-----|------
|14 | 2| 22.2| 50.0| 2| 20.0| 50.0| 4| 21.1|100.0|
|----------------------|----|-----|-----|----|-----|-----|----|-----|------
|15 | 2| 22.2| 50.0| 2| 20.0| 50.0| 4| 21.1|100.0|
|----------------------|----|-----|-----|----|-----|-----|----|-----|------
|16 | .| .| .| 1| 10.0|100.0| 1| 5.3|100.0|
|----------------------|----|-----|-----|----|-----|-----|----|-----|------
|Total | 9|100.0| 47.4| 10|100.0| 52.6| 19|100.0|100.0|
--------------------------------------------------------------------------|

Resources