The built-in VIM :sort command sorts lines of text. I want to sort words in a single line, e.g. transform the line
b a d c e f
to
a b c d e f
Currently I accomplish this by selecting the line and then using :!tr ' ' '\n' | sort | tr '\n' ' ', but I'm sure there's a better, simpler, quicker way. Is there?
Note that I use bash so if there's a shorter and more elegant bash command for doing this it's also fine.
EDIT: My use-case is that I have a line that says SOME_VARIABLE="one two three four etc" and I want the words in that variable to be sorted, i.e. I want to have SOME_VARIABLE="etc four one three two".
The end result should preferably be mappable to a shortcut key as this is something I find myself needing quite often.
In pure vim, you could do this:
call setline('.', join(sort(split(getline('.'), ' ')), " "))
Edit
To do this so that it works over a range that is less than one line is a little more complicated (this allows either sorting multiple lines individually or sorting part of one line, depending on the visual selection):
command! -nargs=0 -range SortWords call SortWords()
" Add a mapping, go to your string, then press vi",s
" vi" selects everything inside the quotation
" ,s calls the sorting algorithm
vmap ,s :SortWords<CR>
" Normal mode one: ,s to select the string and sort it
nmap ,s vi",s
function! SortWords()
" Get the visual mark points
let StartPosition = getpos("'<")
let EndPosition = getpos("'>")
if StartPosition[0] != EndPosition[0]
echoerr "Range spans multiple buffers"
elseif StartPosition[1] != EndPosition[1]
" This is a multiple line range, probably easiest to work line wise
" This could be made a lot more complicated and sort the whole
" lot, but that would require thoughts on how many
" words/characters on each line, so that can be an exercise for
" the reader!
for LineNum in range(StartPosition[1], EndPosition[1])
call setline(LineNum, join(sort(split(getline('.'), ' ')), " "))
endfor
else
" Single line range, sort words
let CurrentLine = getline(StartPosition[1])
" Split the line into the prefix, the selected bit and the suffix
" The start bit
if StartPosition[2] > 1
let StartOfLine = CurrentLine[:StartPosition[2]-2]
else
let StartOfLine = ""
endif
" The end bit
if EndPosition[2] < len(CurrentLine)
let EndOfLine = CurrentLine[EndPosition[2]:]
else
let EndOfLine = ""
endif
" The middle bit
let BitToSort = CurrentLine[StartPosition[2]-1:EndPosition[2]-1]
" Move spaces at the start of the section to variable StartOfLine
while BitToSort[0] == ' '
let BitToSort = BitToSort[1:]
let StartOfLine .= ' '
endwhile
" Move spaces at the end of the section to variable EndOfLine
while BitToSort[len(BitToSort)-1] == ' '
let BitToSort = BitToSort[:len(BitToSort)-2]
let EndOfLine = ' ' . EndOfLine
endwhile
" Sort the middle bit
let Sorted = join(sort(split(BitToSort, ' ')), ' ')
" Reform the line
let NewLine = StartOfLine . Sorted . EndOfLine
" Write it out
call setline(StartPosition[1], NewLine)
endif
endfunction
Using great ideas from your answers, especially Al's answer, I eventually came up with the following:
:vnoremap <F2> d:execute 'normal i' . join(sort(split(getreg('"'))), ' ')<CR>
This maps the F2 button in visual mode to delete the selected text, split, sort and join it and then re-insert it. When the selection spans multiple lines this will sort the words in all of them and output one sorted line, which I can quickly fix using gqq.
I'll be glad to hear suggestions on how this can be further improved.
Many thanks, I've learned a lot :)
EDIT: Changed '<C-R>"' to getreg('"') to handle text with the char ' in it.
Here's the equivalent in pure vimscript:
:call setline('.',join(sort(split(getline('.'),' ')),' '))
It's no shorter or simpler, but if this is something you do often, you can run it across a range of lines:
:%call setline('.',join(sort(split(getline('.'),' ')),' '))
Or make a command
:command -nargs=0 -range SortLine <line1>,<line2>call setline('.',join(sort(split(getline('.'),' ')),' '))
Which you can use with
:SortLine
:'<,'>SortLine
:%SortLine
etc etc
:!perl -ne '$,=" ";print sort split /\s+/'
Not sure if it requires explanation, but if yes:
perl -ne ''
runs whatever is within '' for every line in input - putting the line in default variable $_.
$,=" ";
Sets list output separator to space. For example:
=> perl -e 'print 1,2,3'
123
=> perl -e '$,=" ";print 1,2,3'
1 2 3
=> perl -e '$,=", ";print 1,2,3'
1, 2, 3
Pretty simple.
print sort split /\s+/
Is shortened version of:
print( sort( split( /\s+/, $_ ) ) )
($_ at the end is default variable).
split - splits $_ to array using given regexp, sort sorts given list, print - prints it.
Maybe you prefer Python:
!python -c "import sys; print(' '.join(sorted(sys.stdin.read().split())))"
Visual select text, and execute this line.
My AdvancedSorters plugin now has a :SortWORDs command that does this (among other sorting-related commands).
Related
I'm getting hard times understanding how to achieve what I want using awk and after searching for quite some time, I couldn't find the solution I'm looking for.
I have an input text that looks like this:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line
I want to properly format the weird lines between ' (' and ')'. The expected output is as follow:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Looking up on stack overflow I found this :
How to select lines between two marker patterns which may occur multiple times with awk/sed
So what I'm using now is echo $text | awk '/ \(/{flag=1;next}/\)/{flag=0}flag'
Which almost works except it filters out the non-matching lines, here's the output produced by this very last command:
(Element 4)
(Element 1, span 1 to Element 5, span 4)
Anyone knows how-to do this? I'm open to any suggestion, including not-using awk if you know better.
Bonus point if you teach me how to remove syntaxic coloration on my question code blocks :)
Thanks a billion times
Edit: Ok, so I accepted #EdMorton's solution as he provided something using awk (well, GNU awk). However, I'm currently using #aaron's sed voodoo incantations with great success and will probably continue doing so until I hit anything new on that specific usecase.
I strongly suggest reading EdMorton's explanation, last paragraph made my day. If anyone passing by has good ressources regarding awk/sed they can share, feel free to do so in the comments.
Here's how I would do it with GNU sed :
s/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}
Which, for those who don't speak gibberish, means :
remove the leading spaces from lines that start with spaces and an opening bracket
test if the line now start with an opening bracket. If that's the case, do the following :
mark this spot as the label l, which denotes the start of a loop
add a line from the input to the pattern space
test if you now have a closing bracket in your pattern space
if so, jump to the label e
(if not) jump to the label l
mark this spot as the label e, which denotes the end of the code
remove the linefeeds from the pattern space
(implicitly print the pattern space, whether it has been modified or not)
This can probably be refined, but it does the trick :
$ echo """Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line """ | sed 's/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}'
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Edit : if you can disable history expansion (set +H), this sed command is nicer : s/^\s*(/(/;/^(/{:l N;/)/!b l;s/\n//g}
sed is for simple substitutions on individual lines, that is all. If you try to do anything else with it then you are using constructs that became obsolete in the mid-1970s when awk was invented, are almost certainly non-portable and inefficient, are always just a pile of indecipherable arcane runes, and are used today just for mental exercise.
The following uses GNU awk for multi-char RS, RT and the \s shorthand for [[:space:]] and works by simply isolating the (...) strings and then doing whatever you want with them:
$ cat tst.awk
BEGIN {
RS="[(][^)]+[)]" # a regexp for the string you want to isolate in RT
ORS="" # disable appending of newlines so we print as-is
}
{
gsub(/\n[[:blank:]]+$/,"\n") # remove any blanks before RT at the start of each line
sub(/\(\s+/,"(",RT) # remove spaces after ( in RT
sub(/\s+\)/,")",RT) # remove spaces before ) in RT
gsub(/\s+/," ",RT) # compress each chain of spaces to one blank char in RT
print $0 RT # print the result
}
$ awk -f tst.awk file
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
If you're considering using a sed solution for this also consider how you would enhance it if/when you have the slightest requirements change. Any change to the above awk code would be trivial and obvious while a change to the equivalent sed code would require first sacrificing a goat under a blood moon then breaking out your copy of the Rosetta Stone...
It's doable in awk, and maybe there's a slicker way than this. It looks for lines between and including those containing only blanks and either an open or close parenthesis, and processes them specially. Everything else it just prints:
awk '/^ *\( *$/,/^ *\) *$/ {
sub(/^ */, "");
sub(/ *$/, "");
if ($1 ~ /[()]/) hold = hold $1; else hold = hold " " $0
if ($0 ~ /\)/) {
sub(/\( /, "(", hold)
sub(/ \)/, ")", hold)
print hold
hold = ""
}
next
}
{ print }' data
The variable hold is initially empty.
The first pair of sub calls strip leading and trailing blanks (copying the data from the question, there's a blank after span 1 to). The if adds the ( or ) to hold without a space, or the line to hold after a space. If the close parenthesis is present, remove the space after the open parenthesis and before the close parenthesis, print hold, and reset hold to empty. Always skip the rest of the script with next. The rest of the script is { print } — print unconditionally, often written 1 by minimalists.
The file data is copy'n'paste from the data in the question.
Output:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
The 'Another Line' (with capital L) has a trailing blank because the data in the question does.
With awk
$ cat fmt.awk
function rem_wsp(s) { # remove white spaces
gsub(/[\t ]/, "", s)
return s
}
function beg() {return rem_wsp($0)=="("}
function end() {return rem_wsp($0)==")"}
function dump_block() {
print "(" block ")"
}
beg() {
in_block = 1
next
}
end() {
dump_block()
in_block = block = ""
next
}
in_block {
if (length(block)>0) sep = " "
block = block sep $0
next
}
{
print
}
END {
if (in_block) dump_block()
}
Usage:
$ awk -f fmt.awk fime.dat
So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!
I have a text file that looks like the following:
1000000 45 M This is a line This is another line Another line
that breaks into that also breaks that has a blank
multiple rows into multiple rows - row below.
How annoying!
1000001 50 F I am another I am well behaved.
column that has
text spanning
multiple rows
I would like to convert this into a csv file that looks like:
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
The text file output comes from a program that was written in 1984, and I have no way to modify the output. I want it in csv format so that I can convert it to Excel as painlessly as possible. I am not sure where to start, and rather than reinvent the wheel, was hoping someone could point me in the right direction. Thanks!
== EDIT ==
I've modified the text file to have \n between rows - maybe this will be helpful?
== EDIT 2 ==
I've modified the text file to have a blank row.
Using GNU awk
gawk '
BEGIN { FIELDWIDTHS="11 6 5 22 22" }
length($1) == 11 {
if ($1 ~ /[^[:blank:]]/) {
if (f1) print_line()
f1=$1; f2=$2; f3=$3; f4=$4; f5=$5
}
else {
f4 = f4" "$4; f5 = f5" "$5
}
}
function rtrim(str) {
sub(/[[:blank:]]+$/, "", str)
return str
}
function print_line() {
gsub(/[[:blank:]]{2,}/, " ", f4); gsub(/"/, "&&", f4)
gsub(/[[:blank:]]{2,}/, " ", f5); gsub(/"/, "&&", f5)
printf "%s,%s,%s,\"%s\",\"%s\"\n", rtrim(f1), rtrim(f2), rtrim(f3),f4,f5
}
END {if (f1) print_line()}
' file
1000000,45,M,"This is a line that breaks into multiple rows ","This is another line that also breaks into multiple rows - How annoying!"
1000001,50,F,"I am another column that has text spanning multiple rows","I am well behaved. "
I've quoted the last 2 columns in case they contain commas, and doubled any potential inner double quotes.
Here's a Perl script that does what you want. It uses unpack to split the fixed width columns into fields, adding to the previous fields if there is no data in the first column.
As you've mentioned that the widths vary between files, the script works out the widths for itself, based on the content of the first line. The assumption is that there are at least two space characters between each field. It creates a format string like A11 A6 A5 A22 A21, where "A" means any character and the numbers specify the width of each field.
Inspired by glenn's version, I have wrapped any field containing spaces in double quotes. Whether that's useful or not depends on how you're going to end up using the data. For example, if you want to parse it using another tool and there are commas within the input, it may be helpful. If you don't want it to happen, you can change the grep block in both places to simply grep { $_ ne "" }:
use strict;
use warnings;
chomp (my $first_line = <>);
my #fields = split /(?<=\s{2})(?=\S)/, $first_line;
my $format = join " ", map { "A" . length } #fields;
my #cols = unpack $format, $first_line;
while(<>) {
chomp( my $line = $_ );
my #tmp = unpack $format, $line;
if ($tmp[0] ne '') {
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
#cols = #tmp;
}
else {
for (1..$#tmp) {
$cols[$_] .= " $tmp[$_]" if $tmp[$_] ne "";
}
}
}
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
Output:
1000000, 45, M, "This is a line that breaks into multiple rows", "This is another line that also breaks into multiple rows - How annoying!"
1000001, 50, F, "I am another column that has text spanning multiple rows", "I am well behaved."
Using this awk:
awk -F ' {2,}' -v OFS=', ' 'NF==5{if (p) print a[1], a[2], a[3], a[4], a[5];
for (i=1; i<=NF; i++) a[i]=$i; p=index($0,$4)}
NF<4 {for(i=2; i<=NF; i++) index($0,$i) == p ? a[4]=a[4] " " $i : a[5]=a[5] $i}
END { print a[1], a[2], a[3], a[4], a[5] }' file
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
You can write a script in python that does that. Read each line, call split on it, if the line is not empty append to the previous line. If it is, then add the next line to the result set. Finally use the csv write to write the result set to file.
Something along the lines of :
#import csv
inputFile = open(filename, 'r')
isNewItem = True
results = []
for line in inputFile:
if len(results) == 0:
isNewItem = True
else if line == '':
isNewItem = True
continue
else:
inNewItem = False
temp = line.split()
if isNewItem:
results.append(temp)
else
lastRow = results[-1]
combinedRow = []
for leftColumn, rigtColumn in lastRow, temp:
combinedRow.append(leftColumn + rightColumn)
with open(csvOutputFileName, 'w') as outFile:
csv.write(results)
Given the following indented text:
two spaces
four
six
non-leading spaces
I'd like to convert every 2 leading spaces to a tab, essentially converting from soft tabs to hard tabs. I'm looking for the following result (using an 'x' instead of "\t"):
xtwo spaces
xxfour
xxxsix
non-leading spaces
What is the most efficient or eloquent way to do this in ruby?
What I have so far seems to be working, but it doesn't feel right.
input.gsub!(/^ {2}/,"x")
res = []
input.split(/\n/).each do |line|
while line =~ /^x+ {2}/
line.gsub!(/^(x+) {2}/,"\\1x")
end
res << line
end
puts res.join("\n")
I noticed the answer using sed and \G:
perl -pe '1 while s/\G {2}/\t/gc' input.txt >output.txt
But I can't figure out how to mimic the pattern in Ruby. This is as far as I got:
rep = 1
while input =~ /^x* {2}/ && rep < 10
input.gsub!(/\G {2}/,"x")
rep += 1
end
puts input
Whats wrong with using (?:^ {2})|\G {2} in multi-line mode?
The first match will always be at the beginning of the line,
then \G will match succesively right next to that, or the match
will fail. The next match will always be the beginning of the line.. repeats.
In Perl its $str =~ s/(?:^ {2})|\G {2}/x/mg; or $str =~ s/(?:^ {2})|\G {2}/\t/mg;
Ruby http://ideone.com/oZ4Os
input.gsub!(/(?:^ {2})|\G {2}/m,"x")
Edit: Of course the anchors can be factored out and put into an alternation
http://ideone.com/1oDOJ
input.gsub!(/(?:^|\G) {2}/m,"x")
You can just use a single gsub for that:
str.gsub(/^( {2})+/) { |spaces| "\t" * (spaces.length / 2) }
I am trying to replace a two letter state abbreviation with text then the abbreviation.
Eventually I want to find and replace the rest. How do I capture the value found? .... I tried \1 and {1}
AL 32.2679134368897 -86.5251510620117
AR 35.2315113544464 -92.2926173210144
AZ 33.3440766538127 -111.955985217148
CO 39.7098631425337 -104.899092934348
if( usState == "AZ") dpos= "33.4736704187888" + " " + "-112.043138087587";
if( usState == "CA") dpos= "36.0783581515733" + " " + " -119.868895584259";
if( usState == "CO") dpos= "39.8950788035537" + " " + " -104.831521872318";
if( usState == "CT") dpos= "41.6001570945562" + " " + " -72.6606015937273";
Update
$1 does not work.
I am finding: [A-Z][A-Z]
replacing with: if( usState == "$1
Oddly enough, Visual Studio Regular Expressions are different than normal .Net regular expressions. They have a slightly different syntax for tags and replaces. In order to tag a piece of text for later matching you must wrap it in braces {}. Then you can use \n in the replacement strings where n is the nth tagged expression. For your scenario here are the strings you should use
Find: {[A-Z][A-Z]}
Replace: if( usState == "\1")
My regex matcher matches $1. Try that.
I might not have understood your problem, but why don't you record a temporary macro to do the transformation?
Since this questions seems to be a duplicate of https://stackoverflow.com/a/3147177/154480 but I found this one first: since Visual Studio 2012, you can use (pattern) and $1. As an example for this specific question, find ([A-Z]{2}) by if( usState == "$1")
Enclose the [A-Z][A-Z] within parentheses, which captures it; then, use \1 in your replacement string to refer to the capture.