Sort by specific column [duplicate] - sorting

This question already has answers here:
How to sort numeric and literal columns in Vim
(5 answers)
Closed 9 years ago.
If I have a text file with several tab-separated columns like this:
1 foo bar
3 bar foo
How would I sort based on the second or third column?
I read something like using :'<,'>!sort -n -k 2 in visual mode or :sort /.*\%2v/, but none of these commands seem to work.

You can use the built in sort command.
To sort by the second tab delimited column you can use :sort /[^\t]*\t/ to sort the second column.
To sort the third column you can use :sort /[^\t]*\t\{2}/
Generally just replace the number with the column number minus 1. (ie index columns with first column being index 0)

Sadly, it doesn't seem to be possible with the use of visual blocks inside the same file and/or with one command because :ex is linewise, i.e. Ctrl-v+ selection + :'<,'>sort would just sort the whole line either way.
A somewhat hacky "solution" would be to select whatever you want to sort with a visual block, sort it in another window and apply the changes to your original file. Something like this:
Ctrl-v + selection + x + :tabnew + p + :sort + Ctrl-vG$x + :q + `[P (align paste)
Source: Barry Arthur - Sort Me A Column (bairui from #vim#freenode).

The external sort, called via :'<,'>!sort -k 2 does work. Only if the -n flag (for numeric sorting) is given but the column you want to use is non-numeric, the result is not as expected. So to use the external sort, just drop -n in your example.
Remark: Also :'<,'>sort /.*\%2v/ does work for me.

Related

Explain how this command finds the 5 most CPU intensive processes

If I have the command
ps | sort -k 3,3 | tail -n 5
How is this find the 5 most CPU intensive processes?
I get that it is taking all the processes, sorting them based on a column through the -k option, but what does 3,3 mean?
You could read what you seek for from the official manual of sort (info sort in linux); in particular, you are interested in the following extracts:
‘-k POS1[,POS2]’
‘--key=POS1[,POS2]’
Specify a sort field that consists of the part of the line between
POS1 and POS2 (or the end of the line, if POS2 is omitted),
_inclusive_.
and, skipping a few paragraphs,
Example: To sort on the second field, use ‘--key=2,2’ (‘-k 2,2’).
See below for more notes on keys and more examples. See also the
‘--debug’ option to help determine the part of the line being used
in the sort.
So, basically, 3,3 emphasises that only the third column shall be considered for sorting, and the others will be ignored.

How to format output of a select statement in Postgres back into a delimited file?

I am trying to work with some oddly created 'dumps' of some tables in postgres. Due to the tables containing specific data I will have to refrain from posting the exact information but I can give an example.
To give a bit more information, someone though that this exact command was a good way to backup a table.
echo 'select * from test1'|psql > test1.date.txt
However, in this one example that gives a lot of information that no one neeeds. To also be even more fun the person saw fit to remove the | that is normally seen with the data.
So what I end up with is something like this.
rowid test1
-------+----------------------
1 hi
2 no
(2 rows)
To also note, for this customer there are multiple tables here. My thoughts here was to use some simple python to figure out where in each line the + was and then mark those points. Then apply those points to each line throughout the file.
I was able to make this work for one set of files but for some reason the next set of files just doesn't work. What happens instead is that on most lines a pipe gets thrown in the middle of data
Maybe there is something I missing here, but does anyone see an easy way to put something like the above back into a normal delimiter file that I could then just load into the database?
Any python or bash related suggestions would also work in this case. Thank you.
As mentioned above, without a real example of where the '|' are that are causing problems, or a real example of where you are having problem, it is hard to know whether we are addressing your actual issue. That said, your two primary swiss-army=knives for text processing are sed and awk. If you have data similar to your example, with pipes between data fields you need to discard, then awk provides a fairly easy solution.
Take for example your short example and add a pipe in the middle that needs to be discarded, e.g.
$ cat dat/pgsql2.txt
rowid test1
-------+----------------------
1 | hi
2 | no
To process the file in awk discarding the '|' and outputting the remaining records in comma-separated-value format, you could do something like the following:
awk '{
if (NR > 2) {
for (i = 1; i <= NF; i++) {
if ($i != "|") {
if (i == 1)
printf "%s", $i
else
printf ",%s", $i
}
printf "\n"
}
}
}' inputfile
Which simply reads from inputfile (last line) and loops over the number of fields (NF) (3 in this case) and if the row-number is > 2 (to omit the heading) and the field $i is not "|", then it simply checks if this is the first field and outputs it without a comma, otherwise all other fields are output with a preceding comma.
Example Output
1,hi
2,no
awk is a bit awkward at first, but as far as text processing goes, there isn't much that will top it.
After trying multiple methods the only way I could make this work sadly was to just use the import feature for Excel and then play with that to get the columns I needed.

How to use shell to combine files and let the content have default order and no repeat lines

I have 2 files, I want to combine them.
The combined content need to no repeat content and the content order is default order.
when I uniq them, I will use sort, the sort command make the order deranged!
so, how to use shell to combine files and let the content has default order and no repeat lines?
Your question is unclear. Do you mean that you want the order to be stable? Do you want the inputs to be interleaved in some way? If you just want all of the content of file 1 (in its original order) followed by all of the content of file 2 (in its original order) with duplicates removed, you can do:
awk '!a[$0]++' input1 input2
If you want the input interleaved in some way, you'll need to describe the ordering that you want. (By "default order", it sounds like you want the data sorted, but if sort makes the ordering "deranged", then that's clearly not what you want.)
You can use Ruby to do it - sorting doesn't matter:
array_1 = File.read("file1").split("\n")
array_2 = File.read("file2").split("\n")
union = array_1 | array_2

Create column that is equal to the value of an existing column in same file +1?

I have a tab delimited text file with several million rows and with 2 columns that looks like this:
1 693731
1 729679
1 730087
1 731718
1 734349
I want to add an additional column to the file that is equal to the value of column 2 + 1. So for the above example it would look like this:
1 693731 693732
1 729679 729680
1 730087 730088
1 731718 731719
1 734349 734350
What would be the best way to do this using unix shell? Any input would be greatly appreciated.
When you're dealing with columnar data, "awk" is a good tool. It also has math capabilities built-in, so it's a natural for this:
awk '{ print $1"\t"$2"\t"($2+1); }' < data.tsv
awk runs this code for every line in the input. For each of those lines, the $ notation indicates a column: $1 is the first column in the current row, $2 is the second, and so on.
Though I prefer explicit column enumeration, you may use ghoti's optimization, where $0 represents all data on the current row:
awk '{ print $0"\t"($2+1); }' < data.tsv
Because of the UNIX toolbox approach, there are many ways to solve this problem. Whether or not this is "best" depends on many factors: speed, maintainability, portability, etc.

Find matched and unmatched records and position of key-word is unknown [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have two files FILE1 & FILE2, and lets say both are fixed length of 30 characters. I need to find the records from FILE1 & FILE2 which contain the string 'COBOL', where the position of this key-word is unknown and changes for every record. To be more clear below is as sample layout.
FILE1 :
NVWGNLVKAKOIVCOBOLLKASVOIWSNVS
SOSIVSNAVIS7780HLSVHSKSCOBOL56
ZXCVBNMASDFGHJJKKLIIUYYTRREEWQ
1234567890COBOL1234556FCVHJJHH
COBOL1231231231231231341234334
FILE2 :
123456789012345678901234567890
COBOL1231231231231231341234334
GYKCHYYIIHHFTIUIHGJUGTUHGFUYHG
Can any one explain me how to do it using SORT or JOINKEYS and also by using COBOL program.
I need two output files.
Output FILE-OP1 : (which contain all the records having COBOL key-word from file1 & file2)
NVWGNLVKAKOIVCOBOLLKASVOIWSNVS
SOSIVSNAVIS7780HLSVHSKSCOBOL56
1234567890COBOL1234556FCVHJJHH
COBOL1231231231231231341234334
COBOL1231231231231231341234334
Output File-OP2 (which contain only matching records with COBOL key-word from file1 & file2)
COBOL1231231231231231341234334
An example, pseudo-codeish, Cobol:
Open File1
Read File1 into The-Record
Perform until End-Of-File
Perform varying II from 1 by 1
until II > length of The-Record
If The-Record (II:5) = 'COBOL'
Display "Found COBOL at position " II
End-If
End-Perform
Read File1 into The-Record
End-perform
Repeat for file2 with the same program pointed at your other file.
As this sounds homework-y, I've left several little quirks that you will need to fix in that code, but you should see where it blows up or fails and be able to resolve those reasonably easily.
If you need to do some sort of matching and dropping between the two files, that is a different animal and you need to get your rules for it. Are you trying to match the files that have "COBOL" located in the same position or something? What behavior do you expect?
For your FILE1, SORT it on the entire input data, only including records which contain COBOL and appending a sequence number (you show your output in the original sequence). If there can be duplicate records, SORT on the sequence-number you attach as well.
Similar for FILE2.
The SORT for each program can be stand-alone (DFSORT or SyncSORT) or within a COBOL program.
You then "match" the files, here's a useful bit of pseudo-code from Bruce Martin: https://stackoverflow.com/a/22950005/1927206
Logically after the match, you then need to SORT both outputs on the sequence-number alone, and after that remove the sequence-numbers.
Remembering that you only need to know if COBOL is present in the data, if using COBOL for the first two SORTs you have a variety of ways to locate the word COBOL (and remembering you only need to know if it is there, not where it is or how many times it may be there): as Joe Zitzelberger showed, you can use a one-byte reference-modification, but be careful not to go beyond the data with your PERFORM VARYING (use compiler option SSRANGE if you are unclear what I mean); you can use INSPECT; UNSTRING; STRING; define you data with an OCCURS, for a length of five and use an index for a one-byte table; use OCCURS DEPENDING ON; do it "byte at a time"; etc.
This is a little bit like free-format number handling.
You can use "SS" in DFSORT to find records containing cobol.
Step 1. read both infiles, produce one outfile OP-1
INCLUDE COND=(1,30,SS,EQ,C'COBOL')
Step2. produce a work file in the same way as step 1. using only File 1.
Step3. produce a work file in the same way as step 1. using only File 2.
Run joinkeys on these two to find matches. ==> outfile OP-2
Essentially this strategy serves to eliminate non qualifying rows from the join.

Resources