Sorting a file up to a character

Sorting a file up to a character - bash

I am writing a bash script. I have a file like this
./1#1#d41d8cd98f00b204e9800998ecf8427e
./11.txt#2#d41d8cd98f00b204e9800998ecf8427e
./12/1#1#d41d8cd98f00b204e9800998ecf8427e
./12/1#2#d41d8cd98f00b204e9800998ecf8427e
./12/1.txt#1#d41d8cd98f00b204e9800998ecf8427e
./12/1.txt#2#d41d8cd98f00b204e9800998ecf8427e
./12/2.txt#1#d41d8cd98f00b204e9800998ecf8427e
./12/2.txt#2#d41d8cd98f00b204e9800998ecf8427e
./1#2#d41d8cd98f00b204e9800998ecf8427e
./13#2#d41d8cd98f00b204e9800998ecf8427e
./2.txt#1#5d74727d50368c4741d76989586d91de
./2.txt#2#5d74727d50368c4741d76989586d91de
I would like to sort this file, but in a specific way. Let's call characters up to the first # section one, between the two # characters section two. So for example, given a line like this:
./1#2#d41d8cd98f00b204e9800998ecf8427e
Section one: ./1
Section two: 2
What I want to achive is sorting this file according to section one first and then according to section 2. So what is wrong with this example is the 9th line, it should be 2nd.
Is there an easy way to achieve this goal? I am unsure how to tackle this problem. Maybe I should somehow sort this file up to the first # and then again only according to the second section? Even if this is a good answer, not sure how to do it.
Expected result:
./1#1#d41d8cd98f00b204e9800998ecf8427e
./1#2#d41d8cd98f00b204e9800998ecf8427e
./11.txt#2#d41d8cd98f00b204e9800998ecf8427e
./12/1#1#d41d8cd98f00b204e9800998ecf8427e
./12/1#2#d41d8cd98f00b204e9800998ecf8427e
./12/1.txt#1#d41d8cd98f00b204e9800998ecf8427e
./12/1.txt#2#d41d8cd98f00b204e9800998ecf8427e
./12/2.txt#1#d41d8cd98f00b204e9800998ecf8427e
./12/2.txt#2#d41d8cd98f00b204e9800998ecf8427e
./13#2#d41d8cd98f00b204e9800998ecf8427e
./2.txt#1#5d74727d50368c4741d76989586d91de
./2.txt#2#5d74727d50368c4741d76989586d91de

Seems like you just want to sort by more than one key:
$ sort -t# -k1,1 -k2 file
./1#1#d41d8cd98f00b204e9800998ecf8427e
./1#2#d41d8cd98f00b204e9800998ecf8427e
./11.txt#2#d41d8cd98f00b204e9800998ecf8427e
./12/1#1#d41d8cd98f00b204e9800998ecf8427e
./12/1#2#d41d8cd98f00b204e9800998ecf8427e
./12/1.txt#1#d41d8cd98f00b204e9800998ecf8427e
./12/1.txt#2#d41d8cd98f00b204e9800998ecf8427e
./12/2.txt#1#d41d8cd98f00b204e9800998ecf8427e
./12/2.txt#2#d41d8cd98f00b204e9800998ecf8427e
./13#2#d41d8cd98f00b204e9800998ecf8427e
./2.txt#1#5d74727d50368c4741d76989586d91de
./2.txt#2#5d74727d50368c4741d76989586d91de
-k1,1 means sort by only the first field, then -k2 means sort by the rest of the fields, starting from the second. -t# means that fields are separated by a #.

Related

grep listing false duplicates

I have the following data containing a subset of record numbers formatting like so:
>head pilot.dat
AnalogPoint,206407
AnalogPoint,2584
AnalogPoint,206292
AnalogPoint,206278
AnalogPoint,206409
AnalogPoint,206410
AnalogPoint,206254
AnalogPoint,206266
AnalogPoint,206408
AnalogPoint,206284
I want to compare the list of entries to another subset file called "disps.dat" to find duplicates, which is formatted in the same way:
>head disps.dat
StatusPoint,280264
StatusPoint,280266
StatusPoint,280267
StatusPoint,280268
StatusPoint,280269
StatusPoint,280335
StatusPoint,280336
StatusPoint,280334
StatusPoint,280124
I used the command:
grep -f pilot.dat disps.dat > duplicate.dat
However, the output file "duplicate.dat" is listing records that exist in the second file "disps.dat", but do not exist in the first file.
(Note, both files are big, so the sample shown above don't have duplicates, but I do expect and have confirmed at least 10-12k duplicates to show up in total)
> head duplicate.dat
AnalogPoint,208106
AnalogPoint,208107
StatusPoint,1235220
AnalogPoint,217270
AnalogPoint,217271
AnalogPoint,217272
AnalogPoint,217273
AnalogPoint,217274
AnalogPoint,217275
AnalogPoint,217277
> grep "AnalogPoint,208106" pilot.dat
>
I tested the above command with a smaller sample of data (10 records), also formatted the same, and the results work fine, so I'm a little bit confused on why it is failing on the larger execution.
I also tried feeding it in as a string with -F thinking that the "," comma might be the source of issue. Right now, I am feeding the data through a 'for' loop and echoing each line, which is executing very, very slowly but at least it will help me cross out the regex possibility.

the -x or -w option is needed to do an exact match.
-x will match exact string, and -w will match exact substring and block non-word characters which works in my case to handle trailing numbers.
The issue is that a record in the first file such as:
"AnalogPoint,1"
Would end up flagging records in the second file like:
"AnalogPoint,10"
"AnalogPoint,123"
"AnalogPoint,100200"
And so on.
Thanks to #Barmar for pointing out my issue.

extract data between similar patterns

I am trying to use sed to print the contents between two patterns including the first one. I was using this answer as a source.
My file looks like this:
>item_1
abcabcabacabcabcabcabcabacabcabcabcabcabacabcabc
>item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
>item_3
cdecde
>item_4
defdefdefdefdefdefdef
I want it to start searching from item_2 (and include) and finish at next occuring > (not include). So my code is sed -n '/item_2/,/>/{/>/!p;}'.
The result wanted is:
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
but I get it without item_2.
Any ideas?

Using awk, split input by >s and print part(s) matching item_2.
$ awk 'BEGIN{RS=">";ORS=""} /item_2/' file
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb

I would go for the awk method suggested by oguz for its simplicity. Now if you are interested in a sed way, out of curiosity, you could fix what you have already tried with a minor change :
sed -n '/^>item_2/ s/.// ; //,/>/ { />/! p }' input_file
The empty regex // recalls the previous regex, which is handy here to avoid duplicating /item_2/. But keep in mind that // is actually dynamic, it recalls the latest regex evaluated at runtime, which is not necessarily the closest regex on its left (although it's often the case). Depending on the program flow (branching, address range), the content of the same // can change and... actually here we have an interesting example ! (and I'm not saying that because it's my baby ^^)
On a line where /^>item_2/ matches, the s/.// command is executed and the latest regex before // becomes /./, so the following address range is equivalent to /./,/>/.
On a line where /^>item_2/ does not match, the latest regex before // is /^>item_2/ so the range is equivalent to /^>item_2/,/>/.
To avoid confusion here as the effect of // changes during execution, it's important to note that an address range evaluates only its left side when not triggered and only its right side when triggered.

This might work for you (GNU sed):
sed -n ':a;/^>item_2/{s/.//;:b;p;n;/^>/!bb;ba}' file
Turn off implicit printing -n.
If a line begins >item_2, remove the first character, print the line and fetch the next line
If that line does not begins with a >, repeat the last two instructions.
Otherwise, repeat the whole set of instructions.
If there will always be only one line following >item_2, then:
sed '/^>item_2/!d;s/.//;n' file

Unix sort: sort by specific character following another character

I have a file that contains information in the following form:
"dog/3/cat/6/fish/2/78/90"
(we'll not worry about the last two values here)
Is it possible to sort the contents of the file by the numeric value after the odd numbered slashes with the unix sort command?
For instance, the output might look like this:
dog/4/house/3/frog/89/100
dog/3/mouse/2/chicken/12/68/80
dog/2/cat/5/bird/12/77/90

This should give you what you want, I think:
sort -t/ -k2,2nr -k4,4nr -k6,6nr

Sorting IP addresses in vim

I have just discovered the command :sort n in vim (how did I not know about that?!), which has almost done exactly what I need.
What I am trying to sort, though, is a long list of IP addresses (it's an "allow hosts" file to be Included into our apache config), and it would be nice for :sort n to be able to recognise that 123.45.6.7 should sort before 123.45.16.7 (for example).
Is it a safe assumption that I should be less OCD about it and not worry, because I'm not going to be able to do this without a mildly-complex sed or awk command or something?
To be clear, the rows all look something like:
Allow from 1.2.3.4
Allow from 5.6.7.8
Allow from 9.10.11.12
etc

Vim sort seems to be stable in practice (but it is not guaranteed). Therefore you can try:
:%sort n /.*\./
:%sort n /\.\d\+\./
:%sort n /\./
:%sort n
Which will sort by number after the last dot (* is greedy), then by number after the first dot following a dot and digits, then by number after the first dot, and last by the first number.

A straightforward way to achieve the correct sorting order without
relying on the stability of the sorting algorithm implemented by the
:sort command, is to prepend zeroes to the numbers within the IP
addresses, so that all of the components in them consist of exactly
three digits.
Prepend zeros to the single-digit and two-digit numbers:
:%s/\<\d\d\?\>/0&/g|%&&
Sort the lines comparing IP addresses as text:
:sort r/\(\d\{3}\)\%(\.\d\{3}\)\{3}/
Strip redundant leading zeros:
:%s/\<00\?\ze\d//g
To run all three steps as a single command, one can use the following
one-liner:
:%s/\<\d\d\?\>/0&/g|%&&|sor r/\(\d\{3}\)\%(\.\d\{3}\)\{3}/|%s/\<00\?\ze\d//g

I'm not a vim user so I can't offer a direct way to do it with builtin commands, however it's possible to replace a section of text with the output of it run through a command. So, a simple script like this could be used:
#!/usr/bin/python
import sys
input_lines = sys.stdin.readlines()
sorted_lines = sorted(input_lines,
key=lambda line: map(int, line.split()[-1].split('.')))
for line in sorted_lines:
sys.stdout.write(line)
See https://www.linux.com/learn/tutorials/442419-vim-tips-working-with-external-commands, section "Filtering text through external filters", which explains how you can use this as a filter within vim.
This script should do what you want and will work on any region where all the selected lines end in an IPv4 address.

You can use:
:%!sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n
-t . means use . as delimiter.
Then sort numerically from 4th column to 1st column.

display consolidated list of numbers from a CSV using BASH

I was sent a large list of URL's in an Excel spreadsheet, each unique according to a certain get variable in the string (who's value is a number ranging from 5-7 numbers in length). I am having to run some queries on our databases based on those numbers, and don't want to have to go through the hundreds of entries weeding out the numbers one-by-one. What BASH commands that can be used to parse out the number from each line (it's the only number in each line) and consolidate it down to one line with all the numbers, comma separated?
A sample (shortened) listing of the CVS spreadsheet includes:
http://www.domain.com/view.php?fDocumentId=123456
http://www.domain.com/view.php?fDocumentId=223456
http://www.domain.com/view.php?fDocumentId=323456
http://www.domain.com/view.php?fDocumentId=423456
DocumentId=523456
DocumentId=623456
DocumentId=723456
DocumentId=823456
....
...
The change of format was intentional, as they decided to simply reduce it down to the variable name and value after a few rows. The change of the get variable from fDocumentId to just DocumentId was also intentional. Ideal output would look similar to:
123456,23456,323456,423456,523456,623456,723456,823456
EDIT: my apologies, I did not notice that half way through the list, they decided to get froggy and change things around, there's entries that when saved as CSV, certain rows will appear as:
"DocumentId=098765 COMMENT, COMMENT"
DocumentId=898765 COMMENT
DocumentId=798765- COMMENT
"DocumentId=698765- COMMENT, COMMENT"
With several other entries that look similar to any of the above rows. COMMENT can be replaced with a single string of (upper-case) characters no longer than 3 characters in length per COMMENT

Assuming the variable always on it's own, and last on the line, how about just taking whatever is on the right of the =?
sed -r "s/.*=([0-9]+)$/\1/" testdata | paste -sd","
EDIT: Ok, with the new information, you'll have to edit the regex a bit:
sed -r "s/.*f?DocumentId=([0-9]+).*/\1/" testdata | paste -sd","
Here anything after DocumentId or fDocumentId will be captured. Works for the data you've presented so far, at least.

More simple than this :)
cat file.csv | cut -d "=" -f 2 | xargs

If you're not completely committed to bash, the Swiss Army Chainsaw will help:
perl -ne '{$_=~s/.*=//; $_=~s/ .*//; $_=~s/-//; chomp $_ ; print "$_," }' < YOUR_ORIGINAL_FILE
That cuts everything up to and including an =, then everything after a space, then removes any dashes. Run on the above input, it returns
123456,223456,323456,423456,523456,623456,723456,823456,098765,898765,798765,698765,

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sorting a file up to a character - bash

Related

grep listing false duplicates

extract data between similar patterns

Unix sort: sort by specific character following another character

Sorting IP addresses in vim

display consolidated list of numbers from a CSV using BASH

Categories

Resources