Get first N chars and sort them - bash

I have a requirement where i need to fetch first four characters from each line of file and sort them.
I tried below way. but its not sorting each line
cut -c1-4 simple_file.txt | sort -n
O/p using above:
appl
bana
uoia
Expected output:
alpp
aabn
aiou

sort isn't the right tool for the job in this case, as it used to sort lines of input, not the characters within each line.
I know you didn't tag the question with perl but here's one way you could do it:
perl -F'' -lane 'print(join "", sort #F[0..3])' file
This uses the -a switch to auto-split each line of input on the delimiter specified by -F (in this case, an empty string, so each character is its own element in the array #F). It then sorts the first 4 characters of the array using the standard string comparison order. The result is joined together on an empty string.

Try defining two helper functions:
explodeword () {
test -z "$1" && return
echo ${1:0:1}
explodeword ${1:1}
}
sortword () {
echo $(explodeword $1 | sort) | tr -d ' '
}
Then
cut -c1-4 simple_file.txt | while read -r word; do sortword $word; done
will do what you want.

The sort command is used to sort files line by line, it's not designed to sort the contents of a line. It's not impossible to make sort do what you want, but it would be a bit messy and probably inefficient.
I'd probably do this in Python, but since you might not have Python, here's a short awk command that does what you want.
awk '{split(substr($0,1,4),a,"");n=asort(a);s="";for(i=1;i<=n;i++)s=s a[i];print s}'
Just put the name of the file (or files) that you want to process at the end of the command line.
Here's some data I used to test the command:
data
this
is a
simple
test file
a
of
apple
banana
cat
uoiea
bye
And here's the output
hist
ais
imps
estt
a
fo
alpp
aabn
act
eiou
bey
Here's an ugly Python one-liner; it would look a bit nicer as a proper script rather than as a Bash command line:
python -c "import sys;print('\n'.join([''.join(sorted(s[:4])) for s in open(sys.argv[1]).read().splitlines()]))"
In contrast to the awk version, this command can only process a single file, and it reads the whole file into RAM to process it, rather than processing it line by line.

Related

Reduce Unix Script execution time for while loop

Have a reference file "names.txt" with data as below:
Tom
Jerry
Mickey
Note: there are 20k lines in the file "names.txt"
There is another delimited file with multiple lines for every key from the reference file "names.txt" as below:
Name~~Id~~Marks~~Column4~~Column5
Note: there are about 30 columns in the delimited file:
The delimited file looks like something :
Tom~~123~~50~~C4~~C5
Tom~~111~~45~~C4~~C5
Tom~~321~~33~~C4~~C5
.
.
Jerry~~222~~13~~C4~~C5
Jerry~~888~~98~~C4~~C5
.
.
Need to extract rows from the delimited file for every key from the file "names.txt" having the highest value in the "Marks" column.
So, there will be one row in the output file for every key form the file "names.txt".
Below is the code snipped in unix that I am using which is working perfectly fine but it takes around 2 hours to execute the script.
while read -r line; do
getData `echo ${line// /}`
done < names.txt
function getData
{
name=$1
grep ${name} ${delimited_file} | awk -F"~~" '{if($1==name1 && $3>max){op=$0; max=$3}}END{print op} ' max=0 name1=${name} >> output.txt
}
Is there any way to parallelize this and reduce the execution time. Can only use shell scripting.
Rule of thumb for optimizing bash scripts:
The size of the input shouldn't affect how often a program has to run.
Your script is slow because bash has to run the function 20k times, which involves starting grep and awk. Just starting programs takes a hefty amount of time. Therefore, try an approach where the number of program starts is constant.
Here is an approach:
Process the second file, such that for every name only the line with the maximal mark remains.
Can be done with sort and awk, or sort and uniq -f + Schwartzian transform.
Then keep only those lines whose names appear in names.txt.
Easy with grep -f
sort -t'~' -k1,1 -k5,5nr file2 |
awk -F'~~' '$1!=last{print;last=$1}' |
grep -f <(sed 's/.*/^&~~/' names.txt)
The sed part turns the names into regexes that ensure that only the first field is matched; assuming that names do not contain special symbols like . and *.
Depending on the relation between the first and second file it might be faster to swap those two steps. The result will be the same.

Utilising variables in tail command

I am trying to export characters from a reference file in which their byte position is known. To do this, I have a long list of numbers stored as a variable which have been used as the input to a tail command.
For example, the reference file looks like:
ggaaatgcattcaaacatgc
And the list looks like:
5
10
7
15
I have tried using this code:
list=$(<pos.txt)
echo "$list"
cat ref.txt | tail -c +"list" | head -c1 > out.txt
However, it keeps returning "invalid number of bytes: '+5\n10\n7\n15...'"
My expected output would be
a
t
g
a
...
Can anybody tell me what I'm doing wrong? Thanks!
It looks like you are trying to access your list variable in your tail command. You can access it like this: $list rather than just using quotes around it.
Your logic is flawed even after fixing the variable access. The list variable includes all lines of your list.txt file. Including the newline character \n which is invisible in many UIs and programs, but it is of course visible when you are manually reading single bytes. You need to feed the lines one by one to make it work properly.
Also unless those numbers are indexes from the end, you need to feed them to head instead of tail.
If I understood what you are attempting to do correctly, this should work:
while read line
do
head -c $line ref.txt | tail -c 1 >> out.txt
done < pos.txt
The reason for your command failure is simple. The variable list contains a multi-line string stored from the pos.txt files including newlines. You cannot pass not more than one integer value for the -c flag.
Your attempts can be fixed quite easily with removing calls to cat and using a temporary variable to hold the file content
while IFS= read -r lineNo; do
tail -c "$lineNo" ref.txt | head -c1
done < pos.txt
But then if your intentions is print the desired output in a new-line every time, head does not output that way. It just forms a string atga for your given input in a single line and not across multiple lines with one character at each line.
As Gordon mentions in one of the comments, for much more efficient FASTA files processing, you could just use one invocation of awk though (skipping multiple forks to head/tail). Your provided input does not involve any headers to skip which would be straightforward as
awk ' FNR==NR{ n = split($0,arr,""); for(i=1;i<=n;i++) hash[i] = arr[i] }
( $0 in hash ){ print hash[$0] } ' ref.txt pos.txt
You could use cut instead of tail:
pos=$(<pos.txt)
cut -c ${pos//$'\n'/,} --output-delimiter=$'\n' ref.txt
Or just awk:
awk -F '' 'NR==FNR{c[$0];next} {for(i in c) print $i}' pos.txt ref.txt
both yield:
a
g
t
a

Using both GNU Utils with Mac Utils in bash

I am working with plotting extremely large files with N number of relevant data entries. (N varies between files).
In each of these files, comments are automatically generated at the start and end of the file and would like to filter these out before recombining them into one grand data set.
Unfortunately, I am using MacOSx, where I encounter some issues when trying to remove the last line of the file. I have read that the most efficient way was to use head/tail bash commands to cut off sections of data. Since head -n -1 does not work for MacOSx I had to install coreutils through homebrew where the ghead command works wonderfully. However the command,
tail -n+9 $COUNTER/test.csv | ghead -n -1 $COUNTER/test.csv >> gfinal.csv
does not work. A less than pleasing workaround was I had to separate the commands, use ghead > newfile, then use tail on newfile > gfinal. Unfortunately, this will take while as I have to write a new file with the first ghead.
Is there a workaround to incorporating both GNU Utils with the standard Mac Utils?
Thanks,
Keven
The problem with your command is that you specify the file operand again for the ghead command, instead of letting it take its input from stdin, via the pipe; this causes ghead to ignore stdin input, so the first pipe segment is effectively ignored; simply omit the file operand for the ghead command:
tail -n+9 "$COUNTER/test.csv" | ghead -n -1 >> gfinal.csv
That said, if you only want to drop the last line, there's no need for GNU head - OS X's own BSD sed will do:
tail -n +9 "$COUNTER/test.csv" | sed '$d' >> gfinal.csv
$ matches the last line, and d deletes it (meaning it won't be output).
Finally, as #ghoti points out in a comment, you could do it all using sed:
sed -n '9,$ {$!p;}' file
Option -n tells sed to only produce output when explicitly requested; 9,$ matches everything from line 9 through (,) the end of the file (the last line, $), and {$!p;} prints (p) every line in that range, except (!) the last ($).
I realize that your question is about using head and tail, but I'll answer as if you're interested in solving the original problem rather than figuring out how to use those particular tools to solve the problem. :)
One method using sed:
sed -e '1,8d;$d' inputfile
At this level of simplicity, GNU sed and BSD sed both work the same way. Our sed script says:
1,8d - delete lines 1 through 8,
$d - delete the last line.
If you decide to generate a sed script like this on-the-fly, beware of your quoting; you will have to escape the dollar sign if you put it in double quotes.
Another method using awk:
awk 'NR>9{print last} NR>1{last=$0}' inputfile
This works a bit differently in order to "recognize" the last line, capturing the previous line and printing after line 8, and then NOT printing the final line.
This awk solution is a bit of a hack, and like the sed solution, relies on the fact that you only want to strip ONE final line of the file.
If you want to strip more lines than one off the bottom of the file, you'd probably want to maintain an array that would function sort of as a buffered FIFO or sliding window.
awk -v striptop=8 -v stripbottom=3 '
{ last[NR]=$0; }
NR > striptop*2 { print last[NR-striptop]; }
{ delete last[NR-striptop]; }
END { for(r in last){if(r<NR-stripbottom+1) print last[r];} }
' inputfile
You specify how much to strip in variables. The last array keeps a number of lines in memory, prints from the far end of the stack, and deletes them as they are printed. The END section steps through whatever remains in the array, and prints everything not prohibited by stripbottom.

How to use sort for sorting file in BASH?

I have to sort file. When I use sort, it takes only first word in every line. For example. I have these words in line.
able a abundance around accelerated early acting following ad
I execute
sort file.txt
Output is:
able a abundance around accelerated early acting following ad
If I have just one column, sort works. What is the problem?
You can try something like this:
tr " " "\n" < file.txt | sort | tr "\n" " " > newfile.txt
Output to newfile.txt:
a able abundance accelerated acting ad around early following
A few options for you:
To sort the words in a line, you can use sed to replace spaces with new lines then pipe that to sort:
sed 's/ /\n/g' file.txt | sort
To sort on a specific column use awk to print the column then pipe that to sort:
awk {print $2} | sort
I've used this a lot working with data files, and I have yet to find a way to get the whole line after the sort.
You have to put each word on a separate line:
tr -s '[:blank:]' '\n' < file.txt | sort | paste -d" " -s
a able abundance accelerated acting ad around early following
Just in case you want to sort each line in a a multi line file, the following one-liner can do it for you
python2 -c"print '\n'.join(' '.join(sorted(l.split())) for l in open('FILE'))"
If the above looks useful, you can augment your ~/.bashrc with
csort(){ python2 -c"print '\n'.join(' '.join(sorted(l.split()))for l in open('$1'))";}
and later use it like in
csort FILE
The machinery of the python command is best explained exploding the command line like this
with open('FILE') as f: # f is a file object
for line in f: # iterating on a f.o. returns lines
words = line.split() # by default splits on white space
print " ".join(sorted(words)) # sorted returns a sorted list
# ' '.join() joins the list elements with spaces

How to parse a CSV in a Bash script?

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:
The index of the identifier
The identifier value
I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).
Any ideas, taking in special consideration for performance?
As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:
$ csvtool -t ',' col "$index" - < csvfile | grep "$value"
According to the docs, it handles escaping, quoting, etc.
See this youtube video: BASH scripting lesson 10 working with CSV files
CSV file:
Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME
Bash script:
#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
do
echo -e "$user \
======================\n\
Role :\t $job\n\
ID :\t $uid\n\
SITE :\t $location\n"
done < $1
IFS=$OLDIFS
Output:
Bob Brown ======================
Role : Manager
ID : 16581
SITE : Main
Sally Seaforth ======================
Role : Director
ID : 4678
SITE : HOME
First prototype using plain old grep and cut:
grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"
If that's fast enough and gives the proper output, you're done.
CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.
So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.
In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:
Name,Phone
"Woo, John",425-555-1212
You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:
#!/usr/bin/env tclsh
package require csv
package require Tclx
# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue
# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1
for_file line $fileName {
set columns [csv::split $line]
set columnValue [lindex $columns $columnNumber]
if {$columnValue == $expectedValue} {
puts $line
}
}
Save this script to a file called csv.tcl and invoke it as:
$ tclsh csv.tcl filename indexNumber expectedValue
Explanation
The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.
Using awk:
export INDEX=2
export VALUE=bar
awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv
Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:
awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv
Jeez...with variables, and everything, awk is almost a real programming language...
For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.
If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.
You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):
csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u
This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.
index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file
I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)
#!/usr/bin/env python
import sys, csv
with sys.stdin as f:
reader = csv.reader(f)
for row in reader:
for col in row:
print col+'\t',
print
Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:
> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,
In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.
Parsing CSV with primitive text-processing tools will fail on many types of CSV input.
xsv is a lovely and fast tool for doing this properly. To search for all records that contain the string "foo" in the third column:
cat file.csv | xsv search -s 3 foo
A sed or awk solution would probably be shorter, but here's one for Perl:
perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`
where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)
Awk (gawk) actually provides extensions, one of which being csv processing.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print $0}'
The output is:
"James T. Kirk",123
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print $0 }, aka prints the full line as requested.

Resources