I've created a Bash script to look through files within a directory, pull data based on an identifier then use that data to fill an sqlite3 table. It seems some information will fill into the same row as the key, and some won't. The script is as follows:
#!/bin/bash
sqlite3 review.sql "CREATE TABLE Review(Review_ID INTEGER PRIMARY KEY, Author TEXT, Date TEXT);"
path="/home/me/Downloads/test/*"
for i in $path
do
total=$(grep -c '<Author>' $i)
count=1
while [ $count -le $total ]
do
date=$(grep -m$count '<Date>' $i | sed 's#<Date>##' | tail -n1)
author=$(grep -m$count '<Author>' $i | sed 's#<Author>##' | tail -n1)
sqlite3 review.sql "INSERT INTO Review(Author,Date) VALUES('$author','$date');"
((count++))
done
done
The Files I'm look through look like this:
<Author>john
<Date>Jan 6, 2009
<Author>jacob
<Date>Dec 26, 2008
<Author>rachael
<Date>Dec 14, 2008
when I query for the primary key and the date attributes i get this as expected
sqlite> SELECT Review_ID, Date FROM Review;
Review_ID Date
---------- ------------
1 Jan 6, 2009
2 Dec 26, 2008
3 Dec 14, 2008
4 Jan 7, 2009
5 Jan 5, 2009
6 Nov 14, 2008
but when i query for the primary key and author i get this
sqlite> SELECT Review_ID, Author FROM Review;
Review_ID Author
---------- ----------
john
jacob
rachael
Jean
kareem
may
Upon doing some more testing, it definitely seems to have problems with some of the text string. For example I tried adding last names, and get this result:
Review_ID Author
---------- ------------
1 john jacob
2 jacob richa
rae simon
Jean jak
5 kareem jabr
6 may flower
It does better, but still doesn't like a couple of them, i thought maybe something to do with three letters, but then "may" wouldn't be showing up, but indeed if I add a letter to "rae" and a letter to "jak" the 3 and 4 do actually show up in the Review_ID column. I noticed the same thing happens if a column contains a "$", like in "$173" for example. The text, I really can't figure out though, there doesn't seem to be an obvious absolute pattern to what it accepts and what it doesn't. I made up the names in order to simplify this post, but just to give a few more examples I'll include a some more examples from what I'm actually working with to show a few more strings that work and ones that don't.
1 everywhereman2
RW53
Marilyn1949
fallriverma
8 SweetwaterMill
AuntSusie006
13 Traveler34NewJe
madmatriarch
2 Savvytourist2
greatvictory
25 Lightsleeper999
strollaround
30 Lucygoosey1985
lesbriggs
3 miguelluna019
lulubaby
1 myassesdragon
tomu023
BrettOcean
46 A TripAdvisor M
dmills1956
julcarl
49 A TripAdvisor M
TSW42
lass=
After a dry run, remove the echo to take the safety off.
awk -v RS='' -v FS='\n?<[^>]+>' '{print $2 ":" $3}' \
/home/me/Downloads/test/* |while IFS=: read author date; do
echo sqlite review.sql "INSERT INTO Review(Author,Date) VALUES('$author','$date');"
done
This is a brittle awk-to-bash solution based on using the regular expression \n?<[^>]+> as Field Separator and a blank line '' as Record Separator. The FS expression means "optional newline followed by a string enclosed by angles." We then output the fields with a simple separator : and read them into bash.
You can make system calls in awk with system(), but it gets very messy very quickly. Better to export the clean data in this case.
The script with awk and printf is:
#!/bin/bash
sqlite3 review.sql "CREATE TABLE Review(Review_ID INTEGER PRIMARY KEY, Author TEXT, Date TEXT);"
path="/home/drew/Downloads/testcases/*"
for i in $path
do
total=$(grep -c '<Author>' $i)
count=1
while [ $count -le $total ]
do
date=$(grep -m$count '<Date>' $i | sed 's#<Date>##' | tail -n1)
author=$(grep -m$count '<Author>' $i | sed 's#<Author>##' | tail -n1 | awk '{printf "- %s -", $1}')
echo $author
((count++))
done
done
I'm not sure, but I feel like I shouldn't have to add echo to it with printf, but without it nothing prints. With the echo I get the following output:
-Jeanjakey
- kareem -
- may -
- RW53 -
-Marilyn1949
-AuntSusie006
-madmatriarch
-strollaround
-lulubaby
-tomu023
-julcarl
-slass
It seems to somewhat work, but the spacing disappears, any last names disappear, and it seems to do different things to different inputs. With the string concatenation I use the script:
#!/bin/bash
sqlite3 review.sql "CREATE TABLE Review(Review_ID INTEGER PRIMARY KEY, Author TEXT, Date TEXT);"
path="/home/drew/Downloads/testcases/*"
for i in $path
do
total=$(grep -c '<Author>' $i)
count=1
while [ $count -le $total ]
do
date=$(grep -m$count '<Date>' $i | sed 's#<Date>##' | tail -n1)
author2="- "
author2+=$(grep -m$count '<Author>' $i | sed 's#<Author>##' | tail -n1)
author2+=" -"
echo $author
((count++))
done
done
with this I get the output:
-Jeanjakey
-kareem jabron
-may flow she
-RW53
-Marilyn1949
-AuntSusie006
-madmatriarch
-strollaround
-lulubaby
-tomu023
-julcarl
-slass
and a string reassignment:
author2=$(grep -m$count '<Author>' $i | sed 's#<Author>##' | tail -n1)
author2="- $author2 - "
gives the same output.
Related
I want to compare the 2nd and 4th columns of lines in a file. In detail, line 1 with 2,3,4...N, then line 2 with 3,4,5...N, and so on.
I have written a script, it worked but running so long, over 30 minutes.
Let the number of lines is 1733 with header, my code is:
for line1 in {2..1733}; do \
for line2 in {$((line1+1))..1733}; do \
i7_diff=$(cmp -bl \
<(sed -n "${line1}p" Lab_primers.tsv | cut -f 2) \
<(sed -n "${line2}p" Lab_primers.tsv | cut -f 2) | wc -l);
i5_diff=$(cmp -bl \
<(sed -n "${line1}p" Lab_primers.tsv | cut -f 4) \
<(sed -n "${line2}p" Lab_primers.tsv | cut -f 4) | wc -l);
if [ $i7_diff -lt 3 ]; then
if [ $i5_diff -lt 3 ]; then
echo $(sed -n "${line1}p" Lab_primers.tsv)"\n" >> primer_collision.txt
echo $(sed -n "${line2}p" Lab_primers.tsv)"\n\n" >> primer_collision.txt
fi;
fi;
done
done
I used nested for loops then using sed to print exactly the $line, next using cut to extract the desired column. Finally, the cmp and wc command to count the number of differences of two columns of a pair lines.
If meeting the condition (both 2nd and 4th columns of pair of lines have the number of differences less than 3), the code will print a pair lines to output file.
Here is an excerpt of the input (it has 1733 lines):
I7_Index_ID index I5_Index_ID index2 primer
D703 CGCTCATT D507 ACGTCCTG 27
D704 GAGATTCC D507 ACGTCCTG 28
D701 ATTACTCG D508 GTCAGTAC 29
S6779 CGCTAATC S6579 ACGTCATA 559
D708 TAATGCGC D503 AGGATAGG 44
D705 ATTCAGAA D504 TCAGAGCC 45
D706 GAATTCGT D504 TCAGAGCC 46
i796 ATATGCGC i585 AGGATAGC R100
D714 TGCTTGCT D510 AACCTCTC 102
D715 GGTGATGA D510 AACCTCTC 103
D716 AACCTACG D510 AACCTCTC 104
i787 TGCTTCCA i593 ATCGTCTC R35
Then the expected output is:
D703 CGCTCATT D507 ACGTCCTG 27
S6779 CGCTAATC S6579 ACGTCATA 559
D708 TAATGCGC D503 AGGATAGG 44
i796 ATATGCGC i585 AGGATAGC R100
D714 TGCTTGCT D510 AACCTCTC 102
i787 TGCTTCCA i593 ATCGTCTC R35
My question is what the better code to deal with it, how to reduce the running time?
Thank you for your help!
You could start to sort by fields 2 and 4.
Then, no need for double loop: if a pair exist, they should be adjacent.
sort -k 2,2 -k 4,4 myfile.txt
Then, we need to print only bunch of consecutive lines that share the same 2 and 4 fields.
first=yes
sort -k 2,2 -k 4,4 test.txt | while read l
do
fields=(${l})
new2=${fields[1]}
new4=${fields[3]} # Fields 2 and 4, bash-way
if [[ "$new2" = "$old2" ]] && [[ "$new4" = "$old4" ]]
then
if [[ $first ]]
then
# first time we print something for this series: we need
# to also print the previous line (the first of the series)
echo; echo "$oldl"
# But if the next line is identical (series of 3, no need to repeat this line)
first=
fi
echo "$l"
else
# This line is not identical to the previous. So nothing to print
# If the next one is identical to this one, then, this one will
# be the first of its series
first=yes
fi
old2=$new2
old4=$new4
oldl="${l}"
done
One frustrating thing: uniq -D almost does all the job we did in this script. Except that it is unable to filter on specific lines.
But we could also rewrites lines so that uniq can work.
Not very fluent in awk (if I were, I am pretty sure awk could do the uniq work for me), but well
sort -k 2,2 -k 4,4 test.txt | awk '{print $0" "$2" "$4}' | uniq -D -f 5 | awk '{printf "%-12s %-9s %-12s %-9s %s\n",$1,$2,$3,$4,$5}'
does the job.
sort sort the lines by fields 2 and 4. awk add to the end of each line a copy of fields 2 and 4. Which then make uniq usable, since uniq is able to ignore N first fields. So here, we use uniq ignoring the 5 1st fields, that is working only on the copies of fields 2 and 4. With -D uniq display only duplicate lines.
Then, the last awk remove the copies of field we don't need anymore.
I have file whose size is approx 1 GB and that file has a data in below format .
A|CD|44123|0|0
B|CD|44124|0|0
C|CD|44125|0|0
D|CD|44126|0|0
E|CD|44127|0|0
F|CD|44128|0|0
J|CD|44129|0|0
I|CD|44130|0|0
In this file I have to replace the third column value from a value which i will get after conversion . For which i have to open this file and then read the file and replace it . This process is taking around 5 hours .Below is the code which i am using
cat $FILE_NAME |\
while read REC
do
DATE=`echo "$REC" | cut -d\| -f3`
DATE_NEW=`$UTIL $DATE | head -1 |cut -d" " -f12`
RECORD="$DATE_NEW,"
echo "$RECORD" >> $New_File
done
Is there a way we can make this more better and fast.
Desired output will be like this where DATE_NEW value will be placed on each 3rd column DATE_NEW value will be the converted value which I will get from this
DATE_NEW=`$UTIL $DATE | head -1 |cut -d" " -f12`
A|CD|10/20/2020|0|0
B|CD|10/25/2020|0|0
C|CD|10/25/2020|0|0
D|CD|10/25/2020|0|0
E|CD|11/15/2020|0|0
F|CD|11/14/2020|0|0
J|CD|11/16/2020|0|0
I|CD|11/17/2020|0|0
After the comment from #Sundeep Why is using a shell loop to process text considered bad practice? I wrote the logic in Perl and from 5-7 hours processing time in Perl it took 99 Seconds to get the job done.
Give this a try:
awk -v cmd="Cmd2GetNEWDATE" 'BEGIN{FS=OFS="|"}{cmd|getline v;close(cmd)}$3=v' file
I have a file where there is a lot of books with index number.
I want to search the books with index number.
The file format is kind of like this:
"The Declaration of Independence of the United States of America,
1
by Thomas Jefferson"
......................
Alice's Adventures in Wonderland, by Lewis Carroll
11
#!/bin/bash
echo "Enter the content your are searching for:"
read content
echo -e "\nResult Shwoing For: $content\n"
grep $content GUTINDEX.ALL
If user search for 1.This code is printing 1, 11 every line that has one in them. I want to only print the line which contains 1:
"The Declaration of Independence of the United States of America, 1
simple use the -w flag, read more at grep --help
grep -w ${line_number} ${file_name}
for grep -w 1 books
The Declaration of Independence of the United States of America 1
Bobs's 1 in Wonderland, by Lewis Carroll 11
it may catch book names that contains number,
so better use regex [${digit}]$ for example [1]$ for matching
index at end of line.
grep -w [${line_number}]$ ${file_name}
for grep -w 1$ books
The Declaration of Independence of the United States of America, 1
you need to use regex. Change grep to egrep.
file:
1
11
111
if you want to search only 1 then you can use
cat file | egrep "^1$" # it means start and end with 1.`
then you need extend scrip. For example
file.txt
abc,1
abd,111
abf,11111
#
cat file.txt | while read line ; do
res=$(echo ${line} | awk -v FS=',' '{print $2}' | grep "^1$")
if [ $? -eq 0 ]; then
echo $line
fi
done
I have a text with repeated data patterns, and grep keeps getting all matches without stop.
for ((count = 1; count !=17; count++)); do # 17 times
xuz1[count]=`grep -e "1 O1" $out_file | cut -c10-29`
xuz2[count]=`grep -e "2 O2" $out_file | cut -c10-29`
xuz3[count]=`grep -e "3 O3" $out_file | cut -c10-29`
echo ${xuz1[count]}
echo ${xuz2[count]}
echo ${xuz3[count]}
done
data looks like:
some text.....
Text....
.....
1 O1 111111 111111 111111
2 O2 222211 222211 222211
3 O3 643653 652346 757686
some text.....
1 O1 111122 111122 111122
2 O2 222222 222222 222222
3 O3 343653 652346 757683
some text.....
1 O1 111333 111333 111333
2 O2 222333 222333 222333
3 O3 343653 652346 757684
.
.
.
And result I'm getting:
xuz1[1] = 111111 111111 111111
xuz2[1] = 222211 222211 222211
xuz3[1] = 643653 652346 757686
xuz1[2] = 111111 111111 111111
xuz2[2] = 222211 222211 222211
xuz3[2] = 643653 652346 757686
...
looking for result like this:
xuz1[1]=111111 111111 111111
xuz2[1]=222211 222211 222211
xuz3[1]=343653 652346 757683
xuz1[2]=111122 111122 111122
xuz2[2]=222222 222222 222222
xuz3[2]=343653 652346 757684
also tried "grep -m 1 -e"
Which way should I go?
for now I ended up with one line
grep -A4 -e "1 O1" $out_file | cut -c10-29
Some text.... Is a huge text part.
A little bash script with a single grep is enough
grep -E '^[0-9]+ +O[0-9]+ +.*'|
while read idx oidx cols; do
if ((idx == 1)); then
let ++i
name=xuz$i
let j=1
fi
echo "$name[$j]=$cols"
let ++j
done
You haven't really described what you want, but I guess something like this.
awk '! /^[1-9][0-9]* O[0-9] / { n++; m=0; if (NR>1) print ""; next }
{ print "xuz" ++m "[" n "]=" substr($0, 10) }' "$out_file"
If the regex doesn't match, we assume we are looking at one of the "some text" pieces, and that this starts a new record. Increment n and reset m. Otherwise, print the output for this item within this record.
If some text could be more than one line, you will need a minor change, but I hope this should be enough at least to send you in the right direction.
You can do this in pure Bash, too, though this is going to be highly inefficient - you would expect a Bash while read loop to be at least a hundred times slower than Awk, and the code is markedly less idiomatic and elegant.
while read -r m x result; do
case $m::$x in
[1-9]::O[1-9])
printf 'xuz%d[%d]=%s\n' $m $n "$result;;
*)
# If n is unset, don't print an empty line
printf '%s' "${n+$'\n'}"
let ((n++));;
esac
done <"$out_file"
I would aggressively challenge any requirement to do this in pure Bash. If it's for homework, the requirement is unrealistic, and a core skill for shell script authors is to understand the limits of the shell and the strengths of the common support tools like Awk. The Awk language is virtually guaranteed to be available wherever you have a shell, in particular a heavy shell like Bash. (In a limited e.g. embedded environment, a limited shell like Dash would make more sense. Then e.g. the let keyword won't be available, though it should not be hard to make this script properly portable.)
The case statement accepts glob patterns, not regular expressions, so the pattern here is slightly less general (we accept one positive digit in the first field).
Thank you all for participating in discussion.
*** this is my home project to help my wife do extract data from research calculations /// speed up is around 400 times **
file used for extracting data from, contains around 2000 lines,
needed data blocks look like this
and they're repeated 10-20 times in the file.
uiyououy COORDINATES
NR ATOM CCCCC X Y Z
1 O1 8.00 0.000000000 0.882236820 -0.789494235
2 O2 8.00 0.000000000 -1.218250722 -1.644061652
3 O3 8.00 0.000000000 1.218328524 0.400260050
4 O4 8.00 0.000000000 -0.882314622 2.033295837
Text text text text
tons of text
to extract 4 lines I used expression below
grep -A4 --no-group-separator -e "1 O1" $from_file | cut -c23-64
>xyz_temp.txt
# grep 4 lines at once to txt
sed -i '/^[ \t]*$/d' xyz_temp.txt
#del empty lines from xyz txt
next is to convert string in to numbers (should use '| bc -l' for arithmetic)
while IFS= read line
do
IFS=' ' read -r -a arr_line <<< "$line"
# break line of xyz into 3 numbers
s1=$(echo "${arr_line[0]}" \* 0.529177249 | bc -l)
# some math convertion
s2=$(echo "${arr_line[1]}" \* 0.529177249 | bc -l)
s3=$(echo "${arr_line[2]}" \* 0.529177249 | bc -l)
#-------to array non sorted ------------
arr[$n]=${n}";"${from_file}";"${gd_}";"${frt[count_4s]}";"${n4}";"${s1}";"${s2}";"${s3}
echo ${arr[n]}
#--------------------------------------------
done <"$from_file_txt"
sort array
IFS=$'\n' sorted=($(sort -t \; -k4 -k5 -g <<<"${arr[*]}"))
# -t separator ';' -k column -g generic * to get new line output
#-k4 -k5 sort by column 4 then5
#printf "%s\n" "${sorted[*]}"
unset IFS
There is Last part which will combine data to result view
echo "$n"
n2=1
n42=1
count_4s2=1
i=0
echo "============================== sorted =============================="
################### loop for empty 4s lines
printf "%s" ";" ";" ";" ";" ";" "${count_4s2}" ";"
printf "%s\n"
printf "%s\n" "${sorted[i]}"
while [ $i -lt $((n-2)) ]
do
i=$((i+1))
if [ "$n42" = "4" ] # 1234
then n42=0
count_4s2=$((count_4s2+1))
printf "%s" ";" ";" ";" ";" ";" "${count_4s2}" ";"
printf "%s\n"
fi
#--------------------------------------------
n2=$((n2+1))
n42=$((n42+1))
printf "%s\n" "${sorted[i]}"
done ############# while
#00000000000000000000000000000000000000
printf "%s\n"
echo ==END===END===END==
Output looks like this
============================== sorted ==============================
;;;;;1;
17;A-13_A1+.out;1.3;0.4;1;0;.221176355474853043;-.523049776514580244
18;A-13_A1+.out;1.3;0.4;2;0;-.550350051428402955;-.734584881824005358
19;A-13_A1+.out;1.3;0.4;3;0;.665269869069959489;.133910683627893251
20;A-13_A1+.out;1.3;0.4;4;0;-.336096173116409577;1.123723974181515102
;;;;;2;
13;A-13_A1+.out;1.3;0.45;1;0;.279265277182782148;-.504490787956469897
14;A-13_A1+.out;1.3;0.45;2;0;-.583907412327951988;-.759310392973448167
15;A-13_A1+.out;1.3;0.45;3;0;.662538493711206290;.146829200993661293
16;A-13_A1+.out;1.3;0.45;4;0;-.357896358566036450;1.116971979936256771
;;;;;3;
9;A-13_A1+.out;1.3;0.5;1;0;.339333719743262501;-.482029749553797105
10;A-13_A1+.out;1.3;0.5;2;0;-.612395507070451545;-.788968880150283253
11;A-13_A1+.out;1.3;0.5;3;0;.658674809217196345;.163289820251690233
12;A-13_A1+.out;1.3;0.5;4;0;-.385613021360830052;1.107708808923212876
==END===END===END==
*note : some code might not shown here
next step is to paste it to excel with ; separator.
I get the following output:
Pushkin - 100500
Gogol - 23
Dostoyevsky - 9999
Which is the result of the following script:
for k in "${!authors[#]}"
do
echo $k ' - ' ${authors["$k"]}
done
All I want is to get the output like this:
Pushkin - 100500
Dostoyevsky - 9999
Gogol - 23
which means that the keys in associative array should be sorted by value. Is there an easy method to do so?
You can easily sort your output, in descending numerical order of the 3rd field:
for k in "${!authors[#]}"
do
echo $k ' - ' ${authors["$k"]}
done |
sort -rn -k3
See sort(1) for more about the sort command. This just sorts output lines; I don't know of any way to sort an array directly in bash.
I also can't see how the above can give you names ("Pushkin" et al.) as array keys. In bash, array keys are always integers.
Alternatively you can sort the indexes and use the sorted list of indexes to loop through the array:
authors_indexes=( ${!authors[#]} )
IFS=$'\n' authors_sorted=( $(echo -e "${authors_indexes[#]/%/\n}" | sed -r -e 's/^ *//' -e '/^$/d' | sort) )
for k in "${authors_sorted[#]}"; do
echo $k ' - ' ${authors["$k"]}
done
Extending the answer from #AndrewSchulman, using -rn as a global sort option reverses all columns. In this example, authors with the same associative array value will be output by reverse order of name.
For example
declare -A authors
authors=( [Pushkin]=10050 [Gogol]=23 [Dostoyevsky]=9999 [Tolstoy]=23 )
for k in "${!authors[#]}"
do
echo $k ' - ' ${authors["$k"]}
done | sort -rn -k3
will output
Pushkin - 10050
Dostoyevsky - 9999
Tolstoy - 23
Gogol - 23
Options for sorting specific columns can be provided after the column specifier.
i.e. sort -k3rn
Note that keys can be specified as spans. Here -k3 happens to be fine because it is the final span, but to use only column 3 explicitly (in case further columns were added), it should be specified as -k3,3,
Similarly to sort by column three in descending order, and then column one in ascending order (which is probably what is desired in this example):
declare -A authors
authors=( [Pushkin]=10050 [Gogol]=23 [Dostoyevsky]=9999 [Tolstoy]=23 )
for k in "${!authors[#]}"
do
echo $k ' - ' ${authors["$k"]}
done | sort -k3,3rn -k1,1
will output
Pushkin - 10050
Dostoyevsky - 9999
Gogol - 23
Tolstoy - 23
The best way to sort a bash associative array by VALUE is to NOT sort it.
Instead, get the list of VALUE:::KEYS, sort that list into a new KEY LIST, and iterate through the list.
declare -A ADDR
ADDR[192.168.1.3]="host3"
ADDR[192.168.1.1]="host1"
ADDR[192.168.1.2]="host2"
KEYS=$(
for KEY in ${!ADDR[#]}; do
echo "${ADDR[$KEY]}:::$KEY"
done | sort | awk -F::: '{print $2}'
)
for KEY in $KEYS; do
VAL=${ADDR[$KEY]}
echo "KEY=[$KEY] VAL=[$VAL]"
done
output:
KEY=[192.168.1.1] VAL=[host1]
KEY=[192.168.1.2] VAL=[host2]
KEY=[192.168.1.3] VAL=[host3]
Do something with unsorted keys:
for key in ${!Map[#]}; do
echo $key
done
Do something with sorted keys:
for key in $(for x in ${!Map[#]}; do echo $x; done | sort); do
echo $key
done
Stored sorted keys as array:
Keys=($(for x in ${!Map[#]}; do echo $x; done | sort))
If you can assume the value is always a number (no spaces), but want to allow for the possibility of spaces in the key:
for k in "${!authors[#]}"; do
echo "${authors["$k"]} ${k}"
done | sort -rn | while read number author; do
echo "${author} - ${number}"
done
Example:
$ declare -A authors
$ authors=(['Shakespeare']=1 ['Kant']=2 ['Von Neumann']=3 ['Von Auersperg']=4)
$ for k in "${!authors[#]}"; do echo "${authors["$k"]} ${k}"; done | sort -rn | while read number author; do echo "${author} - ${number}"; done
Von Auersperg - 4
Von Neumann - 3
Kant - 2
Shakespeare - 1
$
The chosen answer seems to work if there are no spaces in the keys, but fails if there are:
$ declare -A authors
$ authors=(['Shakespeare']=1 ['Kant']=2 ['Von Neumann']=3 ['Von Auersperg']=4)
$ for k in "${!authors[#]}"; do echo $k ' - ' ${authors["$k"]}; done | sort -rn -k 3
Kant - 2
Shakespeare - 1
Von Neumann - 3
Von Auersperg - 4
$