Bash: Two "cut" functions in while-loop, only one effective - bash

in a cygwin environment i want to read a csv line for line, and try to get the values from two columns.
So i have
while read line ; do echo `cut -d";" -f5`; done < allk.lst
and the right values are shown.
But:
while read line ; do echo `cut -d";" -f5`; echo `cut -d";" -f4`; done < allk.lst
again shows the values as before...
Any hints to show both values?
Thanks, Bommel

cut -f accepts a list of fields, so there no need to call cut twice
Using cut command to remove multiple columns
echo cut -d";" -f5; does not do what you'd expect. At first, the variable line is missing.
After applying those fixes, you command would look something like:
while read line; do echo $line | cut -d";" -f4-5 ; done < test.txt
Try a demo online!

Hmmm, thanks at first.
Some curiosity:
When using
$ time while read line; do echo $line | cut -d";" -f4-5 ; done < allk.txt
K700W1666;S728A0103
K700W1651;S727A7570
K700W1654;S727A7579
K700W1657;S727A7581
K700W1660;S727A7582
K700W3040;S728A0099
K700W3043;S728A0107
K700W3042;S732A4280
K700W3594;S724A5213
K700W3600;S727A7609
K700W3603;S727A7615
K700W3597;S727A7617
K700W3604;S727A7589
K700W3624;S728A1599
K700W2164;S728A0091
K700W2165;S728A0110
K700W3565;S727A7577
K700W3568;S727A7578
K700W3560;S725A4806
K700W3563;S725A8285
K700W3559;S726A1925
K700W3562;S728A0197
K700W2016;S726A1929
K700W2012;S725A5172
K700W2015;S728A0056
K700W2014;S728A0061
K700W2017;S728A0067
real 0m12.165s
user 0m0.482s
sys 0m1.390s
it takes 12 seconds, and there is a semicolon between.
Whereas
$ time while read line; do echo `cut -d";" -f4-5` ; done < allk.txt
K700W1651;S727A7570 K700W1654;S727A7579 K700W1657;S727A7581 K700W1660;S727A7582 K700W3040;S728A0099 K700W3043;S728A0107 K700W3042;S732A4280 K700W3594;S724A5213 K700W3600;S727A7609 K700W3603;S727A7615 K700W3597;S727A7617 K700W3604;S727A7589 K700W3624;S728A1599 K700W2164;S728A0091 K700W2165;S728A0110 K700W3565;S727A7577 K700W3568;S727A7578 K700W3560;S725A4806 K700W3563;S725A8285 K700W3559;S726A1925 K700W3562;S728A0197 K700W2016;S726A1929 K700W2012;S725A5172 K700W2015;S728A0056 K700W2014;S728A0061 K700W2017;S728A0067
real 0m0.308s
user 0m0.015s
sys 0m0.030s
takes only 0.3 sec... but also with a semic.
So: what is the best way to read this values in two variables (for building SQL commands)?

Related

unix find the difference from a file row wise

I have some data like
[09359]0000.365604| =>SttSasph_Hmbm_bSPO_PhQmOm (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365687| =>Hmbm_bSPO_PhQmOm_Wd (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365879| =>SttSasph_Hmbm_quOuO_PhQmOm (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365890| =>Hmbm_quOuO_PhQmOm_Wd (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365979| WSmmOT SDDQ vSQWSbmO not POt, QOvOQtWnH to Onv mOthod
[09359]0001.625300| db_HOt_POPPWon_Wd: aspuQQOnt POPPWon WD WP 1016,59
[09359]0002.365979| WSmmOT SDDQ vSQWSbmO not POt, QOvOQtWnH to Onv mOthod
Every Line starts with a process number (Which can change) in square brackets
Then Seconds after the module (0001) in this case
Then MicroSeconds after the fullstop.
Then a Pipe to terminate.
Rest part can be ignored
What I need is to caluclate
Convert Seconds into MircoSeconds
Add the Microsconds to Converted Microseconds (From 1)
Find out the difference in microseconds. for eg. line2-line1 , line3-line2, line4- line3 and so.
Print the result in seperate file.
I tried to use this logic. But, it didnt work.
May I get suggestions with optimised way to do it or
improvement in my existing logic
sec=$(grep '^\[.\{1,\}\]' mass.May28.1 | cut -d "| " -f1 | cut -c8- | cut -d"." -f1)
msec=$(grep '^\[.\{1,\}\]' mass.May28.1 | cut -d "| " -f1 | cut -c8- | cut -d"." -f2)
$f_msec=$((sec * 1000000 + msec)) > final_difference_file
If you are comfortable with awk, then you can use this script:
script.awk
BEGIN{ FS="[\\[\\]\\|]+" }
{ printf("[%s]%011.6f|%s\n", $2,$3-prev,$4)
prev = $3 }
Use it like this: awk -f script.awk yourfile
The first line setups the fieldsplitting to use the brackets and pipe (ignore the backslashes they are need to escape the symbols that are regexp metacharacters). The second line prints the fields and calculates the timediff. The last line stores the current time for the calculation in the next line.
This can also be done with a bash script. Since bash lacks floating point arithmetic, we have to gather seconds and microseconds seperately (or call an external tool like bc for each line):
script.sh
IFS='|[].'
factor=1000000
prev=0
while read dummy pid secs msecs text;
do
msecs=$(( $secs * $factor + $msecs ))
timediff=$(( $msecs - $prev ))
prev=$msecs
secs=$(( $timediff / $factor ))
msecs=$(( $timediff - $secs * $factor ))
printf "[%s]%04d.%06d|%s\n" "$pid" "$secs" "$msecs" "$text"
done
Use it like this: bash script.sh yourfile

Get the first real number from a series of files

I try to take the first number from each file.dat of the form:
5.01 1 56.413481000 -0.00063400 0.00095770
5.01 2 61.193808800 0.00102170 0.00078280
5.01 3 65.974136600 -0.00108170 0.00102620
5.01 4 70.754464300 0.00082490 0.00103630
and then use this number (5.01) as the title of a .png file.
I use a bash script and I know the command line=$(head -n 1 $f) as found in a question here, but this take to me the first line of the file $f.
In this case also the space in the line is saved and the .png file title became:
plot 5.01 1 56.413481000 -0.00063400 0.00095770.png
There is some way to take only 5.01 and have a trim title for the plot?
Thanks to all.
I'd probably just do it with perl:
VAL=$( echo "$line" | perl -pe 's/^[^\d]+//g;s/[^\d\.].*$//' )
Something like that anyway.
Should remove:
anything that isn't a digit from the start of line.
Anything not-digit or not . to the end of line.
Or with grep:
grep -o "[0-9]*\.[0-9]*" file.dat | head -1
Edit:
Testing without the head -1 for a oneline input:
echo " 5.01 2 61.193808800 0.00102170 0.00078280" | grep -o "[0-9]*\.[0-9]*"
5.01
61.193808800
0.00102170
0.00078280
Using head -1 will return the first match on the first line.
When you know the match will be on the first line, so can we ignore files with an incorrect first line (and don't grep through complete files):
Make a two-headed monster:
head -1 | grep -o "[0-9]*\.[0-9]*" file.dat | head -1
To extract the first field, assuming they are tab separated:
val=$(head -n 1 $f | cut -f 1)
or, if they are space separated instead:
val=$(head -n 1 $f | cut -f 1 -d ' ')
OR you can avoid calling any extra processes and keep all data manipulation in the bash shell with
while read realNum restOfLine ;
break
done < $f
echo $realNum
This grabs the first "word" and puts the remaining into "restOfLine".
The break ensures that you only read the first line of the file.
IHTH

Joining 2 variables under loop

I'm trying to join 2 variables under loop but I cant get it to work..
My script lists newly added movies. I'm trying to make an output in excel that is clickable. Long story short, I need the script to list the 2 variables like this:
ab
ab
Right now it's doing this
a
a
b
b
This is the code
NEW_MOVIES_DIRLIST=''
for i in $(seq 1 ${NEW_MOVIES_COUNT}); do
MOVIE_PATH=$(echo -e "${NEW_MOVIES_LIST}" | sed -n "${i}p")
NEW_MOVIES_DIRLIST+="$(dirname "${MOVIE_PATH}")/\n"
done
LINKNAME=$(echo -e "${NEW_MOVIES_DIRLIST}" | sed -r 's,FOLTERS_TO_BE_SCANNED/HDD-EXTENDED.-SD./,,g')
NEW_MOVIES_DIRLIST=$(echo -e "${NEW_MOVIES_DIRLIST}" | sed '/^$/d')
NEW_MOVIES_COUNT=$(echo "${NEW_MOVIES_DIRLIST}" | wc -l)
NEW_MOVIES_LIST=''
for ((i = 0; i < ${NEW_MOVIES_COUNT}; i++))
do echo ${NEW_MOVIES_DIRLIST}${LINKNAME}
done
echo "Found ${NEW_MOVIES_COUNT} movies and ${NEW_SERIALS_COUNT} serials!"
${NEW_MOVIES_LIST}
The 2 variables are NEW_MOVIES_DIRLIST and LINKNAME, I can't join them when I run it. Any idea why?
You are adding a newline to the end of the string in the sed. So strip the newline before you display it.
newline=$'\n'
echo "${NEW_MOVIES_DIRLIST//$newline//}$LINKNAME"

UNIX shell-scripting: Split a textfile by its entries

I'm trying to analyze an enormous text file (1.6GB), whose data lines look like this:
20090118025859 -2.400000 78.100000 1023.200000 0.000000
20090118025900 -2.500000 78.100000 1023.200000 0.000000
20090118025901 -2.400000 78.100000 1023.200000 0.000000
I don't even know how many lines there are. But I'm trying to split the file by date. The left number is a time stamp (these lines for example are from 2009, january 18th).
How can I split this file into pieces according to the date?
The number of entries per date differs, so using split with a constant number won't work.
Everything I know would be to grep file '20090118*' > data20090118.dat , but there sure is a way to do all the dates at once, right?
Thanks in advance,
Alex
Using awk:
awk '{print > "data"substr($1,0,8)".dat"}' myfile
This should work if the items are in date sequence:
date=20090101 # Change to the earliest date
while IFS= read -rd $'\n' line
do
if [ "$(echo "$line" | cut -d ' ' -f 1 | cut -c 1-8)" -eq $date ]
then
echo "$line" >> "$date.dat"
else
let date++
fi
done < log.dat
With the caveats that each day needs to have more than 1 record,
and that the output file will have blank lines:
uniq --all-repeated=separate -w8 file | csplit -s - '/^$/' '{*}'
We really should have an option to uniq to output even uniq records.
Also csplit should have an option to suppress the matched line.

What's an easy way to read random line from a file?

What's an easy way to read random line from a file in a shell script?
You can use shuf:
shuf -n 1 $FILE
There is also a utility called rl. In Debian it's in the randomize-lines package that does exactly what you want, though not available in all distros. On its home page it actually recommends the use of shuf instead (which didn't exist when it was created, I believe). shuf is part of the GNU coreutils, rl is not.
rl -c 1 $FILE
Another alternative:
head -$((${RANDOM} % `wc -l < file` + 1)) file | tail -1
sort --random-sort $FILE | head -n 1
(I like the shuf approach above even better though - I didn't even know that existed and I would have never found that tool on my own)
This is simple.
cat file.txt | shuf -n 1
Granted this is just a tad slower than the "shuf -n 1 file.txt" on its own.
perlfaq5: How do I select a random line from a file? Here's a reservoir-sampling algorithm from the Camel Book:
perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' file
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
using a bash script:
#!/bin/bash
# replace with file to read
FILE=tmp.txt
# count number of lines
NUM=$(wc - l < ${FILE})
# generate random number in range 0-NUM
let X=${RANDOM} % ${NUM} + 1
# extract X-th line
sed -n ${X}p ${FILE}
Single bash line:
sed -n $((1+$RANDOM%`wc -l test.txt | cut -f 1 -d ' '`))p test.txt
Slight problem: duplicate filename.
Here's a simple Python script that will do the job:
import random, sys
lines = open(sys.argv[1]).readlines()
print(lines[random.randrange(len(lines))])
Usage:
python randline.py file_to_get_random_line_from
Another way using 'awk'
awk NR==$((${RANDOM} % `wc -l < file.name` + 1)) file.name
A solution that also works on MacOSX, and should also works on Linux(?):
N=5
awk 'NR==FNR {lineN[$1]; next}(FNR in lineN)' <(jot -r $N 1 $(wc -l < $file)) $file
Where:
N is the number of random lines you want
NR==FNR {lineN[$1]; next}(FNR in lineN) file1 file2
--> save line numbers written in file1 and then print corresponding line in file2
jot -r $N 1 $(wc -l < $file) --> draw N numbers randomly (-r) in range (1, number_of_line_in_file) with jot. The process substitution <() will make it look like a file for the interpreter, so file1 in previous example.
#!/bin/bash
IFS=$'\n' wordsArray=($(<$1))
numWords=${#wordsArray[#]}
sizeOfNumWords=${#numWords}
while [ True ]
do
for ((i=0; i<$sizeOfNumWords; i++))
do
let ranNumArray[$i]=$(( ( $RANDOM % 10 ) + 1 ))-1
ranNumStr="$ranNumStr${ranNumArray[$i]}"
done
if [ $ranNumStr -le $numWords ]
then
break
fi
ranNumStr=""
done
noLeadZeroStr=$((10#$ranNumStr))
echo ${wordsArray[$noLeadZeroStr]}
Here is what I discovery since my Mac OS doesn't use all the easy answers. I used the jot command to generate a number since the $RANDOM variable solutions seems not to be very random in my test. When testing my solution I had a wide variance in the solutions provided in the output.
RANDOM1=`jot -r 1 1 235886`
#range of jot ( 1 235886 ) found from earlier wc -w /usr/share/dict/web2
echo $RANDOM1
head -n $RANDOM1 /usr/share/dict/web2 | tail -n 1
The echo of the variable is to get a visual of the generated random number.
Using only vanilla sed and awk, and without using $RANDOM, a simple, space-efficient and reasonably fast "one-liner" for selecting a single line pseudo-randomly from a file named FILENAME is as follows:
sed -n $(awk 'END {srand(); r=rand()*NR; if (r<NR) {sub(/\..*/,"",r); r++;}; print r}' FILENAME)p FILENAME
(This works even if FILENAME is empty, in which case no line is emitted.)
One possible advantage of this approach is that it only calls rand() once.
As pointed out by #AdamKatz in the comments, another possibility would be to call rand() for each line:
awk 'rand() * NR < 1 { line = $0 } END { print line }' FILENAME
(A simple proof of correctness can be given based on induction.)
Caveat about rand()
"In most awk implementations, including gawk, rand() starts generating numbers from the same starting number, or seed, each time you run awk."
-- https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html

Resources