KSH Shell script - Process file by blocks of lines - bash

I am trying to write a bash script in a KSH environment that would iterate through a source text file and process it by blocks of lines
So far I have come up with this code, although it seems to go indefinitely since the tail command does not return 0 lines if asked to retrieve lines beyond those in the source text file
i=1
while [[ `wc -l /path/to/block.file | awk -F' ' '{print $1}'` -gt $((i * 1000)) ]]
do
lc=$((i * 1000))
DA=ProcessingResult_$i.csv
head -$lc /path/to/source.file | tail -1000 > /path/to/block.file
cd /path/to/processing/batch
./process.sh #This will process /path/to/block.file
mv /output/directory/ProcessingResult.csv /output/directory/$DA
i=$((i + 1))
done
Before launching the above script I perform a manual 'first injection': head -$lc /path/to/source.file | tail -1000 > /path/to/temp.source.file
Any idea on how to get the script to stop after processing the last lines from the source file?
Thanks in advance to you all

If you do not want to create so many temporary files up front before beginning to process each block, you could try the below solution. It can save lot of space when processing huge files.
#!/usr/bin/ksh
range=$1
file=$2
b=0; e=0; seq=1
while true
do
b=$((e+1)); e=$((range*seq));
sed -n ${b},${e}p $file > ${file}.temp
[ $(wc -l ${file}.temp | cut -d " " -f 1) -eq 0 ] && break
## process the ${file}.temp as per your need ##
((seq++))
done
The above code generates only one temporary file at a time.
You could pass the range(block size) and the filename as command line args to the script.
example: extractblock.sh 1000 inputfile.txt

have a look to man split
NAME
split - split a file into pieces
SYNOPSIS
split [OPTION]... [INPUT [PREFIX]]
-l, --lines=NUMBER
put NUMBER lines per output file
For example
split -l 1000 source.file
Or to extract the 3rd chunk for example (1000 here is not the number of lines , it is the number of chunks, or a chunk is 1/1000 of source.file)
split -nl/3/1000 source.file
A note on condition :
[[ `wc -l /path/to/block.file | awk -F' ' '{print $1}'` -gt $((i * 1000)) ]]
Maybe it should be source.file instead of block.file, and it is quite inefficient on a big file because it will read (count the lines of the file) for each iteration ; number of lines can be stored in a variable, also using wc on standard input prevents from using awk:
nb_lines=$(wc -l </path/to/source.file )

With Nahuel's recommendation I was able to build the script like this:
i=1
cd /path/to/sourcefile/
split source.file -l 1000 SF
for sf in /path/to/sourcefile/SF*
do
DA=ProcessingResult_$i.csv
cd /path/to/sourcefile/
cat $sf > /path/to/block.file
rm $sf
cd /path/to/processing/batch
./process.sh #This will process /path/to/block.file
mv /output/directory/ProcessingResult.csv /output/directory/$DA
i=$((i + 1))
done
This worked great

Related

limiting file output to only 2 lines [duplicate]

This question already has answers here:
How can I use a file in a command and redirect output to the same file without truncating it?
(14 answers)
Closed 10 months ago.
I ping a series of addresses and append the latency results to a file (each address has a separate file). I'm trying to limit the file to only contain the last 2 entries.
$outpath=/opt/blah/file.txt
resp_str="0.42"
echo $resp_str >> $outpath
tail -2 $outpath > $outpath
Without tail, the file continues to grow with the new data (simply .42 for this example). But when I call tail, it writes out an empty file. If I redirect the tail output to a file of a different name, then I get the expected result. Can I not write out to a file as I read it? Is there a simple solution?
Here's the complete script:
OUT_PATH=/opt/blah/sites/
TEST_FILE=/opt/blah/test.txt
while IFS=, read -r ip name; do
if [[ "$ip" != \#* ]]; then
RESP_STR=$( ping -c 1 -q $ip | grep rtt| awk '{print $4}' | awk -F/ '{ print $2; }')
echo $RESP_STR >> "$OUT_PATH""test/"$name".txt"
fi
done << $TEST_FILE
tail -2 $outpath > $outpath
> truncates the file before tail starts reading it.
You need to buffer the output of tail before writing it back to that file. Use sponge to achieve this:
tail -2 $outpath | sponge $outpath
You can use pipe | to send the output of one commande to another, in this case tail.
We can then append out to a file using >>. If we use > we overwrite the file each time and all previous content is lost.
This example writes the 2 last files in the directory to log.txt each time it is run.
ls | tail -2 >> log.txt
Assumptions:
need to code for parallel, concurrent processes (use of temp files will require each process to have a uniquely named temp file)
go ahead and code to support a high volume of operations (ie, reduce overhead of creating/destroying temp files)
One idea using mktemp to create a temporary file ... we'll wrap this in a function for easier use:
keep2 () {
# Usage: keep2 filename "new line of text"
[[ ! -f "${tmpfile}" ]] &&
tmpfile=$(mktemp)
tail -1 "$1" > "${tmpfile}"
{ cat "${tmpfile}"; echo "$2"; } > "$1"
}
NOTES:
the hardcoded -1 (tail -1) could be parameterized or reference a user-defined env variable
OP can change the order of the input parameters as desired
Taking for a test drive;
> logfile
for ((i=1;i<=5;i++))
do
echo "######### $i"
keep2 logfile "$i"
cat logfile
done
This generates:
######### 1
1
######### 2
1
2
######### 3
2
3
######### 4
3
4
######### 5
4
5
In OP's code the following line:
echo $RESP_STR >> "$OUT_PATH""test/"$name".txt"
would be replaced with:
keep2 "$OUT_PATH""test/"$name".txt" "${RESP_STR}"

How to compare 2 files word by word and storing the different words in result output file

Suppose there are two files:
File1.txt
My name is Anamika.
File2.txt
My name is Anamitra.
I want result file storing:
Result.txt
Anamika
Anamitra
I use putty so can't use wdiff, any other alternative.
not my greatest script, but it works. Other might come up with something more elegant.
#!/bin/bash
if [ $# != 2 ]
then
echo "Arguments: file1 file2"
exit 1
fi
file1=$1
file2=$2
# Do this for both files
for F in $file1 $file2
do
if [ ! -f $F ]
then
echo "ERROR: $F does not exist."
exit 2
else
# Create a temporary file with every word from the file
for w in $(cat $F)
do
echo $w >> ${F}.tmp
done
fi
done
# Compare the temporary files, since they are now 1 word per line
# The egrep keeps only the lines diff starts with > or <
# The awk keeps only the word (i.e. removes < or >)
# The sed removes any character that is not alphanumeric.
# Removes a . at the end for example
diff ${file1}.tmp ${file2}.tmp | egrep -E "<|>" | awk '{print $2}' | sed 's/[^a-zA-Z0-9]//g' > Result.txt
# Cleanup!
rm -f ${file1}.tmp ${file2}.tmp
This uses a trick with the for loop. If you use a for to loop on a file, it will loop on each word. NOT each line like beginners in bash tend to believe. Here it is actually a nice thing to know, since it transforms the files into 1 word per line.
Ex: file content == This is a sentence.
After the for loop is done, the temporary file will contain:
This
is
a
sentence.
Then it is trivial to run diff on the files.
One last detail, your sample output did not include a . at the end, hence the sed command to keep only alphanumeric charactes.

Counting number of delimiters of special character bash shell script Performance improvement

Hi I have a script that is going to count the number of records in a file and find the expected delimiters per a record by dividing the total record count by rs_count. It works fine but it is a little slow on large records. I was wondering if there is a way to improve performance. The RS is a special character octal \246. I am using bash shell script.
Some additional info:
A line is a record.
The file will always have the same number of delimiters.
The purpose of the script is to check if the file has the expected number of fields. After calculating it, the script just echos it out.
for file in $SOURCE; do
echo "executing File -"$file
if (( $total_record_count != 0 ));then
filename=$(basename "$file")
total_record_count=$(wc -l < $file)
rs_count=$(sed -n 'l' $file | grep -o $RS | wc -l)
Delimiter_per_record=$((rs_count/total_record_count))
fi
done
Counting the delimiters (not total records) in a file
On a file with 50,000 lines, I note around a 10 fold increase by incorporating the sed, grep, and wc pipeline to a single awk process:
awk -v RS='Delimiter' 'END{print NR -1}' input_file
Dealing with wc when there's no trailing line breaks
If you count the instances of ^ (start of line), you will get a true count of lines. Using grep:
grep -co "^" input_file
(Thankfully, even though ^ is a regex, the performance of this is on par with wc)
Incorporating these two modifications into a trivial test based on your supplied code:
#!/usr/bin/env bash
SOURCE="$1"
RS=$'\246'
for file in $SOURCE; do
echo "executing File -"$file
if [[ $total_record_count != 0 ]];then
filename=$(basename "$file")
total_record_count=$(grep -oc "^" $file)
rs_count="$(awk -v RS=$'\246' 'END{print NR -1}' $file)"
Delimiter_per_record=$((rs_count/total_record_count))
fi
done
echo -e "\$rs_count:\t${rs_count}\n\$Delimiter_per_record:\t${Delimiter_per_record}\n\$total_record_count:\t${total_record_count}" | column -t
Running this on a file with 50,000 lines on my macbook:
time ./recordtest.sh /tmp/randshort
executing File -/tmp/randshort
$rs_count: 186885
$Delimiter_per_record: 3
$total_record_count: 50000
real 0m0.064s
user 0m0.038s
sys 0m0.012s
Unit test one-liner
(creates /tmp/recordtest, chmod +x's it, creates /tmp/testfile with 10 lines of random characters including octal \246, and then runs the script file on the testfile)
echo $'#!/usr/bin/env bash\n\nSOURCE="$1"\nRS=$\'\\246\'\n\nfor file in $SOURCE; do\n echo "executing File -"$file\n if [[ $total_record_count != 0 ]];then\n filename=$(basename "$file")\n total_record_count=$(grep -oc "^" $file)\n rs_count="$(awk -v RS=$\'\\246\' \'END{print NR -1}\' $file)"\n Delimiter_per_record=$((rs_count/total_record_count))\n fi\ndone\n\necho -e "\\$rs_count:\\t${rs_count}\\n\\$Delimiter_per_record:\\t${Delimiter_per_record}\\n\\$total_record_count:\\t${total_record_count}" | column -t' > /tmp/recordtest ; echo $'\246459ca4f23bafff1c8fc017864aa3930c4a7f2918b\246753f00e5a9278375b\nb\246a3\246fc074b0e415f960e7099651abf369\246a6f\246f70263973e176572\2467355\n1590f285e076797aa83b2ee537c7f99\24666990bb60419b8aa\246bb5b6b\2467053\n89b938a5\246560a54f2826250a2c026c320302529331229255\246ef79fbb52c2\n9042\246bb\246b942408a22f912268ffc78f08c\2462798b0c05a75439\246245be2ea5\n0ef03170413f90e\246e0\246b1b2515c4\2466bf0a1bb\246ee28b78ccce70432e6b\24653\n51229e7ab228b4518404360b31a\2463673261e3242985bf24e59bc657\246999a\n9964\246b08\24640e63fae788ea\246a1777\2460e94f89af8b571e\246e1b53e6332\246c3\246e\n90\246ae12895f\24689885e\246e736f942080f267a275132a348ec1e837b99efe94\n2895e91\246\246f506f\246c1b986a63444b4258\246bc1b39182\24630\24696be' > /tmp/testfile ; chmod +x /tmp/recordtest ; /tmp/./recordtest /tmp/testfile
Which produces this result:
$rs_count: 39
$Delimiter_per_record: 3
$total_record_count: 10
Though there's a number of solutions for counting instances of characters in files, quite a few come undone when trying to process special characters like octal \246
awk seems to handle it reliably and quickly.

How to check that a file has more than 1 line in a BASH conditional?

I need to check if a file has more than 1 line. I tried this:
if [ `wc -l file.txt` -ge "2" ]
then
echo "This has more than 1 line."
fi
if [ `wc -l file.txt` >= 2 ]
then
echo "This has more than 1 line."
fi
These just report errors. How can I check if a file has more than 1 line in a BASH conditional?
The command:
wc -l file.txt
will generate output like:
42 file.txt
with wc helpfully telling you the file name as well. It does this in case you're checking out a lot of files at once and want individual as well as total stats:
pax> wc -l *.txt
973 list_of_people_i_must_kill_if_i_find_out_i_have_cancer.txt
2 major_acheivements_of_my_life.txt
975 total
You can stop wc from doing this by providing its data on standard input, so it doesn't know the file name:
if [[ $(wc -l <file.txt) -ge 2 ]]
The following transcript shows this in action:
pax> wc -l qq.c
26 qq.c
pax> wc -l <qq.c
26
As an aside, you'll notice I've also switched to using [[ ]] and $().
I prefer the former because it has less issues due to backward compatibility (mostly to do with with string splitting) and the latter because it's far easier to nest executables.
A pure bash (≥4) possibility using mapfile:
#!/bin/bash
mapfile -n 2 < file.txt
if ((${#MAPFILE[#]}>1)); then
echo "This file has more than 1 line."
fi
The mapfile builtin stores what it reads from stdin in an array (MAPFILE by default), one line per field. Using -n 2 makes it read at most two lines (for efficiency). After that, you only need to check whether the array MAPFILE has more that one field. This method is very efficient.
As a byproduct, the first line of the file is stored in ${MAPFILE[0]}, in case you need it. You'll find out that the trailing newline character is not trimmed. If you need to remove the trailing newline character, use the -t option:
mapfile -t -n 2 < file.txt
if [ `wc -l file.txt | awk '{print $1}'` -ge "2" ]
...
You should always check what each subcommand returns. Command wc -l file.txt returns output in the following format:
12 file.txt
You need first column - you can extract it with awk or cut or any other utility of your choice.
How about:
if read -r && read -r
then
echo "This has more than 1 line."
fi < file.txt
The -r flag is needed to ensure line continuation characters don't fold two lines into one, which would cause the following file to report one line only:
This is a file with _two_ lines, \
but will be seen as one.
change
if [ `wc -l file.txt` -ge "2" ]
to
if [ `cat file.tex | wc -l` -ge "2" ]
If you're dealing with large files, this awk command is much faster than using wc:
awk 'BEGIN{x=0}{if(NR>1){x=1;exit}}END{if(x>0){print FILENAME,"has more than one line"}else{print FILENAME,"has one or less lines"}}' file.txt

ksh: shell script to search for a string in all files present in a directory at a regular interval

I have a directory (output) in unix (SUN). There are two types of files created with timestamp prefix to the file name. These file are created on a regular interval of 10 minutes.
e. g:
1. 20140129_170343_fail.csv (some lines are there)
2. 20140129_170343_success.csv (some lines are there)
Now I have to search for a particular string in all the files present in the output directory and if the string is found in fail and success files, I have to count the number of lines present in those files and save the output to the cnt_succ and cnt_fail variables. If the string is not found I will search again in the same directory after a sleep timer of 20 seconds.
here is my code
#!/usr/bin/ksh
for i in 1 2
do
grep -l 0140127_123933_part_hg_log_status.csv /osp/local/var/log/tool2/final_logs/* >log_t.txt; ### log_t.txt will contain all the matching file list
while read line ### reading the log_t.txt
do
echo "$line has following count"
CNT=`wc -l $line|tr -s " "|cut -d" " -f2`
CNT=`expr $CNT - 1`
echo $CNT
done <log_t.txt
if [ $CNT > 0 ]
then
exit
fi
echo "waiitng"
sleep 20
done
The problem I'm facing is, I'm not able to get the _success and _fail in file in line and and check their count
I'm not sure about ksh, but while ... do; ... done is notorious for running off with whatever variables you're using in bash. ksh might be similar.
If I've understand your question right, SunOS has grep, uniq and sort AFAIK, so a possible alternative might be...
First of all:
$ cat fail.txt
W34523TERG
ADFLKJ
W34523TERG
WER
ASDTQ34T
DBVSER6
W34523TERG
ASDTQ34T
DBVSER6
$ cat success.txt
abcde
defgh
234523452
vxczvzxc
jkl
vxczvzxc
asdf
234523452
vxczvzxc
dlkjhgl
jkl
wer
234523452
vxczvzxc
And now:
egrep "W34523TERG|ASDTQ34T" fail.txt | sort | uniq -c
2 ASDTQ34T
3 W34523TERG
egrep "234523452|vxczvzxc|jkl" success.txt | sort | uniq -c
3 234523452
2 jkl
4 vxczvzxc
Depending on the input data, you may want to see what options sort has on your system. Examining uniq's options may prove useful too (it can do more than just count duplicates).
Think you want something like this (will work in both bash and ksh)
#!/bin/ksh
while read -r file; do
lines=$(wc -l < "$file")
((sum+=$lines))
done < <(grep -Rl --include="[1|2]*_fail.csv" "somestring")
echo "$sum"
Note this will match files starting with 1 or 2 and ending in _fail.csv, not exactly clear if that's what you want or not.
e.g. Let's say I have two files, one starting with 1 (containing 4 lines) and one starting with 2 (containing 3 lines), both ending in `_fail.csv somewhere under my current working directory
> abovescript
7
Important to understand grep options here
-R, --dereference-recursive
Read all files under each directory, recursively. Follow all
symbolic links, unlike -r.
and
-l, --files-with-matches
Suppress normal output; instead print the name of each input
file from which output would normally have been printed. The
scanning will stop on the first match. (-l is specified by
POSIX.)
Finaly I'm able to find the solution. Here is the complete code:
#!/usr/bin/ksh
file_name="0140127_123933.csv"
for i in 1 2
do
grep -l $file_name /osp/local/var/log/tool2/final_logs/* >log_t.txt;
while read line
do
if [ $(echo "$line" |awk '/success/') ] ## will check the success file
then
CNT_SUCC=`wc -l $line|tr -s " "|cut -d" " -f2`
CNT_SUCC=`expr $CNT_SUCC - 1`
fi
if [ $(echo "$line" |awk '/fail/') ] ## will check the fail file
then
CNT_FAIL=`wc -l $line|tr -s " "|cut -d" " -f2`
CNT_FAIL=`expr $CNT_FAIL - 1`
fi
done <log_t.txt
if [ $CNT_SUCC > 0 ] && [ $CNT_FAIL > 0 ]
then
echo " Fail count = $CNT_FAIL"
echo " Success count = $CNT_SUCC"
exit
fi
echo "waitng for next search..."
sleep 10
done
Thanks everyone for your help.
I don't think I'm getting it right, but You can't diffrinciate the files?
maybe try:
#...
CNT=`expr $CNT - 1`
if [ $(echo $line | grep -o "fail") ]
then
#do something with fail count
else
#do something with success count
fi

Resources