issue with bash : 2 variables instead of one - bash

I've written this piece of code.
The aim is the following :
for each files in the temp list, it should take the first occurence of the list, put it into a variable called $name1 and then the second occurence of the list into a second variable called $name2. The variables are file names. With the 2 variables, I do a join.
for files in $(cat temp.lst); do
if [ $(cat temp.lst | wc -l) == 1 ]
then
name=$(head -1 temp.lst)
join -t\; -j 1 file_minus1.txt "$name" | sed 's/;;/;/g' > file1.txt
else
name1=$(head -1 temp.lst)
name2=$(head -2 temp.lst)
echo "var1 "$name1 "var2 "$name2
sed '1,2d' temp.lst > tmpfile.txt
mv tmpfile.txt temp.lst
join -t\; -j 1 "$name1" "$name2" | sed 's/;;/;/g' > file_minus1.txt
fi
;done
Theoretically, it should work but here it is not working, alas.
The echo line I've put in my code is giving me 3 variables instead of 2
var1 ei_etea17_m.tsv var2 ei_etea17_m.tsv ei_eteu25_m.tsv
Worse, the join is not functionning the way I thought, giving me this error code instead
join: ei_etea17_m.tsv
ei_eteu25_m.tsv: No such file or directory
Please find a sample of my temp.lst
ei_eteu27_m.tsv
ei_eteu28_m.tsv
ei_isbr_m.tsv
ei_isbu_m.tsv
ei_isin_m.tsv
Any suggestions are welcomed.
Best.

To extract 2 lines of a file in a loop, try this:
paste - - < temp.lst |
while read name1 name2; do
if [[ -z $name2 ]]; then
name2=$name1
name1=file_minus1.txt
output=file1.txt
else
output=file_minus1.txt
fi
join -t\; "$name1" "$name2" | sed 's/;;/;/g' > $output
done
Notes
the paste command takes 2 consecutive lines from the file and joins them into a single line (separated by tab)
demo: seq 7 | paste - -
read can assign to multiple variables: the line will be split on whitespace (default) and assigned to the named variables.
in the loop body, I basically follow your logic

To perform an n-way join, use recursion :)
recursive_join () {
# Zero files: do nothing (special case)
# One file: output it
# Multiple files: join the first with the result of joining the rest
file1=$1
shift || return
[ "$#" -eq 0 ] && cat "$file1" ||
recursive_join "$#" | join -t\; -j1 "$file1" -
}
recursive_join ei_eteu27_m.tsv ei_eteu28_m.tsv ei_isbr_m.tsv ei_isbu_m.tsv ei_isin_m.tsv
Adapting this to use a file listing the input files, rather than using command-line arguments, is a little tricker. As long as none of the input file names contain whitespace or other special characters, you could simply use
recursive_join $(cat temp.lst)
Or, if you want to avail yourself of bash features, you could use an array:
while read; do files+=("$REPLY"); done < temp.lst
recursive_join "${files[#]}"
or in bash 4:
readarray files < temp.list
recursive_join "${files[#]}"
However, if you want to stick with standard shell scripting only, it would be better to modify the recursive function to read the input file names from standard input. This makes the function a little uglier, since in order to detect if there is only one file left on standard input, we have to try to read a second one, and put it back on standard input if we succeed.
recursive_join () {
IFS= read -r file1 || return
IFS= read -r file2 &&
{ echo "$file2"; cat; } | recursive_join | join -t\; -j1 "$file1" - ||
cat "$file1"
}
recursive_join < temp.lst
Creating a function that can take either command-line arguments or read a list from standard input is left as an exercise for the reader.

Variable name1 is getting the first line.
Variable name2 is getting the first two lines.
If you want name2 to have only the second line you could try something like:
name2=$(sed -n '2p')
Also sed -i will remove the need for tmpfile.txt.

Ok Gents or Ladies.
I found out the Why.
head -1 temp.lst is only given the file name without the extension.
I need to find a way to include the extension. Doable.

Related

How to filter text data in bash more efficiently

I have data file which I need to filter with bash script, see data example:
name=pencils
name=apples
value=10
name=rocks
value=3
name=tables
value=6
name=beds
name=cups
value=89
I need to group name value pairs like so apples=10, if current line starts with name and next line starts with name, first line should be omitted entirely. So result file should look like this:
apples=10
rocks=3
tables=6
cups=89
I came with this simple solution which works but is very slow, it takes 5 min to complete for file with 2000 lines.
VALUES=$(cat input.txt)
for x in $VALUES; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done
I'm aware that this kind of task is not very suitable for bash, but script is already written and this is just small part of it.
How can I optimize this task in bash?
Do not run any commands in subshells, it slows your script a lot. You can do everything in the current shell.
#! /bin/bash
while IFS== read k v ; do
if [[ $k == name ]] ; then
name=$v
elif [[ $k == value ]] ; then
printf '%s=%s\n' "$name" "$v"
fi
done
There are three easy optimizations you can make that will greatly speed up the script without requiring a major rethink.
1. Replace for with while read
Loading input.txt into a string, and then looping over that string with for x in $VALUES is slow. It requires the whole file to be read into memory even though this task could be done in a streaming fashion, reading a line at a time.
A common replacement for for line in $(cat file) is while read line; do ... done < file. It turns out that loops are compound commands, and like the normal one-line commands we're used to, compound commands can have < and > redirections. Redirecting a file into a loop means that for the duration of the loop, stdin comes from the file. So if you call read line inside the loop then it will read one line each iteration.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done < input.txt
2. Redirect output outside loop
It's not just input that can be redirected. We can do the same thing for the >> output.txt redirection. Here's where you'll see the biggest speedup. When >> output.txt is inside the loop output.txt must be opened and closed every iteration, which is crazy slow. Moving it to the outside means it only needs to be opened once. Much, much faster.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}"
fi
done < input.txt > output.txt
3. Shell string processing
One final improvement is to use faster string processing. Calling grep requires forking a subprocess every time just to do a simple string split. It'd be a lot faster if we could do the string splitting using just shell constructs. Well, as it happens that's easy now that we've switched to read. read can do more than read whole lines; it can also split on a delimiter from the variable $IFS (inter-field separator).
while IFS='=' read -r key value; do
case "$key" in
name) name="$value";;
value) echo "$name=$value";;
fi
done < input.txt > output.txt
Further reading
BashFAQ/001 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
This explains why I have IFS= read -r in the first two iterations.
BashFAQ/024 - I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?
cmd | while read; do ... done is another popular use of while read, but it has unique pitfalls.
BashFAQ/100 - How do I do string manipulations in bash?
More in-shell string processing options.
If you have performance issues do not use bash at all. Use a text processing tool like, for instance, awk:
$ awk -F= '{name = $2} $1 == "value" {print name "=" $2}' data.txt
apples=10
rocks=3
tables=6
cups=89
Explanation: -F= defines the field separator as character =. The first block is executed only if the first field of a line ($1) is equal to string value. It prints variable name followed by character = and the second field ($2). The second block is executed on each line and it stores the second field ($2) in variable name.
Normally, if your input resembles what you show, this should automatically skip the first line. Else, we can exclude it explicitly using a test on the NR variable which value is the line number, starting at 1:
awk -F= 'NR != 1 && $1 == "value" {print name "=" $2}
NR != 1 {name = $2}' data.txt
All this works on inputs like the one you show but not on inputs where you would have other types of lines or several value=... consecutive lines. If you really want to test that the name/value pair is on two consecutive lines we need something more. For instance, test if the first field is name and use another variable n to store the line number of the last encountered name=... line. With all these tests we can now put the 2 blocks in a slightly more intuitive order (but the opposite would work the same):
awk -F= 'NR != 1 && $1 == "name" {name = $2; n = NR}
NR != 1 && NR == n+1 && $1 == "value" {print name "=" $2}' data.txt
With awk there might be a more elegant solution but you can have:
awk 'BEGIN{RS="\n?name=";FS="\nvalue="} {if($2) printf "%s=%s\n",$1,$2}' inputs.txt
RS="\n?name=" says that the record separator is name=
FS="\nvalue=" says that the field separator for each record is value=
if($2) says to only proceed the printf is the second field exists

BASH loop to change data from 1 csv from other csv

trying to change the value of a column based on other column in other csv
so let's say we have a CSV_1 that states with over 1000 lines with 3 columns
shape Color size
round 2 big
triangle 1 small
square 3 medium
then we have a CSV2 that has only 10 with the following information
color
1 REd
2 Blue
3 Yellow
etc
now i want to change the value in column color in CSV_1 with the name of the color of CSV2
so in other words .. something like
for (i=0; i<column.color(csv1); i++) {
if color.csv1=1; then
subustite with color.csv2=1 }
so that loop iterates in all CSV1 Color column and changes the value with the values from CSV2
An explicit loop for this would be very slow in bash. Use a command that does the line-wise processing for you.
sed 's/abc/xyz/' searches abc in each line and replaces it by xyz. Use this to search and replace the numbers in your 2nd column by the names from your 2nd file. The sed command can be automatically generated from the 2nd file using another sed command:
The following script assumes a CSV file without spaces around the delimiting ,.
sed -E "$(sed -E '1d;s#^([^,]*),(.*)#s/^([^,]*,)\1,/\\1\2,/#' 2.csv)" 1.csv
Interactive Example
$ cat 1.csv
shape,Color,size
round,2,big
triangle,1,small
square,3,medium
$ cat 2.csv
color
1,REd
2,Blue
3,Yellow
$ sed -E "$(sed -E '1d;s#^([^,]*),(.*)#s/^([^,]*,)\1,/\\1\2,/#' 2.csv)" 1.csv
shape,Color,size
round,Blue,big
triangle,REd,small
square,Yellow,medium
Here is one approach, with mapfile which is a bash4+ feature and some common utilities in linux/unix.
Assuming both files are delimited with a comma ,
#!/usr/bin/env bash
mapfile -t colors_csv2 < csv2.csv
head -n1 csv1.csv
while IFS=, read -r shape_csv1 color_csv1 size_csv1; do
for color_csv2 in "${colors_csv2[#]:1}"; do
if [[ $color_csv1 == ${color_csv2%,*} ]]; then
printf '%s,%s,%s\n' "$shape_csv1" "${color_csv2#*,}" "$size_csv1"
fi
done
done < <(tail -n +2 csv1.csv)
Would be very slow on large set of data/files.
If ed is available acceptable, with the bash shell.
#!/usr/bin/env bash
ed -s csv1.csv < <(
printf '%s\n' '1d' $'g|.|s|,|/|\\\ns|^|,s/|\\\ns|$|/|' '$a' ',p' 'Q' . ,p |
ed -s csv2.csv
)
To add to #Jetchisel interesting answer, here is an old bash way to achieve that. It should work with bash release 2 as it supports escape literals, indexed array, string expansion, indirect variable references. It implies that color keys in csv2.csv will always be a numeric value. Add shopt -s compat31 at the beginning to test it in the 'old way' with a recent bash. You can also replace declare -a csv2 with a Bash 4+ declare -A csv2 for an associative array, in which case the key can be anything.
#!/bin/bash
declare -a csv2
esc=$'\x1B'
while read -r colors; do
if [ "${colors}" ] ; then
colors="${colors// /${esc}}"
set ${colors//,/ }
if [ "$1" ] ; then
csv2["$1"]="$2"
fi
fi
done < csv2.csv
while read -r output; do
if [ "${output}" ] ; then
outputfilter="${output// /${esc}}"
set ${outputfilter//,/ }
if [ "$2" ] ; then
color="${csv2["$2"]}"
[ "${color}" ] && { tmp="$1,${color},$3";output="${tmp//${esc}/ }"; };
fi
echo "${output}"
fi
done < csv1.csv

How to only concatenate files with same identifier using bash script?

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:
S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt
I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:
original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt
Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...
I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.
My pitiful attempt is something like this:
while IFS= read -r -d '' id; do
cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)
or this:
for ((k=100;k<400;k=k+1));
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &;
done
Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.
I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!
while read $fil;
do
if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
then
cat $line_*.txt >> "$line_merged.txt"
fi
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"
Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.
for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
cat ${id}_*.txt > _${id}_merged.txt
mv ${id}_*.txt folder
fi
done
for f in _*_merged.txt ; do
mv ${f} ${f:1}
done
A plain bash loop with preprocessing:
# first get the list of files
find . -type f |
# then extract the prefix
sed 's#./\([^_]*\)_#\1\t&#' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
cat "$file" >> "${prefix}_merged.txt"
done
That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:
find . -type f | # maybe `sort |` ?
# join filenames with common prefix
awk '{
f=$0; # remember the file path
gsub(/.*\//,"");gsub(/_.*/,""); # extract prefix from filepath and store it in $0
a[$0]=a[$0]" "f # Join path with leading space in associative array indexed with prefix
}
# Output prefix and filanames separated by spaces.
# TBH a tab would be a better separator..
END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
#first array element is the prefix
prefix=${files[0]}
unset files[0]
# rest is the files
case "${#files[#]}" in
0) echo super error; ;;
# one file - preserve the filename
1) cat "${files[#]}" > "$outdir"/"${files[1]}"; ;;
# more files - do a _merged.txt suffix
*) cat "${files[#]}" > "$outdir"/"${prefix}_merged.txt"; ;;
esac
done
Tested on repl.
IDList= echo "S${k}_S*.txt"
Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.
Filename expansion (ie. * -> list of files) is not executed inside " double quotes.
To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )
IDList${k+n}_S*.txt
The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.
To add two numbers use arithemtic expansion $((k + n)).
awk -F'[_.]' '{$1}'
$1 is just invalid here. To print a line, print it {print %1}.
Remember to check your scripts with http://shellcheck.net
A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.
#!/bin/bash
backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts
# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit
for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[#]}"; do
if ((count[$id] > 1)); then
mv "$id"_* "$backupdir"
cat "$backupdir/$id"_* > "$id"_merged.txt
fi
done

Read lines from a file and output with specific formatting with Bash

In A.csv, there are
1
2
3
4
How should I read this file and create variables $B and $C so that:
echo $B
echo $C
returns:
1 2 3 4
1,2,3,4
So far I am trying:
cat A.csv | while read A;
do
echo $A
done
It only returns
1
2
3
4
Assuming bash 4.x, the following is efficient, robust, and native:
# Read each line of A.csv into a separate element of the array lines
readarray -t lines <A.csv
# Generate a string B with a comma after each item in the array
printf -v B '%s,' "${lines[#]}"
# Prune the last comma from that string
B=${B%,}
# Generate a string C with a space after each item in the array
printf -v B '%s ' "${lines[#]}"
As #Cyrus said
B=$(cat A.csv)
echo $B
Will output:
1 2 3 4
Because bash will not carry the newlines if the variable is not wrapped in quotes. This is dangerous if A.csv contains any characters which might be affected by bash glob expansion, but should be fine if you are just reading simple strings.
If you are reading simple strings with no spaces in any of the elements, you can also get your desired result for $C by using:
echo $B | tr ' ' ','
This will output:
1,2,3,4
If lines in A.csv may contain bash special characters or spaces then we return to the loop.
For why I've formatted the file reading loop as I have, refer to: Looping through the content of a file in Bash?
B=''
C=''
while read -u 7 curr_line; do
if [ "$B$C" == "" ]; then
B="$curr_line"
C="$curr_line"
else
B="$B $curr_line"
C="$C,$curr_line"
fi
done 7<A.csv
echo "$B"
echo "$C"
Will construct the two variables as you desire using a loop through the file contents and should prevent against unwanted globbing and splitting.
B=$(cat A.csv)
echo $B
Output:
1 2 3 4
With quotes:
echo "$B"
Output:
1
2
3
4
I would read the file into a bash array:
mapfile -t array < A.csv
Then, with various join characters
b="${array[*]}" # space is the default
echo "$b"
c=$( IFS=","; echo "${array[*]}" )
echo "$c"
Or, you can use paste to join all the lines with a specified separator:
b=$( paste -d" " -s < A.csv )
c=$( paste -d"," -s < A.csv )
Try this :
cat A.csv | while read A;
do
printf "$A"
done
Regards!
Try This(Simpler One):
b=$(tr '\n' ' ' < file)
c=$(tr '\n' ',' < file)
You don't have to read File for that. Make sure you ran dos2unix file command. If you are running in windows(to remove \r).
Note: It will modify the file. So, make sure you copied from original file.

Count number of lines under each header in a text file using bash shell script

I can do this easily in python or some other high level language. What I am interested in is doing this with bash.
Here is the file format:
head-xyz
item1
item2
item3
head-abc
item8
item5
item6
item9
What I would like to do is print the following output:
head-xyz: 3
head-abc: 4
header will have a specific pattern similar to the example i gave above. items also have specific patterns like in the example above. I am only interested in the count of items under each header.
You can use awk:
awk '/head/{h=$0}{c[h]++}END{for(i in c)print i, c[i]-1}' input.file
Breakdown:
/head/{h=$0}
For every line matching /head/, set variable h to record the header.
{c[h]++}
For every line in the file, update the array c, which stores a map from header string to line count.
END{for(i in c)print i, c[i]-1}
At the end, loop through the keys in array c and print the key (header) followed by the value (count). Subtract one to avoid counting the header itself.
Note: Bash version 4 only (uses associative arrays)
#!/usr/bin/env bash
FILENAME="$1"
declare -A CNT
while read -r LINE || [[ -n $LINE ]]
do
if [[ $LINE =~ ^head ]]; then HEADLINE="$LINE"; fi
if [ ${CNT[$HEADLINE]+_} ];
then
CNT[$HEADLINE]=$(( ${CNT[$HEADLINE]} + 1 ))
else
CNT[$HEADLINE]=0
fi
done < "$FILENAME"
for i in "${!CNT[#]}"; do echo "$i: ${CNT[$i]}"; done
Output:
$ bash countitems.sh input
head-abc: 4
head-xyz: 3
Does this answer your question #powerrox ?
If you don't consider sed a high-level language, here's another approach:
for file in head-*; do
echo "$file: \c"
sed -n '/^head-/,${
/^head-/d
/^item[0-9]/!q
p
}
' <$file | wc -l
done
In English, the sed script does
Don't print by default
Within lines matching /^head-/ to end of file
Delete the "head line"
After that, quit if you find a non-item line
Otherwise, print the line
And wc -l to count lines.

Resources