Assign content of a file to an array in bash - bash

I have a file that contains parts of file names either as newline (or separated by spaces). Lets take the following example:
cat file.txt
1
4
500
The actual file names are file_1.dat, file_2.dat, file_3.dat, file_4.dat file_500.dat, and so on.
I want to combine only those files whose names (or part of the names) are stored in file.txt.
To do so I am doing the following:
## read the file and assign to an array
array=()
while IFS= read -r line; do
array+=($line)
done < file.txt
## combine the contents of the files
for file in ${array[#]}
do
cat "file_$file.dat"
done > output.dat
Now in this solution what I don't like is the assignment of the array, that I have to run a do loop for this.
I tried to use
mapfile -t array < <(cat file.txt)
I also tried,
array=( $(cat file2.txt) )
The array that is needed finally is
array=(1 4 500)
In some of the answers(in this platform), I see that doing in the above way (the last option) might be harmful. I wanted to have some clarification on what to do for such assignments.
My question is: in this situation what is the best (safe and fast) way to assign the content of a file into an array?

array=( $(cat file2.txt) )
does not necessarily put each line in the array. It puts each word resulting from word-splitting and globbing into the array.
Consider this file
1
2 3
*
mapfile -t array < file.txt will create an array with the elements 1, 2 3, and *.
array=( $(cat file.txt) ) will create an array with the elements 1, 2, and 3, along with an element for each file name in the current directory.
Using mapfile is both safer and makes your intent of storing one line per element clearer.
However, there is no need for an array at all. You can process each file as you pull a line from your input file.
while IFS= read -r line; do
cat "file_$line.dat"
done < file.txt > output.dat

If you don’t want to deduplicate the file name fragments:
readarray -t lines < file.txt
declare -a fragments
for line in "${lines[#]}"; do
fragments+=($line)
done
names=("${fragments[#]/#/file_}")
names=("${names[#]/%/.dat}")
cat "${names[#]}"
If you do want to deduplicate the file name fragments:
readarray -t lines < file.txt
declare -Ai set_of_fragments
for line in "${lines[#]}"; do
for fragment in $line; do
((++set_of_fragments["${fragment}"]))
done
done
fragments=("${!set_of_fragments[#]}")
names=("${fragments[#]/#/file_}")
names=("${names[#]/%/.dat}")
cat "${names[#]}"

Related

BASH loop to change data from 1 csv from other csv

trying to change the value of a column based on other column in other csv
so let's say we have a CSV_1 that states with over 1000 lines with 3 columns
shape Color size
round 2 big
triangle 1 small
square 3 medium
then we have a CSV2 that has only 10 with the following information
color
1 REd
2 Blue
3 Yellow
etc
now i want to change the value in column color in CSV_1 with the name of the color of CSV2
so in other words .. something like
for (i=0; i<column.color(csv1); i++) {
if color.csv1=1; then
subustite with color.csv2=1 }
so that loop iterates in all CSV1 Color column and changes the value with the values from CSV2
An explicit loop for this would be very slow in bash. Use a command that does the line-wise processing for you.
sed 's/abc/xyz/' searches abc in each line and replaces it by xyz. Use this to search and replace the numbers in your 2nd column by the names from your 2nd file. The sed command can be automatically generated from the 2nd file using another sed command:
The following script assumes a CSV file without spaces around the delimiting ,.
sed -E "$(sed -E '1d;s#^([^,]*),(.*)#s/^([^,]*,)\1,/\\1\2,/#' 2.csv)" 1.csv
Interactive Example
$ cat 1.csv
shape,Color,size
round,2,big
triangle,1,small
square,3,medium
$ cat 2.csv
color
1,REd
2,Blue
3,Yellow
$ sed -E "$(sed -E '1d;s#^([^,]*),(.*)#s/^([^,]*,)\1,/\\1\2,/#' 2.csv)" 1.csv
shape,Color,size
round,Blue,big
triangle,REd,small
square,Yellow,medium
Here is one approach, with mapfile which is a bash4+ feature and some common utilities in linux/unix.
Assuming both files are delimited with a comma ,
#!/usr/bin/env bash
mapfile -t colors_csv2 < csv2.csv
head -n1 csv1.csv
while IFS=, read -r shape_csv1 color_csv1 size_csv1; do
for color_csv2 in "${colors_csv2[#]:1}"; do
if [[ $color_csv1 == ${color_csv2%,*} ]]; then
printf '%s,%s,%s\n' "$shape_csv1" "${color_csv2#*,}" "$size_csv1"
fi
done
done < <(tail -n +2 csv1.csv)
Would be very slow on large set of data/files.
If ed is available acceptable, with the bash shell.
#!/usr/bin/env bash
ed -s csv1.csv < <(
printf '%s\n' '1d' $'g|.|s|,|/|\\\ns|^|,s/|\\\ns|$|/|' '$a' ',p' 'Q' . ,p |
ed -s csv2.csv
)
To add to #Jetchisel interesting answer, here is an old bash way to achieve that. It should work with bash release 2 as it supports escape literals, indexed array, string expansion, indirect variable references. It implies that color keys in csv2.csv will always be a numeric value. Add shopt -s compat31 at the beginning to test it in the 'old way' with a recent bash. You can also replace declare -a csv2 with a Bash 4+ declare -A csv2 for an associative array, in which case the key can be anything.
#!/bin/bash
declare -a csv2
esc=$'\x1B'
while read -r colors; do
if [ "${colors}" ] ; then
colors="${colors// /${esc}}"
set ${colors//,/ }
if [ "$1" ] ; then
csv2["$1"]="$2"
fi
fi
done < csv2.csv
while read -r output; do
if [ "${output}" ] ; then
outputfilter="${output// /${esc}}"
set ${outputfilter//,/ }
if [ "$2" ] ; then
color="${csv2["$2"]}"
[ "${color}" ] && { tmp="$1,${color},$3";output="${tmp//${esc}/ }"; };
fi
echo "${output}"
fi
done < csv1.csv

How to only concatenate files with same identifier using bash script?

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:
S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt
I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:
original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt
Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...
I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.
My pitiful attempt is something like this:
while IFS= read -r -d '' id; do
cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)
or this:
for ((k=100;k<400;k=k+1));
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &;
done
Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.
I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!
while read $fil;
do
if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
then
cat $line_*.txt >> "$line_merged.txt"
fi
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"
Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.
for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
cat ${id}_*.txt > _${id}_merged.txt
mv ${id}_*.txt folder
fi
done
for f in _*_merged.txt ; do
mv ${f} ${f:1}
done
A plain bash loop with preprocessing:
# first get the list of files
find . -type f |
# then extract the prefix
sed 's#./\([^_]*\)_#\1\t&#' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
cat "$file" >> "${prefix}_merged.txt"
done
That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:
find . -type f | # maybe `sort |` ?
# join filenames with common prefix
awk '{
f=$0; # remember the file path
gsub(/.*\//,"");gsub(/_.*/,""); # extract prefix from filepath and store it in $0
a[$0]=a[$0]" "f # Join path with leading space in associative array indexed with prefix
}
# Output prefix and filanames separated by spaces.
# TBH a tab would be a better separator..
END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
#first array element is the prefix
prefix=${files[0]}
unset files[0]
# rest is the files
case "${#files[#]}" in
0) echo super error; ;;
# one file - preserve the filename
1) cat "${files[#]}" > "$outdir"/"${files[1]}"; ;;
# more files - do a _merged.txt suffix
*) cat "${files[#]}" > "$outdir"/"${prefix}_merged.txt"; ;;
esac
done
Tested on repl.
IDList= echo "S${k}_S*.txt"
Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.
Filename expansion (ie. * -> list of files) is not executed inside " double quotes.
To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )
IDList${k+n}_S*.txt
The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.
To add two numbers use arithemtic expansion $((k + n)).
awk -F'[_.]' '{$1}'
$1 is just invalid here. To print a line, print it {print %1}.
Remember to check your scripts with http://shellcheck.net
A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.
#!/bin/bash
backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts
# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit
for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[#]}"; do
if ((count[$id] > 1)); then
mv "$id"_* "$backupdir"
cat "$backupdir/$id"_* > "$id"_merged.txt
fi
done

parsing many variables to another file contain many rows

I have an issue while parsing many variables which extracted by cut command to another file contain many rows, I need to set the variables to the end of each row in sequence.
EX: file 100.txt contain 1000 rows and contain 3 fields A,B,C
another file called pin contain 1000 rows and contain 1 filed 2222
I need to take it one by one and inserted at the end of each row into 100.txt file.
while IFS= read -r line; do
sed -i "/:[0-9]*$/ ! s%$%,$line%" "100.txt"
done < pin.txt
What I have got:
1,2,3,2222,3333
1,2,3,2222,3333
What I expected:
1,2,3,2222
1,2,3,3333
If both files have the same number of lines, paste is your friend:
paste -d, 100.txt pin.txt > tmp.txt
mv -f tmp.txt 100.txt
Here is i would do it using a while read loop without the sed
while IFS= read -r file1 <&3; do
IFS= read -r file2
printf '%s,%s\n' "$file1" "$file2"
done 3<100.txt < pin.txt
Using mapfile bash4+ only.
mapfile -t file1 < 100.txt
mapfile -t file2 < pin.txt
for i in "${!file1[#]}"; do
printf '%s,%s\n' "${file1[$i]}" "${file2[$i]}"
done
Of course those shell loop would be very slow on large data/size files.

issue with bash : 2 variables instead of one

I've written this piece of code.
The aim is the following :
for each files in the temp list, it should take the first occurence of the list, put it into a variable called $name1 and then the second occurence of the list into a second variable called $name2. The variables are file names. With the 2 variables, I do a join.
for files in $(cat temp.lst); do
if [ $(cat temp.lst | wc -l) == 1 ]
then
name=$(head -1 temp.lst)
join -t\; -j 1 file_minus1.txt "$name" | sed 's/;;/;/g' > file1.txt
else
name1=$(head -1 temp.lst)
name2=$(head -2 temp.lst)
echo "var1 "$name1 "var2 "$name2
sed '1,2d' temp.lst > tmpfile.txt
mv tmpfile.txt temp.lst
join -t\; -j 1 "$name1" "$name2" | sed 's/;;/;/g' > file_minus1.txt
fi
;done
Theoretically, it should work but here it is not working, alas.
The echo line I've put in my code is giving me 3 variables instead of 2
var1 ei_etea17_m.tsv var2 ei_etea17_m.tsv ei_eteu25_m.tsv
Worse, the join is not functionning the way I thought, giving me this error code instead
join: ei_etea17_m.tsv
ei_eteu25_m.tsv: No such file or directory
Please find a sample of my temp.lst
ei_eteu27_m.tsv
ei_eteu28_m.tsv
ei_isbr_m.tsv
ei_isbu_m.tsv
ei_isin_m.tsv
Any suggestions are welcomed.
Best.
To extract 2 lines of a file in a loop, try this:
paste - - < temp.lst |
while read name1 name2; do
if [[ -z $name2 ]]; then
name2=$name1
name1=file_minus1.txt
output=file1.txt
else
output=file_minus1.txt
fi
join -t\; "$name1" "$name2" | sed 's/;;/;/g' > $output
done
Notes
the paste command takes 2 consecutive lines from the file and joins them into a single line (separated by tab)
demo: seq 7 | paste - -
read can assign to multiple variables: the line will be split on whitespace (default) and assigned to the named variables.
in the loop body, I basically follow your logic
To perform an n-way join, use recursion :)
recursive_join () {
# Zero files: do nothing (special case)
# One file: output it
# Multiple files: join the first with the result of joining the rest
file1=$1
shift || return
[ "$#" -eq 0 ] && cat "$file1" ||
recursive_join "$#" | join -t\; -j1 "$file1" -
}
recursive_join ei_eteu27_m.tsv ei_eteu28_m.tsv ei_isbr_m.tsv ei_isbu_m.tsv ei_isin_m.tsv
Adapting this to use a file listing the input files, rather than using command-line arguments, is a little tricker. As long as none of the input file names contain whitespace or other special characters, you could simply use
recursive_join $(cat temp.lst)
Or, if you want to avail yourself of bash features, you could use an array:
while read; do files+=("$REPLY"); done < temp.lst
recursive_join "${files[#]}"
or in bash 4:
readarray files < temp.list
recursive_join "${files[#]}"
However, if you want to stick with standard shell scripting only, it would be better to modify the recursive function to read the input file names from standard input. This makes the function a little uglier, since in order to detect if there is only one file left on standard input, we have to try to read a second one, and put it back on standard input if we succeed.
recursive_join () {
IFS= read -r file1 || return
IFS= read -r file2 &&
{ echo "$file2"; cat; } | recursive_join | join -t\; -j1 "$file1" - ||
cat "$file1"
}
recursive_join < temp.lst
Creating a function that can take either command-line arguments or read a list from standard input is left as an exercise for the reader.
Variable name1 is getting the first line.
Variable name2 is getting the first two lines.
If you want name2 to have only the second line you could try something like:
name2=$(sed -n '2p')
Also sed -i will remove the need for tmpfile.txt.
Ok Gents or Ladies.
I found out the Why.
head -1 temp.lst is only given the file name without the extension.
I need to find a way to include the extension. Doable.

how to read file from line x to the end of a file in bash

I would like know how I can read each line of a csv file from the second line to the end of file in a bash script.
I know how to read a file in bash:
while read line
do
echo -e "$line\n"
done < file.csv
But, I want to read the file starting from the second line to the end of the file. How can I achieve this?
tail -n +2 file.csv
From the man page:
-n, --lines=N
output the last N lines, instead of the last 10
...
If the first character of N (the number of bytes or lines) is a '+',
print beginning with the Nth item from the start of each file, other-
wise, print the last N items in the file.
In English this means that:
tail -n 100 prints the last 100 lines
tail -n +100 prints all lines starting from line 100
Simple solution with sed:
sed -n '2,$p' <thefile
where 2 is the number of line you wish to read from.
Or else (pure bash)...
{ for ((i=1;i--;));do read;done;while read line;do echo $line;done } < file.csv
Better written:
linesToSkip=1
{
for ((i=$linesToSkip;i--;)) ;do
read
done
while read line ;do
echo $line
done
} < file.csv
This work even if linesToSkip == 0 or linesToSkip > file.csv's number of lines
Edit:
Changed () for {} as gniourf_gniourf enjoin me to consider: First syntax generate a sub-shell, whille {} don't.
of course, for skipping only one line (as original question's title), the loop for (i=1;i--;));do read;done could be simply replaced by read:
{ read;while read line;do echo $line;done } < file.csv
There are many solutions to this. One of my favorite is:
(head -2 > /dev/null; whatever_you_want_to_do) < file.txt
You can also use tail to skip the lines you want:
tail -n +2 file.txt | whatever_you_want_to_do
Depending on what you want to do with your lines: if you want to store each selected line in an array, the best choice is definitely the builtin mapfile:
numberoflinestoskip=1
mapfile -s $numberoflinestoskip -t linesarray < file
will store each line of file file, starting from line 2, in the array linesarray.
help mapfile for more info.
If you don't want to store each line in an array, well, there are other very good answers.
As F. Hauri suggests in a comment, this is only applicable if you need to store the whole file in memory.
Otherwise, you best bet is:
{
read; # Just a scratch read to get rid (pun!) of the first line
while read line; do
echo "$line"
done
} < file.csv
Notice: there's no subshell involved/needed.
This will work
i=1
while read line
do
test $i -eq 1 && ((i=i+1)) && continue
echo -e "$line\n"
done < file.csv
I would just get a variable.
#!/bin/bash
i=0
while read line
do
if [ $i != 0 ]; then
echo -e $line
fi
i=$i+1
done < "file.csv"
UPDATE Above will check for the $i variable on every line of csv. So if you have got very large csv file of millions of line it will eat significant amount of CPU cycles, no good for Mother nature.
Following one liner can be used to delete the very first line of CSV file using sed and then output the remaining file to while loop.
sed 1d file.csv | while read d; do echo $d; done

Resources