How to split large *.csv files with headers in Bash? - bash

I need split big *.csv file for several smaller. Currently there is 661497 rows, I need each file with max. 40000. I've tried solution that I found on Github but with no success:
FILENAME=/home/cnf/domains/cnf.com.pl/public_html/sklep/dropshipping-pliki/products-files/my_file.csv
HDR=$(head -1 ${FILENAME})
split -l 40000 ${FILENAME} xyz
n=1
for f in xyz*
do
if [[ ${n} -ne 1 ]]; then
echo ${HDR} > part-${n}-${FILENAME}.csv
fi
cat ${f} >> part-${n}-${FILENAME}.csv
rm ${f}
((n++))
done
The error I get:
/home/cnf/domains/cnf.com.pl/public_html/sklep/dropshipping-pliki/download.sh: line 23: part-1-/home/cnf/domains/cnf.com.pl/public_html/sklep/dropshipping-pliki/products-files/my_file.csv.csv: No such file or directory
thanks for help!

Keep in mind FILENAME contains both a directory and a file so later in the script when you build the new filename you get something like:
part-1-/home/cnf/domains/cnf.com.pl/public_html/sklep/dropshipping-pliki/products-files/tyre_8.csv.csv
One quick-n-easy fix would be split the directory and filename into 2 separate variables, eg:
srcdir='/home/cnf/domains/cnf.com.pl/public_html/sklep/dropshipping-pliki/products-files'
filename='tyre_8.csv'
hdr=$(head -1 ${srcdir}/${filename})
split -l 40000 "${srcdir}/${filename}" xyz
n=1
for f in xyz*
do
if [[ ${n} -ne 1 ]]; then
echo ${hdr} > "${srcdir}/part-${n}-${filename}"
fi
cat "${f}" >> "${srcdir}/part-${n}-${filename}"
rm "${f}"
((n++))
done
NOTES:
consider using lowercase variables (using uppercase variables raises the possibility of problems if there's an OS variable of the same name)
wrap variable references in double quotes in case string contains spaces
don't need to add a .csv extension on the new filename since it's already part of $filename

Related

Creating files in succession

How would one go about creating a script for creating 25 empty files in succession? (I.e 1-25, 26-51, 52-77)
I can create files 1-25 but I’m having trouble figuring out how to create a script that continues that process from where it left off, every time I run the script.
#!/bin/bash
higher=$( find files -type f -exec basename {} \; | sort -n | tail -1 )
if [[ "$higher" == "" ]]
then
start=1
end=25
else
(( start = higher + 1 ))
(( end = start + 25 ))
fi
echo "$start --> $end"
for i in $(seq $start 1 $end)
do
touch files/"$i"
done
I put my files in a directory called "files".
hence the find on directory "files".
for each file found, I run a basename on it. That will return only integer values, since the files all have a number filename.
sort -n puts them in order.
tail -1 extracts the highest number.
if there are no files, higher will be empty, so the indexes will be 1 and 25.
otherwise, they will be higher + 1, and higher + 26.
I used seq for the for loop to avoid problems with variables inside a range definition (you did {1..25})
#! /usr/bin/env bash
declare -r base="${1:-base-%d.txt}"
declare -r lot="${2:-25}"
declare -i idx=1
declare -i n=0
printf -v filename "${base}" ${idx}
while [[ -e "${filename}" ]]; do
idx+=1
printf -v filename "${base}" "${idx}"
done
while [[ $n -lt $lot ]]; do
printf -v filename "${base}" ${idx}
if [[ ! -e "${filename}" ]]; then
> "$filename"
n+=1
fi
idx+=1
done
This script accepts two optional parameters.
The first is the basename of your future files with a %d token automatically replaced by the file number. Default value is base-%d.txt;
The number of file to create. Default value is 25.
How script works:
Variable declarations
base: file basename (constant)
lot: number of file to create (constant)
idx: search index
n: counter for new files
Search files already created from 1
The loop stop at first hole in the numbering
Loop to create empty files
The condition in the loop allows to fill in the numbering holes
> filename create an empty file

Why is "ls -1 $fl | wc -l" not returning value 0 in my for loop?

I am trying to add a condition in a for loop to check the existence of a file as well as check for file size > 0 KB.
Period file contains monthly data:
20180101
20180201
20180301
20180401
20180501
There are individual files created for each month. Suppose a file is not created for one month, (20180201), then the loop below should terminate.
For example:
xxx_20180101.txt
xxx_20180301.txt
xxx_20180401.txt
xxx_20180501.txt
if [[ $STATUS -eq 0 ]]; then
for per in `cat ${PATH}/${PERIOD}.txt | cut -f 1 -d";"`
do
for fl in `ls -1 ${PATH}/${FILE} | grep ${per}`
do
if [[ `ls -1 $fl | wc -l` -eq 0 ]]; then
echo "File not found"
STATUS=1
else
if [[ -s "$fl" ]]; then
echo "$fl contain data.">>/dev/null
else
echo "$fl File size is 0KB"
STATUS=1
fi
fi
done
done
fi
but ls -1 $fl | wc -l is not returning 0 value when the if condition is executed.
The following is a demonstration of what a best-practices rewrite might look like.
Note:
We do not (indeed, must not) use a variable named PATH to store a directory under which we look for data files; doing this overwrites the PATH environment variable used to find programs to execute.
ls is not used anywhere; it is a tool intended to generate output for human consumption, not machines.
Reading through input is accomplished with a while read loop; see BashFAQ #1 for more details. Note that the input source for the loop is established at the very end; see the redirection after the done.
Finding file sizes is done with stat -c here; for more options, portable to platforms where stat -c is not supported, see BashFAQ #87.
Because your filename format is well-formed (with an underscore before the substring from your input file, and a .txt after that substring), we're refining the glob to look only for names matching that restriction. This prevents a search for 001 to find xxx_0015.txt, xxx_5001.txt, etc. as well.
#!/usr/bin/env bash
# ^^^^ -- NOT /bin/sh; this lets us use bash-only syntax
path=/provided/by/your/code # replacing buggy use of PATH in original code
period=likewise # replacing use of PERIOD in original code
shopt -s nullglob # generate a zero-length list for unmatched globs
while IFS=';' read -r per _; do
# populate an array with a list of files with names containing $per
files=( "$path/$period/"*"_${per}.txt" )
# if there aren't any, log a message and proceed
if (( ${#files[#]} == 0 )); then
echo "No files with $per found in $path/$period" >&2
continue
fi
# if they *do* exist, loop over them.
for file in "${files[#]}"; do
if [[ -s "$file" ]]; then
echo "$file contains data" >&2
if (( $(stat -c +%s -- "$file") >= 1024 )); then
echo "$file contains 1kb of data or more" >&2
else
echo "$file is not empty, but is smaller than 1kb" >&2
fi
else
echo "$file is empty" >&2
fi
done
done < "$path/$period.txt"
Here's a refactoring of Mikhail's answer with the standard http://shellcheck.net/ warnings ironed out. I have not been able to understand the actual question well enough to guess whether this actually solves the OP's problem.
while IFS='' read -r per; do
if [ -e "xxx_$per.txt" ]; then
echo "xxx_$per.txt found" >&2
else
echo "xxx_$per.txt not found" >&2
fi
done <periods.txt
You are over engineering here. Just iterate over content of file with periods and search each period in a list of files. Like this:
for per in `cat periods.txt`
do
if ls | grep -q "$per"; then
echo "$per found";
else
echo "$per not found"
fi
done

how can I merge two sets of text files with same suffix as file name?

I have two sets of files, the first set of files are:
apple_sweet_1.txt
apple_sweet_2.txt
apple_sweet_3.txt
Now, the second set of files I have are:
mango_sweet_1.txt
mango_sweet_2.txt
mango_sweet_3.txt
I want to cat the respective files in a bash loop so I could get something like this (of course, I don't want to do this individually):
cat apple_sweet_1.txt mango_sweet_1.txt > sweet_1.txt
cat apple_sweet_2.txt mango_sweet_2.txt > sweet_2.txt
cat apple_sweet_3.txt mango_sweet_3.txt > sweet_3.txt
You can use this for loop:
for i in apple_sweet_*.txt; do
p="${i#apple_}"
[[ -f "mango_$p" ]] && cat "$i" "mango_$p" > "$p"
done
bash solution:
for f in apple_sweet_*.txt; do
if [[ "$f" =~ .*_([0-9]+).txt ]]; then
idx=${BASH_REMATCH[1]} # getting file numeric index
mango_fn="mango_sweet_${idx}.txt" # related `mango` filename
[ -f "$mango_fn" ] && cat "$f" "$mango_fn" > "sweet_${idx}.txt"
fi
done

While loop does not execute

I currently have this code:
listing=$(find "$PWD")
fullnames=""
while read listing;
do
if [ -f "$listing" ]
then
path=`echo "$listing" | awk -F/ '{print $(NF)}'`
fullnames="$fullnames $path"
echo $fullnames
fi
done
For some reason, this script isn't working, and I think it has something to do with the way that I'm writing the while loop / declaring listing. Basically, the code is supposed to pull out the actual names of the files, i.e. blah.txt, from the find $PWD.
read listing does not read a value from the string listing; it sets the value of listing with a line read from standard input. Try this:
# Ignoring the possibility of file names that contain newlines
while read; do
[[ -f $REPLY ]] || continue
path=${REPLY##*/}
fullnames+=( $path )
echo "${fullnames[#]}"
done < <( find "$PWD" )
With bash 4 or later, you can simplify this with
shopt -s globstar
for f in **/*; do
[[ -f $f ]] || continue
path+=( "$f" )
done
fullnames=${paths[#]##*/}

Renaming Multiples Files with Different Names in a Directory using shell

I've found most of the questions of this kind where the change in name has been same for the entire set of files in that directory.
But i'm here presented with a situation to give a different name to every file in that directory or just add a different prefix.
For Example, I have about 200 files in a directory, all of them with numbers in their filename. what i want to do is add a prefix of 1 to 200 for every file. Like 1_xxxxxxxx.png,2_xxxxxxxx.png...........200_xxxxxxxx.png
I'm trying this, but it doesnt increment my $i everytime, rather it gives a prefix of 1_ to every file.
echo "renaming files"
i=1 #initializing
j=ls -1 | wc -l #Count number of files in that dir
while [ "$i" -lt "$j" ] #looping
do
for FILE in * ; do NEWFILE=`echo $i_$FILE`; #swapping the file with variable $i
mv $FILE $NEWFILE #doing the actual rename
i=`expr $i+1` #increment $i
done
Thanks for any suggestion/help.
To increment with expr, you definitely need spaces( expr $i + 1 ), but you would probably be better off just doing:
echo "renaming files"
i=1
for FILE in * ; do
mv $FILE $((i++))_$FILE
done
i=1
for f in *; do
mv -- "$f" "${i}_$f"
i=$(($i + 1))
done

Resources