Find if null exists in csv file - bash

I have a csv file. The file has some anomalies as it contains some unknown characters.
The characters appear at line 1535 in popular editors (images attached below). The sed command in the terminal for this linedoes not show anything.
$ sed '1535!d' sample.csv
"sample_id","sample_column_text_1","sample_"sample_id","sample_column_text_1","sample_column_text_2","sample_column_text_3"
However below are the snapshots of the file in various editors.
Sublime Text
Nano
Vi
The directory has various csv files that contain this character/chain of characters.
I need to write a bash script to determine the files that have such characters. How can I achieve this?

The following is from;
http://www.linuxquestions.org/questions/programming-9/how-to-check-for-null-characters-in-file-509377/
#!/usr/bin/perl -w
use strict;
my $null_found = 0;
foreach my $file (#ARGV) {
if ( ! open(F, "<$file") ) {
warn "couldn't open $file for reading: $!\n";
next;
}
while(<F>) {
if ( /\000/ ) {
print "detected NULL at line $. in file $file\n";
$null_found = 1;
last;
}
}
close(F);
}
exit $null_found;
If it works as desired, you can save it to a file, nullcheck.pl and make it executable;
chmod +x nullcheck.pl
It seems to take an array of files names as input, but will fail if it finds in any, so I'd only pass in one at a time. The command below is used to run the script.
for f in $(find . -type f -exec grep -Iq . {} \; -and -print) ; do perl ./nullcheck.pl $f || echo "$f has nulls"; done
The above find command is lifted from Linux command: How to 'find' only text files?

You can try tr :
grep '\000' filename to find if the files contain the \000 characters.
You can use this to remove NULL and make it non-NULL file :
tr < file-with-nulls -d '\000' > file-without-nulls

Related

Script for printing out file names and their number of appearance starting from a given folder

I need to write a shell script, which starts with a given folder name as an argument will print out the names of folder and files in it and how many times does each name appear in the given folder.
edit I need to check only their names, without taking into consideration the file extensions.
#!/bin/bash
folder="$1"
for f in "$folder"
do
echo "$f"
done
And I would expect to see something like this (if i have 3 files with the same name and different extension like x.html, x.css, x.sh and so on, in a directory called dir)
x
3 times
after executing the script with dir (the name of the directory) as a parameter.
The find command already does most of this for you.
find . -printf "%f\n" |
sort | uniq -c
This will not work correctly if you have files whose names contain a newline.
If your find doesn't support -printf, maybe try
find . -exec basename {} \; |
sort | uniq -c
To restrict to just file names or directory names, add -type f or -type d, respectively, before the action (-exec or -printf).
If you genuinely want to remove extensions, try
find .... whatever ... |
sed 's%\.[^./]*$%%' |
sort | uniq -c
Can you try this,
#!/bin/bash
IFS=$'\n' array=($(ls))
iter=0;
for file in ${array[*]}; do
filename=$(basename -- "$file")
extension="${filename##*.}"
filename="${filename%.*}"
filenamearray[$iter]=$filename
iter=$((iter+1))
done
for filename in ${filenamearray[#]}; do
echo $filename;
grep -o $filename <<< "${filenamearray[#]}" | wc -l
done
You can try with find and awk :
find . -type f -print0 |
awk '
BEGIN {
FS="/"
RS="\0"
}
{
k = split( $NF , b , "." )
if ( k > 1 )
sub ( "."b[k] , "" , $NF )
a[$NF]++
}
END {
for ( i in a ) {
j = a[i]>1 ? "s" : ""
print i
print a[i] " time" j
}
}'

Bash - Search and Replace operation with reporting the files and lines that got changed

I have a input file "test.txt" as below -
hostname=abc.com hostname=xyz.com
db-host=abc.com db-host=xyz.com
In each line, the value before space is the old value which needs to be replaced by the new value after the space recursively in a folder named "test". I am able to do this using below shell script.
#!/bin/bash
IFS=$'\n'
for f in `cat test.txt`
do
OLD=$(echo $f| cut -d ' ' -f 1)
echo "Old = $OLD"
NEW=$(echo $f| cut -d ' ' -f 2)
echo "New = $NEW"
find test -type f | xargs sed -i.bak "s/$OLD/$NEW/g"
done
"sed" replaces the strings on the fly in 100s of files.
Is there a trick or an alternative way by which i can get a report of the files changed like absolute path of the file & the exact lines that got changed ?
PS - I understand that sed or stream editors doesn't support this functionality out of the box. I don't want to use versioning as it will be an overkill for this task.
Let's start with a simple rewrite of your script, to make it a little bit more robust at handling a wider range of replacement values, but also faster:
#!/bin/bash
# escape regexp and replacement strings for sed
escapeRegex() { sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$1"; }
escapeSubst() { sed 's/[&/\]/\\&/g' <<<"$1"; }
while read -r old new; do
find test -type f -exec sed "/$(escapeRegex "$old")/$(escapeSubst "$new")/g" -i '{}' \;
done <test.txt
So, we loop over pairs of whitespace-separated fields (old, new) in lines from test.txt and run a standard sed in-place replace on all files found with find.
Pretty similar to your script, but we properly read lines from test.txt (no word splitting, pathname/variable expansion, etc.), we use Bash builtins whenever possible (no need to call external tools like cat, cut, xargs); and we escape sed metacharacters in old/new values for proper use as sed's regexp and replacement expressions.
Now let's add logging from sed:
#!/bin/bash
# escape regexp and replacement strings for sed
escapeRegex() { sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$1"; }
escapeSubst() { sed 's/[&/\]/\\&/g' <<<"$1"; }
while read -r old new; do
find test -type f -printf '\n[%p]\n' -exec sed "/$(escapeRegex "$old")/{
h
s//$(escapeSubst "$new")/g
H
x
s/\n/ --> /
w /dev/stdout
x
}" -i '{}' > >(tee -a change.log) \;
done <test.txt
The sed script above changes each old to new, but it also writes old --> new line to /dev/stdout (Bash-specific), which we in turn append to change.log file. The -printf action in find outputs a "header" line with file name, for each file processed.
With this, your "change log" will look something like:
[file1]
hostname=abc.com --> hostname=xyz.com
[file2]
[file1]
db-host=abc.com --> db-host=xyz.com
[file2]
db-host=abc.com --> db-host=xyz.com
Just for completeness, a quick walk-through the sed script. We act only on lines containing the old value. For each such line, we store it to hold space (h), change it to new, append that new value to the hold space (joined with newline, H) which now holds old\nnew. We swap hold with pattern space (x), so we can run s command that converts it to old --> new. After writing that to the stdout with w, we move the new back from hold to pattern space, so it gets written (in-place) to the file processed.
From man sed:
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if SUFFIX supplied)
This can be used to create a backup file when replacing. You can then look for any backup files, which indicate which files were changed, and diff those with the originals. Once you're done inspecting the diff, simply remove the backup files.
If you formulate your replacements as sed statements rather than a custom format you can go one further, and use either a sed shebang line or pass the file to -f/--file to do all the replacements in one operation.
There's several problems with your script, just replace it all with (using GNU awk instead of GNU sed for inplace editing):
mapfile -t files < <(find test -type f)
awk -i inplace '
NR==FNR { map[$1] = $2; next }
{ for (old in map) gsub(old,map[old]) }
' test.txt "${files[#]}"
You'll find that is orders of magnitude faster than what you were doing.
That still has the issue your existing script does of failing when the "test.txt" strings contain regexp or backreference metacharacters and modifying previously-modified strings and handling partial matches - if that's an issue let us know as it's easy to work around with awk (and extremely difficult with sed!).
To get whatever kind of report you want you just tweak the { for ... } line to print them, e.g. to print a record of the changes to stderr:
mapfile -t files < <(find test -type f)
awk -i inplace '
NR==FNR { map[$1] = $2; next }
{
orig = $0
for (old in map) {
gsub(old,map[old])
}
if ($0 != orig) {
printf "File %s, line %d: \"%s\" became \"%s\"\n", FILENAME, FNR, orig, $0 | "cat>&2"
}
}
' test.txt "${files[#]}"

Efficient way to find paths from a list of filenames

From a list of file names stored in a file f, what's the best way to find the relative path of each file name under dir, outputting this new list to file p? I'm currently using the following:
while read name
do
find dir -type f -name "$name" >> p
done < f
which is too slow for a large list, or a large directory tree.
EDIT: A few numbers:
Number of directories under dir: 1870
Number of files under dir: 80622
Number of filenames in f: 73487
All files listed in f do exist under dir.
The following piece of python code does the trick. The key is to run find once and store the output in a hashmap to provide an O(1) way to get from file_name to the list of paths for the filename.
#!/usr/bin/env python
import os
file_names = open("f").readlines()
file_paths = os.popen("find . -type f").readlines()
file_names_to_paths = {}
for file_path in file_paths:
file_name = os.popen("basename "+file_path).read()
if file_name not in file_names_to_paths:
file_names_to_paths[file_name] = [file_path]
else:
file_names_to_paths[file_name].append(file_path) # duplicate file
out_file = open("p", "w")
for file_name in file_names:
if file_names_to_paths.has_key(file_name):
for path in file_names_to_paths[file_name]:
out_file.write(path)
Try this perl one-liner
perl -e '%H=map{chomp;$_=>1}<>;sub R{my($p)=#_;map R($_),<$p/*> if -d$p;($b=$p)=~s|.*/||;print"$p\n" if$H{$b}}R"."' f
1- create an hashmap whose keys are filenames : %H=map{chomp;$_=>1}<>
2- define a recursive subroutine to traverse directories : sub R{}
2.1- recusive call for directories : map R($_), if -d$p
2.2- extract the filename from the path : ($b=$p)=~s|.*/||
2.3- print if hashmap contains filename : print"$p\n" if$H{$b}
3- call R with path current directory : R"."
EDIT : to traverse hidden directories (.*)
perl -e '%H=map{chomp;$_=>1}<>;sub R{my($p)=#_;map R($_),grep !m|/\.\.?$|,<$p/.* $p/*> if -d$p;($b=$p)=~s|.*/||;print"$p\n" if$H{$b}}R"."' f
I think this should do the trick:
xargs locate -b < f | grep ^dir > p
Edit: I can't think of an easy way to prefix dir/*/ to the list of file names, otherwise you could just pass that directly to xargs locate.
Depending on what percentage of the directory tree is considered a match, it might be faster to find every file, then grep out the matching ones:
find "$dir" -type f | grep -f <( sed 's+\(.*\)+/\1$+' "$f" )
The sed command pre-processes your list of file names into regular expressions that will only match full names at the end of a path.
Here is an alternative using bash and grep
#!/bin/bash
flist(){
for x in "$1"/*; do #*/ for markup
[ -d "$x" ] && flist $x || echo "$x"
done
}
dir=/etc #the directory you are searching
list=$(< myfiles) #the file with file names
#format the list for grep
list="/${list//
/\$\|/}"
flist "$dir" | grep "$list"
...if you need full posix shell compliance (busybox ash, hush, etc...) replace the $list substring manipulation with a variant of chepner's sed and replace $(< file) with $(cat file)

Using bash: how do you find a way to search through all your files in a directory (recursively?)

I need a command that will help me accomplish what I am trying to do. At the moment, I am looking for all the ".html" files in a given directory, and seeing which ones contain the string "jacketprice" in any of them.
Is there a way to do this? And also, for the second (but separate) command, I will need a way to replace every instance of "jacketprice" with "coatprice", all in one command or script. If this is feasible feel free to let me know. Thanks
find . -name "*.html" -exec grep -l jacketprice {} \;
for i in `find . -name "*.html"`
do
sed -i "s/jacketprice/coatprice/g" $i
done
As for the second question,
find . -name "*.html" -exec sed -i "s/jacketprice/coatprice/g" {} \;
Use recursive grep to search through your files:
grep -r --include="*.html" jacketprice /my/dir
Alternatively turn on bash's globstar feature (if you haven't already), which allows you to use **/ to match directories and sub-directories.
$ shopt -s globstar
$ cd /my/dir
$ grep jacketprice **/*.html
$ sed -i 's/jacketprice/coatprice/g' **/*.html
Depending on whether you want this recursively or not, perl is a good option:
Find, non-recursive:
perl -nwe 'print "Found $_ in file $ARGV\n" if /jacketprice/' *.html
Will print the line where the match is found, followed by the file name. Can get a bit verbose.
Replace, non-recursive:
perl -pi.bak -we 's/jacketprice/coatprice/g' *.html
Will store original with .bak extension tacked on.
Find, recursive:
perl -MFile::Find -nwE '
BEGIN { find(sub { /\.html$/i && push #ARGV, $File::Find::name }, '/dir'); };
say $ARGV if /jacketprice/'
It will print the file name for each match. Somewhat less verbose might be:
perl -MFile::Find -nwE '
BEGIN { find(sub { /\.html$/i && push #ARGV, $File::Find::name }, '/dir'); };
$found{$ARGV}++ if /jacketprice/; END { say for keys %found }'
Replace, recursive:
perl -MFile::Find -pi.bak -we '
BEGIN { find(sub { /\.html$/i && push #ARGV, $File::Find::name }, '/dir'); };
s/jacketprice/coatprice/g'
Note: In all recursive versions, /dir is the bottom level directory you wish to search. Also, if your perl version is less than 5.10, say can be replaced with print followed by newline, e.g. print "$_\n" for keys %found.

Bash script optimisation

This is the script in question:
for file in `ls products`
do
echo -n `cat products/$file \
| grep '<td>.*</td>' | grep -v 'img' | grep -v 'href' | grep -v 'input' \
| head -1 | sed -e 's/^ *<td>//g' -e 's/<.*//g'`
done
I'm going to run it on 50000+ files, which would take about 12 hours with this script.
The algorithm is as follows:
Find only lines containing table cells (<td>) that do not contain any of 'img', 'href', or 'input'.
Select the first of them, then extract the data between the tags.
The usual bash text filters (sed, grep, awk, etc.) are available, as well as perl.
Looks like that can all be replace by one gawk command:
gawk '
/<td>.*<\/td>/ && !(/img/ || /href/ || /input/) {
sub(/^ *<td>/,""); sub(/<.*/,"")
print
nextfile
}
' products/*
This uses the gawk extension nextfile.
If the wildcard expansion is too big, then
find products -type f -print | xargs gawk '...'
Here's some quick perl to do the whole thing that should be alot faster.
#!/usr/bin/perl
process_files($ARGV[0]);
# process each file in the supplied directory
sub process_files($)
{
my $dirpath = shift;
my $dh;
opendir($dh, $dirpath) or die "Cant readdir $dirpath. $!";
# get a list of files
my #files;
do {
#files = readdir($dh);
foreach my $ent ( #files ){
if ( -f "$dirpath/$ent" ){
get_first_text_cell("$dirpath/$ent");
}
}
} while ($#files > 0);
closedir($dh);
}
# return the content of the first html table cell
# that does not contain img,href or input tags
sub get_first_text_cell($)
{
my $filename = shift;
my $fh;
open($fh,"<$filename") or die "Cant open $filename. $!";
my $found = 0;
while ( ( my $line = <$fh> ) && ( $found == 0 ) ){
## capture html and text inside a table cell
if ( $line =~ /<td>([&;\d\w\s"'<>]+)<\/td>/i ){
my $cell = $1;
## omit anything with the following tags
if ( $cell !~ /<(img|href|input)/ ){
$found++;
print "$cell\n";
}
}
}
close($fh);
}
Simply invoke it by passing the directory to be searched as the first argument:
$ perl parse.pl /html/documents/
What about this (should be much faster and clearer):
for file in products/*; do
grep -P -o '(?<=<td>).*(?=<\/td>)' $file | grep -vP -m 1 '(img|input|href)'
done
the for will look to every file in products. See the difference with your syntax.
the first grep will output just the text between <td> and </td> without those tags for every cell as long as each cell is in a single line.
finally the second grep will output just the first line (which is what I believe you wanted to achieve with that head -1) of those lines which doesn't contain img, href or input (and will exit right then reducing the overall time allowing to process the next file faster)
I would have loved to use just a single grep, but then the regex will be really awful. :-)
Disclaimer: of course I haven't tested it

Resources