Related
I have a csvfile like:
col1,col2
A,100foo
A,104foo
B,110bar
C,111bar
Now I have a searchstring
B,112
Which shall return line:
B,110bar
Or a searchstring
A,103
Which Shall return A,100foo
So I am always looking for the line 'smaller' than the searchstring.
The second column is not a number, so I cannot do math operations.
I more need something like an 'inaccurate search'.
Can I do that in Bash?
The file can be Sorte alphabetically, so I was thinking about like a 'grep-like' and the take the line before.
It's not really clear how inaccurate the search is allowed to be.
Would searching all lines that begin with the same character of searchstring do?
str = "B,110"
grep "^${str:0:1}" csvfile
or are there more requirements on the format of the line?
So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!
I'm working on a csv file like the one below, comma delimited, each cell is enclosed in double quotes, but some of them contain double quote and/or comma inside double quote enclosure. The actual file contain around 300 columns and 200,000 rows.
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with "comma" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, "cde" here","cde","cde","cde"
I'll need to remove some unless columns, and merge last few columns, instead of having "," in between them, I need </br>. and move second column to the end. Anything within the cells should be the same, with double quotes and commas as the original file. Below is an example of the output that I need.
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, "cde" here","cde</br>cde</br>cde","cde"
In this example I want to remove column3 and merge column 5, 6, 7.
Below is the code that I tried to use, but it is reading either double quote and/or comma, which is end of the row to be different than what I expected.
awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
sed is used to remove beginning and ending double quote of a cell.
The output file that I'm getting right now, if previous field contains a double quote, it will consider that is the beginning of a cell, so the following values are often pushed up a column.
Other code that I have used consider every comma as beginning of a cell, so that won't work as well.
awk -F',' 'BEGIN{OFS=",";} {print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
Any help is greatly appreciated. thanks!
CSV is a loose format. There may be subtle variations in formatting. Your particular format may or may not be expressible with a regular grammar/regular expression. (See this question for a discussion about this.) Even if your particular formatting can be expressed with regular expressions, it may be easier to just whip out a parser from an existing library.
It is not a bash/awk/sed solution as you may have wanted or needed, but Python has a csv module for parsing CSV files. There are a number of options to tweak the formatting. Try something like this:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for row in inreader:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
Note that in Python, indexes start with 0 (e.g. row[1] is the second field). The first index of a slice is inclusive, the last is exclusive (row[1:3] is row[1] and row[2] only). Your formatting seems to require quotes around every field, hence the quoting=csv.QUOTE_ALL. There are more options at Dialects and Formatting Parameters.
The above code produces the following output:
"Column1","Column4","Column5</br>Column6</br>Column7","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, cde"" here""","cde</br>cde</br>cde","cde"
There are two issues with this:
It doesn't treat the first row any differently, so the headers of columns 5, 6, and 7 are merged like the other rows.
Your input CSV contains "some other, "cde" here" (third row, fourth column) with unescaped quotes around the cde. There is another case of this on line two, but it was removed since it is in column 3. The result contains incorrect quotes.
If these quotes are properly escaped, your sample input CSV file becomes
infile.csv (escaped quotes):
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with ""comma"" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, ""cde"" here","cde","cde","cde"
Now consider this modified Python script that doesn't merge columns on the first row:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
first_row = True
for row in inreader:
if first_row:
first_row = False
else:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field (index 1) to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
The output outfile.csv is
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, ""cde"" here","cde</br>cde</br>cde","cde"
This is your sample output, but with properly escaped "some other, ""cde"" here".
This may not be precisely what you wanted, not being a sed or awk solution, but I hope it is still useful. Processing more complicated formats may justify more complicated tools. Using an existing library also removes a few opportunities to make mistakes.
This might be an oversimplification of the problem but this has worked for me with your test data:
cat /tmp/inputfile.csv | sed 's#\"\,\"#|#g' | sed 's#"</br>"#</br>#g' | awk 'BEGIN {FS="|"} {print $1 "," $4 "," $5 "</br>" $6 "</br>" $7 "," $2}'
Please not that I am on Mac probably that's why I had to wrap the commas in the AWK script in quotation marks.
I would like to search through a file and find all instances where the last non-blank character is a comma and move the line below that up one. Essentially, undoing line continuations like
private static final double SOME_NUMBERS[][] = {
{1.0, -6.032174644509064E-23},
{-0.25, -0.25},
{-0.16624879837036133, -2.6033824355191673E-8}
};
and transforming that to
private static final double SOME_NUMBERS[][] = {
{1.0, -6.032174644509064E-23}, {-0.25, -0.25}, {-0.16624879837036133, -2.6033824355191673E-8}
};
Is there a good way to do this?
As mjswartz suggests in the comments, we need a sed substitution command like s/,\n/ /g. That, however, does not work by itself because, by default, sed reads in only one line at a time. We can fix that by reading in the whole file first and then doing the substitution:
$ sed 'H;1h;$!d;x; s/,[[:blank:]]*\n[[:blank:]]*/, /g;' file
private static final double SOME_NUMBERS[][] = {
{1.0, -6.032174644509064E-23}, {-0.25, -0.25}, {-0.16624879837036133, -2.6033824355191673E-8}
};
Because this reads in the whole file at once, this is not a good approach for huge files.
The above was tested with GNU sed.
How it works
H;1h;$!d;x;
This series of commands reads in the whole file. It is probably simplest to think of this as an idiom. If you really want to know the gory details:
H - Append current line to hold space
1h - If this is the first line, overwrite the hold space with it
$!d - If this is not the last line, delete pattern space and jump to the next line.
x - Exchange hold and pattern space to put whole file in pattern space
s/,[[:blank:]]*\n[[:blank:]]*/, /g
This looks for lines that end with a comma, optionally followed by blanks, followed by a newline and replaces that, and any leading space on the following line, with a comma and a single space.
I think for large files awk would be better:
awk -vRS=", *\n" -vORS=", " '1' file
On lua-shell, just write like this:
function nextlineup()
vim:normal("j^y$k$pjddk")
end
vim:open("code.txt")
vim:normal("G")
while vim:k() do
vim:normal("$")
if(vim.currc == string.byte(',')) nextlineup();
end
If you are not familier with vim ,this script seems a bit scary and not robust. In fact, every operation in it is precise(and much quicker, because tey are built-in functions).
Since you are processing a code file, i suggest you try it.
here is a demo
Here is a perl solution.
cat file | perl -e '{$c = 0; while () { s/^\s+/ / if ($c); s/,\s*$/,/; print($_); $c = (m/,\s*$/) ? 1: 0; }}'
I have a CSV like:
1015,5
1015,4
1035,17
1035,11
1009,1
1009,4
1026,9
1004,5
1004,5
1009,1
I search a way to obtain : an addition of the second number if the first number match
1015,9
1035,28
1009,6
1026,9
1004,10
Try this :
awk 'BEGIN{FS=OFS=","}{a[$1]+=$2}END{for(i in a){print i,a[i]}}' file
This is the awk snippet that every shell coder should know from the top of his head.