I'm trying to read my excel files saved in my azure storage container like this
string connectionString = Environment.GetEnvironmentVariable("AZURE_STORAGE_CONNECTION_STRING");
BlobServiceClient blobServiceClient = new BlobServiceClient(connectionString);
BlobContainerClient containerClient = blobServiceClient.GetBlobContainerClient("concursos");
foreach (BlobItem blobItem in containerClient.GetBlobs())
{
BlobClient blobClient = containerClient.GetBlobClient(blobItem.Name);
ExcelPackage.LicenseContext = LicenseContext.NonCommercial;
using (var stream=blobClient.OpenRead(new BlobOpenReadOptions(true)))
using (ExcelPackage package = new ExcelPackage(stream))
{
ExcelWorksheet worksheet = package.Workbook.Worksheets.FirstOrDefault();
int colCount = worksheet.Dimension.End.Column;
int rowCount = worksheet.Dimension.End.Row;
for (int row = 1; row <= rowCount; row++)
{
for (int col = 1; col <= colCount; col++)
{
Console.WriteLine(" Row:" + row + " column:" + col + " Value:" + worksheet.Cells[row, col].Value.ToString().Trim());
}
}
But the sentence
ExcelWorksheet worksheet = package.Workbook.Worksheets.FirstOrDefault();
throws me an error:System.NullReferenceException: 'Object reference not set to an instance of an object.' worksheet was null
I debug an see fine my stream an my package
The excels in blobs are like this one .xls
Any idea, please?
Thanks
Please check if worksheet is empty .This error occurs if there is empty sheet with empty coumns and rows.
I tried to reproduce the same
Initially I tried to read a excel sheet with EPplus , where starting column and rows are filled and not empty and could execute and read successfully using the same code as yours.
Then I removed column1 to be empty and stored in blob and tried to read it and got null reference exception.
The Dimension object of the ExcelWorksheet will be null if the worksheet was just initialized and is empty .
And so throws null reference exception, AFAIK , the only way is to check if files are empty or to add content to it before accessing them so that if columns are empty , it would not throw exception.
worksheet.Cells[1, 1].Value = "Some text value";
Same way try to add worksheet, to avoid exception if in case there are no sheets in container blob.
ExcelWorksheet worksheet = new ExcelPackage().Workbook.Worksheets.Add("Sheet1");
This code will not throw an exception since the Dimension object was initialized by adding content to the worksheet.If the loaded
ExcelWorksheet already contains data, you will not face this issue.
ExcelWorksheet worksheet = package.Workbook.Worksheets.First();
//or ExcelWorksheet worksheet = package.Workbook.Worksheets[0];
// Add below line to add new sheet , if no sheets are present and returning null exception
//ExcelWorksheet worksheet = new ExcelPackage().Workbook.Worksheets.Add("Sheet1");
//Add below line to add column and row , if sheet is empty and returning null exception
worksheet.Cells[1, 1].Value = " This is the end of worksheet";
int colCount = worksheet.Dimension.End.Column;
int rowCount = worksheet.Dimension.End.Row;
for (int row = 1; row <= rowCount; row++)
{
for (int col = 1; col <= colCount; col++)
{
Console.WriteLine(" Row:" + row + " column:" + col + " Value:" + worksheet.Cells[row, col].Value.ToString().Trim());
}
}
You can alternatively check if the value is null.
if(worksheet.cells[row,column].value != null)
{
//proceed with code
}
The problem was the file extension of the excel files in blobs
Only works fone with .xlsx not with .xls
Thanks
I have csv file with 5,00,000 records in it. Fields in csv file are as follows
No, Name, Address
Now i want to compare name and address from each record with name and address of all remaining records.
I was doing it in following way
List<String> lines = new ArrayList<>();
BufferedReader firstbufferedReader = new BufferedReader(new FileReader(newFile(pathname)));
while ((line = firstbufferedReader.readLine()) != null) {
lines.add(line);
}
firstbufferedReader.close();
for (int i = 0; i < lines.size(); i++)
{
csvReader = new CSVReader(new StringReader(lines.get(i)));
csvReader = null;
for (int j = i + 1; j < lines.size(); j++)
{
csvReader = new CSVReader(new StringReader(lines.get(j)));
csvReader = null;
application.linesToCompare(lines.get(i),lines.get(j));
}
}
linesToCompare Function will extract name and address from respective parameters and do comaprison. If i found records to be 80% matching(based on name and address) i am marking them as duplicates.
But my this approach is taking too much time to process that csv file.
I want a faster approach may be some kind of map reduce or anything.
Thanks in advance
It is taking a long time because it looks like you are reading the file a huge amount of times.
You first read the file into the lines List, then for every entry you read it again, then inside that you read it again!. Instead of doing this, read the file once into your lines array and then use that to compare the entries against each other.
Something like this might work for you:
List<String> lines = new ArrayList<>();
BufferedReader firstbufferedReader = new BufferedReader(new FileReader(newFile(pathname)));
while ((line = firstbufferedReader.readLine()) != null) {
lines.add(line);
}
firstbufferedReader.close();
for (int i = 0; i < lines.size(); i++)
{
for (int j = i + 1; j < lines.size(); j++)
{
application.linesToCompare(lines.get(i),lines.get(j));
}
}
Is there a way to scan a HBase table getting, for example, the first 100
results, then later get the next 100 and so on... Just like in SQL we do
with LIMIT and OFFSET?
My row keys are uuid
You can do it multiple ways. The easiest one is a page filter. Below is the code example from HBase: The Definitive Guide, page 150.
private static final byte[] POSTFIX = new byte[] { 0x00 };
Filter filter = new PageFilter(15);
int totalRows = 0; byte[] lastRow = null;
while (true) {
Scan scan = new Scan();
scan.setFilter(filter);
if (lastRow != null) {
byte[] startRow = Bytes.add(lastRow, POSTFIX);
System.out.println("start row: " + Bytes.toStringBinary(startRow));
scan.setStartRow(startRow);
}
ResultScanner scanner = table.getScanner(scan);
int localRows = 0;
Result result;
while ((result = scanner.next()) != null) {
System.out.println(localRows++ + ": " + result);
totalRows++;
lastRow = result.getRow();
}
scanner.close();
if (localRows == 0) break;
}
System.out.println("total rows: " + totalRows);
Or you can set catching on scan for the limit you want and then change the start row to the last row + 1 from the prev scan for every get.
Given a web log which consists of fields 'User ' 'Page url'. We have to find out the most frequent 3-page sequence that users takes.
There is a time stamp. and it is not guaranteed that the single user access will be logged sequentially it could be like user1 Page1 user2 Pagex user1 Page2 User10 Pagex user1 Page 3 her User1s page sequence is page1-> page2-> page 3
Assuming your log is stored in timestamp order, here's an algorithm to do what you need:
Create a hashtable 'user_visits' mapping user ID to the last two pages you observed them to visit
Create a hashtable 'visit_count' mapping 3-tuples of pages to frequency counts
For each entry (user, URL) in the log:
If 'user' exists in user_visits with two entries, increment the entry in visit_count corresponding to the 3-tuple of URLs by one
Append 'URL' to the relevant entry in user_visits, removing the oldest entry if necessary.
Sort the visit_count hashtable by value. This is your list of most popular sequences of URLs.
Here's an implementation in Python, assuming your fields are space-separated:
fh = open('log.txt', 'r')
user_visits = {}
visit_counts = {}
for row in fh:
user, url = row.split(' ')
prev_visits = user_visits.get(user, ())
if len(prev_vists) == 2:
visit_tuple = prev_vists + (url,)
visit_counts[visit_tuple] = visit_counts.get(visit_tuple, 0) + 1
user_visits[user] = (prev_vists[1], url)
popular_sequences = sorted(visit_counts, key=lambda x:x[1], reverse=True)
Quick and dirty:
Build a list of url/timestamps per
user
sort each list by timestamp
iterate over each list
for each 3 URL sequence, create or increment a counter
find the highest count in the URL sequence count list
foreach(entry in parsedLog)
{
users[entry.user].urls.add(entry.time, entry.url)
}
foreach(user in users)
{
user.urls.sort()
for(i = 0; i < user.urls.length - 2; i++)
{
key = createKey(user.urls[i], user.urls[i+1], user.urls[i+2]
sequenceCounts.incrementOrCreate(key);
}
}
sequenceCounts.sortDesc()
largestCountKey = sequenceCounts[0]
topUrlSequence = parseKey(largestCountkey)
Here's a bit of SQL assuming you could get your log into a table such as
CREATE TABLE log (
ord int,
user VARCHAR(50) NOT NULL,
url VARCHAR(255) NOT NULL,
ts datetime
) ENGINE=InnoDB;
If the data is not sorted per user then (assuming that ord column is the number of the line from the log file)
SELECT t.url, t2.url, t3.url, count(*) c
FROM
log t INNER JOIN
log t2 ON t.user = t2.user INNER JOIN
log t3 ON t2.user = t3.user
WHERE
t2.ord IN (SELECT MIN(ord)
FROM log i
WHERE i.user = t.user AND i.ord > t.ord)
AND
t3.ord IN (SELECT MIN(ord)
FROM log i
WHERE i.user = t.user AND i.ord > t2.ord)
GROUP BY t.user, t.url, t2.url, t3.url
ORDER BY c DESC
LIMIT 10;
This will give top ten 3 stop paths for a user. Alternatively if you can get it ordered by user and time you can join on rownumbers more easily.
Source code in Mathematica
s= { {user},{page} } (* load List (log) here *)
sortedListbyUser=s[[Ordering[Transpose[{s[[All, 1]], Range[Length[s]]}]] ]]
Tally[Partition [sortedListbyUser,3,1]]
This problem is similar to
Find k most frequent words from a file
Here is how you can solve it :
Group each triplet (page1, page2, page3) into a word
Apply the algorithm mentioned here
1.Reads user page access urls from file line by line,these urls separated by separator,eg:
u1,/
u1,main
u1,detail
The separator is comma.
2.Store each page's visit count to map:pageVisitCounts;
3.Sort the visit count map by value in descend order;
public static Map<String, Integer> findThreeMaxPagesPathV1(String file, String separator, int depth) {
Map<String, Integer> pageVisitCounts = new HashMap<String, Integer>();
if (file == null || "".equals(file)) {
return pageVisitCounts;
}
try {
File f = new File(file);
FileReader fr = new FileReader(f);
BufferedReader bf = new BufferedReader(fr);
Map<String, List<String>> userUrls = new HashMap<String, List<String>>();
String currentLine = "";
while ((currentLine = bf.readLine()) != null) {
String[] lineArr = currentLine.split(separator);
if (lineArr == null || lineArr.length != (depth - 1)) {
continue;
}
String user = lineArr[0];
String page = lineArr[1];
List<String> urlLinkedList = null;
if (userUrls.get(user) == null) {
urlLinkedList = new LinkedList<String>();
} else {
urlLinkedList = userUrls.get(user);
String pages = "";
if (urlLinkedList.size() == (depth - 1)) {
pages = urlLinkedList.get(0).trim() + separator + urlLinkedList.get(1).trim() + separator + page;
} else if (urlLinkedList.size() > (depth - 1)) {
urlLinkedList.remove(0);
pages = urlLinkedList.get(0).trim() + separator + urlLinkedList.get(1).trim() + separator + page;
}
if (!"".equals(pages) && null != pages) {
Integer count = (pageVisitCounts.get(pages) == null ? 0 : pageVisitCounts.get(pages)) + 1;
pageVisitCounts.put(pages, count);
}
}
urlLinkedList.add(page);
System.out.println("user:" + user + ", urlLinkedList:" + urlLinkedList);
userUrls.put(user, urlLinkedList);
}
bf.close();
fr.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return pageVisitCounts;
}
public static void main(String[] args) {
String file = "/home/ieee754/Desktop/test-access.log";
String separator = ",";
Map<String, Integer> pageVisitCounts = findThreeMaxPagesPathV1(file, separator, 3);
System.out.println(pageVisitCounts.size());
Map<String, Integer> result = MapUtil.sortByValueDescendOrder(pageVisitCounts);
System.out.println(result);
}
I have the following query (column log is of type CLOB):
UPDATE table SET log=? where id=?
The query above works fine when using the setAsciiStream method to put a value longer than 4000 characters into the log column.
But instead of replacing the value, I want to append it, hence my query looks like this:
UPDATE table SET log=log||?||chr(10) where id=?
The above query DOES NOT work any more and I get the following error:
java.sql.SQLException: ORA-01461: can bind a LONG value only for insert into a LONG column
It looks to me like you have to use a PL/SQL block to do what you want. The following works for me, assuming there's an entry with id 1:
import oracle.jdbc.OracleDriver;
import java.sql.*;
import java.io.ByteArrayInputStream;
public class JDBCTest {
// How much test data to generate.
public static final int SIZE = 8192;
public static void main(String[] args) throws Exception {
// Generate some test data.
byte[] data = new byte[SIZE];
for (int i = 0; i < SIZE; ++i) {
data[i] = (byte) (64 + (i % 32));
}
ByteArrayInputStream stream = new ByteArrayInputStream(data);
DriverManager.registerDriver(new OracleDriver());
Connection c = DriverManager.getConnection(
"jdbc:oracle:thin:#some_database", "user", "password");
String sql =
"DECLARE\n" +
" l_line CLOB;\n" +
"BEGIN\n" +
" l_line := ?;\n" +
" UPDATE table SET log = log || l_line || CHR(10) WHERE id = ?;\n" +
"END;\n";
PreparedStatement stmt = c.prepareStatement(sql);
stmt.setAsciiStream(1, stream, SIZE);
stmt.setInt(2, 1);
stmt.execute();
stmt.close();
c.commit();
c.close();
}
}
BLOBs are not mutable from SQL (well, besides setting them to NULL), so to append, you would have to download the blob first, concatenate locally, and upload the result again.
The usual solution is to write several records to the database with a common key and a sequence which tells the DB how to order the rows.