draw network in pajek3xl - social-networking

I installed pajek3xl but just do not seem to be able to simply draw a network given a .net file. There used to be a draw menu but I cannot find it. is it still possible to draw a network in pajek3xl?

Pajek3XL is meant for analysis of huge networks (over 2 billion of vertices).
For such networks visualization is not needed.
If you need visualization simply use ordinary Pajek,
which is installed when running Install Shield as well.

Related

Cytoscape vs STRING for long list of proteins

I am mid-way through my university project, and I have run into an issue. I have a long list of around 1000 proteins that I wanted to analyse in STRING, however, my list is too large. I decided to try and utilise Cytoscape (and downloaded the stringApp), but the networks generated are still very messy. I've attached a screenshot here. Is there any way to improve the presentation of the network by downloading any Cytoscape apps or by tweaking the settings?
Thanks in advance
Well, the short answer is "no". A slightly longer answer is "it depends".
Showing a hairball really isn't helpful, usually, so you need to refine things somewhat. What is your data source (i.e. where did the 1000 proteins come from)? What do you hope to see in the network? If you are looking for particular groups of proteins (e.g. complexes), you would probably want to use MCL to cluster them first. If you have some other data you want to map, such as transcriptomic or proteomic data, you could refine your network based on fold change or abundance values.
All that being said, somethings you might try. First, you are seeing the "fast" version of the network. Try clicking on the show graphics details button (the diamond in the network view tool bar). That will give you the full graphics details. Second, you might try spreading the network out a bit by using the Layout->Layout Tools. Turn off the "Selected Only" and then adjust the scale. Finally, depending on your biological question, you might want to eliminate proteins that are only present in the nucleus or cytoplasm, or are only in lung tissue. This is all possible using the sliders provided by the stringApp's Results Panel.
-- scooter

Music21 and D3.js for music feature extraction and visualization?

I am looking for suggestions on what tools could be used for the following scenarios about music feature extraction and visualization (on my Mac):
identify and group notes in a score (from different voices/instruments) that sound concurrently (even if they are attacked in different time offsets, though sound at some point together due to different duration lengths); then connect them graphically (e.g. on a score representation, with a line connecting them)
identify melodic and accompanying parts (assigned to different voices/instruments, perhaps interchangeably within the same voice/instrument)
extract initial tonality and following modulations; then map all extracted tonalities on a scale based on the circle of 5ths (where 0 is the initial tonality, -1 is one 5th lower, +1 one 5th higher, etc.)
I have been thinking of using music21 (the music works I am interested in are part of its corpus), but I am not sure if this is the right way to go. Are there also other tools (e.g. jSymbolic2??) that could help?
And what about visualization? Could the above scenarios be visually "solved" within music21 or would I need an additional tool, like D3.js (which I have briefly used in the past)?
If you would have an advice on any of the above scenarios, that would help me a lot! Thanks, Ilias

d3js force large number of nodes

pl. help me with this noob questions. I want to show a network with large number (70000) of nodes, and 2.1 million links in force layout. Looking for a good and scalable way to do this.
How do we actually show such large nodes practically, can we do some kind of approximation and show semantically same network (e.g: http://www.visualcomplexity.com/vc/project.cfm?id=76 )
How do we actually reduce such data in back end [ say using KDE ? We cannot afford to use science.js in front end as the volume is large ]
Initial view can be the network with pre-determined locations of the nodes or clusters. How do we predertmine the locations in back end, before sending the data to d3js. Do we have to use topojson ?
Any such examples are available using d3js (and a backend - say java, python etc) ?
Sorry about the question, but do you really need to show all that information in one shot?
If you really need it, have first a look with Gephi and see what it looks like, then pass to the next step.
If you see that you can focus on specific nodes or patterns at the beginning and then explore the result of the chart, probably this is the best solution from a performance point of view.
In case the discovery approach works but you are still having troubles with many items on the screen, just control the force layout with a time based threshold. It's not perfect but it will work for hundred nodes.
Next step
If you decide to go anyway on this path, I would recommend the followings:
Aggregate: that's probably the most useful thing you can do here: let the user interact with the data and dig in it to see more in detail. That is the best solution if you have to serve many clients.
Do not run the force directed layout on the front end with the entire network as is: it will eat all the browser resources for at least tens of minutes in any case.
Compute the layout on the back end - e.g. using JUNG or Gephi core itself in Java or NetworkX in Python - and then just display the result.
Cache the result of the point above as well: they are many even for the server if you have many clients, so cache it.
When the user drag the network, hide the links: it should speed up the computation ( sigmajs uses this trick)

OCR for scanning printed receipts. [duplicate]

Would OCR Software be able to reliably translate an image such as the following into a list of values?
UPDATE:
In more detail the task is as follows:
We have a client application, where the user can open a report. This report contains a table of values.
But not every report looks the same - different fonts, different spacing, different colors, maybe the report contains many tables with different number of rows/columns...
The user selects an area of the report which contains a table. Using the mouse.
Now we want to convert the selected table into values - using our OCR tool.
At the time when the user selects the rectangular area I can ask for extra information
to help with the OCR process, and ask for confirmation that the values have been correct recognised.
It will initially be an experimental project, and therefore most likely with an OpenSource OCR tool - or at least one that does not cost any money for experimental purposes.
Simple answer is YES, you should just choose right tools.
I don't know if open source can ever get close to 100% accuracy on those images, but based on the answers here probably yes, if you spend some time on training and solve table analisys problem and stuff like that.
When we talk about commertial OCR like ABBYY or other, it will provide you 99%+ accuracy out of the box and it will detect tables automatically. No training, no anything, just works. Drawback is that you have to pay for it $$. Some would object that for open source you pay your time to set it up and mantain - but everyone decides for himself here.
However if we talk about commertial tools, there is more choice actually. And it depends on what you want. Boxed products like FineReader are actually targeting on converting input documents into editable documents like Word or Excell. Since you want actually to get data, not the Word document, you may need to look into different product category - Data Capture, which is essentially OCR plus some additional logic to find necessary data on the page. In case of invoice it could be Company name, Total amount, Due Date, Line items in the table, etc.
Data Capture is complicated subject and requires some learning, but being properly used can give quaranteed accuracy when capturing data from the documents. It is using different rules for data cross-check, database lookups, etc. When necessary it may send datafor manual verification. Enterprises are widely usind Data Capture applicaitons to enter millions of documents every month and heavily rely on data extracted in their every day workflow.
And there are also OCR SDK ofcourse, that will give you API access to recognition results and you will be able to program what to do with the data.
If you describe your task in more detail I can provide you with advice what direction is easier to go.
UPDATE
So what you do is basically Data Capture application, but not fully automated, using so-called "click to index" approach. There is number of applications like that on the market: you scan images and operator clicks on the text on the image (or draws rectangle around it) and then populates fields to database. It is good approach when number of images to process is relatively small, and manual workload is not big enough to justify cost of fully automated application (yes, there are fully automated systems that can do images with different font, spacing, layout, number of rows in the tables and so on).
If you decided to develop stuff and instead of buying, then all you need here is to chose OCR SDK. All UI you are going to write yoursself, right? The big choice is to decide: open source or commercial.
Best Open source is tesseract OCR, as far as I know. It is free, but may have real problems with table analysis, but with manual zoning approach this should not be the problem. As to OCR accuracty - people are often train OCR for font to increase accuracy, but this should not be the case for you, since fonts could be different. So you can just try tesseract out and see what accuracy you will get - this will influence amount of manual work to correct it.
Commertial OCR will give higher accuracy but will cost you money. I think you should anyway take a look to see if it worth it, or tesserack is good enough for you. I think the simplest way would be to download trial version of some box OCR prouct like FineReader. You will get good idea what accuracy would be in OCR SDK then.
If you always have solid borders in your table, you can try this solution:
Locate the horizontal and vertical lines on each page (long runs of
black pixels)
Segment the image into cells using the line coordinates
Clean up each cell (remove borders, threshold to black and white)
Perform OCR on each cell
Assemble results into a 2D array
Else your document have a borderless table, you can try to follow this line:
Optical Character Recognition is pretty amazing stuff, but it isn’t
always perfect. To get the best possible results, it helps to use the
cleanest input you can. In my initial experiments, I found that
performing OCR on the entire document actually worked pretty well as
long as I removed the cell borders (long horizontal and vertical
lines). However, the software compressed all whitespace into a single
empty space. Since my input documents had multiple columns with
several words in each column, the cell boundaries were getting lost.
Retaining the relationship between cells was very important, so one
possible solution was to draw a unique character, like “^” on each
cell boundary – something the OCR would still recognize and that I
could use later to split the resulting strings.
I found all this information in this link, asking Google "OCR to table". The author published a full algorithm using Python and Tesseract, both opensource solutions!
If you want to try the Tesseract power, maybe you should try this site:
http://www.free-ocr.com/
Which OCR you are talking about?
Will you be developing codes based on that OCR or you will be using something off the shelves?
FYI:
Tesseract OCR
it has implemented the document reading executable, so you can feed the whole page in, and it will extract characters for you. It recognizes blank spaces pretty well, it might be able to help with tab-spacing.
I've been OCR'ing scanned documents since '98. This is a recurring problem for scanned docs, specially for those that include rotated and/or skewed pages.
Yes, there are several good commercial systems and some could provide, once well configured, terrific automatic data-mining rate, asking for the operator's help only for those very degraded fields. If I were you, I'd rely on some of them.
If commercial choices threat your budget, OSS can lend a hand. But, "there's no free lunch". So, you'll have to rely on a bunch of tailor-made scripts to scaffold an affordable solution to process your bunch of docs. Fortunately, you are not alone. In fact, past last decades, many people have been dealing with this. So, IMHO, the best and concise answer for this question is provided by this article:
https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
Its reading is worth! The author offers useful tools of his own, but the article's conclusion is very important to give you a good mindset about how to solve this kind of problem.
"There is no silver bullet."
(Fred Brooks, The Mitical Man-Month)
It really depends on implementation.
There are a few parameters that affect the OCR's ability to recognize:
1. How well the OCR is trained - the size and quality of the examples database
2. How well it is trained to detect "garbage" (besides knowing what's a letter, you need to know what is NOT a letter).
3. The OCR's design and type
4. If it's a Nerural Network, the Nerural Network structure affects its ability to learn and "decide".
So, if you're not making one of your own, it's just a matter of testing different kinds until you find one that fits.
You could try other approach. With tesseract (or other OCRS) you can get coordinates for each word. Then you can try to group those words by vercital and horizontal coordinates to get rows/columns. For example to tell a difference between a white space and tab space. It takes some practice to get good results but it is possible. With this method you can detect tables even if the tables use invisible separators - no lines. The word coordinates are solid base for table recog
We also have struggled with the issue of recognizing text within tables. There are two solutions which do it out of the box, ABBYY Recognition Server and ABBYY FlexiCapture. Rec Server is a server-based, high volume OCR tool designed for conversion of large volumes of documents to a searchable format. Although it is available with an API for those types of uses we recommend FlexiCapture. FlexiCapture gives low level control over extraction of data from within table formats including automatic detection of table items on a page. It is available in a full API version without a front end, or the off the shelf version that we market. Reach out to me if you want to know more.
Here are the basic steps that have worked for me. Tools needed include Tesseract, Python, OpenCV, and ImageMagick if you need to do any rotation of images to correct skew.
Use Tesseract to detect rotation and ImageMagick mogrify to fix it.
Use OpenCV to find and extract tables.
Use OpenCV to find and extract each cell from the table.
Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.
Use Tesseract to OCR each cell.
Combine the extracted text of each cell into the format you need.
The code for each of these steps is extensive, but if you want to use a python package, it's as simple as the following.
pip3 install table_ocr
python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png
That package and demo module will turn the following table into CSV output.
Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4
If you need to make any changes to get the code to work for table borders with different widths, there are extensive notes at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

How to handle large numbers of pushpins in Bing Maps

I am using Bing Maps with Ajax and I have about 80,000 locations to drop pushpins into. The purpose of the feature is to allow a user to search for restaurants in Louisiana and click the pushpin to see the health inspection information.
Obviously it doesn't do much good to have 80,000 pins on the map at one time, but I am struggling to find the best solution to this problem. Another problem is that the distance between these locations is very small (All 80,000 are in Louisiana). I know I could use clustering to keep from cluttering the map, but it seems like that would still cause performance problems.
What I am currently trying to do is to simply not show any pins until a certain zoom level and then only show the pins within the current view. The way I am currently attempting to do that is by using the viewchangeend event to find the zoom level and the boundaries of the map and then querying the database (through a web service) for any points in that range.
It feels like I am going about this the wrong way. Is there a better way to manage this large amount of data? Would it be better to try to load all points initially and then have the data on hand without having to hit my web service every time the map moves. If so, how would I go about it?
I haven't been able to find answers to my questions, which usually means that I am asking the wrong questions. If anyone could help me figure out the right question it would be greatly appreciated.
Well, I've implemented a slightly different approach to this. It was just a fun exercise, but I'm displaying all my data (about 140.000 points) in Bing Maps using the HTML5 canvas.
I previously load all the data to the client. Then, I've optimized the drawing process so much that I've attached it to the "Viewchange" event (which fires all the time during the view change process).
I've blogged about this. You can check it here.
My example does not have interaction on it but could be easily done (should be a nice topic for a blog post). You would have thus to handle the events manually and search for the corresponding points yourself or, if the amount of points to draw and/or the zoom level was below some threshold, show regular pushpins.
Anyway, another option, if you're not restricted to Bing Maps, is to use the likes of Leaflet. It allows you to create a Canvas Layer which is a tile-based layer but rendered in client-side using HTML5 canvas. It opens a new range of possibilities. Check for example this map in GisCloud.
Yet another option, although more suitable to static data, is using a technique called UTFGrid. The lads that developed it can certainly explain it better than me, but it scales for as many points as you want with a fenomenal performance. It consists on having a tile layer with your info, and an accompanying json file with something like an "ascii-art" file describing the features on the tiles. Then, using a library called wax it provides complete mouse-over, mouse-click events on it, without any performance impact whatsoever.
I've also blogged about it.
I think clustering would be your best bet if you can get away with using it. You say that you tried using clustering but it still caused performance problems? I went to test it out with 80000 data points at the V7 Interactive SDK and it seems to perform fine. Test it out yourself by going to the link and change the line in the Load module - clustering tab:
TestDataGenerator.GenerateData(100,dataCallback);
to
TestDataGenerator.GenerateData(80000,dataCallback);
then hit the Run button. The performance seems acceptable to me with that many data points.

Resources