This site has moved

You can find my new website along with a lossy port of this blog at jpfairbanks.net

Posted in Uncategorized

Followup to Python for Data Analysis

I have read and thoroughly enjoyed the book by Wes McKinney. Ever since, I have been using python for all of my coding. That is except for the CUDA and Cilk stuff required for my High Performance Computing algorithms class. The Pandas library has been a wonderful tool. It really builds atop and plays nicely with the Numpy, Scipy, pyplot stack for interactive data analysis. Making a quick plot to see a trend in some data is really easy and going all the way to publication quality plots scales well. I have been developing a workflow for a paper that I am writing with David Ediger and Rob McColl. The python code is an integral part of the paper because it shows the record of how I analysed the data, and prepared the figures. My goal for the end paper is to distribute a tarball will Bash code for gathering data; Stinger C code for graph experiments; Python with Pandas code for data analysis; latex for document preparation; and a Makefile that will do it all in a pipeline to make the paper. That way if we get a new dataset, all of the hard work that I have been doing the past months can be replicated in a single command line. This will take a real step forward for reproducible research

 

I think that I also have a clearer picture of how one should organize code for using IPython and interactively exploring some data. I started off with a little script that would load the data and analyse it and make the figures. But then I realized that lots of the features I had were being reused and so I collapsed them into functions with some general parameters. I tried to right functions directly, but that makes debugging them a lot harder because you don’t end up in their namespace when they are done executing. I also don’t really know where functionality should live, and what data is going to be reused later. So writing code in a static style where you plan first and then code is not going to work for exploratory data analysis. As this script grew larger it had a lot of computation that was going to be performed every time I ran it. But all of the IO takes most of the time and I do not need to reload the data every time I make a small change to feature. 

This brings us to the file layout that I am now using. The first file does all of the loading of data and cleans up after that by closing files. The second file does the computation and defines a bunch of functions that can be used at the IPython prompt. The third file does all of the visualization in a general way ie. no string constants. The paper_figures file uses constants to make the final figures that will go into the paper and deposits them into files for latex to find them. This allows you to iterate on the computation file without running the IO every time you change a computation. By separating these tasks we can make interactive data analysis more fun and responsive. The loading data and paper_figures files are really for the batch mode that will come later once the data has been explored and we are distributing the results to other researchers and decision makers. The files in the middle are evolving a little library of things you do with your data that could be reused next time. 

Let me know if you have had a similar experience or wisdom to share on this subject.

Tagged with: , , , ,
Posted in Uncategorized

Python for Data Analysis

Python for Data Analysis

I am reading this book and it is really good. Everyone who wants to do Data analysis should read this book and consider using these tools. It presents NumPy and SciPy for numeric and vectorized operations, matplotlib for fast and programmatic plotting, and Pandas for a robust Data structure framework. It also goes over some data formats and tools for parsing them.

Tagged with: , ,
Posted in Uncategorized

Graph 500 List

title=”Graph 500 List”>Graph 500 List

At SC12 they released these rankings of top machines for graph applications. It was measured by success at Breadth First Search. I am looking forward to the new challenges for more complex problems like maximal independent set and vertex coloring. Hopefully the incentive of the rankings will spur some innovation in these problems, like that seen that in BFS.

Tagged with: ,
Posted in Uncategorized

SC12

I am going to Supercomputing12 next week. It looks cold in Utah, but it should be a good trip. I will be at the GaTech booth for some of the show, and I am definitely going to the Python in HPC talk. I am looking forward to the new Graph500 list unveiling http://www.graph500.org/.

http://sc12.supercomputing.org/

Posted in Uncategorized

dK Graph sequence

In reading about graph generators I found a series that can be computed from a graph that tries to capture the degree correlations between vertices. The authors say that it converges to a single graph for large enough K. If you can show that under that metric certain graph properties behave well it could be a very useful sequence. They try to illustrate this empirically by computing random graphs with the same degree sequence and plotting various properties of them. I am looking forward to heavy reading of this paper http://arxiv.org/abs/cs/0605007. This was used in another paper to model fit random graphs. 

Tagged with:
Posted in Uncategorized

Stochastic Kronecker Graphs

Stochastic Kronecker Graphs

I just read this paper out of Sandia, and I thought it was a really good read. It really highlights why the SKG model is flawed and provides a good fix.

Tagged with:
Posted in Uncategorized
Follow

Get every new post delivered to your Inbox.