Four Years Remaining

How to Tame Python and Uninstall Packages in Mac OS X

Posted by Swen 27.11.2008 4 Comments
Note: This is again a post for sharing personal computer-taming experiences, that aims to help others who might stumble upon simialr problems. Also note, that this is a guest posting kindly presented by Swen Laur.

We as dumb users are often hurting ourselves through ignorance. One of many such traps is Python install on Mac OS X Leopard. First, for most users a new install is completely unnecessary, since Leopard comes with pre-installed Python. However, as top Google hits kindly suggest to install MacPython, we as ignorant users follow the advice. As a result, there will be two versions of Python in the computer. Now if we install setup-tools and easy_install, spooky things start to happen — packages that are installed through easy_install are suddenly inaccessible and nothing seems to work properly.

The explanation behind this phenomenon is simple though a bit technical. Obviously, after having installed MacPython, we have two distributions of Python in our computer:
- /System/Library/Frameworks/Python.framework/Versions/Current/bin/
- /Library/Frameworks/Python.framework/Versions/Current/bin/
which are accessible from the command line through the links in the directories /usr/bin and /usr/local/bin, respectively. On installation, MacPython modifies .bash_login file so that the new path is

/Library/Frameworks/Python.framework/Versions/Current/bin/;${PATH}
and thus the command line invocation of python is translated to

/Library/Frameworks/Python.framework/Versions/Current/bin/python

which, naturally, corresponds to MacPython.

As a second clue, note that setup-tools and easy_install packages come with Leopard's Python itself. More precisely, the corresponding command line tool is /usr/bin/easy_install. However, if you install setup-tools for your new MacPython, the corresponding binary will be /usr/local/bin/easy_install. Now note that in the default configuration, /usr/bin goes before /usr/local/bin in the PATH variable. As a direct consequence, each invocation of easy_install puts packages to Apple Python (into the directory /Library/Python/<version>/site-packages, to be precise), and these packages are naturally not accessible by the MacPython, that you normally execute from a shell. MacPython search the packages from the directory

/Library/Frameworks/Python.framework/Versions/<version>/lib/python2.5/site-packages.

As a simple solution, we must change the configuration of easy_install. For that we have to change the configuration .pydistutils.cfg located at your home directory. There should be the following lines
```
[install]
install_lib=/Library/Frameworks/Python.framework/Versions/Current/lib/python$py_version_short/site-packages/
install_scripts =/Library/Frameworks/Python.framework/Versions/Current/bin
```
The first line specifies where all Python modules should be installed and the second line specifies where all executables should be located (this directory is in the first place in the PATH and thus MacPyton executables are the first to be executed). After that one should reinstall setuptools and easy_install.During the installation of setuptools, the program might comlain. In this case you should silence it by modifying PYTHONPATH. As a result, easy_install is located

/Library/Frameworks/Python.framework/Versions/Current/bin

and thus the packages are put into correct directory. As such, we solve the problem with easy_install but are still vulnerable to other easter-eggs coming from the fact that we do not use the Apple distribution of Python. In particular, you cannot do any user-interface related programming, since the corresponding PyObjC module is missing from non-Apple Python distributions.

Another, more complex solution is to remove MacPython cleanly. The latter is not an easy thing to do, since package management of open-source tools is more than unsatisfactory in Mac OS X (in fact, open-source tools are often difficult to install and uninstall under many non-Linux operating systems). The situation is even worse when you have installed some other Apple packages (.mpkg files) to extend basic configuration of MacPython. In the following, we describe how to do a successful uninstall in such cases. This procedure is not for the fainthearted — if you have a time-machine copy of the system state before installing MacPython, then it might be easier to revert to this state.

The key to successful uninstallation procedure is a basic understanding what is a package and how the Installer.app works. First, note that the package is nothing more than a directory (one can expand the contents by right-clicking the package) with the following structure
```
Contents/Archive.bom
Contents/Info.plist
Contents/Packages/..
Contents/PkgInfo
Contents/Resources
```
The file Info.plist provides an xml-description how to install the package. The file Archive.bom contains in semi-binary description of all files that will be copied to various install locations. The directories Packages and Resources contain sub-packages and files to be copied to the system. The installer.app reads Info.plist and Archive.bom to create necessary files and then copies the package without directories Resources to /Library/Receipts. The receipts are in principle enough to do a clean uninstall if the files are not used by other programs. Normal practice for the Apple software requires that all files should be stored in the directory
/Applications/Installed-Application.app or in some place in /Library if the corresponding piece of software is shared by several applications. Mostly, the corresponding directories are stored in /Library/Frameworks. The integrity of all files installed under /System during the system updates are not guaranteed and therefore sane persons do not store files there. The latter is not true for open-source software, which often tries to write into directories /usr/local/bin and /usr/bin (extremely dangerous).

To uninstall packages, we have to first find out, where it was written. The key IFPkgFlagDefaultLocation or IFPkgRelocatedPath in the Info.plist contains the default or actual installation directory. With the command line tool lsbom we can also see the list of all files and directories that where installed. To be precise, the lsbom gives out relative paths with respect to install directory.

Now applying this knowledge, we can deduce that MacPython consists of five packages, which create a directory /Library/Frameworks/Python.framework
and write the following files
```
idle
idle2.5
pydoc
pydoc2.5
python
python-config
python2.5
python2.5-config
pythonw
pythonw2.5
smtpd.py
smtpd2.5.py
```
to the directory /usr/local/bin in addition to /Application/MacPython <>. Hence, if we remove these files, we restore the initial state unless we installed some other packages. In this case the procedure can be more tedious.

Finally, we should restore the old ~/.bash_login from the file ~/.bash_login.pysave.

To summarize, uninstallation of packages is more difficult than application-bundles (yet nevertheless, it is not hopeless). But still, I think applications without installers are the way to go.

Important update: How to install scypy from source code

Scipy is a nice extension of Python that provides the power of Matlab functions. By some odd reasons the maintainers of scipy package suggest to install non-Apple distribution of Python. In other words, they force you to abandon PyObjC library that is a core graphical library on Mac OS X. Fortunately, it is still possible to have both by installing scipy package from the source. There are some hiccups in the procedure but in general it is doable
1. Install unfpack thogether with UFconfig and AMD libraries from the source
  1. Download source files from the url http://www.cise.ufl.edu/research/sparse/umfpack/
  2. Change the configuration file UFconfig according your computer. Flag
    
    UMFPACK_CONFIG = -DNBLAS
    
    is a safe though a bit slow choice
  3. Run make and pray
  4. Copy libumfpack.a and libamd.a
  5. Copy all files from Include directories of UMPACK, UFconfig and AMD:
    
    umfpack*.h UFconfig.h amd.h amd_internal.h
2. Download and install numpy and scipy packages. You might change the compilation target by by typing the following commandexport MACOSX_DEPLOYMENT_TARGET=10.4in the shell before compiling and installing. The target value 10.5 is not good choice, since it automatically creates compile problems (joys of using freeware). Note that the current tarball scipy-0.7.0b1 is incomplete (joys of using freeware) and thus you have to use svn trunc for that. But otherwise, the installation guide http://www.scipy.org/Installing_SciPy/Mac_OS_X is correct.
Important update: An update to the update

Seems that they have now fixed the 10.5 target and you can use the command

export MACOSX_DEPLOYMENT_TARGET=10.5

before compiling if you have Leopard
Tags: How to, Mac, Python
Maximum Likelihood for Incomplete Data

Posted by Konstantin 13.11.2008 1 Comment

It is good to attend lectures from time to time, even if these are on the topics that you "already know". Because, once in a while, you get to find out something new about things that are so "simple" and "familiar", that you thought there was nothing new you could possibly discover about them. So, today I've had an accidental revelation regarding a simple yet amusing property of the maximum likelihood estimation method. I think I should have been told that years ago, right on the first statistics course.

Maximum Likelihood

Maximum likelihood estimation is one of the simplest and most popular methods for fitting mathematical models to given data. It is most easily illustrated by an example. Suppose you are given a list of numbers $(x_1, x_2, \dots, x_n)$ , which, you believe, are all instances of a normally distributed random variable with mean $\mu$ and variance $\sigma^2$ . The problem is that the value of $\mu$ is unknown to you and you would like to estimate it from the data. How do you do it? The maximum likelihood method suggests you to search for a value of $\mu$ , for which the probability (or probability density) of obtaining exactly this set of numbers would be maximal. That is:

$\hat\mu = \mathrm{argmax}_{\mu} P[x_1, x_2, \dots, x_n\,|\,\mu]$

In most cases this turns out to be quite a reasonable suggestion, which leads to good estimates (although, there are some rare exceptions). Here, for example, it turns out that to estimate the mean you should simply compute the average of the given numbers:

$\hat\mu = \frac{1}{n}\sum_i x_i$

This is more-or-less all what a typical textbook has to say about maximal likelihood, but there's more to it.

Counting Lightbulbs

Consider the following example. Suppose you are willing to measure the average lifetime of a common lightbulb (that is, the amount of time it will glow until burning out). For that you buy 1000 lightbulbs in a general store, switch all of them on and wait for a year, keeping track of the timepoints at which the individual lightbulbs burn out. Suppose the year has passed and you managed to record the lifetime of just 50 "dead" lightbulbs (the remaining 950 keep on glowing even after the whole year of uninterrupted work). What do you do now? Do you wait for other lightbulbs to burn out? But this can take ages! Do you just discard the 950 lightbulbs and only consider the 50 for which you have "available data"? This is also not good, because, clearly, the knowledge that 95% of lightbulbs live longer than a year is important. Indeed, from this fact alone you could infer that the average lifetime must be between 18 and 21 years with 95% confidence (an interested reader is invited to verify this claim). So what do you do?

Assume that the lifetime of a lightbulb can be modeled using exponential distribution with mean $\lambda$ (this is a natural assumption because exponential distribution is well-suited for describing the lifetime of electric equipment). Then the probability (probability density, in fact) that lightbulb i burns out after $t_i$ years can be expressed as:

$P[X=t_i\,|\,\lambda] = \frac{1}{\lambda}e^{-\frac{t_i}{\lambda}}$ .

Naturally, the likelihood of lightbulbs 1,2,3,...,50 burning out after $t_1,t_2,\dots,t_{50}$ years correspondingly, is just the product of the corresponding expressions:

$\prod_{i=1}^{50}\frac{1}{\lambda}e^{-\frac{t_i}{\lambda}}$ .

Now what about the lightbulbs that are still alive? The likelihood that a lightbulb will glow longer than a year can be easily expressed as:

$P[X > 1\,|\,\lambda] = e^{-\frac{1}{\lambda}}$ .

Therefore, the total likelihood of having 50 lightbulbs burn out at time moments $t_1,t_2,\dots,t_{50}$ and 950 lightbulbs keep working longer than a year is the following product:

$P[X_1 = t_1, X_2 = t_2, \dots, X_{50} = t_{50}, X_{51} > 1, \dots, X_{1000} > 1|\,\lambda] =$
$= \left(\prod_{i=1}^{50}P[X_i = t_i\,|\,\lambda]\right) \cdot \left(P[X > 1 \,|\,\lambda]\right)^{950} =$
$= \left(\prod_{i=1}^{50}\frac{1}{\lambda}e^{-\frac{t_i}{\lambda}}\right)\cdot e^{-\frac{950}{\lambda}}$ .

Finding the estimate of $\lambda$ that maximizes this expression is now a simple technical detail of minor importance. What is important is the ease with which we managed to operate the "uncertain measurements" for the 950 lightbulbs.

Final Remarks

Note that the idea can be extended to nearly arbitrary kinds of uncertainty. For example, if, for some lightbulb we knew that its lifetime was a number between a and b, we would include this piece of knowledge by using the term $P[a < X < b]$ in the likelihood function. This inherent flexibility of the maximum likelihood method looks like a feature quite useful for the analysis of noisy data.

Surprisingly, however, it does not seem to be applied much in practice. The reason for that could lie in the fact that many nontrivial constraints on the data will make the likelihood function highly nonconvex and hard to optimize. Consider, for example, the probability density

$P[X = x_1 \vee X = x_2 \vee X = x_3 \,|\,\lambda] = P[x_1\,|\,\lambda] + P[x_2\,|\,\lambda] + P[x_3\,|\,\lambda]$

for a normally distributed variable X. Depending on the choice of $x_1$ , $x_2$ , and $x_3$ , the likelihood function will have either three different peaks at $\lambda = x_1, x_2, x_3$ , two peaks or just one. Thus, finding an exact optimum is much harder than in the typical textbook case. But on the other hand, there are various cases where the likelihood function is nicely convex (consider the interval constraint mentioned above) and besides there are numerous less efficient algorithms around anyway. So maybe the trick is not used much, simply because it's not in the textbook?

Tags: Data analysis, Probability theory, Statistics
The Rule of The Empty Inbox

Posted by Konstantin 06.11.2008 No Comments
Effective time management is an important and an interesting issue. It is important because, theoretically, it should provide a way to be more efficient and thus spend less time doing work and more time having fun. It is interesting because of the wide range of methodologies, techniques and advices lying around: there is clearly no universal method that fits everyone. Some people say that the key trick to efficiency is being goal-oriented and managing priorities, others believe it is all about proper todo-lists, calendars and reminders. Some say you must constantly motivate yourself to work hard, others believe you better learn to go with the flow and follow your internal desires. There are also those who say that being organized is an inherent character trait and not a skill that one can acquire.

I strongly disagree with the latter statement. Being in general a lazy, careless, unmotivated and not at all a responsible person, I still manage to be somewhat productive from time to time, and I think a significant role in that is due to simple tricks. I realized it most clearly amidst master's studies, when my lecture schedule was suddenly over and I had to figure out myself what to do with my time.

Now, the whole "time management theory" is actually quite trivial, just a bunch of obvious pieces of advice. One can read through a book or two of those in a pair of days. It takes much longer, however, to figure out which of those pieces of advice would work for you personally. Take for example the popular suggestion to keep a notebook with a todo-list and spend 5 minutes every day to plan the day. I tried that several times, and although it does seem to help temporarily, I just can't make myself follow the routine. First of all, the requirement to write trivialities into the notebook is demanding. The mind resists it and tries to optimize the notebook away as quickly as possible. Secondly, the need to follow some routine "5 minutes every day" is a disciplinary burden too heavy for my lazy personality.

Here's another popular suggestion that I found to be somewhat useless: plan your projects, set yourself deadlines and keep up to them. This doesn't work for me for two reasons. Firstly, there is not much added value in explicit planning for most day-to-day tasks, intuition does a pretty good job here anyway. And then, no matter how detailed a plan you would have for each project, you'll still have the problem of scheduling your time among these, as well as handling all those tiny routine events and unexpected tasks. Finally, the requirement of keeping up to deadlines sounds more like a problem than a solution to me.

Fortunately, there are tricks that work much better, and it is them that I was planning to write about. The best time management ideas that I know of, have been nicely summarized by D. Allen in his GTD ideology. His advice is based on three obvious (as usually), but nonetheless insightful observations:
1. A time management system won't work unless you put everything in it. This includes every smallest idea that you ever had to think about, including "I want to go see that stupid movie some day" and all the less important things. Although it might seem like an overkill, it's not. These small ideas, unless materialized somehow, can use up quite a lot of your brainpower by just "being on your mind". As soon as you write them out in a safe place, your mind gets much clearer. Also, you can only manage your schedule if you really know that everything is there.
2. If there's an idea, it is not immediately clear what action it implies. For example, "I want to see a movie" is an idea. The first action that needs to be done to make it happen is actually "Call a friend". It's not a big deal to figure it out, but nonetheless there's some amount of thought involved. Surprisingly often, this small amount of thought is a terrible hurdle while undone. Indeed, isn't it much more pleasant to know that you need to "call a friend" than to have a vague "go see a movie" entry on the todo-list? What GTD essentially suggests is to make this "first action" decision as early as possible and keep the todo-lists in terms of actions, not ideas.
3. A proper file management/archiving system is crucial. There has to be a place where you can store all these ideas, "first actions", reference materials, schedule, reminders, etc. Ideally, this storage has to be organized in an associative manner, i.e. you should be able to retrieve things in context easily. "I'm in a mood to write an email now, do I have a task that fits this context?", "I'm working on this project now, what are the related materials?". Once again, if you can trust your system to store things for you, your mind is relieved.
The GTD ideology describes a simple workflow process based on these observations. The whole process uses heavily the notion of "inboxes" and goes as follows:
1. All the ideas or tasks you might have arrive at your "IN" inbox.
2. You review this inbox regularly, and decide for each item it's first action. Upon decision, you must move items out of the "IN" inbox, and you have the following limited set of options:
  1. There is nothing to do about it. Delete and forget.
  2. This is some useful information. Archive for future reference.
  3. This requires you to do something small (it'll take 2 minutes or so). Do it now. Delete message.
  4. This requires you to do something that will take longer than 2 minutes. Then:
    
    You can schedule doing it to a fixed date and time (add an entry to the calendar). Delete message or archive for reference.
    
    You can move the message into the "ASAP" inbox. You'll review that inbox next time you have some free time. Note that the choice between (1) and (2) allows to balance conveniently between "100% planning" and "no scheduling at all" while still keeping track of all your stuff.
    
    You can move it into the "Someday" inbox. You'll review that inbox someday later.
    
    This is something you have to delegate. Then you move the item to the "Pending" inbox, so that you won't forget about it.
3. Besides the inboxes "IN", "ASAP", "Later" and "Pending" you also need to keep a list of ongoing projects as folders with relevant information. You can review this list from time to time to make sure you haven't forgotten anything.
The best part in this whole process (and this is what actually makes it suitable for me personally), is the fact that it can be implemented using e-mail. This way it does not force much additional discipline: I'm reading email several times a day anyway as part of my compulsory procrastination activities. Most of my tasks, even the smallest ones, are in one way or another reflected in the e-mail. It is then only a matter of keeping the proper folder structure and moving emails out of INBOX according to the above protocol, remembering to think about the "next action" in the process. That is quite easy, it forces to manage and schedule more or less all of my activities, and it keeps the mind reasonably free, too.
Tags: Personal, Time management