Talk:NLP Reading Group

From CLSP Wiki
Jump to: navigation, search

Feel free to post ideas for future papers and topics here!

Plagiarism detection

(from 11/8/2001)

Occurs to me that plagiarism detection systems might be relevant to bitext alignment. A message to the Corpora list yesterday announced the following review paper: [1], [2]]

The bibliography includes links to some online papers.

I doubt many of these approaches are easy to convert to the multilingual case -- e.g., some use hashing to detect substring exact match, and they don't seem to use probabilities. However, plagiarism detection does have to face some characteristic problems that show up in translation:

  • word substitution (via thesaurus)
  • movement of blocks of text rather than independent movement of individual words

So it might be interesting to look for a potentially relevant paper.

Tree automata and transducers

(from 1/14/02)

TATA book

Using Tree Automata and Regular Expressions to Manipulate Hierarchically Structured Data

Reinforcement learning

Date: Mon, 12 Aug 2002 16:15:10 +0200
From: Juergen Schmidhuber <juergen@idsia.ch>
To: connectionists@cs.cmu.edu
Subject: optimal universal reinforcement learner

I'd like to draw your attention to the first optimal, universal
reinforcement learner.
 
While traditional RL requires unrealistic Markovian assumptions,
the recent AIXI model of Marcus Hutter just needs an environment
whose reactions to control actions are sampled from an unknown
but computable distribution mu.  This includes basically
every environment we can reasonably talk about:
 
M. Hutter. Towards a Universal Theory of Artificial Intelligence
based on Algorithmic Probability and Sequential Decisions.
Proc. ECML-2001, p. 226-238.
http://www.idsia.ch/~marcus/ai/paixi.pdf
ftp://ftp.idsia.ch/pub/techrep/IDSIA-14-00.ps.gz
 
Self-Optimizing and Pareto-Optimal Policies in General
Environments based on Bayes-Mixtures. Proc. COLT-2002, 364-379.
http://www.idsia.ch/~marcus/ai/selfopt.pdf
ftp://ftp.idsia.ch/pub/techrep/IDSIA-04-02.ps.gz
 
How does AIXI work?  An optimal predictor using a universal
Bayesmix XI predicts future events including reward.  Here XI
is just a weighted sum of all distributions nu in a set M.
AIXI now simply selects those action sequences that maximize
predicted reward.
 
It turns out that this method really is self-optimizing in the
following sense:  for all nu in the mix, the average value of
actions, given the history, asymptotically converges to the
optimal value achieved by the unknown policy which knows the
true mu in advance!
 
The necessary condition is that M does admit self-optimizing
policies. This is also sufficient! And there is no other policy
yielding higher or equal value in all environments nu and a
strictly higher value in at least one.
 
Interestingly, the right way of treating the temporal horizon
is not to discount it exponentially, as done in most traditional
RL work, but to let the future horizon grow in proportion to the
learner's lifetime so far.
 
To quote some AIXI referees: "[...] Clearly fundamental and
potentially interesting research direction with practical
applications. [...] Great theory. Extends a major theoretical
direction that led to practical MDL and MML. This approach may
do the same thing (similar thing) wrt to decision theory and
reinforcement learning, to name a few." "[...] this could be
the foundation of a theory which might inspire AI and MC for
years (decades?)."
 
Juergen Schmidhuber, IDSIA
http://www.idsia.ch/~juergen/unilearn.html

Split-and-merge EM

Date: Mon,  4 Nov 2002 12:20:28 -0500
From: "Jason Eisner" <jason@cs.jhu.edu>
Subject: paper for a future week

Hinton described this work to me; the paper might be worth
looking at sometime in the reading group.

http://www.cs.toronto.edu/~hinton/absps/ueda.html

SMEM Algorithm for Mixture Models 

Ueda, Nakano, Ghahramani & Hinton

We Present a split and merge EM (SMEM) algorithm to overcome the local
maxima problem in parameter estimation of finite mixture models.  In
the case of mixture models, local maxima often involve having too many
components of a mixture model in one part of the space and too few in
another, widely separated part of the space.  To escape from such
configurations we repeatedly perform simultaneous split and merge
operations using a new criterion for efficiently selecting the split
and merge candidates.  We apply the proposed algorithm to the training
of Gaussian mixtures and mixtures of factor analyzers using synthetic
and real data and show the effectiveness of using the split and merge
operations to improve the likelihood of both the training data and of
held-out test data.  We also show the practical usefulness of the
proposed algorithm by applying it to image compression and pattern
recognition problems.

Boosting

from 12/2002: Review paper on boosting

from 3/2003: Paper associated with the talk below

This is a talk at Cornell next week:

>Title:          How to be a Bayesian without believing
>Speaker:        Yoav Freund
>
>Abstract: Most of the study of Bayesian prediction procedures is premised 
>on some strong assumptions regarding the prior distribution. In 
>particular, an assumption that always needs to be made is that the prior 
>distribution assigns a non-zero probability to the correct model.  In 
>practice, however, we often have to restrict the set of models (in the 
>support of the prior) in order to make it feasible to compute the 
>posterior average.  As a result, we often can't assume that the correct 
>model is in our set and the standard Bayesian theory cannot be applied.
>
>In this work we show a classification procedure which uses model averaging 
>and can be interpreted as a Bayesian procedure.  We show that this 
>procedure has some desirable properties when it is applied to *any* source 
>of IID examples, regardless of whether or not the source distribution is 
>related to the models in the support of the prior.  Our main result is 
>that the predictions made by this procedure are very stable with regard to 
>the choice of the random training set.
>
>This stability property has some far-reaching implications on a variety of 
>issues, including Bias-Variance decomposition of classification error, the 
>"curse of dimensionality" and Bagging.
>
>This is joint work with Yishay Mansour and Rob Schapire.

Collaborative filtering

Date: Tue, 25 Feb 2003 12:24:24 -0500
From: "Jason Eisner" <jason@cs.jhu.edu>
Subject: possible paper - collaborative filtering

Another possible paper for reading group.  Collaborative filtering is
a smoothing problem: guess which objects (e.g., documents, movies,
etc.)  a given user will like, based on other knowledge about who
likes what.

Abstractly, this is very similar to predicting which documents are
relevant to a given term, or which nouns are likely to be the objects
of a given verb.

----------------------------------------------------------------------

http://research.microsoft.com/~carlk/papers/cfw.htm

CFW: A Collaborative Filtering System Using Posteriors Over Weights of
Evidence

Carl M. Kadie, Christopher Meek, & David Heckerman
Microsoft Research

We describe CFW, a computationally efficient algorithm for collaborative
filtering that uses posteriors over weights of evidence. In experiments
on real data, we show that this method predicts as well or better than
other methods in situations where the size of the user query is small.
The new approach works particularly well when the user's query contains
low frequency (unpopular) items. The approach complements that of
dependency networks which perform well when the size of the query is
large. Also in this paper, we argue that the use of posteriors over
weights of evidence is a natural way to recommend similar items---a task
that is somewhat different from the usual collaborative-filtering task.

Proceedings of the Eighteenth Conference on Uncertainty in Artificial
Intelligence, Edmonton, Alberta, August, 2002.

Software for graphical models

See email below. Others that would make the list now are

Jason 23:43, 3 April 2008 (EDT) and Jason 15:49, 24 June 2008 (EDT)

Date: 3 Jul 2003 21:38:38 -0400
From: "Jason Eisner" <jason@cs.jhu.edu>

I have a suggestion for a line of inquiry when the NLP reading group resumes:

Why don't we look at some of the best, most coherent software for graphical models. We'd start by reading the tutorial documentation together and playing around with the toolkits. Then once we understand how to use them and have an intuition about how they work, we'd delve into the papers that explain the algorithms that they use.

I like the idea of starting with concrete examples and documentation, and then working backward to understand the theory. We could also try this approach for machine learning toolkits such as WEKA. Of course, it only makes sense in areas where the theory and software have both reached a level of elegance.

Here are a couple of packages that have caught my eye recently:

This is a beautiful system for directed graphical models. It was released only a couple of days ago, and my reaction was that someone had finally done this right. The language looks like ML, but all the variables are random variables, which may be defined in terms of one another via ordinary expressions, and hence are not independent. Evaluation is lazy. You can do inference (ask for the joint or marginal distributions on variables given observations of some other variables), and you can do learning (I think using EM with Dirichlet priors). The language is general enough that you can specify and train PCFGs. (It is possible that this is quite slow -- I don't yet know how things work under the hood.)
"Index to documentation for software that implements flexible Bayesian models based on neural networks, Gaussian processes, mixtures, and Dirichlet diffusion trees, and that demonstrates various Markov chain Monte Carlo methods."
Neal is very good, and I suspect this is the best software that uses Markov chain sampling (MCMC) to do inference and learning when it might otherwise be infeasible.
  • Daphne Koller's stuff
... if it has been released. I think that it uses loopy belief propagation, but I don't know if it's been released in a general form.
  • (the text below refers to MALLET)
Andrew McCallum and his students are apparently working on some general information extraction software based on conditional random fields, presumably modeled after the stuff he worked on at the late lamented Whizbang!. It is already at the point where they are using it locally, so we should keep an eye out for release.
  • There is also a nice-looking short paper by Yoav Shoham that surveys recent attempts to push AI into programming languages, including IBAL.

Probabilistic linguistics

From: Philip Resnik <resnik@umiacs.umd.edu>
Date: Mon, 8 Sep 2003 10:22:42 -0400 (EDT)

Another thought on possible readings: Bod et al. (2003) have a new
book called "Probabilistic Linguistics", with some chapters
potentially interesting from the point of view of the
linguistics/models interface (i.e. with the goal of linguistic
explanation as opposed to task-based performance).  Chris Manning has
a nice chapter on probabilistic syntax.

  Philip

Time series analysis (cointegration, leading indicators)

From: "Jason Eisner" <jason@cs.jhu.edu>
Date: Wed,  8 Oct 2003 22:59:40 -0400

An odd but newsworthy source of possible reading:

Clive Granger just won (half of) the Nobel Prize in economics, for his
work on time series variables.=20=20

His cointegration technique from the 80's tries to determine whether
two time-series variables have a statistically significant
relationship.  (It's not enough for them to follow the same overall
trend.  To get significance, I suspect, smaller fluctuations also have
to be linked more often than expected by chance.)=20=20

Reading about this work suggested several NLP analogies to me.

First, I wonder whether this kind of analysis could be used on
clustering and k-nearest neighbor methods, especially those based on
distributional similarity (e.g., Lillian Lee's work).  In such
methods, we are assuming that past similarity of p(. | x) and p(. | y)
will predict their future similarity -- just as in economic prediction.
Such an assumption should be stronger if the hills and valleys of the
two distributions are well matched.

Granger also looked at the causal relationships among time series,
i.e., which variables are leading indicators of the others.  This
seems analogous to transformation models -- i.e., is NP -> Det Adj N
derived from NP -> Det N or vice-versa?

Of course, we are less interested in these yes/no questions (are x and
y significantly related, does x lead y, etc.) than in the problem of
actually predicting y from x.  But that's also true of economists, and
I'm sure they've addressed those issues by now.

There are many textbook treatments of these topics.  Cointegration and
analysis of leading indicators are widely used now by economic
forecasters.

jason

----------------------------------------------------------------------

http://news.ft.com/servlet/ContentServer?pagename=3DFT.com/StoryFT/FullStor=
y&c=3DStoryFT&cid=3D1059480455082

Designers of forecasting toolkit win Nobel prize
By Alan Beattie, International Economy Correspondent
Published: October 9 2003 5:00 | Last Updated: October 9 2003 5:00

Two academics who designed essential parts of the modern statistical
toolkit for analysing financial markets and macroeconomic forecasting
have won this year's Nobel prize for economics.

Clive Granger, a British economist now at the University of California
in San Diego, and Robert Engle, an American at New York University,
made critical breakthroughs in techniques to handle "time series"
variables - data recorded at regular intervals. They will share the
prize of SKr10m ($1.3m).

Prof Granger's major discovery was in the invention of
"cointegration", a means of analysing the relationship between two
variables, such as wealth and consumption, both of which had a trend
over time. Previously, economists had often claimed a correlation
between two or more such variables which turned out to be spurious,
being driven simply by the fact that both followed a trend - an error
pointed out by Prof Granger in the mid-1970s.

In subsequent work over a decade later, Prof Granger established that
such "non-stationary" variables could be combined in ways which
allowed accurate statistical inferences to be drawn. He also developed
the "Granger causality test" to estimate which of several variables
led movements in the others.

Because many macroeconomic variables such as income, consumption, the
price level and gross domestic product have persistent trends, the
techniques that Prof Granger developed are now routinely used by
economic forecasters at central banks, at finance ministries, in
academia and in the financial markets.

"Suddenly, after Clive's breakthrough, we had apparatus that allowed
us to disentangle those things that merely moved together over time
from those that were actually related to each other," said Kenneth
Wallis, professor of economics at Warwick University in the UK and
former editor of the academic journal Econometrica, which published
pathbreaking articles by both men.

Although Prof Engle did some work with Prof Granger, his award was for
separate research on variables whose volatility varies over time. It
is commonly used to analyse prices in financial markets. He told
reporters yesterday he was delighted with the prize, which reflected
how widely the techniques he developed had become used.

Traditional tools of analysis assumed that volatility was
constant. The techniques Prof Engle developed, particularly
autoregressive conditional heteroskedasticity (ARCH) models which
predicted volatility would depend on its own behaviour in the past,
are now widely employed by investors and financial institutions
assessing the risk of portfolios.

Prof Wallis said that the two academics had created a highly
influential centre of excellence at the University of California at
San Diego, though Prof Engle was tempted away to NYU.

Sparse learning

ICML/UAI/COLT 2008 is holding a workshop on Sparse Optimization and Variable Selection, whose CFP has many references to techniques, as well as links to past workshops.

Also, in 4/2003 I posted a note about JMLR's special issue on variable and feature selection, including the intro paper.

Jason 23:18, 3 April 2008 (EDT)

Deep learning

Just ran across an interesting Deep Learning Workshop from NIPS 2007.

By "deep learning," they mean learning neural networks with many layers, or other systems that build up high-level representations one step at a time, e.g., for vision.

The website above has videos of all talks, plus links to quite a number of background papers from 2006 and 2007. This looks like an interesting future topic. We could even watch videos in reading group! Jason 22:34, 3 April 2008 (EDT)

Enhanced messaging

Improving email is a great end-user NLP application. It'd be fun to look at this AAAI 2008 Workshop on Enhanced Messaging once it's over. Jason 00:05, 4 April 2008 (EDT)

See also the CALO project.

space-efficient search, etc.

We should follow up on some of the references noted at 5/15/08.

Also, Kevin Murphy's thesis has other good stuff in it, all about HMMs and their many generalizations. At some point, we should probably read chapter 5 on stochastic methods for approximate inference in DBNs, notably Rao-Blackwellized particle filtering.

Data structures

A taxonomy of suffix array construction algorithms. Simon J. Puglisi, W. F. Smyth, Andrew H. Turpin. July 2007. Computing Surveys (CSUR) , Volume 39 Issue 2.